Update Dataset Versions

Create new dataset versions by building on existing ones—add new files, exclude specific files, or start fresh while maintaining version lineage.


When to Update Dataset Versions

Use incremental versioning when you need to:

  • Add new data — Append newly processed files to existing dataset

  • Remove bad data — Exclude specific files while keeping the rest

  • Swap files — Replace specific files without recreating entire dataset

  • A/B test data — Create variant datasets by excluding different subsets


Basic Update Pattern

The standard approach creates a new version based on an existing one:

import json

metadata = {
    "new_file.csv": {
        "valohai.dataset-versions": [{
            'uri': "dataset://my-dataset/v3",           # New version to create
            'from': "dataset://my-dataset/v2",          # Base version
            'start_fresh': False,                       # Include files from v2
            'exclude': ['bad_file.csv', 'old_file.csv'] # Remove these specific files
        }]
    }
}

# Save your new output file
new_data.to_csv('/valohai/outputs/new_file.csv')

# Save metadata
metadata_path = "/valohai/outputs/valohai.metadata.jsonl"
with open(metadata_path, "w") as outfile:
    for file_name, file_metadata in metadata.items():
        json.dump({"file": file_name, "metadata": file_metadata}, outfile)
        outfile.write("\n")

What happens:

  • Creates v3 based on v2

  • Includes all files from v2 except bad_file.csv and old_file.csv

  • Adds new_file.csv to the new version


Update Parameters

uri (required)

The new dataset version to create.

'uri': "dataset://my-dataset/v3"

from (optional)

Base the new version on an existing version.

'from': "dataset://my-dataset/v2"

If omitted: New version starts empty (only includes files in current metadata).


start_fresh (optional, default: False)

Controls whether to include files from the base version.

# Include all files from base (default)
'start_fresh': False

# Exclude all files from base
'start_fresh': True

exclude (optional)

List of filenames to exclude from the base version.

'exclude': ['file1.csv', 'file2.csv', 'file3.csv']

Important: Filenames are exact matches (case-sensitive).


targeting_aliases (optional)

Update dataset aliases to point to the new version.

'targeting_aliases': ['production', 'stable']

See Dataset Aliases for details.


Common Update Patterns

Add New Files to Existing Dataset

Keep all existing files, add new ones:

import json

# Process and save new files
for i in range(10, 15):
    new_data = process_data(i)
    new_data.to_csv(f'/valohai/outputs/data_{i:03d}.csv')

# Add new files to dataset
metadata = {}
for i in range(10, 15):
    metadata[f"data_{i:03d}.csv"] = {
        "valohai.dataset-versions": [{
            'uri': "dataset://training-data/v2",
            'from': "dataset://training-data/v1",
            'start_fresh': False  # Keep all files from v1
        }]
    }

metadata_path = "/valohai/outputs/valohai.metadata.jsonl"
with open(metadata_path, "w") as outfile:
    for file_name, file_metadata in metadata.items():
        json.dump({"file": file_name, "metadata": file_metadata}, outfile)
        outfile.write("\n")

Result:

  • v2 = All files from v1 + data_010.csv through data_014.csv


Remove Specific Files

Exclude files that failed validation:

import json

# Files that failed quality checks
failed_files = ['corrupted_001.csv', 'outlier_042.csv', 'invalid_099.csv']

# Create new version without failed files
metadata = {
    "validation_report.txt": {  # Can add a new file too
        "valohai.dataset-versions": [{
            'uri': "dataset://clean-data/v2",
            'from': "dataset://clean-data/v1",
            'start_fresh': False,
            'exclude': failed_files  # Remove these
        }]
    }
}

# Save validation report
with open('/valohai/outputs/validation_report.txt', 'w') as f:
    f.write(f"Removed {len(failed_files)} files\n")

metadata_path = "/valohai/outputs/valohai.metadata.jsonl"
with open(metadata_path, "w") as outfile:
    for file_name, file_metadata in metadata.items():
        json.dump({"file": file_name, "metadata": file_metadata}, outfile)
        outfile.write("\n")

Result:

  • v2 = All files from v1 except the three failed files + validation report

⚠️ validation_report.txt file can't be excluded from the dataset version - even if added to the list of excluded files!


Replace Specific Files

Exclude old files and add updated versions:

import json

# Save updated files
updated_data.to_csv('/valohai/outputs/data_001.csv')  # New version of data_001
updated_data.to_csv('/valohai/outputs/data_005.csv')  # New version of data_005

metadata = {
    "data_001.csv": {
        "valohai.dataset-versions": [{
            'uri': "dataset://my-data/v3",
            'from': "dataset://my-data/v2",
            'start_fresh': False,
            'exclude': ['data_001.csv', 'data_005.csv']
        }]
    },
    "data_005.csv": {
        "valohai.dataset-versions": [{
            'uri': "dataset://my-data/v3",
            'from': "dataset://my-data/v2",
            'start_fresh': False,
            'exclude': ['data_001.csv', 'data_005.csv']
        }]
    }
}

metadata_path = "/valohai/outputs/valohai.metadata.jsonl"
with open(metadata_path, "w") as outfile:
    for file_name, file_metadata in metadata.items():
        json.dump({"file": file_name, "metadata": file_metadata}, outfile)
        outfile.write("\n")

Result:

  • v3 = All files from v2 except old data_001.csv and data_005.csv + new versions


Create a new version lineage without inheriting files:

import json

# Create completely new dataset version
new_data.to_csv('/valohai/outputs/new_approach.csv')

metadata = {
    "new_approach.csv": {
        "valohai.dataset-versions": [{
            'uri': "dataset://experiments/v2",
            'from': "dataset://experiments/v1",
            'start_fresh': True  # Don't include any files from v1
        }]
    }
}

metadata_path = "/valohai/outputs/valohai.metadata.jsonl"
with open(metadata_path, "w") as outfile:
    for file_name, file_metadata in metadata.items():
        json.dump({"file": file_name, "metadata": file_metadata}, outfile)
        outfile.write("\n")

Result:

  • v2 = Only new_approach.csv (no files from v1)

  • Version history shows v2 was based on v1 (for tracking)

Use case: You want to track version lineage (this experiment followed that one) but don't want to inherit data.


Create A/B Test Variants

Split dataset into two variants for comparison:

import json

# Save a marker file for each variant
with open('/valohai/outputs/variant_a_marker.txt', 'w') as f:
    f.write("Variant A configuration\n")

# Variant A: Exclude group B files
metadata = {
    "variant_a_marker.txt": {
        "valohai.dataset-versions": [{
            'uri': "dataset://ab-test/variant-a",
            'from': "dataset://ab-test/baseline",
            'start_fresh': False,
            'exclude': ['group_b_001.csv', 'group_b_002.csv', 'group_b_003.csv'],
            'targeting_aliases': ['current-a']
        }]
    }
}

metadata_path = "/valohai/outputs/valohai.metadata.jsonl"
with open(metadata_path, "w") as outfile:
    for file_name, file_metadata in metadata.items():
        json.dump({"file": file_name, "metadata": file_metadata}, outfile)
        outfile.write("\n")

Run a second execution for Variant B that excludes group A files instead.


Update via Web UI

The UI provides a simpler way to create new versions based on existing ones.

  1. Open your dataset in Data → Datasets

  2. Find the version you want to base on

  3. Click the ... menu at the end of the version row

  4. Select "Create new version from this version"

  5. The new version starts with all files from the base version

  6. Add new files or remove unwanted files

  7. Name the new version

  8. Save

Limitations: The UI doesn't support the exclude parameter—you must manually remove files one by one.


Combining Multiple Updates

You can reference multiple base versions or create complex update logic:

import json

metadata = {
    "combined_data.csv": {
        "valohai.dataset-versions": [
            {
                'uri': "dataset://combined/v1",
                'from': "dataset://source-a/latest",
                'start_fresh': False
            },
            {
                'uri': "dataset://combined/v1",  # Same target version
                'from': "dataset://source-b/latest",
                'start_fresh': False
            }
        ]
    }
}

Result: combined/v1 includes files from both source-a/latest and source-b/latest.


Best Practices

Use Descriptive Version Names

# Good: Clear what changed
'uri': "dataset://my-data/v2-removed-outliers"
'uri': "dataset://my-data/v3-added-aug-2024"

# Avoid: Generic names
'uri': "dataset://my-data/new"
'uri': "dataset://my-data/fixed"

Document Changes

Add a marker file explaining what changed:

# Create changelog file
with open('/valohai/outputs/CHANGELOG.txt', 'w') as f:
    f.write("Version v3 changes:\n")
    f.write("- Removed 15 corrupted files\n")
    f.write("- Added 50 new validated samples\n")
    f.write("- Reprocessed all images with new augmentation\n")

# Include in metadata
metadata = {
    "CHANGELOG.txt": {
        "valohai.dataset-versions": [{
            'uri': "dataset://my-data/v3",
            'from': "dataset://my-data/v2",
            'start_fresh': False,
            'exclude': corrupted_files
        }]
    }
}

Validate Before Creating New Version

import os
import json

# Verify files exist before adding to dataset
output_dir = '/valohai/outputs/'
expected_files = ['data_001.csv', 'data_002.csv', 'data_003.csv']

for filename in expected_files:
    filepath = os.path.join(output_dir, filename)
    if not os.path.exists(filepath):
        raise FileNotFoundError(f"Expected file missing: {filename}")
    
    # Verify file is not empty
    if os.path.getsize(filepath) == 0:
        raise ValueError(f"File is empty: {filename}")

# Now safe to create metadata
metadata = {
    filename: {
        "valohai.dataset-versions": [{
            'uri': "dataset://validated-data/v2",
            'from': "dataset://validated-data/v1",
            'start_fresh': False
        }]
    }
    for filename in expected_files
}

Use Aliases for Promotion Workflow

# Development → Staging → Production pipeline
metadata = {
    "model.pkl": {
        "valohai.dataset-versions": [{
            'uri': "dataset://models/v5",
            'from': "dataset://models/v4",
            'start_fresh': False,
            'targeting_aliases': ['staging']  # Promote to staging first
        }]
    }
}

# After validation, manually update 'production' alias to point to v5

Common Issues & Fixes

Files Not Excluded

Symptom: Specified files in exclude list still appear in new version

Causes & Fixes:

  • Typo in filename → Filenames must match exactly (case-sensitive)

  • File path included → Use just filename, not full path

  • start_fresh: Trueexclude doesn't apply with start_fresh: True

Debug:

# Print exact filenames from base version
from valohai import inputs
for file_path in inputs('base-dataset').paths():
    print(f"Filename: {os.path.basename(file_path)}")

Base Version Files Not Included

Symptom: New version is empty or missing files from base version

Causes & Fixes:

  • start_fresh: True → Set to False to include base files

  • Missing from parameter → Specify which version to base on

  • Wrong base version URI → Verify version exists and spelling is correct


New Version Not Created

Symptom: Execution completes but dataset version doesn't appear

How to diagnose:

  1. Open execution in Valohai UI

  2. Click Alerts tab

  3. Look for dataset creation errors

Common causes:

  • Invalid JSON in metadata → Validate JSON syntax

  • Missing output files → Ensure all files in metadata were actually saved

  • Base version doesn't exist → Verify from version exists



Next Steps

  • Practice updating a dataset by excluding specific files

  • Set up a validation → production promotion workflow using aliases

  • Create A/B test variants from a baseline dataset

  • Learn about dataset packaging for large collections

Last updated

Was this helpful?