Update Dataset Versions

Create new dataset versions by building on existing ones—add new files, exclude specific files, or start fresh while maintaining version lineage.

When to Update Dataset Versions

Use incremental versioning when you need to:

Add new data — Append newly processed files to existing dataset
Remove bad data — Exclude specific files while keeping the rest
Swap files — Replace specific files without recreating entire dataset
A/B test data — Create variant datasets by excluding different subsets

Basic Update Pattern

The standard approach creates a new version based on an existing one:

import json

metadata = {
    "new_file.csv": {
        "valohai.dataset-versions": [{
            'uri': "dataset://my-dataset/v3",           # New version to create
            'from': "dataset://my-dataset/v2",          # Base version
            'start_fresh': False,                       # Include files from v2
            'exclude': ['bad_file.csv', 'old_file.csv'] # Remove these specific files
        }]
    }
}

# Save your new output file
new_data.to_csv('/valohai/outputs/new_file.csv')

# Save metadata
metadata_path = "/valohai/outputs/valohai.metadata.jsonl"
with open(metadata_path, "w") as outfile:
    for file_name, file_metadata in metadata.items():
        json.dump({"file": file_name, "metadata": file_metadata}, outfile)
        outfile.write("\n")

What happens:

Creates v3 based on v2
Includes all files from v2 except bad_file.csv and old_file.csv
Adds new_file.csv to the new version

Update Parameters

`uri` (required)

The new dataset version to create.

'uri': "dataset://my-dataset/v3"

`from` (optional)

Base the new version on an existing version.

'from': "dataset://my-dataset/v2"

If omitted: New version starts empty (only includes files in current metadata).

`start_fresh` (optional, default: `False`)

Controls whether to include files from the base version.

# Include all files from base (default)
'start_fresh': False

# Exclude all files from base
'start_fresh': True

`exclude` (optional)

List of filenames to exclude from the base version.

'exclude': ['file1.csv', 'file2.csv', 'file3.csv']

Important: Filenames are exact matches (case-sensitive).

`targeting_aliases` (optional)

Update dataset aliases to point to the new version.

'targeting_aliases': ['production', 'stable']

See Dataset Aliases for details.

Common Update Patterns

Add New Files to Existing Dataset

Keep all existing files, add new ones:

import json

# Process and save new files
for i in range(10, 15):
    new_data = process_data(i)
    new_data.to_csv(f'/valohai/outputs/data_{i:03d}.csv')

# Add new files to dataset
metadata = {}
for i in range(10, 15):
    metadata[f"data_{i:03d}.csv"] = {
        "valohai.dataset-versions": [{
            'uri': "dataset://training-data/v2",
            'from': "dataset://training-data/v1",
            'start_fresh': False  # Keep all files from v1
        }]
    }

metadata_path = "/valohai/outputs/valohai.metadata.jsonl"
with open(metadata_path, "w") as outfile:
    for file_name, file_metadata in metadata.items():
        json.dump({"file": file_name, "metadata": file_metadata}, outfile)
        outfile.write("\n")

Result:

v2 = All files from v1 + data_010.csv through data_014.csv

Remove Specific Files

Exclude files that failed validation:

import json

# Files that failed quality checks
failed_files = ['corrupted_001.csv', 'outlier_042.csv', 'invalid_099.csv']

# Create new version without failed files
metadata = {
    "validation_report.txt": {  # Can add a new file too
        "valohai.dataset-versions": [{
            'uri': "dataset://clean-data/v2",
            'from': "dataset://clean-data/v1",
            'start_fresh': False,
            'exclude': failed_files  # Remove these
        }]
    }
}

# Save validation report
with open('/valohai/outputs/validation_report.txt', 'w') as f:
    f.write(f"Removed {len(failed_files)} files\n")

metadata_path = "/valohai/outputs/valohai.metadata.jsonl"
with open(metadata_path, "w") as outfile:
    for file_name, file_metadata in metadata.items():
        json.dump({"file": file_name, "metadata": file_metadata}, outfile)
        outfile.write("\n")

Result:

v2 = All files from v1 except the three failed files + validation report

⚠️ validation_report.txt file can't be excluded from the dataset version - even if added to the list of excluded files!

Replace Specific Files

Exclude old files and add updated versions:

import json

# Save updated files
updated_data.to_csv('/valohai/outputs/data_001.csv')  # New version of data_001
updated_data.to_csv('/valohai/outputs/data_005.csv')  # New version of data_005

metadata = {
    "data_001.csv": {
        "valohai.dataset-versions": [{
            'uri': "dataset://my-data/v3",
            'from': "dataset://my-data/v2",
            'start_fresh': False,
            'exclude': ['data_001.csv', 'data_005.csv']
        }]
    },
    "data_005.csv": {
        "valohai.dataset-versions": [{
            'uri': "dataset://my-data/v3",
            'from': "dataset://my-data/v2",
            'start_fresh': False,
            'exclude': ['data_001.csv', 'data_005.csv']
        }]
    }
}

metadata_path = "/valohai/outputs/valohai.metadata.jsonl"
with open(metadata_path, "w") as outfile:
    for file_name, file_metadata in metadata.items():
        json.dump({"file": file_name, "metadata": file_metadata}, outfile)
        outfile.write("\n")

Result:

v3 = All files from v2 except old data_001.csv and data_005.csv + new versions

Start Fresh (Link Versions Without Files)

Create a new version lineage without inheriting files:

import json

# Create completely new dataset version
new_data.to_csv('/valohai/outputs/new_approach.csv')

metadata = {
    "new_approach.csv": {
        "valohai.dataset-versions": [{
            'uri': "dataset://experiments/v2",
            'from': "dataset://experiments/v1",
            'start_fresh': True  # Don't include any files from v1
        }]
    }
}

metadata_path = "/valohai/outputs/valohai.metadata.jsonl"
with open(metadata_path, "w") as outfile:
    for file_name, file_metadata in metadata.items():
        json.dump({"file": file_name, "metadata": file_metadata}, outfile)
        outfile.write("\n")

Result:

v2 = Only new_approach.csv (no files from v1)
Version history shows v2 was based on v1 (for tracking)

Use case: You want to track version lineage (this experiment followed that one) but don't want to inherit data.

Create A/B Test Variants

Split dataset into two variants for comparison:

import json

# Save a marker file for each variant
with open('/valohai/outputs/variant_a_marker.txt', 'w') as f:
    f.write("Variant A configuration\n")

# Variant A: Exclude group B files
metadata = {
    "variant_a_marker.txt": {
        "valohai.dataset-versions": [{
            'uri': "dataset://ab-test/variant-a",
            'from': "dataset://ab-test/baseline",
            'start_fresh': False,
            'exclude': ['group_b_001.csv', 'group_b_002.csv', 'group_b_003.csv'],
            'targeting_aliases': ['current-a']
        }]
    }
}

metadata_path = "/valohai/outputs/valohai.metadata.jsonl"
with open(metadata_path, "w") as outfile:
    for file_name, file_metadata in metadata.items():
        json.dump({"file": file_name, "metadata": file_metadata}, outfile)
        outfile.write("\n")

Run a second execution for Variant B that excludes group A files instead.

Update via Web UI

The UI provides a simpler way to create new versions based on existing ones.

Open your dataset in Data → Datasets
Find the version you want to base on
Click the ... menu at the end of the version row
Select "Create new version from this version"
The new version starts with all files from the base version
Add new files or remove unwanted files
Name the new version
Save

Limitations: The UI doesn't support the exclude parameter—you must manually remove files one by one.

Combining Multiple Updates

You can reference multiple base versions or create complex update logic:

import json

metadata = {
    "combined_data.csv": {
        "valohai.dataset-versions": [
            {
                'uri': "dataset://combined/v1",
                'from': "dataset://source-a/latest",
                'start_fresh': False
            },
            {
                'uri': "dataset://combined/v1",  # Same target version
                'from': "dataset://source-b/latest",
                'start_fresh': False
            }
        ]
    }
}

Result: combined/v1 includes files from both source-a/latest and source-b/latest.

Best Practices

Use Descriptive Version Names

# Good: Clear what changed
'uri': "dataset://my-data/v2-removed-outliers"
'uri': "dataset://my-data/v3-added-aug-2024"

# Avoid: Generic names
'uri': "dataset://my-data/new"
'uri': "dataset://my-data/fixed"

Document Changes

Add a marker file explaining what changed:

# Create changelog file
with open('/valohai/outputs/CHANGELOG.txt', 'w') as f:
    f.write("Version v3 changes:\n")
    f.write("- Removed 15 corrupted files\n")
    f.write("- Added 50 new validated samples\n")
    f.write("- Reprocessed all images with new augmentation\n")

# Include in metadata
metadata = {
    "CHANGELOG.txt": {
        "valohai.dataset-versions": [{
            'uri': "dataset://my-data/v3",
            'from': "dataset://my-data/v2",
            'start_fresh': False,
            'exclude': corrupted_files
        }]
    }
}

Validate Before Creating New Version

import os
import json

# Verify files exist before adding to dataset
output_dir = '/valohai/outputs/'
expected_files = ['data_001.csv', 'data_002.csv', 'data_003.csv']

for filename in expected_files:
    filepath = os.path.join(output_dir, filename)
    if not os.path.exists(filepath):
        raise FileNotFoundError(f"Expected file missing: {filename}")
    
    # Verify file is not empty
    if os.path.getsize(filepath) == 0:
        raise ValueError(f"File is empty: {filename}")

# Now safe to create metadata
metadata = {
    filename: {
        "valohai.dataset-versions": [{
            'uri': "dataset://validated-data/v2",
            'from': "dataset://validated-data/v1",
            'start_fresh': False
        }]
    }
    for filename in expected_files
}

Use Aliases for Promotion Workflow

# Development → Staging → Production pipeline
metadata = {
    "model.pkl": {
        "valohai.dataset-versions": [{
            'uri': "dataset://models/v5",
            'from': "dataset://models/v4",
            'start_fresh': False,
            'targeting_aliases': ['staging']  # Promote to staging first
        }]
    }
}

# After validation, manually update 'production' alias to point to v5

Common Issues & Fixes

Files Not Excluded

Symptom: Specified files in exclude list still appear in new version

Causes & Fixes:

Typo in filename → Filenames must match exactly (case-sensitive)
File path included → Use just filename, not full path
start_fresh: True → exclude doesn't apply with start_fresh: True

Debug:

# Print exact filenames from base version
from valohai import inputs
for file_path in inputs('base-dataset').paths():
    print(f"Filename: {os.path.basename(file_path)}")

Base Version Files Not Included

Symptom: New version is empty or missing files from base version

Causes & Fixes:

start_fresh: True → Set to False to include base files
Missing from parameter → Specify which version to base on
Wrong base version URI → Verify version exists and spelling is correct

New Version Not Created

Symptom: Execution completes but dataset version doesn't appear

How to diagnose:

Open execution in Valohai UI
Click Alerts tab
Look for dataset creation errors

Common causes:

Invalid JSON in metadata → Validate JSON syntax
Missing output files → Ensure all files in metadata were actually saved
Base version doesn't exist → Verify from version exists

Create and Manage Datasets — Core dataset concepts and creation
Add Context to Your Data Files — Metadata system overview
Load Data in Jobs — Use dataset versions as inputs

Next Steps

Practice updating a dataset by excluding specific files
Set up a validation → production promotion workflow using aliases
Create A/B test variants from a baseline dataset
Learn about dataset packaging for large collections

PreviousCreate and Manage Datasets NextPackage Datasets for Faster Downloads

Last updated 15 days ago

Was this helpful?

When to Update Dataset Versions

Basic Update Pattern

Update Parameters

uri (required)

from (optional)

start_fresh (optional, default: False)

exclude (optional)

targeting_aliases (optional)

Common Update Patterns

Add New Files to Existing Dataset

Remove Specific Files

Replace Specific Files

Start Fresh (Link Versions Without Files)

Create A/B Test Variants

Update via Web UI

Combining Multiple Updates

Best Practices

Use Descriptive Version Names

Document Changes

Validate Before Creating New Version

Use Aliases for Promotion Workflow

Common Issues & Fixes

Files Not Excluded

Base Version Files Not Included

New Version Not Created

Related Pages

Next Steps

`uri` (required)

`from` (optional)

`start_fresh` (optional, default: `False`)

`exclude` (optional)

`targeting_aliases` (optional)