Update Dataset Versions

Create new dataset versions by building on existing ones—add new files, exclude specific files, or start fresh while maintaining version lineage.


When to Update Dataset Versions

Use incremental versioning when you need to:

  • Add new data — Append newly processed files to existing dataset

  • Remove bad data — Exclude specific files while keeping the rest

  • Swap files — Replace specific files without recreating entire dataset

  • A/B test data — Create variant datasets by excluding different subsets


Basic Update Pattern

The standard approach creates a new version based on an existing one:

import json

metadata = {
    "new_file.csv": {
        "valohai.dataset-versions": [
            {
                "uri": "dataset://my-dataset/v3",  # New version to create
                "from": "dataset://my-dataset/v2",  # Base version
                "start_fresh": False,  # Include files from v2
                "exclude": ["bad_file.csv", "old_file.csv"],  # Remove these specific files
            },
        ],
    },
}

# Save your new output file
new_data.to_csv("/valohai/outputs/new_file.csv")

# Save metadata
metadata_path = "/valohai/outputs/valohai.metadata.jsonl"
with open(metadata_path, "w") as outfile:
    for file_name, file_metadata in metadata.items():
        json.dump({"file": file_name, "metadata": file_metadata}, outfile)
        outfile.write("\n")

What happens:

  • Creates v3 based on v2

  • Includes all files from v2 except bad_file.csv and old_file.csv

  • Adds new_file.csv to the new version


Update Parameters

uri (required)

The new dataset version to create.


from (optional)

Base the new version on an existing version.

If omitted: New version starts empty (only includes files in current metadata).


start_fresh (optional, default: False)

Controls whether to include files from the base version.


exclude (optional)

List of filenames to exclude from the base version.

Important: Filenames are exact matches (case-sensitive).


targeting_aliases (optional)

Update dataset aliases to point to the new version.

See Dataset Aliases for details.


Common Update Patterns

Add New Files to Existing Dataset

Keep all existing files, add new ones:

Result:

  • v2 = All files from v1 + data_010.csv through data_014.csv


Remove Specific Files

Exclude files that failed validation:

Result:

  • v2 = All files from v1 except the three failed files + validation report

⚠️ validation_report.txt file can't be excluded from the dataset version - even if added to the list of excluded files!


Replace Specific Files

Exclude old files and add updated versions:

Result:

  • v3 = All files from v2 except old data_001.csv and data_005.csv + new versions


Create a new version lineage without inheriting files:

Result:

  • v2 = Only new_approach.csv (no files from v1)

  • Version history shows v2 was based on v1 (for tracking)

Use case: You want to track version lineage (this experiment followed that one) but don't want to inherit data.


Create A/B Test Variants

Split dataset into two variants for comparison:

Run a second execution for Variant B that excludes group A files instead.


Update via Web UI

The UI provides a simpler way to create new versions based on existing ones.

  1. Open your dataset in Data → Datasets

  2. Find the version you want to base on

  3. Click the ... menu at the end of the version row

  4. Select "Create new version from this version"

  5. The new version starts with all files from the base version

  6. Add new files or remove unwanted files

  7. Name the new version

  8. Save

Limitations: The UI doesn't support the exclude parameter—you must manually remove files one by one.


Combining Multiple Updates

You can reference multiple base versions or create complex update logic:

Result: combined/v1 includes files from both source-a/latest and source-b/latest.


Best Practices

Use Descriptive Version Names


Document Changes

Add a marker file explaining what changed:


Validate Before Creating New Version


Use Aliases for Promotion Workflow


Common Issues & Fixes

Files Not Excluded

Symptom: Specified files in exclude list still appear in new version

Causes & Fixes:

  • Typo in filename → Filenames must match exactly (case-sensitive)

  • File path included → Use just filename, not full path

  • start_fresh: Trueexclude doesn't apply with start_fresh: True

Debug:


Base Version Files Not Included

Symptom: New version is empty or missing files from base version

Causes & Fixes:

  • start_fresh: True → Set to False to include base files

  • Missing from parameter → Specify which version to base on

  • Wrong base version URI → Verify version exists and spelling is correct


New Version Not Created

Symptom: Execution completes but dataset version doesn't appear

How to diagnose:

  1. Open execution in Valohai UI

  2. Click Alerts tab

  3. Look for dataset creation errors

Common causes:

  • Invalid JSON in metadata → Validate JSON syntax

  • Missing output files → Ensure all files in metadata were actually saved

  • Base version doesn't exist → Verify from version exists



Next Steps

  • Practice updating a dataset by excluding specific files

  • Set up a validation → production promotion workflow using aliases

  • Create A/B test variants from a baseline dataset

  • Learn about dataset packaging for large collections

Last updated

Was this helpful?