Update Dataset Versions
Create new dataset versions by building on existing ones—add new files, exclude specific files, or start fresh while maintaining version lineage.
When to Update Dataset Versions
Use incremental versioning when you need to:
Add new data — Append newly processed files to existing dataset
Remove bad data — Exclude specific files while keeping the rest
Swap files — Replace specific files without recreating entire dataset
A/B test data — Create variant datasets by excluding different subsets
Basic Update Pattern
The standard approach creates a new version based on an existing one:
import json
metadata = {
"new_file.csv": {
"valohai.dataset-versions": [{
'uri': "dataset://my-dataset/v3", # New version to create
'from': "dataset://my-dataset/v2", # Base version
'start_fresh': False, # Include files from v2
'exclude': ['bad_file.csv', 'old_file.csv'] # Remove these specific files
}]
}
}
# Save your new output file
new_data.to_csv('/valohai/outputs/new_file.csv')
# Save metadata
metadata_path = "/valohai/outputs/valohai.metadata.jsonl"
with open(metadata_path, "w") as outfile:
for file_name, file_metadata in metadata.items():
json.dump({"file": file_name, "metadata": file_metadata}, outfile)
outfile.write("\n")What happens:
Creates
v3based onv2Includes all files from
v2exceptbad_file.csvandold_file.csvAdds
new_file.csvto the new version
Update Parameters
uri (required)
uri (required)The new dataset version to create.
'uri': "dataset://my-dataset/v3"from (optional)
from (optional)Base the new version on an existing version.
'from': "dataset://my-dataset/v2"If omitted: New version starts empty (only includes files in current metadata).
start_fresh (optional, default: False)
start_fresh (optional, default: False)Controls whether to include files from the base version.
# Include all files from base (default)
'start_fresh': False
# Exclude all files from base
'start_fresh': Trueexclude (optional)
exclude (optional)List of filenames to exclude from the base version.
'exclude': ['file1.csv', 'file2.csv', 'file3.csv']Important: Filenames are exact matches (case-sensitive).
targeting_aliases (optional)
targeting_aliases (optional)Update dataset aliases to point to the new version.
'targeting_aliases': ['production', 'stable']See Dataset Aliases for details.
Common Update Patterns
Add New Files to Existing Dataset
Keep all existing files, add new ones:
import json
# Process and save new files
for i in range(10, 15):
new_data = process_data(i)
new_data.to_csv(f'/valohai/outputs/data_{i:03d}.csv')
# Add new files to dataset
metadata = {}
for i in range(10, 15):
metadata[f"data_{i:03d}.csv"] = {
"valohai.dataset-versions": [{
'uri': "dataset://training-data/v2",
'from': "dataset://training-data/v1",
'start_fresh': False # Keep all files from v1
}]
}
metadata_path = "/valohai/outputs/valohai.metadata.jsonl"
with open(metadata_path, "w") as outfile:
for file_name, file_metadata in metadata.items():
json.dump({"file": file_name, "metadata": file_metadata}, outfile)
outfile.write("\n")Result:
v2= All files fromv1+data_010.csvthroughdata_014.csv
Remove Specific Files
Exclude files that failed validation:
import json
# Files that failed quality checks
failed_files = ['corrupted_001.csv', 'outlier_042.csv', 'invalid_099.csv']
# Create new version without failed files
metadata = {
"validation_report.txt": { # Can add a new file too
"valohai.dataset-versions": [{
'uri': "dataset://clean-data/v2",
'from': "dataset://clean-data/v1",
'start_fresh': False,
'exclude': failed_files # Remove these
}]
}
}
# Save validation report
with open('/valohai/outputs/validation_report.txt', 'w') as f:
f.write(f"Removed {len(failed_files)} files\n")
metadata_path = "/valohai/outputs/valohai.metadata.jsonl"
with open(metadata_path, "w") as outfile:
for file_name, file_metadata in metadata.items():
json.dump({"file": file_name, "metadata": file_metadata}, outfile)
outfile.write("\n")Result:
v2= All files fromv1except the three failed files + validation report
⚠️ validation_report.txt file can't be excluded from the dataset version - even if added to the list of
excludedfiles!
Replace Specific Files
Exclude old files and add updated versions:
import json
# Save updated files
updated_data.to_csv('/valohai/outputs/data_001.csv') # New version of data_001
updated_data.to_csv('/valohai/outputs/data_005.csv') # New version of data_005
metadata = {
"data_001.csv": {
"valohai.dataset-versions": [{
'uri': "dataset://my-data/v3",
'from': "dataset://my-data/v2",
'start_fresh': False,
'exclude': ['data_001.csv', 'data_005.csv']
}]
},
"data_005.csv": {
"valohai.dataset-versions": [{
'uri': "dataset://my-data/v3",
'from': "dataset://my-data/v2",
'start_fresh': False,
'exclude': ['data_001.csv', 'data_005.csv']
}]
}
}
metadata_path = "/valohai/outputs/valohai.metadata.jsonl"
with open(metadata_path, "w") as outfile:
for file_name, file_metadata in metadata.items():
json.dump({"file": file_name, "metadata": file_metadata}, outfile)
outfile.write("\n")Result:
v3= All files fromv2except olddata_001.csvanddata_005.csv+ new versions
Start Fresh (Link Versions Without Files)
Create a new version lineage without inheriting files:
import json
# Create completely new dataset version
new_data.to_csv('/valohai/outputs/new_approach.csv')
metadata = {
"new_approach.csv": {
"valohai.dataset-versions": [{
'uri': "dataset://experiments/v2",
'from': "dataset://experiments/v1",
'start_fresh': True # Don't include any files from v1
}]
}
}
metadata_path = "/valohai/outputs/valohai.metadata.jsonl"
with open(metadata_path, "w") as outfile:
for file_name, file_metadata in metadata.items():
json.dump({"file": file_name, "metadata": file_metadata}, outfile)
outfile.write("\n")Result:
v2= Onlynew_approach.csv(no files fromv1)Version history shows
v2was based onv1(for tracking)
Use case: You want to track version lineage (this experiment followed that one) but don't want to inherit data.
Create A/B Test Variants
Split dataset into two variants for comparison:
import json
# Save a marker file for each variant
with open('/valohai/outputs/variant_a_marker.txt', 'w') as f:
f.write("Variant A configuration\n")
# Variant A: Exclude group B files
metadata = {
"variant_a_marker.txt": {
"valohai.dataset-versions": [{
'uri': "dataset://ab-test/variant-a",
'from': "dataset://ab-test/baseline",
'start_fresh': False,
'exclude': ['group_b_001.csv', 'group_b_002.csv', 'group_b_003.csv'],
'targeting_aliases': ['current-a']
}]
}
}
metadata_path = "/valohai/outputs/valohai.metadata.jsonl"
with open(metadata_path, "w") as outfile:
for file_name, file_metadata in metadata.items():
json.dump({"file": file_name, "metadata": file_metadata}, outfile)
outfile.write("\n")Run a second execution for Variant B that excludes group A files instead.
Update via Web UI
The UI provides a simpler way to create new versions based on existing ones.
Open your dataset in Data → Datasets
Find the version you want to base on
Click the
...menu at the end of the version rowSelect "Create new version from this version"
The new version starts with all files from the base version
Add new files or remove unwanted files
Name the new version
Save
Limitations: The UI doesn't support the exclude parameter—you must manually remove files one by one.
Combining Multiple Updates
You can reference multiple base versions or create complex update logic:
import json
metadata = {
"combined_data.csv": {
"valohai.dataset-versions": [
{
'uri': "dataset://combined/v1",
'from': "dataset://source-a/latest",
'start_fresh': False
},
{
'uri': "dataset://combined/v1", # Same target version
'from': "dataset://source-b/latest",
'start_fresh': False
}
]
}
}Result: combined/v1 includes files from both source-a/latest and source-b/latest.
Best Practices
Use Descriptive Version Names
# Good: Clear what changed
'uri': "dataset://my-data/v2-removed-outliers"
'uri': "dataset://my-data/v3-added-aug-2024"
# Avoid: Generic names
'uri': "dataset://my-data/new"
'uri': "dataset://my-data/fixed"Document Changes
Add a marker file explaining what changed:
# Create changelog file
with open('/valohai/outputs/CHANGELOG.txt', 'w') as f:
f.write("Version v3 changes:\n")
f.write("- Removed 15 corrupted files\n")
f.write("- Added 50 new validated samples\n")
f.write("- Reprocessed all images with new augmentation\n")
# Include in metadata
metadata = {
"CHANGELOG.txt": {
"valohai.dataset-versions": [{
'uri': "dataset://my-data/v3",
'from': "dataset://my-data/v2",
'start_fresh': False,
'exclude': corrupted_files
}]
}
}Validate Before Creating New Version
import os
import json
# Verify files exist before adding to dataset
output_dir = '/valohai/outputs/'
expected_files = ['data_001.csv', 'data_002.csv', 'data_003.csv']
for filename in expected_files:
filepath = os.path.join(output_dir, filename)
if not os.path.exists(filepath):
raise FileNotFoundError(f"Expected file missing: {filename}")
# Verify file is not empty
if os.path.getsize(filepath) == 0:
raise ValueError(f"File is empty: {filename}")
# Now safe to create metadata
metadata = {
filename: {
"valohai.dataset-versions": [{
'uri': "dataset://validated-data/v2",
'from': "dataset://validated-data/v1",
'start_fresh': False
}]
}
for filename in expected_files
}Use Aliases for Promotion Workflow
# Development → Staging → Production pipeline
metadata = {
"model.pkl": {
"valohai.dataset-versions": [{
'uri': "dataset://models/v5",
'from': "dataset://models/v4",
'start_fresh': False,
'targeting_aliases': ['staging'] # Promote to staging first
}]
}
}
# After validation, manually update 'production' alias to point to v5Common Issues & Fixes
Files Not Excluded
Symptom: Specified files in exclude list still appear in new version
Causes & Fixes:
Typo in filename → Filenames must match exactly (case-sensitive)
File path included → Use just filename, not full path
start_fresh: True→excludedoesn't apply withstart_fresh: True
Debug:
# Print exact filenames from base version
from valohai import inputs
for file_path in inputs('base-dataset').paths():
print(f"Filename: {os.path.basename(file_path)}")Base Version Files Not Included
Symptom: New version is empty or missing files from base version
Causes & Fixes:
start_fresh: True→ Set toFalseto include base filesMissing
fromparameter → Specify which version to base onWrong base version URI → Verify version exists and spelling is correct
New Version Not Created
Symptom: Execution completes but dataset version doesn't appear
How to diagnose:
Open execution in Valohai UI
Click Alerts tab
Look for dataset creation errors
Common causes:
Invalid JSON in metadata → Validate JSON syntax
Missing output files → Ensure all files in metadata were actually saved
Base version doesn't exist → Verify
fromversion exists
Related Pages
Create and Manage Datasets — Core dataset concepts and creation
Add Context to Your Data Files — Metadata system overview
Load Data in Jobs — Use dataset versions as inputs
Next Steps
Practice updating a dataset by excluding specific files
Set up a validation → production promotion workflow using aliases
Create A/B test variants from a baseline dataset
Learn about dataset packaging for large collections
Last updated
Was this helpful?
