Data Management
Valohai automatically versions every file in your ML workflow, from raw datasets to trained models, ensuring complete reproducibility and traceability without manual intervention.
Why automatic versioning matters
Machine learning teams face three critical data challenges:
Reproducing experiments: Without proper versioning, recreating a model from 6 months ago becomes impossible. Which exact dataset version was used? What preprocessing was applied?
Tracking data lineage: Understanding how your production model was created requires tracing through multiple data transformations, from raw images to augmented training sets to the final model artifacts.
Managing dataset iterations: Image datasets with millions of files evolve constantly. Teams add new samples, fix labels, and create subsets. Manual tracking quickly becomes unmanageable.
Valohai solves these challenges by automatically versioning every file that passes through the platform.
How versioning works
Every file in Valohai is immutable and permanently stored. When an execution creates a new output:
Files never overwrite existing versions
Each file gets a unique identifier
Files remain accessible unless explicitly purged
This happens automatically when you use Valohai-supported data stores like AWS S3, Azure Blob Storage, Google Cloud Storage, OCI Object Storage or OpenStack Swift.
Handling massive datasets
For projects with millions of files, Valohai offers:
Dataset packaging: Bundle thousands of files into a single archive for faster job starts
On-demand inputs: Start processing immediately without waiting for all files to download
Additional caching layers: Allow multiple machines to access already downloaded data
Execution tracking
Every execution in Valohai captures complete context for reproducibility.
Alongside used Environment, Code, Parameters and produced Metadata, Valohai will keep track of every file used or produced by the execution.
Lineage visualization
Trace any file backward and forward through your pipeline to understand:
Which execution created this model?
What datasets were used for training?
Which deployments are using this model?

The trace view handles millions of files efficiently, showing summarized dataset views with drill-down capabilities.
Organizing with tags and aliases
Tags
Group related models, files, executions for easy filtering:
experiment-phase-1production-candidatequarterly-report
Tags help teams navigate hundreds of experiments without losing important runs.

Aliases
Create human-readable pointers to specific file versions:
production-config→ points to the latest production-ready configuration fileclean-dataset-v2→ references your latest preprocessed data
Aliases automatically version themselves—when you update production-config, Valohai keeps the complete history.

Datasets for complex collections
Datasets group related files into versioned collections—perfect for managing training data that evolves over time.
Common use cases:
Image classification datasets with thousands of photos
Multi-modal data (images + labels + metadata)
Train/validation/test splits
Smart versioning without duplication
When you modify a dataset, adding new files, removing outdated ones, or replacing specific items, Valohai creates a new version without duplicating unchanged files. The platform only stores references to existing files plus any new additions.
This means:
Adding 1,000 images to a million-image dataset doesn't duplicate the million
Removing mislabeled samples creates a new clean version without copying data
Multiple dataset versions can share the same underlying files
Each dataset modification creates a new version.
Real-world scenarios
Rolling back a production model
# Reference a specific model version by alias
vh execution run train.py --input model=datum://cat-model-20251005Managing dataset iterations
Create initial dataset:
animal-photos-v1Add new images → automatically becomes
animal-photos-v1.1Fix mislabeled data →
animal-photos-v1.2Create subset for experiments →
animal-photos-subset-v1
Each version remains accessible for comparison and rollback.
Next steps
Configure your data store to enable automatic versioning
Create your first dataset for organizing training data
Work with tags and aliases for better organization
Upload and version files programmatically
Last updated
Was this helpful?
