Data Management

Valohai automatically versions every file in your ML workflow, from raw datasets to trained models, ensuring complete reproducibility and traceability without manual intervention.

Why automatic versioning matters

Machine learning teams face three critical data challenges:

Reproducing experiments: Without proper versioning, recreating a model from 6 months ago becomes impossible. Which exact dataset version was used? What preprocessing was applied?

Tracking data lineage: Understanding how your production model was created requires tracing through multiple data transformations, from raw images to augmented training sets to the final model artifacts.

Managing dataset iterations: Image datasets with millions of files evolve constantly. Teams add new samples, fix labels, and create subsets. Manual tracking quickly becomes unmanageable.

Valohai solves these challenges by automatically versioning every file that passes through the platform.

How versioning works

Every file in Valohai is immutable and permanently stored. When an execution creates a new output:

Files never overwrite existing versions
Each file gets a unique identifier
Files remain accessible unless explicitly purged

This happens automatically when you use Valohai-supported data stores like AWS S3, Azure Blob Storage, Google Cloud Storage, OCI Object Storage or OpenStack Swift.

Handling massive datasets

For projects with millions of files, Valohai offers:

Dataset packaging: Bundle thousands of files into a single archive for faster job starts
On-demand inputs: Start processing immediately without waiting for all files to download
Additional caching layers: Allow multiple machines to access already downloaded data

Execution tracking

Every execution in Valohai captures complete context for reproducibility.

Alongside used Environment, Code, Parameters and produced Metadata, Valohai will keep track of every file used or produced by the execution.

Lineage visualization

Trace any file backward and forward through your pipeline to understand:

Which execution created this model?
What datasets were used for training?
Which deployments are using this model?

The trace view handles millions of files efficiently, showing summarized dataset views with drill-down capabilities.

Organizing with tags and aliases

Aliases

Create human-readable pointers to specific file versions:

production-config → points to the latest production-ready configuration file
clean-dataset-v2 → references your latest preprocessed data

Aliases automatically version themselves—when you update production-config, Valohai keeps the complete history.

Datasets for complex collections

Datasets group related files into versioned collections—perfect for managing training data that evolves over time.

Common use cases:

Image classification datasets with thousands of photos
Multi-modal data (images + labels + metadata)
Train/validation/test splits

Smart versioning without duplication

When you modify a dataset, adding new files, removing outdated ones, or replacing specific items, Valohai creates a new version without duplicating unchanged files. The platform only stores references to existing files plus any new additions.

This means:

Adding 1,000 images to a million-image dataset doesn't duplicate the million
Removing mislabeled samples creates a new clean version without copying data
Multiple dataset versions can share the same underlying files

Each dataset modification creates a new version.

Real-world scenarios

Rolling back a production model

# Reference a specific model version by alias
vh execution run train.py --input model=datum://cat-model-20251005

Managing dataset iterations

Create initial dataset: animal-photos-v1
Add new images → automatically becomes animal-photos-v1.1
Fix mislabeled data → animal-photos-v1.2
Create subset for experiments → animal-photos-subset-v1

Each version remains accessible for comparison and rollback.

Next steps

Configure your data store to enable automatic versioning
Create your first dataset for organizing training data
Work with tags and aliases for better organization
Upload and version files programmatically

PreviousOracle Bucket Storage NextSave Files from Jobs

Last updated 1 month ago

Was this helpful?