Data Management

Valohai automatically versions every file in your ML workflow, from raw datasets to trained models, ensuring complete reproducibility and traceability without manual intervention.

Why automatic versioning matters

Machine learning teams face three critical data challenges:

Reproducing experiments: Without proper versioning, recreating a model from 6 months ago becomes impossible. Which exact dataset version was used? What preprocessing was applied?

Tracking data lineage: Understanding how your production model was created requires tracing through multiple data transformations, from raw images to augmented training sets to the final model artifacts.

Managing dataset iterations: Image datasets with millions of files evolve constantly. Teams add new samples, fix labels, and create subsets. Manual tracking quickly becomes unmanageable.

Valohai solves these challenges by automatically versioning every file that passes through the platform.

How versioning works

Every file in Valohai is immutable and permanently stored. When an execution creates a new output:

  • Files never overwrite existing versions

  • Each file gets a unique identifier

  • Files remain accessible unless explicitly purged

This happens automatically when you use Valohai-supported data stores like AWS S3, Azure Blob Storage, Google Cloud Storage, OCI Object Storage or OpenStack Swift.

Handling massive datasets

For projects with millions of files, Valohai offers:

Execution tracking

Every execution in Valohai captures complete context for reproducibility.

Alongside used Environment, Code, Parameters and produced Metadata, Valohai will keep track of every file used or produced by the execution.

Lineage visualization

Trace any file backward and forward through your pipeline to understand:

  • Which execution created this model?

  • What datasets were used for training?

  • Which deployments are using this model?

The trace view handles millions of files efficiently, showing summarized dataset views with drill-down capabilities.

Organizing with tags and aliases

Tags

Group related models, files, executions for easy filtering:

  • experiment-phase-1

  • production-candidate

  • quarterly-report

Tags help teams navigate hundreds of experiments without losing important runs.

Aliases

Create human-readable pointers to specific file versions:

  • production-config → points to the latest production-ready configuration file

  • clean-dataset-v2 → references your latest preprocessed data

Aliases automatically version themselves—when you update production-config, Valohai keeps the complete history.

Datasets for complex collections

Datasets group related files into versioned collections—perfect for managing training data that evolves over time.

Common use cases:

  • Image classification datasets with thousands of photos

  • Multi-modal data (images + labels + metadata)

  • Train/validation/test splits

Smart versioning without duplication

When you modify a dataset, adding new files, removing outdated ones, or replacing specific items, Valohai creates a new version without duplicating unchanged files. The platform only stores references to existing files plus any new additions.

This means:

  • Adding 1,000 images to a million-image dataset doesn't duplicate the million

  • Removing mislabeled samples creates a new clean version without copying data

  • Multiple dataset versions can share the same underlying files

Each dataset modification creates a new version.

Real-world scenarios

Rolling back a production model

# Reference a specific model version by alias
vh execution run train.py --input model=datum://cat-model-20251005

Managing dataset iterations

  1. Create initial dataset: animal-photos-v1

  2. Add new images → automatically becomes animal-photos-v1.1

  3. Fix mislabeled data → animal-photos-v1.2

  4. Create subset for experiments → animal-photos-subset-v1

Each version remains accessible for comparison and rollback.

Next steps

Last updated

Was this helpful?