Data

Valohai treats data as a first-class citizen. Every file you load or save gets versioned, tracked, and linked to the code that created it — automatically.

No more "which dataset did I use for this model?" or "where did this file come from?" Your data has a complete audit trail from raw input to final output.

How Valohai Handles Data

Everything is Versioned

Every file saved from an execution gets a unique datum:// link that points to an immutable version. Use these links as inputs in future jobs, and Valohai guarantees you'll always get the exact same file.

Example:

inputs:
  - name: training-data
    default: datum://01234567-89ab-cdef-0123-456789abcdef

This single link captures:

The exact file content
When it was created
Which execution produced it
What code and parameters were used

Cloud Storage Without the Complexity

Connect your S3, Azure Blob Storage, GCS, OVH Object Storage or Oracle bucket once, after that, files appear as local paths in your code—no boto3, no authentication logic, just pd.read_csv('/valohai/inputs/dataset/data.csv').

Valohai handles:

Authentication and credential rotation
Cross-region, cross-cloud transfers and caching
Access control between projects
Metadata and lineage tracking

You control:

Where data lives (your cloud account)
Bucket policies and compliance
Cost and retention

⚠️ Platform setup required: Data scientists can start using data immediately, but platform teams need to configure data stores first.

From Files to Datasets

Individual files work great for single models or CSVs. But when you're managing hundreds of images or train/validation splits, datasets keep everything organized.

Datasets group related files into versioned collections:

# Instead of manually adding 1000 individual files
inputs:
  - name: training-images
    default: dataset://imagenet/train-v3  # One reference, 1000 files

Update all files together, track changes between versions, and use aliases like production or staging to promote datasets through your workflow.

Working with Data in Valohai

Save Files from Your Code

Write files to /valohai/outputs/ and Valohai uploads them automatically:

# Save any file to outputs
import pandas as pd

df.to_csv('/valohai/outputs/processed_data.csv')
model.save('/valohai/outputs/model.pkl')

That's it. No upload logic, no authentication. Files appear in your project's Data tab with full lineage.

Add Context with Metadata

Attach tags, aliases, and custom properties to files for search, filtering, and audit trails:

import json

metadata = {
    "valohai.tags": ["validated", "production"],
    "valohai.alias": "latest-model",
    "accuracy": 0.95,
    "training_date": "2024-01-15"
}

# Save metadata alongside your output
with open('/valohai/outputs/model.pkl.metadata.json', 'w') as f:
    json.dump(metadata, f)

Now search by tag, reference by alias, or query custom properties through the API.

Load Files in Your Code

Reference data in your YAML, and Valohai downloads it before your code runs:

- step:
    name: train
    image: python:3.9
    command: python train.py
    inputs:
      - name: dataset
        default: s3://mybucket/mydata/project-a/data.csv

Access it like a local file:

import pandas as pd

df = pd.read_csv('/valohai/inputs/dataset/data.csv')

Supports S3, Azure, GCS, OVH, Oracle or public URLs, datum:// links from previous executions and Valohai datasets and models (dataset:// and model:// links).

Data Storage Options

Cloud Object Storage (Recommended)

Configure S3, Azure Blob, GCS, OVH or Oracle Bucket Storage as your primary data store. Best for:

Versioned inputs and outputs
Reproducible pipelines
Audit trails and compliance
Multi-region access

Set up your data store →

Network Storage (For Shared Data)

Mount AWS EFS, Google Filestore, or on-premises NFS when you need:

Shared scratch space across executions
Access to existing on-prem datasets
Shared cache layer for large files or large amount of files (tens or hundreds of thousands)

Trade-off: Network mounts are fast but could include data that's not tracked by Valohai. Be careful, because use of such data could break reproducibility.

Mount network storage →

Databases (For Structured Queries)

Query BigQuery, Redshift, or Snowflake directly from executions. Best for:

Pulling training data from data warehouses
Running feature engineering on SQL tables
Joining external datasets

Query databases →

Data Organization Patterns

For ML Experiments

Use datasets with aliases for environment promotion:

inputs:
  - name: training-data
    default: dataset://customer-churn/production
  - name: validation-data
    default: dataset://customer-churn/staging

Update aliases when promoting datasets—no code changes needed.

For Production Pipelines

Use datum:// links with aliases for immutable references:

inputs:
  - name: model-weights
    default: datum://abc123...  # Exact version, always reproducible

For Large-Scale Data

Package files into tar archives before creating datasets:

import tarfile

with tarfile.open('/valohai/outputs/images.tar', 'w') as tar:
    tar.add('/valohai/outputs/images/', arcname='images')

Downloading one 10GB tar is much faster than 100,000 individual files.

💡 If packaging is not an option, and you do require a large mount of data, checkout available caching strategies that could speed up the access to your data.

When to Use What

Need

Solution

Why

Single model file

datum:// link

Immutable, versioned, traceable

Training/validation split

Dataset with versions

Files versioned together

Image classification folders

Dataset with keep-directories

Preserves folder structure

Existing on-prem data

Network mount

Data stays on premises

Query data warehouse

Database connector

No data movement needed

Environment promotion

Dataset aliases

Update alias, not code

Next Steps

For Data Scientists:

Save file from your code to outputs
Load files in your code using inputs
Create datasets for multi-file workflows
Add metadata for searchability

For Platform Teams:

Configure cloud storage as your data store
Connect databases for SQL access
Set up network mounts if needed

💡 Platform setup required: Before data scientists can save files, platform teams must configure at least one data store.

PreviousRun parallel executions within a pipeline NextConfigure Data Stores

Last updated 14 days ago

Was this helpful?