Data
Valohai treats data as a first-class citizen. Every file you load or save gets versioned, tracked, and linked to the code that created it — automatically.
No more "which dataset did I use for this model?" or "where did this file come from?" Your data has a complete audit trail from raw input to final output.
How Valohai Handles Data
Everything is Versioned
Every file saved from an execution gets a unique datum:// link that points to an immutable version. Use these links as inputs in future jobs, and Valohai guarantees you'll always get the exact same file.
Example:
inputs:
- name: training-data
default: datum://01234567-89ab-cdef-0123-456789abcdefThis single link captures:
The exact file content
When it was created
Which execution produced it
What code and parameters were used
Cloud Storage Without the Complexity
Connect your S3, Azure Blob Storage, GCS, OVH Object Storage or Oracle bucket once, after that, files appear as local paths in your code—no boto3, no authentication logic, just pd.read_csv('/valohai/inputs/dataset/data.csv').
Valohai handles:
Authentication and credential rotation
Cross-region, cross-cloud transfers and caching
Access control between projects
Metadata and lineage tracking
You control:
Where data lives (your cloud account)
Bucket policies and compliance
Cost and retention
⚠️ Platform setup required: Data scientists can start using data immediately, but platform teams need to configure data stores first.
From Files to Datasets
Individual files work great for single models or CSVs. But when you're managing hundreds of images or train/validation splits, datasets keep everything organized.
Datasets group related files into versioned collections:
# Instead of manually adding 1000 individual files
inputs:
- name: training-images
default: dataset://imagenet/train-v3 # One reference, 1000 filesUpdate all files together, track changes between versions, and use aliases like production or staging to promote datasets through your workflow.
Working with Data in Valohai
Save Files from Your Code
Write files to /valohai/outputs/ and Valohai uploads them automatically:
# Save any file to outputs
import pandas as pd
df.to_csv('/valohai/outputs/processed_data.csv')
model.save('/valohai/outputs/model.pkl')That's it. No upload logic, no authentication. Files appear in your project's Data tab with full lineage.
Add Context with Metadata
Attach tags, aliases, and custom properties to files for search, filtering, and audit trails:
import json
metadata = {
"valohai.tags": ["validated", "production"],
"valohai.alias": "latest-model",
"accuracy": 0.95,
"training_date": "2024-01-15"
}
# Save metadata alongside your output
with open('/valohai/outputs/model.pkl.metadata.json', 'w') as f:
json.dump(metadata, f)Now search by tag, reference by alias, or query custom properties through the API.
Load Files in Your Code
Reference data in your YAML, and Valohai downloads it before your code runs:
- step:
name: train
image: python:3.9
command: python train.py
inputs:
- name: dataset
default: s3://mybucket/mydata/project-a/data.csvAccess it like a local file:
import pandas as pd
df = pd.read_csv('/valohai/inputs/dataset/data.csv')Supports S3, Azure, GCS, OVH, Oracle or public URLs, datum:// links from previous executions and Valohai datasets and models (dataset:// and model:// links).
Data Storage Options
Cloud Object Storage (Recommended)
Configure S3, Azure Blob, GCS, OVH or Oracle Bucket Storage as your primary data store. Best for:
Versioned inputs and outputs
Reproducible pipelines
Audit trails and compliance
Multi-region access
Network Storage (For Shared Data)
Mount AWS EFS, Google Filestore, or on-premises NFS when you need:
Shared scratch space across executions
Access to existing on-prem datasets
Shared cache layer for large files or large amount of files (tens or hundreds of thousands)
Trade-off: Network mounts are fast but could include data that's not tracked by Valohai. Be careful, because use of such data could break reproducibility.
Databases (For Structured Queries)
Query BigQuery, Redshift, or Snowflake directly from executions. Best for:
Pulling training data from data warehouses
Running feature engineering on SQL tables
Joining external datasets
Data Organization Patterns
For ML Experiments
Use datasets with aliases for environment promotion:
inputs:
- name: training-data
default: dataset://customer-churn/production
- name: validation-data
default: dataset://customer-churn/stagingUpdate aliases when promoting datasets—no code changes needed.
For Production Pipelines
Use datum:// links with aliases for immutable references:
inputs:
- name: model-weights
default: datum://abc123... # Exact version, always reproducibleFor Large-Scale Data
Package files into tar archives before creating datasets:
import tarfile
with tarfile.open('/valohai/outputs/images.tar', 'w') as tar:
tar.add('/valohai/outputs/images/', arcname='images')Downloading one 10GB tar is much faster than 100,000 individual files.
💡 If packaging is not an option, and you do require a large mount of data, checkout available caching strategies that could speed up the access to your data.
When to Use What
Single model file
datum:// link
Immutable, versioned, traceable
Training/validation split
Dataset with versions
Files versioned together
Image classification folders
Dataset with keep-directories
Preserves folder structure
Existing on-prem data
Network mount
Data stays on premises
Query data warehouse
Database connector
No data movement needed
Environment promotion
Dataset aliases
Update alias, not code
Next Steps
For Data Scientists:
Save file from your code to outputs
Load files in your code using inputs
Create datasets for multi-file workflows
Add metadata for searchability
For Platform Teams:
Configure cloud storage as your data store
Connect databases for SQL access
Set up network mounts if needed
💡 Platform setup required: Before data scientists can save files, platform teams must configure at least one data store.
Last updated
Was this helpful?
