Create and Manage Datasets
Datasets are versioned collections of files that simplify working with multiple related files. Use datasets for training/validation splits, image classification folders, or any workflow requiring coordinated file groups.
The Problem with Individual Files
💡 Quick recap: In Valohai, individual files are called datums. Each datum has a unique
datum://link you can use as an input. See Load Data in Jobs for details.
Datums work well for single files, but become complex when managing collections:
Updating 50 image files? You'd need to update 50 datum links
Maintaining train/validation/test splits? Hard to keep them synchronized
Versioning related files together? No built-in way to track the group
Example problem:
# Managing individual files becomes tedious
inputs:
- name: train-images
default:
- datum://abc123...
- datum://def456...
- datum://ghi789...
# ... 47 more filesDatasets Solve This
Datasets group related files into versioned collections.
Same workflow, cleaner:
inputs:
- name: train-images
default: dataset://my-images/train-v2Key benefits:
Group related files — One reference points to entire collection
Version together — Update all files as a unit
Track changes — See what changed between versions
Immutable versions — Each version is locked once created
Flexible access — Use
latest, specific versions, or aliases
Datasets vs Datums
Reference
Single file
Collection of files
URI format
datum://file-id
dataset://name/version
Use when
One model file, one CSV
Image folders, data splits, multi-file outputs
Versioning
Each file versioned independently
Files versioned together as a group
Updates
Create new datum
Create new dataset version
When to Use Datasets
Training/Validation/Test Splits
Keep data splits synchronized:
inputs:
- name: train-data
default: dataset://customer-churn/train-v3
- name: validation-data
default: dataset://customer-churn/validation-v3
- name: test-data
default: dataset://customer-churn/test-v3When you update the data, create new versions (v4) and all splits stay aligned.
Image Classification
Organize images by class:
dataset://imagenet/train-v1 contains:
├── cats/
│ ├── cat001.jpg
│ ├── cat002.jpg
│ └── ...
├── dogs/
│ ├── dog001.jpg
│ ├── dog002.jpg
│ └── ...
└── birds/
├── bird001.jpg
└── ...Learn more about directory structure: See Directory Structure in Datasets below.
Multi-File Model Artifacts
Package related model files together:
dataset://bert-model/production contains:
├── model.bin
├── config.json
├── vocab.txt
└── tokenizer_config.jsonCreate a Dataset
Datasets have two levels:
Dataset — The container with a name (e.g.,
my-images)Dataset Version — Specific collection of files (e.g.,
v1,v2,latest)
You must create both.
Create via Code (Recommended)
Create datasets programmatically when saving execution outputs.
Basic Dataset Creation
import json
# Define which files belong to the dataset version
metadata = {
"train_image_001.jpg": {
"valohai.dataset-versions": ["dataset://my-images/v1"]
},
"train_image_002.jpg": {
"valohai.dataset-versions": ["dataset://my-images/v1"]
},
"train_image_003.jpg": {
"valohai.dataset-versions": ["dataset://my-images/v1"]
}
}
# Save all your output files first
for i in range(1, 4):
# Your code to save images
image.save(f'/valohai/outputs/train_image_{i:03d}.jpg')
# Save dataset metadata in single JSONL file
metadata_path = "/valohai/outputs/valohai.metadata.jsonl"
with open(metadata_path, "w") as outfile:
for file_name, file_metadata in metadata.items():
json.dump({"file": file_name, "metadata": file_metadata}, outfile)
outfile.write("\n")What happens:
If dataset
my-imagesdoesn't exist, it's created automaticallyVersion
v1is created with the three specified filesFiles are available as
dataset://my-images/v1
Create Training/Validation Split
import json
# Save your split files
train_data.to_csv('/valohai/outputs/train.csv')
val_data.to_csv('/valohai/outputs/validation.csv')
test_data.to_csv('/valohai/outputs/test.csv')
# Assign files to dataset versions
metadata = {
"train.csv": {
"valohai.dataset-versions": ["dataset://customer-data/train-v2"]
},
"validation.csv": {
"valohai.dataset-versions": ["dataset://customer-data/validation-v2"]
},
"test.csv": {
"valohai.dataset-versions": ["dataset://customer-data/test-v2"]
}
}
# Save metadata
metadata_path = "/valohai/outputs/valohai.metadata.jsonl"
with open(metadata_path, "w") as outfile:
for file_name, file_metadata in metadata.items():
json.dump({"file": file_name, "metadata": file_metadata}, outfile)
outfile.write("\n")Now use dataset://customer-data/train-v2 in your training pipeline.
Legacy Approach (Sidecar Files)
The older approach used individual .metadata.json files per output:
import json
# Save output file
save_path = '/valohai/outputs/data_file.csv'
data.to_csv(save_path)
# Save metadata in sidecar file
metadata = {
"valohai.dataset-versions": ["dataset://my-dataset/v1"]
}
metadata_path = '/valohai/outputs/data_file.csv.metadata.json'
with open(metadata_path, 'w') as outfile:
json.dump(metadata, outfile)This still works, but the JSONL approach is recommended for better organization when handling multiple files.
Create via Web UI
Step 1: Create the Dataset Container
Open your project
Navigate to Data → Datasets tab
Click "Create dataset"
Enter a Name (e.g.,
my-images)Select Owner:
Your account — Private to you
Your organization — Shared with team
Click "Create"
Step 2: Create a Dataset Version
Click on your dataset name
Click "Create new version"
Select files to include:
Search by filename, tags, or data store
Click "Add" or "Add Selected" for multiple files
Add or remove files until satisfied
Enter a version name (e.g.,
v1,train-split-2024-q1)Click "Save new version"
Important: Once saved, dataset versions are immutable. You cannot edit them—only create new versions.
Use Datasets as Inputs
Reference datasets in your pipeline using dataset:// URIs.
In valohai.yaml
- step:
name: train-model
image: pytorch/pytorch:2.0.0
command: python train.py
inputs:
- name: training-images
default: dataset://my-images/v2
- name: validation-images
default: dataset://my-images/validation-v2URI Formats
# Specific version
default: dataset://my-images/v2# Latest version (always points to newest)
default: dataset://my-images/latest# Custom alias (see Dataset Aliases section below)
default: dataset://my-images/productionIn Code
All files from the dataset are downloaded to the input directory:
import os
# List all files in the dataset
input_dir = '/valohai/inputs/training-images/'
for filename in os.listdir(input_dir):
filepath = os.path.join(input_dir, filename)
print(f"Processing {filename}")
# Your processing logicLearn more: Load Data in Jobs
Dataset Versioning
Dataset versions are immutable once created. This ensures reproducibility—an execution using dataset://my-images/v2 will always get the exact same files.
Version Naming
Choose clear, descriptive version names:
# Good: Descriptive and sortable
"v1", "v2", "v3"
"train-2024-01-15"
"baseline-split"
"production-2024-q1"
# Avoid: Ambiguous or hard to track
"latest" # Valohai reserved keyword
"final"
"new"
"temp"Version History
Track all versions in the Valohai UI:
Navigate to Data → Datasets
Click on your dataset
View the Versions table showing:
Version name
Creation date
Number of files
Creator
Update Existing Versions
You cannot edit a dataset version after creation. To modify:
Create a new version based on the old one
Add or remove files
Save with a new version name
For complex updates (excluding specific files, starting from existing versions), see Update Dataset Versions.
Dataset Aliases
Aliases let you reference dataset versions with human-readable names instead of hardcoding version names in your code.
The latest Alias
latest AliasEvery dataset automatically has a latest alias pointing to the newest version:
inputs:
- name: training-data
default: dataset://my-dataset/latest # Always uses newest versionNo setup required — latest updates automatically when you create new versions.
Custom Aliases
Create your own aliases for environment management or workflow stages.
Example use cases:
- dataset://my-images/production # Current production dataset
- dataset://my-images/staging # Being tested
- dataset://my-images/baseline # Original benchmark dataset
- dataset://my-images/experiment-42 # Specific experiment versionCreate Alias via Web UI
Open your dataset
Navigate to the Aliases tab
Click "Create new dataset version alias"
Enter alias name (e.g.,
production)Select the dataset version to point to
Save
Update an alias:
Find the alias in the Aliases tab
Click "Edit"
Select a different version
Save
The UI tracks alias history—see when it was changed and what it pointed to before.
Create Alias via Code
Set aliases when creating dataset versions:
import json
metadata = {
"model.pkl": {
"valohai.dataset-versions": [{
'uri': "dataset://my-models/v5",
'from': "dataset://my-models/v4",
'targeting_aliases': ['production', 'stable'] # Creates/updates these aliases
}]
}
}
save_path = '/valohai/outputs/model.pkl'
model.save(save_path)
metadata_path = "/valohai/outputs/valohai.metadata.jsonl"
with open(metadata_path, "w") as outfile:
for file_name, file_metadata in metadata.items():
json.dump({"file": file_name, "metadata": file_metadata}, outfile)
outfile.write("\n")What happens:
Creates version
v5based onv4Updates
productionalias to point tov5Updates
stablealias to point tov5If aliases don't exist, they're created
Use Aliases in Pipelines
- step:
name: train-production-model
image: python:3.9
command: python train.py
inputs:
- name: training-data
default: dataset://customer-data/production
- name: validation-data
default: dataset://customer-data/stagingWhen you promote a dataset version to production, just update the alias—no code changes needed.
Alias Best Practices
Environment-based:
'targeting_aliases': ['dev', 'staging', 'production']Workflow stages:
'targeting_aliases': ['preprocessing-done', 'validated', 'ready-for-training']Experiment tracking:
'targeting_aliases': ['baseline', 'experiment-current', 'best-so-far']Directory Structure in Datasets
How files are organized in /valohai/inputs/ depends on how they were saved originally.
Flat Structure
If source files were saved without directories:
/valohai/inputs/my-dataset/
├── image001.jpg
├── image002.jpg
└── image003.jpgAccess in code:
import os
input_dir = '/valohai/inputs/my-dataset/'
files = os.listdir(input_dir)
for filename in files:
filepath = os.path.join(input_dir, filename)
process_image(filepath)Nested Structure
If source files used keep-directories when loading:
/valohai/inputs/my-dataset/
├── cats/
│ ├── cat001.jpg
│ └── cat002.jpg
├── dogs/
│ ├── dog001.jpg
│ └── dog002.jpg
└── birds/
└── bird001.jpgAccess in code:
import os
input_dir = '/valohai/inputs/my-dataset/'
# Process by subdirectory (class label)
for class_name in os.listdir(input_dir):
class_dir = os.path.join(input_dir, class_name)
if os.path.isdir(class_dir):
print(f"Processing class: {class_name}")
for filename in os.listdir(class_dir):
filepath = os.path.join(class_dir, filename)
process_image(filepath, label=class_name)The structure depends on:
How files were originally uploaded or generated
The
keep-directoriessetting when files were saved as outputsSee Ingest & Save Files for details on preserving directory structure
Performance: Package Files Together
⚠️ Important for large datasets: Downloading millions of individual small files is slow, even with fast networks.
The Problem
Slow: 2 million individual 10KB files
⏱️ Download time: Hours due to overhead per fileThe Solution
Package related files together before creating datasets:
import tarfile
# Package images into tar file (no compression needed)
with tarfile.open('/valohai/outputs/images.tar', 'w') as tar:
tar.add('/valohai/outputs/images/', arcname='images')
# Add packaged file to dataset
metadata = {
"images.tar": {
"valohai.dataset-versions": ["dataset://my-images/v1"]
}
}Benefits:
Fast: Single file download
Atomic: All-or-nothing download
No compression overhead (tar without gzip)
Preserves directory structure
In your training code:
import tarfile
# Extract once at start of execution
with tarfile.open('/valohai/inputs/images/images.tar', 'r') as tar:
tar.extractall('/tmp/images/')
# Now process extracted files
for filename in os.listdir('/tmp/images/'):
process_image(filename)💡 When to package: If your dataset has >10,000 small files, strongly consider packaging them. The one-time extraction cost is much faster than downloading thousands of individual files.
Common Issues & Fixes
Dataset Version Not Created
Symptom: Execution completes successfully but dataset version doesn't appear
How to diagnose:
Open the execution in Valohai UI
Click the Alerts tab (top of execution page)
Look for dataset creation errors or warnings

Common causes:
Invalid version name → Use alphanumeric, hyphens, underscores only
Metadata file not saved → Verify
valohai.metadata.jsonlexists in outputsJSON syntax error → Validate JSON format
Wrong metadata structure → Check
{"file": "...", "metadata": {...}}format
Wrong Files in Dataset
Symptom: Dataset version contains unexpected files or missing files
Causes & Fixes:
Typo in filename → Filenames in metadata must match output files exactly
Files not saved before metadata → Save all output files before writing metadata
Wrong dataset URI in metadata → Double-check dataset name and version
Debug:
import os
import json
# List what was actually saved
print("Output files:", os.listdir('/valohai/outputs/'))
# Verify metadata content
with open('/valohai/outputs/valohai.metadata.jsonl', 'r') as f:
for line in f:
print("Metadata entry:", json.loads(line))Can't Use Dataset in Execution
Symptom: Input shows dataset://... but execution fails with "not found"
Causes & Fixes:
Typo in dataset URI → Check dataset name and version spelling
Version doesn't exist → Verify version was created in Data → Datasets tab
Wrong project → Dataset must be in same project as execution
Permission issue → Check dataset ownership (private vs organization)
Last updated
Was this helpful?
