Package Datasets for Faster Downloads
Automatically package your dataset versions into optimized files for dramatically faster downloads. Designed for datasets with thousands or millions of small files.
The Problem: Small Files Are Slow
Downloading many small files is slow, even with fast networks and cloud storage.
❌ 500k images (10KB each) -> Download takes ~12h Overhead per file: API calls, authentication, metadata checks ...
✅ Million images packed -> Single file download takes ~10-15 minutes Single API call, continuous download stream
Why this matters:
Each file requires separate API calls and authentication
Network latency adds up across thousands of files
Cloud storage rate limits can slow batch downloads
Training jobs wait hours for data before starting
The Solution: Automatic Dataset Packaging
Dataset version packaging automatically bundles all files in a dataset version into optimized package files. When you use a packaged dataset as input, Valohai downloads the package instead of individual files, then extracts them transparently.
Benefits:
10-100x faster downloads for datasets with many small files
Automatic — Happens during dataset creation
Transparent — Your code doesn't change
Cached — Packages are reused across executions
No compression overhead — Fast extraction
When to Use Dataset Packaging
Ideal Use Cases
Image classification datasets:
❌ 1 million images x 50KB each = slow individual downloads ✅ Packaged into ~50GB file = fast download of a single file
Time-series data:
❌ 500k CSV files x 5KB each = hours of downloading ✅ Packaged into ~2.5GB file = minutes of downloading
Benefits appear when you have:
10,000+ files in a dataset version
File sizes under 1MB each
Frequent reuse of the same dataset
Long data download times blocking training
When Not to Use
Don't package if:
You have fewer than 1,000 files (overhead not worth it)
Files are already large (>10MB each)
You only use the dataset once
Files are already packaged (e.g., existing tar/zip archives)
Packaging vs Manual Tar Files
You might wonder: "Why not just tar files myself?"
Manual tar (covered in Create and Manage Datasets)
One-time packaging, full control over structure, works today for all dataset types
Automatic packaging (this feature)
Repeated dataset updates, programmatic workflows, want Valohai to handle it
Key difference:
Manual tar: You create
images.tar, add it to dataset, extract in your codeAutomatic packaging: You create dataset normally, Valohai packages automatically, extracts transparently
Use both:
# Manual tar for stable baseline data
metadata = {
"baseline_images.tar": {
"valohai.dataset-versions": ["dataset://images/baseline-v1"]
}
}
# Automatic packaging for frequently updated data
metadata = {
"new_image_001.jpg": {
"valohai.dataset-versions": [{
"uri": "dataset://images/daily-v2",
"packaging": True # Valohai packages automatically
}]
}
}Enable Dataset Packaging
To enable Dataset Packaging you just have to include the next environment variable in your execution
VH_ENABLE_DATASET_VERSION_PACKAGING with the truthy value ( 1 or yes or true ).
💡 Take a look at other environment variables that control worker machine agent's behavior and how to make them included by default in every execution.
How to Package a Dataset Version
Step 1: Enable in Execution Configuration
When creating your execution (via UI, API, or CLI), set the environment variable:
VH_ENABLE_DATASET_VERSION_PACKAGING=1In valohai.yaml:
- step:
name: create-packaged-dataset
image: python:3.9
command: python prepare_data.py
environment-variables:
- name: VH_ENABLE_DATASET_VERSION_PACKAGING
default: "1"In UI:
When creating execution, go to "Environment Variables"
Add:
VH_ENABLE_DATASET_VERSION_PACKAGING=1

Step 2: Create Dataset with Packaging Flag
Add "packaging": True to your dataset version metadata:
import json
import os
# Process and save your files
output_dir = '/valohai/outputs/'
for i in range(100000):
# Your data processing
data = process_image(i)
filename = f'image_{i:06d}.jpg'
save_image(data, os.path.join(output_dir, filename))
# Create metadata with packaging enabled
metadata = {}
for i in range(100000):
filename = f'image_{i:06d}.jpg'
metadata[filename] = {
"valohai.dataset-versions": [{
"uri": "dataset://my-images/train-v1",
"packaging": True # Enable automatic packaging
}]
}
# Save metadata
metadata_path = '/valohai/outputs/valohai.metadata.jsonl'
with open(metadata_path, 'w') as f:
for filename, file_metadata in metadata.items():
json.dump({"file": filename, "metadata": file_metadata}, f)
f.write('\n')What happens:
Execution saves 100,000 image files
Valohai packages all images into
.vhpkgzipfile(s)Package uploaded alongside individual files
Dataset version
train-v1is created with package reference
Step 3: Use the Packaged Dataset
Use the dataset normally in your training pipeline:
- step:
name: train-model
image: pytorch/pytorch:2.0.0
command: python train.py
inputs:
- name: training-images
default: dataset://my-images/train-v1No code changes needed — Your training script reads files from /valohai/inputs/training-images/ exactly as before:
import os
# Works the same whether packaged or not
input_dir = '/valohai/inputs/training-images/'
for filename in os.listdir(input_dir):
filepath = os.path.join(input_dir, filename)
process_image(filepath)Complete Example: Time-Series Dataset
Process daily sensor data and package for fast access:
import json
import pandas as pd
import os
# Step 1: Process data files
output_dir = '/valohai/outputs/'
dates = pd.date_range('2024-01-01', '2024-12-31', freq='D')
metadata = {}
for date in dates:
# Process sensor data for this date
sensor_data = load_sensor_data(date)
# Save as CSV
filename = f'sensor_{date.strftime("%Y%m%d")}.csv'
filepath = os.path.join(output_dir, filename)
sensor_data.to_csv(filepath, index=False)
# Add to metadata with packaging enabled
metadata[filename] = {
"valohai.dataset-versions": [{
"uri": "dataset://sensor-data/year-2024",
"packaging": True
}]
}
# Step 2: Save metadata
metadata_path = os.path.join(output_dir, 'valohai.metadata.jsonl')
with open(metadata_path, 'w') as f:
for filename, file_metadata in metadata.items():
json.dump({"file": filename, "metadata": file_metadata}, f)
f.write('\n')
print(f"Created dataset with {len(metadata)} files")
print("Packaging will happen automatically")valohai.yaml configuration:
- step:
name: prepare-sensor-data
image: python:3.9
command: python prepare_data.py
environment-variables:
- name: VH_ENABLE_DATASET_VERSION_PACKAGING
default: "1"
- step:
name: train-forecasting-model
image: python:3.9
command: python train.py
inputs:
- name: sensor-data
default: dataset://sensor-data/year-2024Complete Example: Image Classification
Prepare image classification dataset with folder structure:
import json
import os
from PIL import Image
# Step 1: Process and save images by class
output_dir = '/valohai/outputs/'
classes = ['cats', 'dogs', 'birds']
metadata = {}
for class_name in classes:
class_dir = os.path.join(output_dir, class_name)
os.makedirs(class_dir, exist_ok=True)
# Process images for this class
for i in range(10000):
# Your image processing
img = process_image(class_name, i)
# Save with class folder structure
filename = f'{class_name}/{class_name}_{i:05d}.jpg'
filepath = os.path.join(output_dir, filename)
img.save(filepath)
# Add to metadata with packaging
metadata[filename] = {
"valohai.dataset-versions": [{
"uri": "dataset://imagenet-subset/train-v1",
"packaging": True
}]
}
# Step 2: Save metadata
metadata_path = os.path.join(output_dir, 'valohai.metadata.jsonl')
with open(metadata_path, 'w') as f:
for filename, file_metadata in metadata.items():
json.dump({"file": filename, "metadata": file_metadata}, f)
f.write('\n')
print(f"Packaged {len(metadata)} images across {len(classes)} classes")Directory structure is preserved:
/valohai/inputs/training-images/
├── cats/
│ ├── cats_00000.jpg
│ ├── cats_00001.jpg
│ └── ...
├── dogs/
│ ├── dogs_00000.jpg
│ └── ...
└── birds/
└── ...How It Works Behind the Scenes
Understanding what happens helps with debugging and optimization.
During Dataset Creation
Execution runs with
VH_ENABLE_DATASET_VERSION_PACKAGING=1Files saved to
/valohai/outputs/Metadata processed — Valohai detects
"packaging": TrueAutomatic packaging:
All files for the dataset version are bundled into
.vhpkgzipfile(s)Package format is an uncompressed zip (fast extraction)
Large datasets may be split into multiple packages
Upload:
Individual files uploaded to storage
Package file(s) uploaded alongside them
Dataset version created with package reference
During Execution Using Packaged Dataset
Execution starts needing
dataset://my-images/train-v1Valohai checks if packages exist for this dataset version
Download:
Package file(s) downloaded instead of individual files
Dramatically fewer API calls (1-10 vs thousands/millions)
Automatic extraction:
Package extracted to cache directory
Files appear in
/valohai/inputs/as normal
Your code runs — No changes needed, files work identically
Caching:
Extracted files cached for future executions
Same package reused across multiple runs
Verify Packaging Worked
Check Execution Logs
Look for packaging messages in the execution that created the dataset:
(dataset version package 1/1) uploading <dataset-name>.<dataset-version>.vhpkgzip (15 GB)
(dataset version package 1/1) uploading to <destination-store> bucket <destination-bucket>
(dataset version package 1/1) upload complete (datum <datum-id>)Expert Mode UI
View packages in the Valohai UI:
Open the dataset version in Data → Datasets
Press Ctrl + Shift + X (or Cmd + Shift + X on Mac) to enable Expert Mode
View the Packages section showing
.vhpkgzipfiles
Individual files
2-3 hours
None
Packaged
10-15 minutes
2-3 minutes extraction
Speedup
~10x faster
One-time cost
Individual files
4-6 hours
None
Packaged
10-15 minutes
5-8 minutes extraction
Speedup
~20x faster
One-time cost
Individual files
30-40 minutes
None
Packaged
25-35 minutes
2-3 minutes extraction
Speedup
Minimal
Not worth it
Current Limitations
Programmatic Creation Only
Dataset packaging currently works only for dataset versions created programmatically (via execution metadata).
Not yet supported:
Datasets created via Web UI
Datasets created via API
Workaround: Create datasets programmatically in a dedicated data preparation execution.
Requires Environment Variable
You must set VH_ENABLE_DATASET_VERSION_PACKAGING=1 when creating the dataset version.
This requirement will be removed in a future update when packaging becomes the default behavior.
Best Practices
Start with Small Test Dataset
Before packaging millions of files, test with a smaller subset:
# Test with 1,000 files first
metadata = {}
for i in range(1000): # Not 100,000
filename = f'test_image_{i:05d}.jpg'
metadata[filename] = {
"valohai.dataset-versions": [{
"uri": "dataset://test-packaging/small-test",
"packaging": True
}]
}Verify packaging works, then scale to full dataset.
Name Dataset Versions Clearly
Indicate when datasets are packaged:
# Good: Clear versioning
"uri": "dataset://images/train-v2-packaged"
"uri": "dataset://sensors/2024-q1-optimized"
# Avoid: Unclear what changed
"uri": "dataset://images/new"
"uri": "dataset://sensors/updated"Monitor First Execution
The first execution using a packaged dataset will:
Download package
Extract all files (one-time cost)
Cache for future use
Subsequent executions skip steps 1-2 and use cached files.
Watch logs to verify extraction completes successfully.
Combine with Dataset Versioning
Create new packaged versions as data evolves:
# Initial dataset
metadata = {
"file.jpg": {
"valohai.dataset-versions": [{
"uri": "dataset://my-data/v1",
"packaging": True
}]
}
}
# Updated dataset (new files added)
metadata = {
"new_file.jpg": {
"valohai.dataset-versions": [{
"uri": "dataset://my-data/v2",
"from": "dataset://my-data/v1",
"packaging": True
}]
}
}Each version is packaged separately with its own optimized package.
Next Steps
Test with a small dataset (1,000-10,000 files)
Measure performance improvement for your use case
Scale to production datasets
Last updated
Was this helpful?
