Package Datasets for Faster Downloads
Automatically package your dataset versions into optimized files for dramatically faster downloads. Designed for datasets with thousands or millions of small files.
The Problem: Small Files Are Slow
Downloading many small files is slow, even with fast networks and cloud storage.
❌ 500k images (10KB each) -> Download takes ~12h Overhead per file: API calls, authentication, metadata checks ...
✅ Million images packed -> Single file download takes ~10-15 minutes Single API call, continuous download stream
Why this matters:
Each file requires separate API calls and authentication
Network latency adds up across thousands of files
Cloud storage rate limits can slow batch downloads
Training jobs wait hours for data before starting
The Solution: Automatic Dataset Packaging
Dataset version packaging automatically bundles all files in a dataset version into optimized package files. When you use a packaged dataset as input, Valohai downloads the package instead of individual files, then extracts them transparently.
Benefits:
10-100x faster downloads for datasets with many small files
Automatic — Happens during dataset creation
Transparent — Your code doesn't change
Cached — Packages are reused across executions
No compression overhead — Fast extraction
When to Use Dataset Packaging
Ideal Use Cases
Image classification datasets:
❌ 1 million images x 50KB each = slow individual downloads ✅ Packaged into ~50GB file = fast download of a single file
Time-series data:
❌ 500k CSV files x 5KB each = hours of downloading ✅ Packaged into ~2.5GB file = minutes of downloading
Benefits appear when you have:
10,000+ files in a dataset version
File sizes under 1MB each
Frequent reuse of the same dataset
Long data download times blocking training
When Not to Use
Don't package if:
You have fewer than 1,000 files (overhead not worth it)
Files are already large (>10MB each)
You only use the dataset once
Files are already packaged (e.g., existing tar/zip archives)
Packaging vs Manual Tar Files
You might wonder: "Why not just tar files myself?"
Manual tar (covered in Create and Manage Datasets)
One-time packaging, full control over structure, works today for all dataset types
Automatic packaging (this feature)
Repeated dataset updates, programmatic workflows, want Valohai to handle it
Key difference:
Manual tar: You create
images.tar, add it to dataset, extract in your codeAutomatic packaging: You create dataset normally, Valohai packages automatically, extracts transparently
Use both:
Enable Dataset Packaging
To enable Dataset Packaging you just have to include the next environment variable in your execution
VH_ENABLE_DATASET_VERSION_PACKAGING with the truthy value ( 1 or yes or true ).
💡 Take a look at other environment variables that control worker machine agent's behavior and how to make them included by default in every execution.
How to Package a Dataset Version
Step 1: Enable in Execution Configuration
When creating your execution (via UI, API, or CLI), set the environment variable:
In valohai.yaml:
In UI:
When creating execution, go to "Environment Variables"
Add:
VH_ENABLE_DATASET_VERSION_PACKAGING=1

Step 2: Create Dataset with Packaging Flag
Add "packaging": True to your dataset version metadata:
What happens:
Execution saves 100,000 image files
Valohai packages all images into
.vhpkgzipfile(s)Package uploaded alongside individual files
Dataset version
train-v1is created with package reference
Step 3: Use the Packaged Dataset
Use the dataset normally in your training pipeline:
No code changes needed — Your training script reads files from /valohai/inputs/training-images/ exactly as before:
Complete Example: Time-Series Dataset
Process daily sensor data and package for fast access:
valohai.yaml configuration:
Complete Example: Image Classification
Prepare image classification dataset with folder structure:
Directory structure is preserved:
How It Works Behind the Scenes
Understanding what happens helps with debugging and optimization.
During Dataset Creation
Execution runs with
VH_ENABLE_DATASET_VERSION_PACKAGING=1Files saved to
/valohai/outputs/Metadata processed — Valohai detects
"packaging": TrueAutomatic packaging:
All files for the dataset version are bundled into
.vhpkgzipfile(s)Package format is an uncompressed zip (fast extraction)
Large datasets may be split into multiple packages
Upload:
Individual files uploaded to storage
Package file(s) uploaded alongside them
Dataset version created with package reference
During Execution Using Packaged Dataset
Execution starts needing
dataset://my-images/train-v1Valohai checks if packages exist for this dataset version
Download:
Package file(s) downloaded instead of individual files
Dramatically fewer API calls (1-10 vs thousands/millions)
Automatic extraction:
Package extracted to cache directory
Files appear in
/valohai/inputs/as normal
Your code runs — No changes needed, files work identically
Caching:
Extracted files cached for future executions
Same package reused across multiple runs
Verify Packaging Worked
Check Execution Logs
Look for packaging messages in the execution that created the dataset:
Expert Mode UI
View packages in the Valohai UI:
Open the dataset version in Data → Datasets
Press Ctrl + Shift + X (or Cmd + Shift + X on Mac) to enable Expert Mode
View the Packages section showing
.vhpkgzipfiles
Individual files
2-3 hours
None
Packaged
10-15 minutes
2-3 minutes extraction
Speedup
~10x faster
One-time cost
Individual files
4-6 hours
None
Packaged
10-15 minutes
5-8 minutes extraction
Speedup
~20x faster
One-time cost
Individual files
30-40 minutes
None
Packaged
25-35 minutes
2-3 minutes extraction
Speedup
Minimal
Not worth it
Current Limitations
Programmatic Creation Only
Dataset packaging currently works only for dataset versions created programmatically (via execution metadata).
Not yet supported:
Datasets created via Web UI
Datasets created via API
Workaround: Create datasets programmatically in a dedicated data preparation execution.
Requires Environment Variable
You must set VH_ENABLE_DATASET_VERSION_PACKAGING=1 when creating the dataset version.
This requirement will be removed in a future update when packaging becomes the default behavior.
Best Practices
Start with Small Test Dataset
Before packaging millions of files, test with a smaller subset:
Verify packaging works, then scale to full dataset.
Name Dataset Versions Clearly
Indicate when datasets are packaged:
Monitor First Execution
The first execution using a packaged dataset will:
Download package
Extract all files (one-time cost)
Cache for future use
Subsequent executions skip steps 1-2 and use cached files.
Watch logs to verify extraction completes successfully.
Combine with Dataset Versioning
Create new packaged versions as data evolves:
Each version is packaged separately with its own optimized package.
Next Steps
Test with a small dataset (1,000-10,000 files)
Measure performance improvement for your use case
Scale to production datasets
Last updated
Was this helpful?
