Package Datasets for Faster Downloads

Automatically package your dataset versions into optimized files for dramatically faster downloads. Designed for datasets with thousands or millions of small files.

The Problem: Small Files Are Slow

Downloading many small files is slow, even with fast networks and cloud storage.

❌ 500k images (10KB each) -> Download takes ~12h Overhead per file: API calls, authentication, metadata checks ...

✅ Million images packed -> Single file download takes ~10-15 minutes Single API call, continuous download stream

Why this matters:

Each file requires separate API calls and authentication
Network latency adds up across thousands of files
Cloud storage rate limits can slow batch downloads
Training jobs wait hours for data before starting

The Solution: Automatic Dataset Packaging

Dataset version packaging automatically bundles all files in a dataset version into optimized package files. When you use a packaged dataset as input, Valohai downloads the package instead of individual files, then extracts them transparently.

Benefits:

10-100x faster downloads for datasets with many small files
Automatic — Happens during dataset creation
Transparent — Your code doesn't change
Cached — Packages are reused across executions
No compression overhead — Fast extraction

When to Use Dataset Packaging

Ideal Use Cases

Image classification datasets:

❌ 1 million images x 50KB each = slow individual downloads ✅ Packaged into ~50GB file = fast download of a single file

Time-series data:

❌ 500k CSV files x 5KB each = hours of downloading ✅ Packaged into ~2.5GB file = minutes of downloading

Benefits appear when you have:

10,000+ files in a dataset version
File sizes under 1MB each
Frequent reuse of the same dataset
Long data download times blocking training

When Not to Use

Don't package if:

You have fewer than 1,000 files (overhead not worth it)
Files are already large (>10MB each)
You only use the dataset once
Files are already packaged (e.g., existing tar/zip archives)

Packaging vs Manual Tar Files

You might wonder: "Why not just tar files myself?"

Approach

When to Use

Manual tar (covered in Create and Manage Datasets)

One-time packaging, full control over structure, works today for all dataset types

Automatic packaging (this feature)

Repeated dataset updates, programmatic workflows, want Valohai to handle it

Key difference:

Manual tar: You create images.tar, add it to dataset, extract in your code
Automatic packaging: You create dataset normally, Valohai packages automatically, extracts transparently

Use both:

# Manual tar for stable baseline data
metadata = {
    "baseline_images.tar": {
        "valohai.dataset-versions": ["dataset://images/baseline-v1"]
    }
}

# Automatic packaging for frequently updated data
metadata = {
    "new_image_001.jpg": {
        "valohai.dataset-versions": [{
            "uri": "dataset://images/daily-v2",
            "packaging": True  # Valohai packages automatically
        }]
    }
}

Enable Dataset Packaging

To enable Dataset Packaging you just have to include the next environment variable in your execution VH_ENABLE_DATASET_VERSION_PACKAGING with the truthy value ( 1 or yes or true ).

💡 Take a look at other environment variables that control worker machine agent's behavior and how to make them included by default in every execution.

How to Package a Dataset Version

Step 1: Enable in Execution Configuration

When creating your execution (via UI, API, or CLI), set the environment variable:

VH_ENABLE_DATASET_VERSION_PACKAGING=1

In valohai.yaml:

- step:
    name: create-packaged-dataset
    image: python:3.9
    command: python prepare_data.py
    environment-variables:
      - name: VH_ENABLE_DATASET_VERSION_PACKAGING
        default: "1"

In UI:

When creating execution, go to "Environment Variables"
Add: VH_ENABLE_DATASET_VERSION_PACKAGING = 1

Step 2: Create Dataset with Packaging Flag

Add "packaging": True to your dataset version metadata:

import json
import os

# Process and save your files
output_dir = '/valohai/outputs/'
for i in range(100000):
    # Your data processing
    data = process_image(i)
    filename = f'image_{i:06d}.jpg'
    save_image(data, os.path.join(output_dir, filename))

# Create metadata with packaging enabled
metadata = {}
for i in range(100000):
    filename = f'image_{i:06d}.jpg'
    metadata[filename] = {
        "valohai.dataset-versions": [{
            "uri": "dataset://my-images/train-v1",
            "packaging": True  # Enable automatic packaging
        }]
    }

# Save metadata
metadata_path = '/valohai/outputs/valohai.metadata.jsonl'
with open(metadata_path, 'w') as f:
    for filename, file_metadata in metadata.items():
        json.dump({"file": filename, "metadata": file_metadata}, f)
        f.write('\n')

What happens:

Execution saves 100,000 image files
Valohai packages all images into .vhpkgzip file(s)
Package uploaded alongside individual files
Dataset version train-v1 is created with package reference

Step 3: Use the Packaged Dataset

Use the dataset normally in your training pipeline:

- step:
    name: train-model
    image: pytorch/pytorch:2.0.0
    command: python train.py
    inputs:
      - name: training-images
        default: dataset://my-images/train-v1

No code changes needed — Your training script reads files from /valohai/inputs/training-images/ exactly as before:

import os

# Works the same whether packaged or not
input_dir = '/valohai/inputs/training-images/'
for filename in os.listdir(input_dir):
    filepath = os.path.join(input_dir, filename)
    process_image(filepath)

Complete Example: Time-Series Dataset

Process daily sensor data and package for fast access:

import json
import pandas as pd
import os

# Step 1: Process data files
output_dir = '/valohai/outputs/'
dates = pd.date_range('2024-01-01', '2024-12-31', freq='D')

metadata = {}

for date in dates:
    # Process sensor data for this date
    sensor_data = load_sensor_data(date)
    
    # Save as CSV
    filename = f'sensor_{date.strftime("%Y%m%d")}.csv'
    filepath = os.path.join(output_dir, filename)
    sensor_data.to_csv(filepath, index=False)
    
    # Add to metadata with packaging enabled
    metadata[filename] = {
        "valohai.dataset-versions": [{
            "uri": "dataset://sensor-data/year-2024",
            "packaging": True
        }]
    }

# Step 2: Save metadata
metadata_path = os.path.join(output_dir, 'valohai.metadata.jsonl')
with open(metadata_path, 'w') as f:
    for filename, file_metadata in metadata.items():
        json.dump({"file": filename, "metadata": file_metadata}, f)
        f.write('\n')

print(f"Created dataset with {len(metadata)} files")
print("Packaging will happen automatically")

valohai.yaml configuration:

- step:
    name: prepare-sensor-data
    image: python:3.9
    command: python prepare_data.py
    environment-variables:
      - name: VH_ENABLE_DATASET_VERSION_PACKAGING
        default: "1"

- step:
    name: train-forecasting-model
    image: python:3.9
    command: python train.py
    inputs:
      - name: sensor-data
        default: dataset://sensor-data/year-2024

Complete Example: Image Classification

Prepare image classification dataset with folder structure:

import json
import os
from PIL import Image

# Step 1: Process and save images by class
output_dir = '/valohai/outputs/'
classes = ['cats', 'dogs', 'birds']

metadata = {}

for class_name in classes:
    class_dir = os.path.join(output_dir, class_name)
    os.makedirs(class_dir, exist_ok=True)
    
    # Process images for this class
    for i in range(10000):
        # Your image processing
        img = process_image(class_name, i)
        
        # Save with class folder structure
        filename = f'{class_name}/{class_name}_{i:05d}.jpg'
        filepath = os.path.join(output_dir, filename)
        img.save(filepath)
        
        # Add to metadata with packaging
        metadata[filename] = {
            "valohai.dataset-versions": [{
                "uri": "dataset://imagenet-subset/train-v1",
                "packaging": True
            }]
        }

# Step 2: Save metadata
metadata_path = os.path.join(output_dir, 'valohai.metadata.jsonl')
with open(metadata_path, 'w') as f:
    for filename, file_metadata in metadata.items():
        json.dump({"file": filename, "metadata": file_metadata}, f)
        f.write('\n')

print(f"Packaged {len(metadata)} images across {len(classes)} classes")

Directory structure is preserved:

/valohai/inputs/training-images/
├── cats/
│   ├── cats_00000.jpg
│   ├── cats_00001.jpg
│   └── ...
├── dogs/
│   ├── dogs_00000.jpg
│   └── ...
└── birds/
    └── ...

How It Works Behind the Scenes

Understanding what happens helps with debugging and optimization.

During Dataset Creation

Execution runs with VH_ENABLE_DATASET_VERSION_PACKAGING=1
Files saved to /valohai/outputs/
Metadata processed — Valohai detects "packaging": True
Automatic packaging:
- All files for the dataset version are bundled into .vhpkgzip file(s)
- Package format is an uncompressed zip (fast extraction)
- Large datasets may be split into multiple packages
Upload:
- Individual files uploaded to storage
- Package file(s) uploaded alongside them
Dataset version created with package reference

During Execution Using Packaged Dataset

Execution starts needing dataset://my-images/train-v1
Valohai checks if packages exist for this dataset version
Download:
- Package file(s) downloaded instead of individual files
- Dramatically fewer API calls (1-10 vs thousands/millions)
Automatic extraction:
- Package extracted to cache directory
- Files appear in /valohai/inputs/ as normal
Your code runs — No changes needed, files work identically
Caching:
- Extracted files cached for future executions
- Same package reused across multiple runs

Verify Packaging Worked

Check Execution Logs

Look for packaging messages in the execution that created the dataset:

(dataset version package 1/1) uploading <dataset-name>.<dataset-version>.vhpkgzip (15 GB)
(dataset version package 1/1) uploading to <destination-store> bucket <destination-bucket>
(dataset version package 1/1) upload complete (datum <datum-id>)

Expert Mode UI

View packages in the Valohai UI:

Open the dataset version in Data → Datasets
Press Ctrl + Shift + X (or Cmd + Shift + X on Mac) to enable Expert Mode
View the Packages section showing .vhpkgzip files

Method

Download Time

Setup Overhead

Individual files

2-3 hours

None

Packaged

10-15 minutes

2-3 minutes extraction

Speedup

~10x faster

One-time cost

Method

Download Time

Setup Overhead

Individual files

4-6 hours

None

Packaged

10-15 minutes

5-8 minutes extraction

Speedup

~20x faster

One-time cost

Method

Download Time

Setup Overhead

Individual files

30-40 minutes

None

Packaged

25-35 minutes

2-3 minutes extraction

Speedup

Minimal

Not worth it

Current Limitations

Programmatic Creation Only

Dataset packaging currently works only for dataset versions created programmatically (via execution metadata).

Not yet supported:

Datasets created via Web UI
Datasets created via API

Workaround: Create datasets programmatically in a dedicated data preparation execution.

Requires Environment Variable

You must set VH_ENABLE_DATASET_VERSION_PACKAGING=1 when creating the dataset version.

This requirement will be removed in a future update when packaging becomes the default behavior.

Best Practices

Start with Small Test Dataset

Before packaging millions of files, test with a smaller subset:

# Test with 1,000 files first
metadata = {}
for i in range(1000):  # Not 100,000
    filename = f'test_image_{i:05d}.jpg'
    metadata[filename] = {
        "valohai.dataset-versions": [{
            "uri": "dataset://test-packaging/small-test",
            "packaging": True
        }]
    }

Verify packaging works, then scale to full dataset.

Name Dataset Versions Clearly

Indicate when datasets are packaged:

# Good: Clear versioning
"uri": "dataset://images/train-v2-packaged"
"uri": "dataset://sensors/2024-q1-optimized"

# Avoid: Unclear what changed
"uri": "dataset://images/new"
"uri": "dataset://sensors/updated"

Monitor First Execution

The first execution using a packaged dataset will:

Download package
Extract all files (one-time cost)
Cache for future use

Subsequent executions skip steps 1-2 and use cached files.

Watch logs to verify extraction completes successfully.

Combine with Dataset Versioning

Create new packaged versions as data evolves:

# Initial dataset
metadata = {
    "file.jpg": {
        "valohai.dataset-versions": [{
            "uri": "dataset://my-data/v1",
            "packaging": True
        }]
    }
}

# Updated dataset (new files added)
metadata = {
    "new_file.jpg": {
        "valohai.dataset-versions": [{
            "uri": "dataset://my-data/v2",
            "from": "dataset://my-data/v1",
            "packaging": True
        }]
    }
}

Each version is packaged separately with its own optimized package.

Next Steps

Test with a small dataset (1,000-10,000 files)
Measure performance improvement for your use case
Scale to production datasets

PreviousUpdate Dataset Versions NextDatabases

Last updated 4 hours ago

Was this helpful?