Package Datasets for Faster Downloads

Automatically package your dataset versions into optimized files for dramatically faster downloads. Designed for datasets with thousands or millions of small files.


The Problem: Small Files Are Slow

Downloading many small files is slow, even with fast networks and cloud storage.

500k images (10KB each) -> Download takes ~12h Overhead per file: API calls, authentication, metadata checks ...

Million images packed -> Single file download takes ~10-15 minutes Single API call, continuous download stream

Why this matters:

  • Each file requires separate API calls and authentication

  • Network latency adds up across thousands of files

  • Cloud storage rate limits can slow batch downloads

  • Training jobs wait hours for data before starting


The Solution: Automatic Dataset Packaging

Dataset version packaging automatically bundles all files in a dataset version into optimized package files. When you use a packaged dataset as input, Valohai downloads the package instead of individual files, then extracts them transparently.

Benefits:

  • 10-100x faster downloads for datasets with many small files

  • Automatic — Happens during dataset creation

  • Transparent — Your code doesn't change

  • Cached — Packages are reused across executions

  • No compression overhead — Fast extraction


When to Use Dataset Packaging

Ideal Use Cases

Image classification datasets:

1 million images x 50KB each = slow individual downloads Packaged into ~50GB file = fast download of a single file

Time-series data:

500k CSV files x 5KB each = hours of downloading Packaged into ~2.5GB file = minutes of downloading

Benefits appear when you have:

  • 10,000+ files in a dataset version

  • File sizes under 1MB each

  • Frequent reuse of the same dataset

  • Long data download times blocking training


When Not to Use

Don't package if:

  • You have fewer than 1,000 files (overhead not worth it)

  • Files are already large (>10MB each)

  • You only use the dataset once

  • Files are already packaged (e.g., existing tar/zip archives)


Packaging vs Manual Tar Files

You might wonder: "Why not just tar files myself?"

Approach
When to Use

Manual tar (covered in Create and Manage Datasets)

One-time packaging, full control over structure, works today for all dataset types

Automatic packaging (this feature)

Repeated dataset updates, programmatic workflows, want Valohai to handle it

Key difference:

  • Manual tar: You create images.tar, add it to dataset, extract in your code

  • Automatic packaging: You create dataset normally, Valohai packages automatically, extracts transparently

Use both:

# Manual tar for stable baseline data
metadata = {
    "baseline_images.tar": {
        "valohai.dataset-versions": ["dataset://images/baseline-v1"]
    }
}

# Automatic packaging for frequently updated data
metadata = {
    "new_image_001.jpg": {
        "valohai.dataset-versions": [{
            "uri": "dataset://images/daily-v2",
            "packaging": True  # Valohai packages automatically
        }]
    }
}

Enable Dataset Packaging

To enable Dataset Packaging you just have to include the next environment variable in your execution VH_ENABLE_DATASET_VERSION_PACKAGING with the truthy value ( 1 or yes or true ).

💡 Take a look at other environment variables that control worker machine agent's behavior and how to make them included by default in every execution.


How to Package a Dataset Version

Step 1: Enable in Execution Configuration

When creating your execution (via UI, API, or CLI), set the environment variable:

VH_ENABLE_DATASET_VERSION_PACKAGING=1

In valohai.yaml:

- step:
    name: create-packaged-dataset
    image: python:3.9
    command: python prepare_data.py
    environment-variables:
      - name: VH_ENABLE_DATASET_VERSION_PACKAGING
        default: "1"

In UI:

  • When creating execution, go to "Environment Variables"

  • Add: VH_ENABLE_DATASET_VERSION_PACKAGING = 1


Step 2: Create Dataset with Packaging Flag

Add "packaging": True to your dataset version metadata:

import json
import os

# Process and save your files
output_dir = '/valohai/outputs/'
for i in range(100000):
    # Your data processing
    data = process_image(i)
    filename = f'image_{i:06d}.jpg'
    save_image(data, os.path.join(output_dir, filename))

# Create metadata with packaging enabled
metadata = {}
for i in range(100000):
    filename = f'image_{i:06d}.jpg'
    metadata[filename] = {
        "valohai.dataset-versions": [{
            "uri": "dataset://my-images/train-v1",
            "packaging": True  # Enable automatic packaging
        }]
    }

# Save metadata
metadata_path = '/valohai/outputs/valohai.metadata.jsonl'
with open(metadata_path, 'w') as f:
    for filename, file_metadata in metadata.items():
        json.dump({"file": filename, "metadata": file_metadata}, f)
        f.write('\n')

What happens:

  1. Execution saves 100,000 image files

  2. Valohai packages all images into .vhpkgzip file(s)

  3. Package uploaded alongside individual files

  4. Dataset version train-v1 is created with package reference


Step 3: Use the Packaged Dataset

Use the dataset normally in your training pipeline:

- step:
    name: train-model
    image: pytorch/pytorch:2.0.0
    command: python train.py
    inputs:
      - name: training-images
        default: dataset://my-images/train-v1

No code changes needed — Your training script reads files from /valohai/inputs/training-images/ exactly as before:

import os

# Works the same whether packaged or not
input_dir = '/valohai/inputs/training-images/'
for filename in os.listdir(input_dir):
    filepath = os.path.join(input_dir, filename)
    process_image(filepath)

Complete Example: Time-Series Dataset

Process daily sensor data and package for fast access:

import json
import pandas as pd
import os

# Step 1: Process data files
output_dir = '/valohai/outputs/'
dates = pd.date_range('2024-01-01', '2024-12-31', freq='D')

metadata = {}

for date in dates:
    # Process sensor data for this date
    sensor_data = load_sensor_data(date)
    
    # Save as CSV
    filename = f'sensor_{date.strftime("%Y%m%d")}.csv'
    filepath = os.path.join(output_dir, filename)
    sensor_data.to_csv(filepath, index=False)
    
    # Add to metadata with packaging enabled
    metadata[filename] = {
        "valohai.dataset-versions": [{
            "uri": "dataset://sensor-data/year-2024",
            "packaging": True
        }]
    }

# Step 2: Save metadata
metadata_path = os.path.join(output_dir, 'valohai.metadata.jsonl')
with open(metadata_path, 'w') as f:
    for filename, file_metadata in metadata.items():
        json.dump({"file": filename, "metadata": file_metadata}, f)
        f.write('\n')

print(f"Created dataset with {len(metadata)} files")
print("Packaging will happen automatically")

valohai.yaml configuration:

- step:
    name: prepare-sensor-data
    image: python:3.9
    command: python prepare_data.py
    environment-variables:
      - name: VH_ENABLE_DATASET_VERSION_PACKAGING
        default: "1"

- step:
    name: train-forecasting-model
    image: python:3.9
    command: python train.py
    inputs:
      - name: sensor-data
        default: dataset://sensor-data/year-2024

Complete Example: Image Classification

Prepare image classification dataset with folder structure:

import json
import os
from PIL import Image

# Step 1: Process and save images by class
output_dir = '/valohai/outputs/'
classes = ['cats', 'dogs', 'birds']

metadata = {}

for class_name in classes:
    class_dir = os.path.join(output_dir, class_name)
    os.makedirs(class_dir, exist_ok=True)
    
    # Process images for this class
    for i in range(10000):
        # Your image processing
        img = process_image(class_name, i)
        
        # Save with class folder structure
        filename = f'{class_name}/{class_name}_{i:05d}.jpg'
        filepath = os.path.join(output_dir, filename)
        img.save(filepath)
        
        # Add to metadata with packaging
        metadata[filename] = {
            "valohai.dataset-versions": [{
                "uri": "dataset://imagenet-subset/train-v1",
                "packaging": True
            }]
        }

# Step 2: Save metadata
metadata_path = os.path.join(output_dir, 'valohai.metadata.jsonl')
with open(metadata_path, 'w') as f:
    for filename, file_metadata in metadata.items():
        json.dump({"file": filename, "metadata": file_metadata}, f)
        f.write('\n')

print(f"Packaged {len(metadata)} images across {len(classes)} classes")

Directory structure is preserved:

/valohai/inputs/training-images/
├── cats/
│   ├── cats_00000.jpg
│   ├── cats_00001.jpg
│   └── ...
├── dogs/
│   ├── dogs_00000.jpg
│   └── ...
└── birds/
    └── ...

How It Works Behind the Scenes

Understanding what happens helps with debugging and optimization.

During Dataset Creation

  1. Execution runs with VH_ENABLE_DATASET_VERSION_PACKAGING=1

  2. Files saved to /valohai/outputs/

  3. Metadata processed — Valohai detects "packaging": True

  4. Automatic packaging:

    • All files for the dataset version are bundled into .vhpkgzip file(s)

    • Package format is an uncompressed zip (fast extraction)

    • Large datasets may be split into multiple packages

  5. Upload:

    • Individual files uploaded to storage

    • Package file(s) uploaded alongside them

  6. Dataset version created with package reference


During Execution Using Packaged Dataset

  1. Execution starts needing dataset://my-images/train-v1

  2. Valohai checks if packages exist for this dataset version

  3. Download:

    • Package file(s) downloaded instead of individual files

    • Dramatically fewer API calls (1-10 vs thousands/millions)

  4. Automatic extraction:

    • Package extracted to cache directory

    • Files appear in /valohai/inputs/ as normal

  5. Your code runs — No changes needed, files work identically

  6. Caching:

    • Extracted files cached for future executions

    • Same package reused across multiple runs


Verify Packaging Worked

Check Execution Logs

Look for packaging messages in the execution that created the dataset:

(dataset version package 1/1) uploading <dataset-name>.<dataset-version>.vhpkgzip (15 GB)
(dataset version package 1/1) uploading to <destination-store> bucket <destination-bucket>
(dataset version package 1/1) upload complete (datum <datum-id>)

Expert Mode UI

View packages in the Valohai UI:

  1. Open the dataset version in Data → Datasets

  2. Press Ctrl + Shift + X (or Cmd + Shift + X on Mac) to enable Expert Mode

  3. View the Packages section showing .vhpkgzip files

Method
Download Time
Setup Overhead

Individual files

2-3 hours

None

Packaged

10-15 minutes

2-3 minutes extraction

Speedup

~10x faster

One-time cost

Method
Download Time
Setup Overhead

Individual files

4-6 hours

None

Packaged

10-15 minutes

5-8 minutes extraction

Speedup

~20x faster

One-time cost

Method
Download Time
Setup Overhead

Individual files

30-40 minutes

None

Packaged

25-35 minutes

2-3 minutes extraction

Speedup

Minimal

Not worth it


Current Limitations

Programmatic Creation Only

Dataset packaging currently works only for dataset versions created programmatically (via execution metadata).

Not yet supported:

  • Datasets created via Web UI

  • Datasets created via API

Workaround: Create datasets programmatically in a dedicated data preparation execution.


Requires Environment Variable

You must set VH_ENABLE_DATASET_VERSION_PACKAGING=1 when creating the dataset version.

This requirement will be removed in a future update when packaging becomes the default behavior.


Best Practices

Start with Small Test Dataset

Before packaging millions of files, test with a smaller subset:

# Test with 1,000 files first
metadata = {}
for i in range(1000):  # Not 100,000
    filename = f'test_image_{i:05d}.jpg'
    metadata[filename] = {
        "valohai.dataset-versions": [{
            "uri": "dataset://test-packaging/small-test",
            "packaging": True
        }]
    }

Verify packaging works, then scale to full dataset.


Name Dataset Versions Clearly

Indicate when datasets are packaged:

# Good: Clear versioning
"uri": "dataset://images/train-v2-packaged"
"uri": "dataset://sensors/2024-q1-optimized"

# Avoid: Unclear what changed
"uri": "dataset://images/new"
"uri": "dataset://sensors/updated"

Monitor First Execution

The first execution using a packaged dataset will:

  1. Download package

  2. Extract all files (one-time cost)

  3. Cache for future use

Subsequent executions skip steps 1-2 and use cached files.

Watch logs to verify extraction completes successfully.


Combine with Dataset Versioning

Create new packaged versions as data evolves:

# Initial dataset
metadata = {
    "file.jpg": {
        "valohai.dataset-versions": [{
            "uri": "dataset://my-data/v1",
            "packaging": True
        }]
    }
}

# Updated dataset (new files added)
metadata = {
    "new_file.jpg": {
        "valohai.dataset-versions": [{
            "uri": "dataset://my-data/v2",
            "from": "dataset://my-data/v1",
            "packaging": True
        }]
    }
}

Each version is packaged separately with its own optimized package.


Next Steps

  • Test with a small dataset (1,000-10,000 files)

  • Measure performance improvement for your use case

  • Scale to production datasets

Last updated

Was this helpful?