Package Datasets for Faster Downloads

Automatically package your dataset versions into optimized files for dramatically faster downloads. Designed for datasets with thousands or millions of small files.


The Problem: Small Files Are Slow

Downloading many small files is slow, even with fast networks and cloud storage.

500k images (10KB each) -> Download takes ~12h Overhead per file: API calls, authentication, metadata checks ...

Million images packed -> Single file download takes ~10-15 minutes Single API call, continuous download stream

Why this matters:

  • Each file requires separate API calls and authentication

  • Network latency adds up across thousands of files

  • Cloud storage rate limits can slow batch downloads

  • Training jobs wait hours for data before starting


The Solution: Automatic Dataset Packaging

Dataset version packaging automatically bundles all files in a dataset version into optimized package files. When you use a packaged dataset as input, Valohai downloads the package instead of individual files, then extracts them transparently.

Benefits:

  • 10-100x faster downloads for datasets with many small files

  • Automatic — Happens during dataset creation

  • Transparent — Your code doesn't change

  • Cached — Packages are reused across executions

  • No compression overhead — Fast extraction


When to Use Dataset Packaging

Ideal Use Cases

Image classification datasets:

1 million images x 50KB each = slow individual downloads Packaged into ~50GB file = fast download of a single file

Time-series data:

500k CSV files x 5KB each = hours of downloading Packaged into ~2.5GB file = minutes of downloading

Benefits appear when you have:

  • 10,000+ files in a dataset version

  • File sizes under 1MB each

  • Frequent reuse of the same dataset

  • Long data download times blocking training


When Not to Use

Don't package if:

  • You have fewer than 1,000 files (overhead not worth it)

  • Files are already large (>10MB each)

  • You only use the dataset once

  • Files are already packaged (e.g., existing tar/zip archives)


Packaging vs Manual Tar Files

You might wonder: "Why not just tar files myself?"

Approach
When to Use

Manual tar (covered in Create and Manage Datasets)

One-time packaging, full control over structure, works today for all dataset types

Automatic packaging (this feature)

Repeated dataset updates, programmatic workflows, want Valohai to handle it

Key difference:

  • Manual tar: You create images.tar, add it to dataset, extract in your code

  • Automatic packaging: You create dataset normally, Valohai packages automatically, extracts transparently

Use both:


Enable Dataset Packaging

To enable Dataset Packaging you just have to include the next environment variable in your execution VH_ENABLE_DATASET_VERSION_PACKAGING with the truthy value ( 1 or yes or true ).

💡 Take a look at other environment variables that control worker machine agent's behavior and how to make them included by default in every execution.


How to Package a Dataset Version

Step 1: Enable in Execution Configuration

When creating your execution (via UI, API, or CLI), set the environment variable:

In valohai.yaml:

In UI:

  • When creating execution, go to "Environment Variables"

  • Add: VH_ENABLE_DATASET_VERSION_PACKAGING = 1


Step 2: Create Dataset with Packaging Flag

Add "packaging": True to your dataset version metadata:

What happens:

  1. Execution saves 100,000 image files

  2. Valohai packages all images into .vhpkgzip file(s)

  3. Package uploaded alongside individual files

  4. Dataset version train-v1 is created with package reference


Step 3: Use the Packaged Dataset

Use the dataset normally in your training pipeline:

No code changes needed — Your training script reads files from /valohai/inputs/training-images/ exactly as before:


Complete Example: Time-Series Dataset

Process daily sensor data and package for fast access:

valohai.yaml configuration:


Complete Example: Image Classification

Prepare image classification dataset with folder structure:

Directory structure is preserved:


How It Works Behind the Scenes

Understanding what happens helps with debugging and optimization.

During Dataset Creation

  1. Execution runs with VH_ENABLE_DATASET_VERSION_PACKAGING=1

  2. Files saved to /valohai/outputs/

  3. Metadata processed — Valohai detects "packaging": True

  4. Automatic packaging:

    • All files for the dataset version are bundled into .vhpkgzip file(s)

    • Package format is an uncompressed zip (fast extraction)

    • Large datasets may be split into multiple packages

  5. Upload:

    • Individual files uploaded to storage

    • Package file(s) uploaded alongside them

  6. Dataset version created with package reference


During Execution Using Packaged Dataset

  1. Execution starts needing dataset://my-images/train-v1

  2. Valohai checks if packages exist for this dataset version

  3. Download:

    • Package file(s) downloaded instead of individual files

    • Dramatically fewer API calls (1-10 vs thousands/millions)

  4. Automatic extraction:

    • Package extracted to cache directory

    • Files appear in /valohai/inputs/ as normal

  5. Your code runs — No changes needed, files work identically

  6. Caching:

    • Extracted files cached for future executions

    • Same package reused across multiple runs


Verify Packaging Worked

Check Execution Logs

Look for packaging messages in the execution that created the dataset:


Expert Mode UI

View packages in the Valohai UI:

  1. Open the dataset version in Data → Datasets

  2. Press Ctrl + Shift + X (or Cmd + Shift + X on Mac) to enable Expert Mode

  3. View the Packages section showing .vhpkgzip files

Method
Download Time
Setup Overhead

Individual files

2-3 hours

None

Packaged

10-15 minutes

2-3 minutes extraction

Speedup

~10x faster

One-time cost

Method
Download Time
Setup Overhead

Individual files

4-6 hours

None

Packaged

10-15 minutes

5-8 minutes extraction

Speedup

~20x faster

One-time cost

Method
Download Time
Setup Overhead

Individual files

30-40 minutes

None

Packaged

25-35 minutes

2-3 minutes extraction

Speedup

Minimal

Not worth it


Current Limitations

Programmatic Creation Only

Dataset packaging currently works only for dataset versions created programmatically (via execution metadata).

Not yet supported:

  • Datasets created via Web UI

  • Datasets created via API

Workaround: Create datasets programmatically in a dedicated data preparation execution.


Requires Environment Variable

You must set VH_ENABLE_DATASET_VERSION_PACKAGING=1 when creating the dataset version.

This requirement will be removed in a future update when packaging becomes the default behavior.


Best Practices

Start with Small Test Dataset

Before packaging millions of files, test with a smaller subset:

Verify packaging works, then scale to full dataset.


Name Dataset Versions Clearly

Indicate when datasets are packaged:


Monitor First Execution

The first execution using a packaged dataset will:

  1. Download package

  2. Extract all files (one-time cost)

  3. Cache for future use

Subsequent executions skip steps 1-2 and use cached files.

Watch logs to verify extraction completes successfully.


Combine with Dataset Versioning

Create new packaged versions as data evolves:

Each version is packaged separately with its own optimized package.


Next Steps

  • Test with a small dataset (1,000-10,000 files)

  • Measure performance improvement for your use case

  • Scale to production datasets

Last updated

Was this helpful?