# Package Datasets for Faster Downloads

Automatically package your dataset versions into optimized files for dramatically faster downloads. Designed for datasets with thousands or millions of small files.

***

### The Problem: Small Files Are Slow

Downloading many small files is slow, even with fast networks and cloud storage.

:x: 500k images (10KB each) -> Download takes \~12h\
Overhead per file: API calls, authentication, metadata checks ...

:white\_check\_mark: Million images packed -> Single file download takes \~10-15 minutes\
Single API call, continuous download stream

**Why this matters:**

* Each file requires separate API calls and authentication
* Network latency adds up across thousands of files
* Cloud storage rate limits can slow batch downloads
* Training jobs wait hours for data before starting

***

### The Solution: Automatic Dataset Packaging

Dataset version packaging automatically bundles all files in a dataset version into optimized package files. When you use a packaged dataset as input, Valohai downloads the package instead of individual files, then extracts them transparently.

**Benefits:**

* **10-100x faster downloads** for datasets with many small files
* **Automatic** — Happens during dataset creation
* **Transparent** — Your code doesn't change
* **Cached** — Packages are reused across executions
* **No compression overhead** — Fast extraction

***

### When to Use Dataset Packaging

#### Ideal Use Cases

**Image classification datasets:**

:x: 1 million images x 50KB each = slow individual downloads\
:white\_check\_mark: Packaged into \~50GB file = fast download of a single file

**Time-series data:**

:x: 500k CSV files x 5KB each = hours of downloading\
:white\_check\_mark: Packaged into \~2.5GB file = minutes of downloading

**Benefits appear when you have:**

* 10,000+ files in a dataset version
* File sizes under 1MB each
* Frequent reuse of the same dataset
* Long data download times blocking training

***

#### When Not to Use

**Don't package if:**

* You have fewer than 1,000 files (overhead not worth it)
* Files are already large (>10MB each)
* You only use the dataset once
* Files are already packaged (e.g., existing tar/zip archives)

***

### Packaging vs Manual Tar Files

You might wonder: "Why not just tar files myself?"

| Approach                                                                                      | When to Use                                                                        |
| --------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------- |
| **Manual tar** (covered in [Create and Manage Datasets](/data/datasets/creating-datasets.md)) | One-time packaging, full control over structure, works today for all dataset types |
| **Automatic packaging** (this feature)                                                        | Repeated dataset updates, programmatic workflows, want Valohai to handle it        |

**Key difference:**

* Manual tar: You create `images.tar`, add it to dataset, extract in your code
* Automatic packaging: You create dataset normally, Valohai packages automatically, extracts transparently

**Use both:**

```python
# Manual tar for stable baseline data
metadata = {
    "baseline_images.tar": {
        "valohai.dataset-versions": ["dataset://images/baseline-v1"],
    },
}

# Automatic packaging for frequently updated data
metadata = {
    "new_image_001.jpg": {
        "valohai.dataset-versions": [
            {
                "uri": "dataset://images/daily-v2",
                "packaging": True,  # Valohai packages automatically
            },
        ],
    },
}
```

***

### Enable Dataset Packaging

To enable **Dataset Packaging** you just have to include the next environment variable in your execution\
`VH_ENABLE_DATASET_VERSION_PACKAGING` with the truthy value ( `1` or `yes` or `true` ).

> 💡 Take a look at [other environment variables](/executions/system-environment-variables.md) that control worker machine agent's behavior and how to make them [included by default](/user-and-organization-management/getting-started/environment-variables.md) in every execution.

***

### How to Package a Dataset Version

#### Step 1: Enable in Execution Configuration

When creating your execution (via UI, API, or CLI), set the environment variable:

```shell
VH_ENABLE_DATASET_VERSION_PACKAGING=1
```

**In valohai.yaml:**

```yaml
- step:
    name: create-packaged-dataset
    image: python:3.9
    command: python prepare_data.py
    environment-variables:
      - name: VH_ENABLE_DATASET_VERSION_PACKAGING
        default: "1"
```

**In UI:**

* When creating execution, go to "Environment Variables"
* Add: `VH_ENABLE_DATASET_VERSION_PACKAGING` = `1`

<figure><img src="/files/t2gjCQuCFbo15tBEU8gX" alt=""><figcaption></figcaption></figure>

***

#### Step 2: Create Dataset with Packaging Flag

Add `"packaging": True` to your dataset version metadata:

```python
import json
import os

# Process and save your files
output_dir = "/valohai/outputs/"
for i in range(100000):
    # Your data processing
    data = process_image(i)
    filename = f"image_{i:06d}.jpg"
    save_image(data, os.path.join(output_dir, filename))

# Create metadata with packaging enabled
metadata = {}
for i in range(100000):
    filename = f"image_{i:06d}.jpg"
    metadata[filename] = {
        "valohai.dataset-versions": [
            {
                "uri": "dataset://my-images/train-v1",
                "packaging": True,  # Enable automatic packaging
            },
        ],
    }

# Save metadata
metadata_path = "/valohai/outputs/valohai.metadata.jsonl"
with open(metadata_path, "w") as f:
    for filename, file_metadata in metadata.items():
        json.dump({"file": filename, "metadata": file_metadata}, f)
        f.write("\n")
```

**What happens:**

1. Execution saves 100,000 image files
2. Valohai packages all images into `.vhpkgzip` file(s)
3. Package uploaded alongside individual files
4. Dataset version `train-v1` is created with package reference

***

#### Step 3: Use the Packaged Dataset

Use the dataset normally in your training pipeline:

```yaml
- step:
    name: train-model
    image: pytorch/pytorch:2.0.0
    command: python train.py
    inputs:
      - name: training-images
        default: dataset://my-images/train-v1
```

**No code changes needed** — Your training script reads files from `/valohai/inputs/training-images/` exactly as before:

```python
import os

# Works the same whether packaged or not
input_dir = "/valohai/inputs/training-images/"
for filename in os.listdir(input_dir):
    filepath = os.path.join(input_dir, filename)
    process_image(filepath)
```

***

### Complete Example: Time-Series Dataset

Process daily sensor data and package for fast access:

```python
import json
import pandas as pd
import os

# Step 1: Process data files
output_dir = "/valohai/outputs/"
dates = pd.date_range("2024-01-01", "2024-12-31", freq="D")

metadata = {}

for date in dates:
    # Process sensor data for this date
    sensor_data = load_sensor_data(date)

    # Save as CSV
    filename = f"sensor_{date.strftime('%Y%m%d')}.csv"
    filepath = os.path.join(output_dir, filename)
    sensor_data.to_csv(filepath, index=False)

    # Add to metadata with packaging enabled
    metadata[filename] = {
        "valohai.dataset-versions": [
            {
                "uri": "dataset://sensor-data/year-2024",
                "packaging": True,
            },
        ],
    }

# Step 2: Save metadata
metadata_path = os.path.join(output_dir, "valohai.metadata.jsonl")
with open(metadata_path, "w") as f:
    for filename, file_metadata in metadata.items():
        json.dump({"file": filename, "metadata": file_metadata}, f)
        f.write("\n")

print(f"Created dataset with {len(metadata)} files")
print("Packaging will happen automatically")
```

**valohai.yaml configuration:**

```yaml
- step:
    name: prepare-sensor-data
    image: python:3.9
    command: python prepare_data.py
    environment-variables:
      - name: VH_ENABLE_DATASET_VERSION_PACKAGING
        default: "1"

- step:
    name: train-forecasting-model
    image: python:3.9
    command: python train.py
    inputs:
      - name: sensor-data
        default: dataset://sensor-data/year-2024
```

***

### Complete Example: Image Classification

Prepare image classification dataset with folder structure:

```python
import json
import os
from PIL import Image

# Step 1: Process and save images by class
output_dir = "/valohai/outputs/"
classes = ["cats", "dogs", "birds"]

metadata = {}

for class_name in classes:
    class_dir = os.path.join(output_dir, class_name)
    os.makedirs(class_dir, exist_ok=True)

    # Process images for this class
    for i in range(10000):
        # Your image processing
        img = process_image(class_name, i)

        # Save with class folder structure
        filename = f"{class_name}/{class_name}_{i:05d}.jpg"
        filepath = os.path.join(output_dir, filename)
        img.save(filepath)

        # Add to metadata with packaging
        metadata[filename] = {
            "valohai.dataset-versions": [
                {
                    "uri": "dataset://imagenet-subset/train-v1",
                    "packaging": True,
                },
            ],
        }

# Step 2: Save metadata
metadata_path = os.path.join(output_dir, "valohai.metadata.jsonl")
with open(metadata_path, "w") as f:
    for filename, file_metadata in metadata.items():
        json.dump({"file": filename, "metadata": file_metadata}, f)
        f.write("\n")

print(f"Packaged {len(metadata)} images across {len(classes)} classes")
```

**Directory structure is preserved:**

```
/valohai/inputs/training-images/
├── cats/
│   ├── cats_00000.jpg
│   ├── cats_00001.jpg
│   └── ...
├── dogs/
│   ├── dogs_00000.jpg
│   └── ...
└── birds/
    └── ...
```

***

### How It Works Behind the Scenes

Understanding what happens helps with debugging and optimization.

#### During Dataset Creation

1. **Execution runs** with `VH_ENABLE_DATASET_VERSION_PACKAGING=1`
2. **Files saved** to `/valohai/outputs/`
3. **Metadata processed** — Valohai detects `"packaging": True`
4. **Automatic packaging:**
   * All files for the dataset version are bundled into `.vhpkgzip` file(s)
   * Package format is an uncompressed zip (fast extraction)
   * Large datasets may be split into multiple packages
5. **Upload:**
   * Individual files uploaded to storage
   * Package file(s) uploaded alongside them
6. **Dataset version created** with package reference

***

#### During Execution Using Packaged Dataset

1. **Execution starts** needing `dataset://my-images/train-v1`
2. **Valohai checks** if packages exist for this dataset version
3. **Download:**
   * Package file(s) downloaded instead of individual files
   * Dramatically fewer API calls (1-10 vs thousands/millions)
4. **Automatic extraction:**
   * Package extracted to cache directory
   * Files appear in `/valohai/inputs/` as normal
5. **Your code runs** — No changes needed, files work identically
6. **Caching:**
   * Extracted files cached for future executions
   * Same package reused across multiple runs

***

### Verify Packaging Worked

#### Check Execution Logs

Look for packaging messages in the execution that created the dataset:

```
(dataset version package 1/1) uploading <dataset-name>.<dataset-version>.vhpkgzip (15 GB)
(dataset version package 1/1) uploading to <destination-store> bucket <destination-bucket>
(dataset version package 1/1) upload complete (datum <datum-id>)
```

***

#### Expert Mode UI

View packages in the Valohai UI:

1. Open the dataset version in **Data → Datasets**
2. Press **Ctrl + Shift + X** (or **Cmd + Shift + X** on Mac) to enable Expert Mode
3. View the **Packages** section showing `.vhpkgzip` files

| Method           | Download Time    | Setup Overhead         |
| ---------------- | ---------------- | ---------------------- |
| Individual files | 2-3 hours        | None                   |
| Packaged         | 10-15 minutes    | 2-3 minutes extraction |
| **Speedup**      | **\~10x faster** | One-time cost          |

| Method           | Download Time    | Setup Overhead         |
| ---------------- | ---------------- | ---------------------- |
| Individual files | 4-6 hours        | None                   |
| Packaged         | 10-15 minutes    | 5-8 minutes extraction |
| **Speedup**      | **\~20x faster** | One-time cost          |

| Method           | Download Time | Setup Overhead         |
| ---------------- | ------------- | ---------------------- |
| Individual files | 30-40 minutes | None                   |
| Packaged         | 25-35 minutes | 2-3 minutes extraction |
| **Speedup**      | **Minimal**   | Not worth it           |

***

### Current Limitations

#### Programmatic Creation Only

Dataset packaging currently works only for dataset versions created programmatically (via execution metadata).

**Not yet supported:**

* Datasets created via Web UI
* Datasets created via API

**Workaround:** Create datasets programmatically in a dedicated data preparation execution.

***

#### Requires Environment Variable

You must set `VH_ENABLE_DATASET_VERSION_PACKAGING=1` when creating the dataset version.

**This requirement will be removed** in a future update when packaging becomes the default behavior.

***

### Best Practices

#### Start with Small Test Dataset

Before packaging millions of files, test with a smaller subset:

```python
# Test with 1,000 files first
metadata = {}
for i in range(1000):  # Not 100,000
    filename = f"test_image_{i:05d}.jpg"
    metadata[filename] = {
        "valohai.dataset-versions": [
            {
                "uri": "dataset://test-packaging/small-test",
                "packaging": True,
            },
        ],
    }
```

Verify packaging works, then scale to full dataset.

***

#### Name Dataset Versions Clearly

Indicate when datasets are packaged:

```python
# Good: Clear versioning
"uri": "dataset://images/train-v2-packaged"
"uri": "dataset://sensors/2024-q1-optimized"

# Avoid: Unclear what changed
"uri": "dataset://images/new"
"uri": "dataset://sensors/updated"
```

***

#### Monitor First Execution

The first execution using a packaged dataset will:

1. Download package
2. Extract all files (one-time cost)
3. Cache for future use

Subsequent executions skip steps 1-2 and use cached files.

**Watch logs** to verify extraction completes successfully.

***

#### Combine with Dataset Versioning

Create new packaged versions as data evolves:

```python
# Initial dataset
metadata = {
    "file.jpg": {
        "valohai.dataset-versions": [
            {
                "uri": "dataset://my-data/v1",
                "packaging": True,
            },
        ],
    },
}

# Updated dataset (new files added)
metadata = {
    "new_file.jpg": {
        "valohai.dataset-versions": [
            {
                "uri": "dataset://my-data/v2",
                "from": "dataset://my-data/v1",
                "packaging": True,
            },
        ],
    },
}
```

Each version is packaged separately with its own optimized package.

***

### Next Steps

* Test with a small dataset (1,000-10,000 files)
* Measure performance improvement for your use case
* Scale to production datasets


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.valohai.com/data/datasets/package-datasets.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
