# Package Datasets for Faster Downloads

Automatically package your dataset versions into optimized files for dramatically faster downloads. Designed for datasets with thousands or millions of small files.

***

### The Problem: Small Files Are Slow

Downloading many small files is slow, even with fast networks and cloud storage.

:x: 500k images (10KB each) -> Download takes \~12h\
Overhead per file: API calls, authentication, metadata checks ...

:white\_check\_mark: Million images packed -> Single file download takes \~10-15 minutes\
Single API call, continuous download stream

**Why this matters:**

* Each file requires separate API calls and authentication
* Network latency adds up across thousands of files
* Cloud storage rate limits can slow batch downloads
* Training jobs wait hours for data before starting

***

### The Solution: Automatic Dataset Packaging

Dataset version packaging automatically bundles all files in a dataset version into optimized package files. When you use a packaged dataset as input, Valohai downloads the package instead of individual files, then extracts them transparently.

**Benefits:**

* **10-100x faster downloads** for datasets with many small files
* **Automatic** — Happens during dataset creation
* **Transparent** — Your code doesn't change
* **Cached** — Packages are reused across executions
* **No compression overhead** — Fast extraction

***

### When to Use Dataset Packaging

#### Ideal Use Cases

**Image classification datasets:**

:x: 1 million images x 50KB each = slow individual downloads\
:white\_check\_mark: Packaged into \~50GB file = fast download of a single file

**Time-series data:**

:x: 500k CSV files x 5KB each = hours of downloading\
:white\_check\_mark: Packaged into \~2.5GB file = minutes of downloading

**Benefits appear when you have:**

* 10,000+ files in a dataset version
* File sizes under 1MB each
* Frequent reuse of the same dataset
* Long data download times blocking training

***

#### When Not to Use

**Don't package if:**

* You have fewer than 1,000 files (overhead not worth it)
* Files are already large (>10MB each)
* You only use the dataset once
* Files are already packaged (e.g., existing tar/zip archives)

***

### Packaging vs Manual Tar Files

You might wonder: "Why not just tar files myself?"

| Approach                                                                                                           | When to Use                                                                        |
| ------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------- |
| **Manual tar** (covered in [Create and Manage Datasets](https://docs.valohai.com/data/datasets/creating-datasets)) | One-time packaging, full control over structure, works today for all dataset types |
| **Automatic packaging** (this feature)                                                                             | Repeated dataset updates, programmatic workflows, want Valohai to handle it        |

**Key difference:**

* Manual tar: You create `images.tar`, add it to dataset, extract in your code
* Automatic packaging: You create dataset normally, Valohai packages automatically, extracts transparently

**Use both:**

```python
# Manual tar for stable baseline data
metadata = {
    "baseline_images.tar": {
        "valohai.dataset-versions": ["dataset://images/baseline-v1"],
    },
}

# Automatic packaging for frequently updated data
metadata = {
    "new_image_001.jpg": {
        "valohai.dataset-versions": [
            {
                "uri": "dataset://images/daily-v2",
                "packaging": True,  # Valohai packages automatically
            },
        ],
    },
}
```

***

### Enable Dataset Packaging

To enable **Dataset Packaging** you just have to include the next environment variable in your execution\
`VH_ENABLE_DATASET_VERSION_PACKAGING` with the truthy value ( `1` or `yes` or `true` ).

> 💡 Take a look at [other environment variables](https://docs.valohai.com/executions/system-environment-variables) that control worker machine agent's behavior and how to make them [included by default](https://docs.valohai.com/user-and-organization-management/getting-started/environment-variables) in every execution.

***

### How to Package a Dataset Version

#### Step 1: Enable in Execution Configuration

When creating your execution (via UI, API, or CLI), set the environment variable:

```shell
VH_ENABLE_DATASET_VERSION_PACKAGING=1
```

**In valohai.yaml:**

```yaml
- step:
    name: create-packaged-dataset
    image: python:3.9
    command: python prepare_data.py
    environment-variables:
      - name: VH_ENABLE_DATASET_VERSION_PACKAGING
        default: "1"
```

**In UI:**

* When creating execution, go to "Environment Variables"
* Add: `VH_ENABLE_DATASET_VERSION_PACKAGING` = `1`

<figure><img src="https://4109720758-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Ff3mjTRQNkASbnMbJqzJ2%2Fuploads%2Fgit-blob-c357c805aeaf2da45ac7318b0984fd51a44cb484%2Fimage.png?alt=media" alt=""><figcaption></figcaption></figure>

***

#### Step 2: Create Dataset with Packaging Flag

Add `"packaging": True` to your dataset version metadata:

```python
import json
import os

# Process and save your files
output_dir = "/valohai/outputs/"
for i in range(100000):
    # Your data processing
    data = process_image(i)
    filename = f"image_{i:06d}.jpg"
    save_image(data, os.path.join(output_dir, filename))

# Create metadata with packaging enabled
metadata = {}
for i in range(100000):
    filename = f"image_{i:06d}.jpg"
    metadata[filename] = {
        "valohai.dataset-versions": [
            {
                "uri": "dataset://my-images/train-v1",
                "packaging": True,  # Enable automatic packaging
            },
        ],
    }

# Save metadata
metadata_path = "/valohai/outputs/valohai.metadata.jsonl"
with open(metadata_path, "w") as f:
    for filename, file_metadata in metadata.items():
        json.dump({"file": filename, "metadata": file_metadata}, f)
        f.write("\n")
```

**What happens:**

1. Execution saves 100,000 image files
2. Valohai packages all images into `.vhpkgzip` file(s)
3. Package uploaded alongside individual files
4. Dataset version `train-v1` is created with package reference

***

#### Step 3: Use the Packaged Dataset

Use the dataset normally in your training pipeline:

```yaml
- step:
    name: train-model
    image: pytorch/pytorch:2.0.0
    command: python train.py
    inputs:
      - name: training-images
        default: dataset://my-images/train-v1
```

**No code changes needed** — Your training script reads files from `/valohai/inputs/training-images/` exactly as before:

```python
import os

# Works the same whether packaged or not
input_dir = "/valohai/inputs/training-images/"
for filename in os.listdir(input_dir):
    filepath = os.path.join(input_dir, filename)
    process_image(filepath)
```

***

### Complete Example: Time-Series Dataset

Process daily sensor data and package for fast access:

```python
import json
import pandas as pd
import os

# Step 1: Process data files
output_dir = "/valohai/outputs/"
dates = pd.date_range("2024-01-01", "2024-12-31", freq="D")

metadata = {}

for date in dates:
    # Process sensor data for this date
    sensor_data = load_sensor_data(date)

    # Save as CSV
    filename = f"sensor_{date.strftime('%Y%m%d')}.csv"
    filepath = os.path.join(output_dir, filename)
    sensor_data.to_csv(filepath, index=False)

    # Add to metadata with packaging enabled
    metadata[filename] = {
        "valohai.dataset-versions": [
            {
                "uri": "dataset://sensor-data/year-2024",
                "packaging": True,
            },
        ],
    }

# Step 2: Save metadata
metadata_path = os.path.join(output_dir, "valohai.metadata.jsonl")
with open(metadata_path, "w") as f:
    for filename, file_metadata in metadata.items():
        json.dump({"file": filename, "metadata": file_metadata}, f)
        f.write("\n")

print(f"Created dataset with {len(metadata)} files")
print("Packaging will happen automatically")
```

**valohai.yaml configuration:**

```yaml
- step:
    name: prepare-sensor-data
    image: python:3.9
    command: python prepare_data.py
    environment-variables:
      - name: VH_ENABLE_DATASET_VERSION_PACKAGING
        default: "1"

- step:
    name: train-forecasting-model
    image: python:3.9
    command: python train.py
    inputs:
      - name: sensor-data
        default: dataset://sensor-data/year-2024
```

***

### Complete Example: Image Classification

Prepare image classification dataset with folder structure:

```python
import json
import os
from PIL import Image

# Step 1: Process and save images by class
output_dir = "/valohai/outputs/"
classes = ["cats", "dogs", "birds"]

metadata = {}

for class_name in classes:
    class_dir = os.path.join(output_dir, class_name)
    os.makedirs(class_dir, exist_ok=True)

    # Process images for this class
    for i in range(10000):
        # Your image processing
        img = process_image(class_name, i)

        # Save with class folder structure
        filename = f"{class_name}/{class_name}_{i:05d}.jpg"
        filepath = os.path.join(output_dir, filename)
        img.save(filepath)

        # Add to metadata with packaging
        metadata[filename] = {
            "valohai.dataset-versions": [
                {
                    "uri": "dataset://imagenet-subset/train-v1",
                    "packaging": True,
                },
            ],
        }

# Step 2: Save metadata
metadata_path = os.path.join(output_dir, "valohai.metadata.jsonl")
with open(metadata_path, "w") as f:
    for filename, file_metadata in metadata.items():
        json.dump({"file": filename, "metadata": file_metadata}, f)
        f.write("\n")

print(f"Packaged {len(metadata)} images across {len(classes)} classes")
```

**Directory structure is preserved:**

```
/valohai/inputs/training-images/
├── cats/
│   ├── cats_00000.jpg
│   ├── cats_00001.jpg
│   └── ...
├── dogs/
│   ├── dogs_00000.jpg
│   └── ...
└── birds/
    └── ...
```

***

### How It Works Behind the Scenes

Understanding what happens helps with debugging and optimization.

#### During Dataset Creation

1. **Execution runs** with `VH_ENABLE_DATASET_VERSION_PACKAGING=1`
2. **Files saved** to `/valohai/outputs/`
3. **Metadata processed** — Valohai detects `"packaging": True`
4. **Automatic packaging:**
   * All files for the dataset version are bundled into `.vhpkgzip` file(s)
   * Package format is an uncompressed zip (fast extraction)
   * Large datasets may be split into multiple packages
5. **Upload:**
   * Individual files uploaded to storage
   * Package file(s) uploaded alongside them
6. **Dataset version created** with package reference

***

#### During Execution Using Packaged Dataset

1. **Execution starts** needing `dataset://my-images/train-v1`
2. **Valohai checks** if packages exist for this dataset version
3. **Download:**
   * Package file(s) downloaded instead of individual files
   * Dramatically fewer API calls (1-10 vs thousands/millions)
4. **Automatic extraction:**
   * Package extracted to cache directory
   * Files appear in `/valohai/inputs/` as normal
5. **Your code runs** — No changes needed, files work identically
6. **Caching:**
   * Extracted files cached for future executions
   * Same package reused across multiple runs

***

### Verify Packaging Worked

#### Check Execution Logs

Look for packaging messages in the execution that created the dataset:

```
(dataset version package 1/1) uploading <dataset-name>.<dataset-version>.vhpkgzip (15 GB)
(dataset version package 1/1) uploading to <destination-store> bucket <destination-bucket>
(dataset version package 1/1) upload complete (datum <datum-id>)
```

***

#### Expert Mode UI

View packages in the Valohai UI:

1. Open the dataset version in **Data → Datasets**
2. Press **Ctrl + Shift + X** (or **Cmd + Shift + X** on Mac) to enable Expert Mode
3. View the **Packages** section showing `.vhpkgzip` files

| Method           | Download Time    | Setup Overhead         |
| ---------------- | ---------------- | ---------------------- |
| Individual files | 2-3 hours        | None                   |
| Packaged         | 10-15 minutes    | 2-3 minutes extraction |
| **Speedup**      | **\~10x faster** | One-time cost          |

| Method           | Download Time    | Setup Overhead         |
| ---------------- | ---------------- | ---------------------- |
| Individual files | 4-6 hours        | None                   |
| Packaged         | 10-15 minutes    | 5-8 minutes extraction |
| **Speedup**      | **\~20x faster** | One-time cost          |

| Method           | Download Time | Setup Overhead         |
| ---------------- | ------------- | ---------------------- |
| Individual files | 30-40 minutes | None                   |
| Packaged         | 25-35 minutes | 2-3 minutes extraction |
| **Speedup**      | **Minimal**   | Not worth it           |

***

### Current Limitations

#### Programmatic Creation Only

Dataset packaging currently works only for dataset versions created programmatically (via execution metadata).

**Not yet supported:**

* Datasets created via Web UI
* Datasets created via API

**Workaround:** Create datasets programmatically in a dedicated data preparation execution.

***

#### Requires Environment Variable

You must set `VH_ENABLE_DATASET_VERSION_PACKAGING=1` when creating the dataset version.

**This requirement will be removed** in a future update when packaging becomes the default behavior.

***

### Best Practices

#### Start with Small Test Dataset

Before packaging millions of files, test with a smaller subset:

```python
# Test with 1,000 files first
metadata = {}
for i in range(1000):  # Not 100,000
    filename = f"test_image_{i:05d}.jpg"
    metadata[filename] = {
        "valohai.dataset-versions": [
            {
                "uri": "dataset://test-packaging/small-test",
                "packaging": True,
            },
        ],
    }
```

Verify packaging works, then scale to full dataset.

***

#### Name Dataset Versions Clearly

Indicate when datasets are packaged:

```python
# Good: Clear versioning
"uri": "dataset://images/train-v2-packaged"
"uri": "dataset://sensors/2024-q1-optimized"

# Avoid: Unclear what changed
"uri": "dataset://images/new"
"uri": "dataset://sensors/updated"
```

***

#### Monitor First Execution

The first execution using a packaged dataset will:

1. Download package
2. Extract all files (one-time cost)
3. Cache for future use

Subsequent executions skip steps 1-2 and use cached files.

**Watch logs** to verify extraction completes successfully.

***

#### Combine with Dataset Versioning

Create new packaged versions as data evolves:

```python
# Initial dataset
metadata = {
    "file.jpg": {
        "valohai.dataset-versions": [
            {
                "uri": "dataset://my-data/v1",
                "packaging": True,
            },
        ],
    },
}

# Updated dataset (new files added)
metadata = {
    "new_file.jpg": {
        "valohai.dataset-versions": [
            {
                "uri": "dataset://my-data/v2",
                "from": "dataset://my-data/v1",
                "packaging": True,
            },
        ],
    },
}
```

Each version is packaged separately with its own optimized package.

***

### Next Steps

* Test with a small dataset (1,000-10,000 files)
* Measure performance improvement for your use case
* Scale to production datasets
