# Create and Manage Datasets

Datasets are versioned collections of files that simplify working with multiple related files. Use datasets for training/validation splits, image classification folders, or any workflow requiring coordinated file groups.

***

### The Problem with Individual Files

> 💡 **Quick recap:** In Valohai, individual files are called **datums**. Each datum has a unique `datum://` link you can use as an input. See [Load Data in Jobs](https://docs.valohai.com/data/data-versioning/load-files-in-jobs) for details.

Datums work well for single files, but become complex when managing collections:

* Updating 50 image files? You'd need to update 50 datum links
* Maintaining train/validation/test splits? Hard to keep them synchronized
* Versioning related files together? No built-in way to track the group

**Example problem:**

```yaml
# Managing individual files becomes tedious
inputs:
  - name: train-images
    default:
      - datum://abc123...
      - datum://def456...
      - datum://ghi789...
      # ... 47 more files
```

***

### Datasets Solve This

Datasets group related files into versioned collections.

**Same workflow, cleaner:**

```yaml
inputs:
  - name: train-images
    default: dataset://my-images/train-v2
```

**Key benefits:**

* **Group related files** — One reference points to entire collection
* **Version together** — Update all files as a unit
* **Track changes** — See what changed between versions
* **Immutable versions** — Each version is locked once created
* **Flexible access** — Use `latest`, specific versions, or aliases

***

### Datasets vs Datums

| Feature        | Datum                             | Dataset                                        |
| -------------- | --------------------------------- | ---------------------------------------------- |
| **Reference**  | Single file                       | Collection of files                            |
| **URI format** | `datum://file-id`                 | `dataset://name/version`                       |
| **Use when**   | One model file, one CSV           | Image folders, data splits, multi-file outputs |
| **Versioning** | Each file versioned independently | Files versioned together as a group            |
| **Updates**    | Create new datum                  | Create new dataset version                     |

***

### When to Use Datasets

#### Training/Validation/Test Splits

Keep data splits synchronized:

```yaml
inputs:
  - name: train-data
    default: dataset://customer-churn/train-v3
  - name: validation-data
    default: dataset://customer-churn/validation-v3
  - name: test-data
    default: dataset://customer-churn/test-v3
```

When you update the data, create new versions (`v4`) and all splits stay aligned.

***

#### Image Classification

Organize images by class:

```
dataset://imagenet/train-v1 contains:
├── cats/
│   ├── cat001.jpg
│   ├── cat002.jpg
│   └── ...
├── dogs/
│   ├── dog001.jpg
│   ├── dog002.jpg
│   └── ...
└── birds/
    ├── bird001.jpg
    └── ...
```

**Learn more about directory structure:** See [Directory Structure in Datasets](#directory-structure-in-datasets) below.

***

#### Multi-File Model Artifacts

Package related model files together:

```
dataset://bert-model/production contains:
├── model.bin
├── config.json
├── vocab.txt
└── tokenizer_config.json
```

***

### Create a Dataset

Datasets have two levels:

1. **Dataset** — The container with a name (e.g., `my-images`)
2. **Dataset Version** — Specific collection of files (e.g., `v1`, `v2`, `latest`)

You must create both.

***

### Create via Code (Recommended)

Create datasets programmatically when saving execution outputs.

#### Basic Dataset Creation

```python
import json

# Define which files belong to the dataset version
metadata = {
    "train_image_001.jpg": {
        "valohai.dataset-versions": ["dataset://my-images/v1"],
    },
    "train_image_002.jpg": {
        "valohai.dataset-versions": ["dataset://my-images/v1"],
    },
    "train_image_003.jpg": {
        "valohai.dataset-versions": ["dataset://my-images/v1"],
    },
}

# Save all your output files first
for i in range(1, 4):
    # Your code to save images
    image.save(f"/valohai/outputs/train_image_{i:03d}.jpg")

# Save dataset metadata in single JSONL file
metadata_path = "/valohai/outputs/valohai.metadata.jsonl"
with open(metadata_path, "w") as outfile:
    for file_name, file_metadata in metadata.items():
        json.dump({"file": file_name, "metadata": file_metadata}, outfile)
        outfile.write("\n")
```

**What happens:**

* If dataset `my-images` doesn't exist, it's created automatically
* Version `v1` is created with the three specified files
* Files are available as `dataset://my-images/v1`

***

#### Create Training/Validation Split

```python
import json

# Save your split files
train_data.to_csv("/valohai/outputs/train.csv")
val_data.to_csv("/valohai/outputs/validation.csv")
test_data.to_csv("/valohai/outputs/test.csv")

# Assign files to dataset versions
metadata = {
    "train.csv": {
        "valohai.dataset-versions": ["dataset://customer-data/train-v2"],
    },
    "validation.csv": {
        "valohai.dataset-versions": ["dataset://customer-data/validation-v2"],
    },
    "test.csv": {
        "valohai.dataset-versions": ["dataset://customer-data/test-v2"],
    },
}

# Save metadata
metadata_path = "/valohai/outputs/valohai.metadata.jsonl"
with open(metadata_path, "w") as outfile:
    for file_name, file_metadata in metadata.items():
        json.dump({"file": file_name, "metadata": file_metadata}, outfile)
        outfile.write("\n")
```

Now use `dataset://customer-data/train-v2` in your training pipeline.

***

#### Legacy Approach (Sidecar Files)

The older approach used individual `.metadata.json` files per output:

```python
import json

# Save output file
save_path = "/valohai/outputs/data_file.csv"
data.to_csv(save_path)

# Save metadata in sidecar file
metadata = {
    "valohai.dataset-versions": ["dataset://my-dataset/v1"],
}

metadata_path = "/valohai/outputs/data_file.csv.metadata.json"
with open(metadata_path, "w") as outfile:
    json.dump(metadata, outfile)
```

**This still works**, but the JSONL approach is recommended for better organization when handling multiple files.

***

### Create via Web UI

#### Step 1: Create the Dataset Container

1. Open your project
2. Navigate to **Data → Datasets** tab
3. Click **"Create dataset"**
4. Enter a **Name** (e.g., `my-images`)
5. Select **Owner**:
   * **Your account** — Private to you
   * **Your organization** — Shared with team
6. Click **"Create"**

***

#### Step 2: Create a Dataset Version

1. Click on your dataset name
2. Click **"Create new version"**
3. Select files to include:
   * Search by filename, tags, or data store
   * Click **"Add"** or **"Add Selected"** for multiple files
4. Add or remove files until satisfied
5. Enter a **version name** (e.g., `v1`, `train-split-2024-q1`)
6. Click **"Save new version"**

**Important:** Once saved, dataset versions are immutable. You cannot edit them—only create new versions.

***

### Use Datasets as Inputs

Reference datasets in your pipeline using `dataset://` URIs.

#### In valohai.yaml

```yaml
- step:
    name: train-model
    image: pytorch/pytorch:2.0.0
    command: python train.py
    inputs:
      - name: training-images
        default: dataset://my-images/v2
      - name: validation-images
        default: dataset://my-images/validation-v2
```

#### URI Formats

```yaml
# Specific version
default: dataset://my-images/v2
```

```yaml
# Latest version (always points to newest)
default: dataset://my-images/latest
```

```yaml
# Custom alias (see Dataset Aliases section below)
default: dataset://my-images/production
```

***

#### In Code

All files from the dataset are downloaded to the input directory:

```python
import os

# List all files in the dataset
input_dir = "/valohai/inputs/training-images/"
for filename in os.listdir(input_dir):
    filepath = os.path.join(input_dir, filename)
    print(f"Processing {filename}")
    # Your processing logic
```

**Learn more:** [Load Data in Jobs](https://docs.valohai.com/data/data-versioning/load-files-in-jobs)

***

### Dataset Versioning

Dataset versions are **immutable** once created. This ensures reproducibility—an execution using `dataset://my-images/v2` will always get the exact same files.

#### Version Naming

Choose clear, descriptive version names:

```python
# Good: Descriptive and sortable
"v1", "v2", "v3"
"train-2024-01-15"
"baseline-split"
"production-2024-q1"

# Avoid: Ambiguous or hard to track
"latest"  # Valohai reserved keyword
"final"
"new"
"temp"
```

***

#### Version History

Track all versions in the Valohai UI:

1. Navigate to **Data → Datasets**
2. Click on your dataset
3. View the **Versions** table showing:
   * Version name
   * Creation date
   * Number of files
   * Creator

***

#### Update Existing Versions

You **cannot edit** a dataset version after creation. To modify:

1. Create a new version based on the old one
2. Add or remove files
3. Save with a new version name

**For complex updates** (excluding specific files, starting from existing versions), see [Update Dataset Versions](https://docs.valohai.com/data/datasets/update-datasets).

***

### Dataset Aliases

Aliases let you reference dataset versions with human-readable names instead of hardcoding version names in your code.

#### The `latest` Alias

Every dataset automatically has a `latest` alias pointing to the newest version:

```yaml
inputs:
  - name: training-data
    default: dataset://my-dataset/latest  # Always uses newest version
```

**No setup required** — `latest` updates automatically when you create new versions.

***

#### Custom Aliases

Create your own aliases for environment management or workflow stages.

**Example use cases:**

```yaml
- dataset://my-images/production      # Current production dataset
- dataset://my-images/staging         # Being tested
- dataset://my-images/baseline        # Original benchmark dataset
- dataset://my-images/experiment-42   # Specific experiment version
```

***

#### Create Alias via Web UI

1. Open your dataset
2. Navigate to the **Aliases** tab
3. Click **"Create new dataset version alias"**
4. Enter alias name (e.g., `production`)
5. Select the dataset version to point to
6. Save

**Update an alias:**

1. Find the alias in the Aliases tab
2. Click **"Edit"**
3. Select a different version
4. Save

The UI tracks alias history—see when it was changed and what it pointed to before.

***

#### Create Alias via Code

Set aliases when creating dataset versions:

```python
import json

metadata = {
    "model.pkl": {
        "valohai.dataset-versions": [
            {
                "uri": "dataset://my-models/v5",
                "from": "dataset://my-models/v4",
                "targeting_aliases": ["production", "stable"],  # Creates/updates these aliases
            },
        ],
    },
}

save_path = "/valohai/outputs/model.pkl"
model.save(save_path)

metadata_path = "/valohai/outputs/valohai.metadata.jsonl"
with open(metadata_path, "w") as outfile:
    for file_name, file_metadata in metadata.items():
        json.dump({"file": file_name, "metadata": file_metadata}, outfile)
        outfile.write("\n")
```

**What happens:**

* Creates version `v5` based on `v4`
* Updates `production` alias to point to `v5`
* Updates `stable` alias to point to `v5`
* If aliases don't exist, they're created

***

#### Use Aliases in Pipelines

```yaml
- step:
    name: train-production-model
    image: python:3.9
    command: python train.py
    inputs:
      - name: training-data
        default: dataset://customer-data/production
      - name: validation-data
        default: dataset://customer-data/staging
```

When you promote a dataset version to production, just update the alias—no code changes needed.

***

#### Alias Best Practices

**Environment-based:**

```python
'targeting_aliases': ['dev', 'staging', 'production']
```

**Workflow stages:**

```python
'targeting_aliases': ['preprocessing-done', 'validated', 'ready-for-training']
```

**Experiment tracking:**

```python
'targeting_aliases': ['baseline', 'experiment-current', 'best-so-far']
```

***

### Directory Structure in Datasets

How files are organized in `/valohai/inputs/` depends on how they were saved originally.

#### Flat Structure

If source files were saved without directories:

```
/valohai/inputs/my-dataset/
├── image001.jpg
├── image002.jpg
└── image003.jpg
```

**Access in code:**

```python
import os

input_dir = "/valohai/inputs/my-dataset/"
files = os.listdir(input_dir)
for filename in files:
    filepath = os.path.join(input_dir, filename)
    process_image(filepath)
```

***

#### Nested Structure

If source files used `keep-directories` when loading:

```
/valohai/inputs/my-dataset/
├── cats/
│   ├── cat001.jpg
│   └── cat002.jpg
├── dogs/
│   ├── dog001.jpg
│   └── dog002.jpg
└── birds/
    └── bird001.jpg
```

**Access in code:**

```python
import os

input_dir = "/valohai/inputs/my-dataset/"

# Process by subdirectory (class label)
for class_name in os.listdir(input_dir):
    class_dir = os.path.join(input_dir, class_name)
    if os.path.isdir(class_dir):
        print(f"Processing class: {class_name}")
        for filename in os.listdir(class_dir):
            filepath = os.path.join(class_dir, filename)
            process_image(filepath, label=class_name)
```

**The structure depends on:**

* How files were originally uploaded or generated
* The `keep-directories` setting when files were saved as outputs
* See [Ingest ](https://docs.valohai.com/data/data-versioning/load-files-in-jobs)& [Save Files](https://docs.valohai.com/data/data-versioning/save-files-from-jobs) for details on preserving directory structure

***

### Performance: Package Files Together

> ⚠️ **Important for large datasets:** Downloading millions of individual small files is slow, even with fast networks.

#### The Problem

```
Slow: 2 million individual 10KB files
⏱️  Download time: Hours due to overhead per file
```

#### The Solution

Package related files together before creating datasets:

```python
import tarfile

# Package images into tar file (no compression needed)
with tarfile.open("/valohai/outputs/images.tar", "w") as tar:
    tar.add("/valohai/outputs/images/", arcname="images")

# Add packaged file to dataset
metadata = {
    "images.tar": {
        "valohai.dataset-versions": ["dataset://my-images/v1"],
    },
}
```

**Benefits:**

* Fast: Single file download
* Atomic: All-or-nothing download
* No compression overhead (tar without gzip)
* Preserves directory structure

**In your training code:**

```python
import tarfile

# Extract once at start of execution
with tarfile.open("/valohai/inputs/images/images.tar", "r") as tar:
    tar.extractall("/tmp/images/")

# Now process extracted files
for filename in os.listdir("/tmp/images/"):
    process_image(filename)
```

> 💡 **When to package:** If your dataset has >10,000 small files, strongly consider packaging them. The one-time extraction cost is much faster than downloading thousands of individual files.

***

### Common Issues & Fixes

#### Dataset Version Not Created

**Symptom:** Execution completes successfully but dataset version doesn't appear

**How to diagnose:**

1. Open the execution in Valohai UI
2. Click the **Alerts** tab (top of execution page)
3. Look for dataset creation errors or warnings

<figure><img src="https://4109720758-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Ff3mjTRQNkASbnMbJqzJ2%2Fuploads%2Fgit-blob-3dca24172f0c33faaf4643b87b165fd625e2525a%2Fimage.png?alt=media" alt=""><figcaption></figcaption></figure>

**Common causes:**

* Invalid version name → Use alphanumeric, hyphens, underscores only
* Metadata file not saved → Verify `valohai.metadata.jsonl` exists in outputs
* JSON syntax error → Validate JSON format
* Wrong metadata structure → Check `{"file": "...", "metadata": {...}}` format

***

#### Wrong Files in Dataset

**Symptom:** Dataset version contains unexpected files or missing files

**Causes & Fixes:**

* Typo in filename → Filenames in metadata must match output files exactly
* Files not saved before metadata → Save all output files before writing metadata
* Wrong dataset URI in metadata → Double-check dataset name and version

**Debug:**

```python
import os
import json

# List what was actually saved
print("Output files:", os.listdir("/valohai/outputs/"))

# Verify metadata content
with open("/valohai/outputs/valohai.metadata.jsonl", "r") as f:
    for line in f:
        print("Metadata entry:", json.loads(line))
```

***

#### Can't Use Dataset in Execution

**Symptom:** Input shows `dataset://...` but execution fails with "not found"

**Causes & Fixes:**

* Typo in dataset URI → Check dataset name and version spelling
* Version doesn't exist → Verify version was created in Data → Datasets tab
* Wrong project → Dataset must be in same project as execution
* Permission issue → Check dataset ownership (private vs organization)
