# Create and Manage Datasets

Datasets are versioned collections of files that simplify working with multiple related files. Use datasets for training/validation splits, image classification folders, or any workflow requiring coordinated file groups.

***

### The Problem with Individual Files

> 💡 **Quick recap:** In Valohai, individual files are called **datums**. Each datum has a unique `datum://` link you can use as an input. See [Load Data in Jobs](/data/data-versioning/load-files-in-jobs.md) for details.

Datums work well for single files, but become complex when managing collections:

* Updating 50 image files? You'd need to update 50 datum links
* Maintaining train/validation/test splits? Hard to keep them synchronized
* Versioning related files together? No built-in way to track the group

**Example problem:**

```yaml
# Managing individual files becomes tedious
inputs:
  - name: train-images
    default:
      - datum://abc123...
      - datum://def456...
      - datum://ghi789...
      # ... 47 more files
```

***

### Datasets Solve This

Datasets group related files into versioned collections.

**Same workflow, cleaner:**

```yaml
inputs:
  - name: train-images
    default: dataset://my-images/train-v2
```

**Key benefits:**

* **Group related files** — One reference points to entire collection
* **Version together** — Update all files as a unit
* **Track changes** — See what changed between versions
* **Immutable versions** — Each version is locked once created
* **Flexible access** — Use `latest`, specific versions, or aliases

***

### Datasets vs Datums

| Feature        | Datum                             | Dataset                                        |
| -------------- | --------------------------------- | ---------------------------------------------- |
| **Reference**  | Single file                       | Collection of files                            |
| **URI format** | `datum://file-id`                 | `dataset://name/version`                       |
| **Use when**   | One model file, one CSV           | Image folders, data splits, multi-file outputs |
| **Versioning** | Each file versioned independently | Files versioned together as a group            |
| **Updates**    | Create new datum                  | Create new dataset version                     |

***

### When to Use Datasets

#### Training/Validation/Test Splits

Keep data splits synchronized:

```yaml
inputs:
  - name: train-data
    default: dataset://customer-churn/train-v3
  - name: validation-data
    default: dataset://customer-churn/validation-v3
  - name: test-data
    default: dataset://customer-churn/test-v3
```

When you update the data, create new versions (`v4`) and all splits stay aligned.

***

#### Image Classification

Organize images by class:

```
dataset://imagenet/train-v1 contains:
├── cats/
│   ├── cat001.jpg
│   ├── cat002.jpg
│   └── ...
├── dogs/
│   ├── dog001.jpg
│   ├── dog002.jpg
│   └── ...
└── birds/
    ├── bird001.jpg
    └── ...
```

**Learn more about directory structure:** See [Directory Structure in Datasets](#directory-structure-in-datasets) below.

***

#### Multi-File Model Artifacts

Package related model files together:

```
dataset://bert-model/production contains:
├── model.bin
├── config.json
├── vocab.txt
└── tokenizer_config.json
```

***

### Create a Dataset

Datasets have two levels:

1. **Dataset** — The container with a name (e.g., `my-images`)
2. **Dataset Version** — Specific collection of files (e.g., `v1`, `v2`, `latest`)

You must create both.

***

### Create via Code (Recommended)

Create datasets programmatically when saving execution outputs.

#### Basic Dataset Creation

```python
import json

# Define which files belong to the dataset version
metadata = {
    "train_image_001.jpg": {
        "valohai.dataset-versions": ["dataset://my-images/v1"],
    },
    "train_image_002.jpg": {
        "valohai.dataset-versions": ["dataset://my-images/v1"],
    },
    "train_image_003.jpg": {
        "valohai.dataset-versions": ["dataset://my-images/v1"],
    },
}

# Save all your output files first
for i in range(1, 4):
    # Your code to save images
    image.save(f"/valohai/outputs/train_image_{i:03d}.jpg")

# Save dataset metadata in single JSONL file
metadata_path = "/valohai/outputs/valohai.metadata.jsonl"
with open(metadata_path, "w") as outfile:
    for file_name, file_metadata in metadata.items():
        json.dump({"file": file_name, "metadata": file_metadata}, outfile)
        outfile.write("\n")
```

**What happens:**

* If dataset `my-images` doesn't exist, it's created automatically
* Version `v1` is created with the three specified files
* Files are available as `dataset://my-images/v1`

***

#### Create Training/Validation Split

```python
import json

# Save your split files
train_data.to_csv("/valohai/outputs/train.csv")
val_data.to_csv("/valohai/outputs/validation.csv")
test_data.to_csv("/valohai/outputs/test.csv")

# Assign files to dataset versions
metadata = {
    "train.csv": {
        "valohai.dataset-versions": ["dataset://customer-data/train-v2"],
    },
    "validation.csv": {
        "valohai.dataset-versions": ["dataset://customer-data/validation-v2"],
    },
    "test.csv": {
        "valohai.dataset-versions": ["dataset://customer-data/test-v2"],
    },
}

# Save metadata
metadata_path = "/valohai/outputs/valohai.metadata.jsonl"
with open(metadata_path, "w") as outfile:
    for file_name, file_metadata in metadata.items():
        json.dump({"file": file_name, "metadata": file_metadata}, outfile)
        outfile.write("\n")
```

Now use `dataset://customer-data/train-v2` in your training pipeline.

***

#### Legacy Approach (Sidecar Files)

The older approach used individual `.metadata.json` files per output:

```python
import json

# Save output file
save_path = "/valohai/outputs/data_file.csv"
data.to_csv(save_path)

# Save metadata in sidecar file
metadata = {
    "valohai.dataset-versions": ["dataset://my-dataset/v1"],
}

metadata_path = "/valohai/outputs/data_file.csv.metadata.json"
with open(metadata_path, "w") as outfile:
    json.dump(metadata, outfile)
```

**This still works**, but the JSONL approach is recommended for better organization when handling multiple files.

***

### Create via Web UI

#### Step 1: Create the Dataset Container

1. Open your project
2. Navigate to **Data → Datasets** tab
3. Click **"Create dataset"**
4. Enter a **Name** (e.g., `my-images`)
5. Select **Owner**:
   * **Your account** — Private to you
   * **Your organization** — Shared with team
6. Click **"Create"**

***

#### Step 2: Create a Dataset Version

1. Click on your dataset name
2. Click **"Create new version"**
3. Select files to include:
   * Search by filename, tags, or data store
   * Click **"Add"** or **"Add Selected"** for multiple files
4. Add or remove files until satisfied
5. Enter a **version name** (e.g., `v1`, `train-split-2024-q1`)
6. Click **"Save new version"**

**Important:** Once saved, dataset versions are immutable. You cannot edit them—only create new versions.

***

### Use Datasets as Inputs

Reference datasets in your pipeline using `dataset://` URIs.

#### In valohai.yaml

```yaml
- step:
    name: train-model
    image: pytorch/pytorch:2.0.0
    command: python train.py
    inputs:
      - name: training-images
        default: dataset://my-images/v2
      - name: validation-images
        default: dataset://my-images/validation-v2
```

#### URI Formats

```yaml
# Specific version
default: dataset://my-images/v2
```

```yaml
# Latest version (always points to newest)
default: dataset://my-images/latest
```

```yaml
# Custom alias (see Dataset Aliases section below)
default: dataset://my-images/production
```

***

#### In Code

All files from the dataset are downloaded to the input directory:

```python
import os

# List all files in the dataset
input_dir = "/valohai/inputs/training-images/"
for filename in os.listdir(input_dir):
    filepath = os.path.join(input_dir, filename)
    print(f"Processing {filename}")
    # Your processing logic
```

**Learn more:** [Load Data in Jobs](/data/data-versioning/load-files-in-jobs.md)

***

### Dataset Versioning

Dataset versions are **immutable** once created. This ensures reproducibility—an execution using `dataset://my-images/v2` will always get the exact same files.

#### Version Naming

Choose clear, descriptive version names:

```python
# Good: Descriptive and sortable
"v1", "v2", "v3"
"train-2024-01-15"
"baseline-split"
"production-2024-q1"

# Avoid: Ambiguous or hard to track
"latest"  # Valohai reserved keyword
"final"
"new"
"temp"
```

***

#### Version History

Track all versions in the Valohai UI:

1. Navigate to **Data → Datasets**
2. Click on your dataset
3. View the **Versions** table showing:
   * Version name
   * Creation date
   * Number of files
   * Creator

***

#### Update Existing Versions

You **cannot edit** a dataset version after creation. To modify:

1. Create a new version based on the old one
2. Add or remove files
3. Save with a new version name

**For complex updates** (excluding specific files, starting from existing versions), see [Update Dataset Versions](/data/datasets/update-datasets.md).

***

### Dataset Aliases

Aliases let you reference dataset versions with human-readable names instead of hardcoding version names in your code.

#### The `latest` Alias

Every dataset automatically has a `latest` alias pointing to the newest version:

```yaml
inputs:
  - name: training-data
    default: dataset://my-dataset/latest  # Always uses newest version
```

**No setup required** — `latest` updates automatically when you create new versions.

***

#### Custom Aliases

Create your own aliases for environment management or workflow stages.

**Example use cases:**

```yaml
- dataset://my-images/production      # Current production dataset
- dataset://my-images/staging         # Being tested
- dataset://my-images/baseline        # Original benchmark dataset
- dataset://my-images/experiment-42   # Specific experiment version
```

***

#### Create Alias via Web UI

1. Open your dataset
2. Navigate to the **Aliases** tab
3. Click **"Create new dataset version alias"**
4. Enter alias name (e.g., `production`)
5. Select the dataset version to point to
6. Save

**Update an alias:**

1. Find the alias in the Aliases tab
2. Click **"Edit"**
3. Select a different version
4. Save

The UI tracks alias history—see when it was changed and what it pointed to before.

***

#### Create Alias via Code

Set aliases when creating dataset versions:

```python
import json

metadata = {
    "model.pkl": {
        "valohai.dataset-versions": [
            {
                "uri": "dataset://my-models/v5",
                "from": "dataset://my-models/v4",
                "targeting_aliases": ["production", "stable"],  # Creates/updates these aliases
            },
        ],
    },
}

save_path = "/valohai/outputs/model.pkl"
model.save(save_path)

metadata_path = "/valohai/outputs/valohai.metadata.jsonl"
with open(metadata_path, "w") as outfile:
    for file_name, file_metadata in metadata.items():
        json.dump({"file": file_name, "metadata": file_metadata}, outfile)
        outfile.write("\n")
```

**What happens:**

* Creates version `v5` based on `v4`
* Updates `production` alias to point to `v5`
* Updates `stable` alias to point to `v5`
* If aliases don't exist, they're created

***

#### Use Aliases in Pipelines

```yaml
- step:
    name: train-production-model
    image: python:3.9
    command: python train.py
    inputs:
      - name: training-data
        default: dataset://customer-data/production
      - name: validation-data
        default: dataset://customer-data/staging
```

When you promote a dataset version to production, just update the alias—no code changes needed.

***

#### Alias Best Practices

**Environment-based:**

```python
'targeting_aliases': ['dev', 'staging', 'production']
```

**Workflow stages:**

```python
'targeting_aliases': ['preprocessing-done', 'validated', 'ready-for-training']
```

**Experiment tracking:**

```python
'targeting_aliases': ['baseline', 'experiment-current', 'best-so-far']
```

***

### Directory Structure in Datasets

How files are organized in `/valohai/inputs/` depends on how they were saved originally.

#### Flat Structure

If source files were saved without directories:

```
/valohai/inputs/my-dataset/
├── image001.jpg
├── image002.jpg
└── image003.jpg
```

**Access in code:**

```python
import os

input_dir = "/valohai/inputs/my-dataset/"
files = os.listdir(input_dir)
for filename in files:
    filepath = os.path.join(input_dir, filename)
    process_image(filepath)
```

***

#### Nested Structure

If source files used `keep-directories` when loading:

```
/valohai/inputs/my-dataset/
├── cats/
│   ├── cat001.jpg
│   └── cat002.jpg
├── dogs/
│   ├── dog001.jpg
│   └── dog002.jpg
└── birds/
    └── bird001.jpg
```

**Access in code:**

```python
import os

input_dir = "/valohai/inputs/my-dataset/"

# Process by subdirectory (class label)
for class_name in os.listdir(input_dir):
    class_dir = os.path.join(input_dir, class_name)
    if os.path.isdir(class_dir):
        print(f"Processing class: {class_name}")
        for filename in os.listdir(class_dir):
            filepath = os.path.join(class_dir, filename)
            process_image(filepath, label=class_name)
```

**The structure depends on:**

* How files were originally uploaded or generated
* The `keep-directories` setting when files were saved as outputs
* See [Ingest ](/data/data-versioning/load-files-in-jobs.md)& [Save Files](/data/data-versioning/save-files-from-jobs.md) for details on preserving directory structure

***

### Performance: Package Files Together

> ⚠️ **Important for large datasets:** Downloading millions of individual small files is slow, even with fast networks.

#### The Problem

```
Slow: 2 million individual 10KB files
⏱️  Download time: Hours due to overhead per file
```

#### The Solution

Package related files together before creating datasets:

```python
import tarfile

# Package images into tar file (no compression needed)
with tarfile.open("/valohai/outputs/images.tar", "w") as tar:
    tar.add("/valohai/outputs/images/", arcname="images")

# Add packaged file to dataset
metadata = {
    "images.tar": {
        "valohai.dataset-versions": ["dataset://my-images/v1"],
    },
}
```

**Benefits:**

* Fast: Single file download
* Atomic: All-or-nothing download
* No compression overhead (tar without gzip)
* Preserves directory structure

**In your training code:**

```python
import tarfile

# Extract once at start of execution
with tarfile.open("/valohai/inputs/images/images.tar", "r") as tar:
    tar.extractall("/tmp/images/")

# Now process extracted files
for filename in os.listdir("/tmp/images/"):
    process_image(filename)
```

> 💡 **When to package:** If your dataset has >10,000 small files, strongly consider packaging them. The one-time extraction cost is much faster than downloading thousands of individual files.

***

### Common Issues & Fixes

#### Dataset Version Not Created

**Symptom:** Execution completes successfully but dataset version doesn't appear

**How to diagnose:**

1. Open the execution in Valohai UI
2. Click the **Alerts** tab (top of execution page)
3. Look for dataset creation errors or warnings

<figure><img src="/files/VhihQzG661QJ8auX6hra" alt=""><figcaption></figcaption></figure>

**Common causes:**

* Invalid version name → Use alphanumeric, hyphens, underscores only
* Metadata file not saved → Verify `valohai.metadata.jsonl` exists in outputs
* JSON syntax error → Validate JSON format
* Wrong metadata structure → Check `{"file": "...", "metadata": {...}}` format

***

#### Wrong Files in Dataset

**Symptom:** Dataset version contains unexpected files or missing files

**Causes & Fixes:**

* Typo in filename → Filenames in metadata must match output files exactly
* Files not saved before metadata → Save all output files before writing metadata
* Wrong dataset URI in metadata → Double-check dataset name and version

**Debug:**

```python
import os
import json

# List what was actually saved
print("Output files:", os.listdir("/valohai/outputs/"))

# Verify metadata content
with open("/valohai/outputs/valohai.metadata.jsonl", "r") as f:
    for line in f:
        print("Metadata entry:", json.loads(line))
```

***

#### Can't Use Dataset in Execution

**Symptom:** Input shows `dataset://...` but execution fails with "not found"

**Causes & Fixes:**

* Typo in dataset URI → Check dataset name and version spelling
* Version doesn't exist → Verify version was created in Data → Datasets tab
* Wrong project → Dataset must be in same project as execution
* Permission issue → Check dataset ownership (private vs organization)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.valohai.com/data/datasets/creating-datasets.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
