# Update Dataset Versions

Create new dataset versions by building on existing ones—add new files, exclude specific files, or start fresh while maintaining version lineage.

***

### When to Update Dataset Versions

Use incremental versioning when you need to:

* **Add new data** — Append newly processed files to existing dataset
* **Remove bad data** — Exclude specific files while keeping the rest
* **Swap files** — Replace specific files without recreating entire dataset
* **A/B test data** — Create variant datasets by excluding different subsets

***

### Basic Update Pattern

The standard approach creates a new version based on an existing one:

```python
import json

metadata = {
    "new_file.csv": {
        "valohai.dataset-versions": [
            {
                "uri": "dataset://my-dataset/v3",  # New version to create
                "from": "dataset://my-dataset/v2",  # Base version
                "start_fresh": False,  # Include files from v2
                "exclude": ["bad_file.csv", "old_file.csv"],  # Remove these specific files
            },
        ],
    },
}

# Save your new output file
new_data.to_csv("/valohai/outputs/new_file.csv")

# Save metadata
metadata_path = "/valohai/outputs/valohai.metadata.jsonl"
with open(metadata_path, "w") as outfile:
    for file_name, file_metadata in metadata.items():
        json.dump({"file": file_name, "metadata": file_metadata}, outfile)
        outfile.write("\n")
```

**What happens:**

* Creates `v3` based on `v2`
* Includes all files from `v2` except `bad_file.csv` and `old_file.csv`
* Adds `new_file.csv` to the new version

***

### Update Parameters

#### `uri` (required)

The new dataset version to create.

```python
'uri': "dataset://my-dataset/v3"
```

***

#### `from` (optional)

Base the new version on an existing version.

```python
'from': "dataset://my-dataset/v2"
```

**If omitted:** New version starts empty (only includes files in current metadata).

***

#### `start_fresh` (optional, default: `False`)

Controls whether to include files from the base version.

```python
# Include all files from base (default)
'start_fresh': False

# Exclude all files from base
'start_fresh': True
```

***

#### `exclude` (optional)

List of filenames to exclude from the base version.

```python
'exclude': ['file1.csv', 'file2.csv', 'file3.csv']
```

**Important:** Filenames are exact matches (case-sensitive).

***

#### `targeting_aliases` (optional)

Update dataset aliases to point to the new version.

```python
'targeting_aliases': ['production', 'stable']
```

See [Dataset Aliases](/data/datasets/creating-datasets.md#dataset-aliases) for details.

***

### Common Update Patterns

#### Add New Files to Existing Dataset

Keep all existing files, add new ones:

```python
import json

# Process and save new files
for i in range(10, 15):
    new_data = process_data(i)
    new_data.to_csv(f"/valohai/outputs/data_{i:03d}.csv")

# Add new files to dataset
metadata = {}
for i in range(10, 15):
    metadata[f"data_{i:03d}.csv"] = {
        "valohai.dataset-versions": [
            {
                "uri": "dataset://training-data/v2",
                "from": "dataset://training-data/v1",
                "start_fresh": False,  # Keep all files from v1
            },
        ],
    }

metadata_path = "/valohai/outputs/valohai.metadata.jsonl"
with open(metadata_path, "w") as outfile:
    for file_name, file_metadata in metadata.items():
        json.dump({"file": file_name, "metadata": file_metadata}, outfile)
        outfile.write("\n")
```

**Result:**

* `v2` = All files from `v1` + `data_010.csv` through `data_014.csv`

***

#### Remove Specific Files

Exclude files that failed validation:

```python
import json

# Files that failed quality checks
failed_files = ["corrupted_001.csv", "outlier_042.csv", "invalid_099.csv"]

# Create new version without failed files
metadata = {
    "validation_report.txt": {  # Can add a new file too
        "valohai.dataset-versions": [
            {
                "uri": "dataset://clean-data/v2",
                "from": "dataset://clean-data/v1",
                "start_fresh": False,
                "exclude": failed_files,  # Remove these
            },
        ],
    },
}

# Save validation report
with open("/valohai/outputs/validation_report.txt", "w") as f:
    f.write(f"Removed {len(failed_files)} files\n")

metadata_path = "/valohai/outputs/valohai.metadata.jsonl"
with open(metadata_path, "w") as outfile:
    for file_name, file_metadata in metadata.items():
        json.dump({"file": file_name, "metadata": file_metadata}, outfile)
        outfile.write("\n")
```

**Result:**

* `v2` = All files from `v1` except the three failed files + validation report

> :warning: **validation\_report.txt** file can't be excluded from the dataset version - even if added to the list of `excluded` files!

***

#### Replace Specific Files

Exclude old files and add updated versions:

```python
import json

# Save updated files
updated_data.to_csv("/valohai/outputs/data_001.csv")  # New version of data_001
updated_data.to_csv("/valohai/outputs/data_005.csv")  # New version of data_005

metadata = {
    "data_001.csv": {
        "valohai.dataset-versions": [
            {
                "uri": "dataset://my-data/v3",
                "from": "dataset://my-data/v2",
                "start_fresh": False,
                "exclude": ["data_001.csv", "data_005.csv"],
            },
        ],
    },
    "data_005.csv": {
        "valohai.dataset-versions": [
            {
                "uri": "dataset://my-data/v3",
                "from": "dataset://my-data/v2",
                "start_fresh": False,
                "exclude": ["data_001.csv", "data_005.csv"],
            },
        ],
    },
}

metadata_path = "/valohai/outputs/valohai.metadata.jsonl"
with open(metadata_path, "w") as outfile:
    for file_name, file_metadata in metadata.items():
        json.dump({"file": file_name, "metadata": file_metadata}, outfile)
        outfile.write("\n")
```

**Result:**

* `v3` = All files from `v2` except old `data_001.csv` and `data_005.csv` + new versions

***

#### Start Fresh (Link Versions Without Files)

Create a new version lineage without inheriting files:

```python
import json

# Create completely new dataset version
new_data.to_csv("/valohai/outputs/new_approach.csv")

metadata = {
    "new_approach.csv": {
        "valohai.dataset-versions": [
            {
                "uri": "dataset://experiments/v2",
                "from": "dataset://experiments/v1",
                "start_fresh": True,  # Don't include any files from v1
            },
        ],
    },
}

metadata_path = "/valohai/outputs/valohai.metadata.jsonl"
with open(metadata_path, "w") as outfile:
    for file_name, file_metadata in metadata.items():
        json.dump({"file": file_name, "metadata": file_metadata}, outfile)
        outfile.write("\n")
```

**Result:**

* `v2` = Only `new_approach.csv` (no files from `v1`)
* Version history shows `v2` was based on `v1` (for tracking)

**Use case:** You want to track version lineage (this experiment followed that one) but don't want to inherit data.

***

#### Create A/B Test Variants

Split dataset into two variants for comparison:

```python
import json

# Save a marker file for each variant
with open("/valohai/outputs/variant_a_marker.txt", "w") as f:
    f.write("Variant A configuration\n")

# Variant A: Exclude group B files
metadata = {
    "variant_a_marker.txt": {
        "valohai.dataset-versions": [
            {
                "uri": "dataset://ab-test/variant-a",
                "from": "dataset://ab-test/baseline",
                "start_fresh": False,
                "exclude": ["group_b_001.csv", "group_b_002.csv", "group_b_003.csv"],
                "targeting_aliases": ["current-a"],
            },
        ],
    },
}

metadata_path = "/valohai/outputs/valohai.metadata.jsonl"
with open(metadata_path, "w") as outfile:
    for file_name, file_metadata in metadata.items():
        json.dump({"file": file_name, "metadata": file_metadata}, outfile)
        outfile.write("\n")
```

Run a second execution for Variant B that excludes group A files instead.

***

### Update via Web UI

The UI provides a simpler way to create new versions based on existing ones.

1. Open your dataset in **Data → Datasets**
2. Find the version you want to base on
3. Click the **`...`** menu at the end of the version row
4. Select **"Create new version from this version"**
5. The new version starts with all files from the base version
6. Add new files or remove unwanted files
7. Name the new version
8. Save

**Limitations:** The UI doesn't support the `exclude` parameter—you must manually remove files one by one.

***

### Combining Multiple Updates

You can reference multiple base versions or create complex update logic:

```python
import json

metadata = {
    "combined_data.csv": {
        "valohai.dataset-versions": [
            {
                "uri": "dataset://combined/v1",
                "from": "dataset://source-a/latest",
                "start_fresh": False,
            },
            {
                "uri": "dataset://combined/v1",  # Same target version
                "from": "dataset://source-b/latest",
                "start_fresh": False,
            },
        ],
    },
}
```

**Result:** `combined/v1` includes files from both `source-a/latest` and `source-b/latest`.

***

### Best Practices

#### Use Descriptive Version Names

```python
# Good: Clear what changed
'uri': "dataset://my-data/v2-removed-outliers"
'uri': "dataset://my-data/v3-added-aug-2024"

# Avoid: Generic names
'uri': "dataset://my-data/new"
'uri': "dataset://my-data/fixed"
```

***

#### Document Changes

Add a marker file explaining what changed:

```python
# Create changelog file
with open("/valohai/outputs/CHANGELOG.txt", "w") as f:
    f.write("Version v3 changes:\n")
    f.write("- Removed 15 corrupted files\n")
    f.write("- Added 50 new validated samples\n")
    f.write("- Reprocessed all images with new augmentation\n")

# Include in metadata
metadata = {
    "CHANGELOG.txt": {
        "valohai.dataset-versions": [
            {
                "uri": "dataset://my-data/v3",
                "from": "dataset://my-data/v2",
                "start_fresh": False,
                "exclude": corrupted_files,
            },
        ],
    },
}
```

***

#### Validate Before Creating New Version

```python
import os
import json

# Verify files exist before adding to dataset
output_dir = "/valohai/outputs/"
expected_files = ["data_001.csv", "data_002.csv", "data_003.csv"]

for filename in expected_files:
    filepath = os.path.join(output_dir, filename)
    if not os.path.exists(filepath):
        raise FileNotFoundError(f"Expected file missing: {filename}")

    # Verify file is not empty
    if os.path.getsize(filepath) == 0:
        raise ValueError(f"File is empty: {filename}")

# Now safe to create metadata
metadata = {
    filename: {
        "valohai.dataset-versions": [
            {
                "uri": "dataset://validated-data/v2",
                "from": "dataset://validated-data/v1",
                "start_fresh": False,
            },
        ],
    }
    for filename in expected_files
}
```

***

#### Use Aliases for Promotion Workflow

```python
# Development → Staging → Production pipeline
metadata = {
    "model.pkl": {
        "valohai.dataset-versions": [
            {
                "uri": "dataset://models/v5",
                "from": "dataset://models/v4",
                "start_fresh": False,
                "targeting_aliases": ["staging"],  # Promote to staging first
            },
        ],
    },
}

# After validation, manually update 'production' alias to point to v5
```

***

### Common Issues & Fixes

#### Files Not Excluded

**Symptom:** Specified files in `exclude` list still appear in new version

**Causes & Fixes:**

* Typo in filename → Filenames must match exactly (case-sensitive)
* File path included → Use just filename, not full path
* `start_fresh: True` → `exclude` doesn't apply with `start_fresh: True`

**Debug:**

```python
# Print exact filenames from base version
from valohai import inputs

for file_path in inputs("base-dataset").paths():
    print(f"Filename: {os.path.basename(file_path)}")
```

***

#### Base Version Files Not Included

**Symptom:** New version is empty or missing files from base version

**Causes & Fixes:**

* `start_fresh: True` → Set to `False` to include base files
* Missing `from` parameter → Specify which version to base on
* Wrong base version URI → Verify version exists and spelling is correct

***

#### New Version Not Created

**Symptom:** Execution completes but dataset version doesn't appear

**How to diagnose:**

1. Open execution in Valohai UI
2. Click **Alerts** tab
3. Look for dataset creation errors

**Common causes:**

* Invalid JSON in metadata → Validate JSON syntax
* Missing output files → Ensure all files in metadata were actually saved
* Base version doesn't exist → Verify `from` version exists

***

### Related Pages

* [Create and Manage Datasets](/data/datasets/creating-datasets.md) — Core dataset concepts and creation
* [Add Context to Your Data Files](/data/data-versioning/metadata-overview.md) — Metadata system overview
* [Load Data in Jobs](/data/data-versioning/load-files-in-jobs.md) — Use dataset versions as inputs

***

### Next Steps

* Practice updating a dataset by excluding specific files
* Set up a validation → production promotion workflow using aliases
* Create A/B test variants from a baseline dataset
* Learn about [dataset packaging](/data/datasets/package-datasets.md) for large collections


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.valohai.com/data/datasets/update-datasets.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
