Create and Manage Datasets

Datasets are versioned collections of files that simplify working with multiple related files. Use datasets for training/validation splits, image classification folders, or any workflow requiring coordinated file groups.


The Problem with Individual Files

💡 Quick recap: In Valohai, individual files are called datums. Each datum has a unique datum:// link you can use as an input. See Load Data in Jobs for details.

Datums work well for single files, but become complex when managing collections:

  • Updating 50 image files? You'd need to update 50 datum links

  • Maintaining train/validation/test splits? Hard to keep them synchronized

  • Versioning related files together? No built-in way to track the group

Example problem:

# Managing individual files becomes tedious
inputs:
  - name: train-images
    default:
      - datum://abc123...
      - datum://def456...
      - datum://ghi789...
      # ... 47 more files

Datasets Solve This

Datasets group related files into versioned collections.

Same workflow, cleaner:

Key benefits:

  • Group related files — One reference points to entire collection

  • Version together — Update all files as a unit

  • Track changes — See what changed between versions

  • Immutable versions — Each version is locked once created

  • Flexible access — Use latest, specific versions, or aliases


Datasets vs Datums

Feature
Datum
Dataset

Reference

Single file

Collection of files

URI format

datum://file-id

dataset://name/version

Use when

One model file, one CSV

Image folders, data splits, multi-file outputs

Versioning

Each file versioned independently

Files versioned together as a group

Updates

Create new datum

Create new dataset version


When to Use Datasets

Training/Validation/Test Splits

Keep data splits synchronized:

When you update the data, create new versions (v4) and all splits stay aligned.


Image Classification

Organize images by class:

Learn more about directory structure: See Directory Structure in Datasets below.


Multi-File Model Artifacts

Package related model files together:


Create a Dataset

Datasets have two levels:

  1. Dataset — The container with a name (e.g., my-images)

  2. Dataset Version — Specific collection of files (e.g., v1, v2, latest)

You must create both.


Create datasets programmatically when saving execution outputs.

Basic Dataset Creation

What happens:

  • If dataset my-images doesn't exist, it's created automatically

  • Version v1 is created with the three specified files

  • Files are available as dataset://my-images/v1


Create Training/Validation Split

Now use dataset://customer-data/train-v2 in your training pipeline.


Legacy Approach (Sidecar Files)

The older approach used individual .metadata.json files per output:

This still works, but the JSONL approach is recommended for better organization when handling multiple files.


Create via Web UI

Step 1: Create the Dataset Container

  1. Open your project

  2. Navigate to Data → Datasets tab

  3. Click "Create dataset"

  4. Enter a Name (e.g., my-images)

  5. Select Owner:

    • Your account — Private to you

    • Your organization — Shared with team

  6. Click "Create"


Step 2: Create a Dataset Version

  1. Click on your dataset name

  2. Click "Create new version"

  3. Select files to include:

    • Search by filename, tags, or data store

    • Click "Add" or "Add Selected" for multiple files

  4. Add or remove files until satisfied

  5. Enter a version name (e.g., v1, train-split-2024-q1)

  6. Click "Save new version"

Important: Once saved, dataset versions are immutable. You cannot edit them—only create new versions.


Use Datasets as Inputs

Reference datasets in your pipeline using dataset:// URIs.

In valohai.yaml

URI Formats


In Code

All files from the dataset are downloaded to the input directory:

Learn more: Load Data in Jobs


Dataset Versioning

Dataset versions are immutable once created. This ensures reproducibility—an execution using dataset://my-images/v2 will always get the exact same files.

Version Naming

Choose clear, descriptive version names:


Version History

Track all versions in the Valohai UI:

  1. Navigate to Data → Datasets

  2. Click on your dataset

  3. View the Versions table showing:

    • Version name

    • Creation date

    • Number of files

    • Creator


Update Existing Versions

You cannot edit a dataset version after creation. To modify:

  1. Create a new version based on the old one

  2. Add or remove files

  3. Save with a new version name

For complex updates (excluding specific files, starting from existing versions), see Update Dataset Versions.


Dataset Aliases

Aliases let you reference dataset versions with human-readable names instead of hardcoding version names in your code.

The latest Alias

Every dataset automatically has a latest alias pointing to the newest version:

No setup requiredlatest updates automatically when you create new versions.


Custom Aliases

Create your own aliases for environment management or workflow stages.

Example use cases:


Create Alias via Web UI

  1. Open your dataset

  2. Navigate to the Aliases tab

  3. Click "Create new dataset version alias"

  4. Enter alias name (e.g., production)

  5. Select the dataset version to point to

  6. Save

Update an alias:

  1. Find the alias in the Aliases tab

  2. Click "Edit"

  3. Select a different version

  4. Save

The UI tracks alias history—see when it was changed and what it pointed to before.


Create Alias via Code

Set aliases when creating dataset versions:

What happens:

  • Creates version v5 based on v4

  • Updates production alias to point to v5

  • Updates stable alias to point to v5

  • If aliases don't exist, they're created


Use Aliases in Pipelines

When you promote a dataset version to production, just update the alias—no code changes needed.


Alias Best Practices

Environment-based:

Workflow stages:

Experiment tracking:


Directory Structure in Datasets

How files are organized in /valohai/inputs/ depends on how they were saved originally.

Flat Structure

If source files were saved without directories:

Access in code:


Nested Structure

If source files used keep-directories when loading:

Access in code:

The structure depends on:

  • How files were originally uploaded or generated

  • The keep-directories setting when files were saved as outputs

  • See Ingest & Save Files for details on preserving directory structure


Performance: Package Files Together

⚠️ Important for large datasets: Downloading millions of individual small files is slow, even with fast networks.

The Problem

The Solution

Package related files together before creating datasets:

Benefits:

  • Fast: Single file download

  • Atomic: All-or-nothing download

  • No compression overhead (tar without gzip)

  • Preserves directory structure

In your training code:

💡 When to package: If your dataset has >10,000 small files, strongly consider packaging them. The one-time extraction cost is much faster than downloading thousands of individual files.


Common Issues & Fixes

Dataset Version Not Created

Symptom: Execution completes successfully but dataset version doesn't appear

How to diagnose:

  1. Open the execution in Valohai UI

  2. Click the Alerts tab (top of execution page)

  3. Look for dataset creation errors or warnings

Common causes:

  • Invalid version name → Use alphanumeric, hyphens, underscores only

  • Metadata file not saved → Verify valohai.metadata.jsonl exists in outputs

  • JSON syntax error → Validate JSON format

  • Wrong metadata structure → Check {"file": "...", "metadata": {...}} format


Wrong Files in Dataset

Symptom: Dataset version contains unexpected files or missing files

Causes & Fixes:

  • Typo in filename → Filenames in metadata must match output files exactly

  • Files not saved before metadata → Save all output files before writing metadata

  • Wrong dataset URI in metadata → Double-check dataset name and version

Debug:


Can't Use Dataset in Execution

Symptom: Input shows dataset://... but execution fails with "not found"

Causes & Fixes:

  • Typo in dataset URI → Check dataset name and version spelling

  • Version doesn't exist → Verify version was created in Data → Datasets tab

  • Wrong project → Dataset must be in same project as execution

  • Permission issue → Check dataset ownership (private vs organization)

Last updated

Was this helpful?