Add Context to Your Files

Your output files shouldn't exist in isolation. Attach experiment details, quality metrics, and production context directly to your files so your team can find, understand, and trust your data.


The Problem

Without metadata, files become black boxes:

  • Which experiment produced this model?

  • What was the validation accuracy?

  • Is this the production-ready version?

  • What preprocessing was applied to this dataset?

Tracking this information in spreadsheets, wikis, or README files breaks down as projects scale. Valohai solves this by collecting experiment and lineage metadata automatically, and letting you attach additional context directly to files.


Three Types of Metadata

Valohai supports three types of metadata, from simple to sophisticated:

1. Tags — Simple Labels

Organize and filter files with text labels.

Use for: Categorization, status tracking, quick filtering

Example: ["validated", "production", "experiment-42"]

Learn more: Organize Files with Tags


2. Aliases — Stable Pointers

Create human-readable shortcuts to specific files that can be updated over time.

Use for: Production references, "latest" pointers, team coordination

Example: datum://model-prod always points to current production model

Learn more: Create File Shortcuts with Aliases


3. Custom Properties — Rich Data

Store any structured data in JSON format.

Use for: Experiment tracking, quality metrics, production metadata

Example: {"accuracy": 0.95, "factory": "EU", "stage": "release"}

Learn more: Track Custom Metadata


Quick Comparison

Type
Format
Mutable
Example Use Case

Tags

List of strings

Yes

Mark files as "validated" or "production-ready"

Aliases

Single string pointer

Yes (pointer updates)

Point "model-prod" to latest approved model

Properties

Any JSON

Yes

Store {"accuracy": 0.95, "hyperparams": {...}}

💡 Tags and aliases are actually special property keys (valohai.tags and valohai.alias). You can combine all three in the same metadata file.


How to Add Metadata

You have three options for adding metadata to your files. Choose based on when you want to add it and how many files you're processing.

Decision Tree

┌─ Saving 1-2 files?
│  └─→ Use sidecar files (.metadata.json)

┌─ Saving 3+ files?
│  └─→ Use single metadata file (valohai.metadata.jsonl) ← RECOMMENDED

└─ After execution completes?
   └─→ Use API

Method 1: Sidecar Files (1-2 Files)

Save a .metadata.json file alongside each output file.

Naming Rules (Critical!)

The metadata file must have the exact same name as your output file, plus .metadata.json:

Correct:
model.pkl → model.pkl.metadata.json
data.csv → data.csv.metadata.json
results.json → results.json.metadata.json

Wrong:
model.pkl → model.metadata.json (missing .pkl)
model.pkl → metadata.json (missing full filename)
data.csv → data.csv.meta.json (wrong extension)

Python Example

import json

# Your metadata (tags, alias, and custom properties)
metadata = {
    "valohai.tags": ["validated", "production"],
    "valohai.alias": "model-prod",
    "accuracy": 0.95,
    "epochs": 100
}

# Save your output file
save_path = '/valohai/outputs/model.pkl'
model.save(save_path)

# Save metadata file
metadata_path = f'{save_path }.metadata.json'
with open(metadata_path, 'w') as f:
    json.dump(metadata, f)

When processing many files, creating individual .metadata.json files is tedious. Use one valohai.metadata.jsonl file instead.

Why This Is Better

Without JSONL (tedious):

100 output files = 200 total files
/valohai/outputs/image_001.jpg
/valohai/outputs/image_001.jpg.metadata.json
/valohai/outputs/image_002.jpg
/valohai/outputs/image_002.jpg.metadata.json
... (98 more pairs)

With JSONL (clean):

100 output files = 101 total files
/valohai/outputs/image_001.jpg
/valohai/outputs/image_002.jpg
... (98 more images)
/valohai/outputs/valohai.metadata.jsonl  ← One file for all metadata

Format Requirements

Filename: Must be exactly valohai.metadata.jsonl

Location: /valohai/outputs/valohai.metadata.jsonl

Format: JSON Lines (JSONL) — one JSON object per line, newline-separated

Each line must have this structure:

{"file": "output_filename.ext", "metadata": {"your": "properties"}}

⚠️ Important: JSONL requires a newline (\n) after each JSON object. Missing newlines will cause parsing errors.

⚠️ If output file is not saved under /valohai/inputs/file.txt but instead under one or more subdirectories (e.g /valohai/inputs/subdir/subdir_2/file.txt) those have to be included in the value of file field inside valohai.metadata.jsonl as well.

e.g. For such file: /valohai/inputs/subdir/subdir_2/file.txt Value of file should be: subdir/subdir_2/file.txt

Python Example

import json

# Process many files
for i in range(100):
    # Save output file
    image_path = f'/valohai/outputs/image_{i:03d}.jpg'
    processed_image.save(image_path)

# Create single metadata file for all outputs
metadata_path = '/valohai/outputs/valohai.metadata.jsonl'
with open(metadata_path, 'w') as f:
    for i in range(100):
        metadata_entry = {
            "file": f"image_{i:03d}.jpg",
            "metadata": {
                "quality_score": scores[i],
                "processing_time": times[i],
                "valohai.tags": ["processed", "batch-2024-Q1"]
            }
        }
        json.dump(metadata_entry, f)
        f.write('\n')  # Critical: newline after each entry

Common JSONL Mistakes

# Wrong: Missing newlines
with open('/valohai/outputs/valohai.metadata.jsonl', 'w') as f:
    json.dump({"file": "file1.jpg", "metadata": {...}}, f)
    json.dump({"file": "file2.jpg", "metadata": {...}}, f)  # No \n!

# Correct: Newline after each object
with open('/valohai/outputs/valohai.metadata.jsonl', 'w') as f:
    json.dump({"file": "file1.jpg", "metadata": {...}}, f)
    f.write('\n')
    json.dump({"file": "file2.jpg", "metadata": {...}}, f)
    f.write('\n')

Helper Function

Create a reusable helper for your projects:

import json

def save_metadata_jsonl(file_metadata_dict, output_dir='/valohai/outputs'):
    """
    Save metadata for multiple files in JSONL format.
    
    Args:
        file_metadata_dict: Dict mapping filenames to metadata dicts
                           e.g., {"model.pkl": {"accuracy": 0.95}}
    """
    metadata_path = f'{output_dir}/valohai.metadata.jsonl'
    with open(metadata_path, 'w') as f:
        for filename, metadata in file_metadata_dict.items():
            json.dump({"file": filename, "metadata": metadata}, f)
            f.write('\n')

# Usage
file_metadata = {
    "model.pkl": {"accuracy": 0.95, "valohai.alias": "model-prod"},
    "data.csv": {"rows": 10000, "valohai.tags": ["validated"]},
    "results.json": {"experiments": 42}
}

save_metadata_jsonl(file_metadata)

With valohai-utils

The valohai-utils package provides built-in helpers:

import valohai

with valohai.output_properties() as properties:
    for i in range(100):
        filename = f"image_{i:03d}.jpg"
        
        # Save output file
        image.save(valohai.outputs().path(filename))
        
        # Add metadata
        properties.add(
            file=filename,
            properties={
                "quality_score": scores[i],
                "valohai.tags": ["processed"]
            }
        )

Method 3: API (After Execution)

Add or update metadata after execution completes using the Valohai API. Useful for validation workflows, quality gates, or manual approval steps.

Three API Endpoints

Endpoint
Use When
What It Does

/api/v0/data/{id}/metadata/

One file, one metadata set

Apply properties to single datum

/api/v0/data/metadata/apply/

Multiple files, different metadata

Apply different properties to each datum

/api/v0/data/metadata/apply-all/

Multiple files, same metadata

Apply same properties to all datums

Quick Example

import os
import requests

properties = {
    "validation_score": 0.98,
    "approved_by": "data-team",
    "valohai.tags": ["validated"]
}

datum_id = "01234567-89ab-cdef-0123-456789abcdef"

response = requests.post(
    f'https://app.valohai.com/api/v0/data/{datum_id}/metadata/',
    json=properties,
    headers={
        'Authorization': 'Token ' + os.getenv('VH_TOKEN'),
        'Content-Type': 'application/json'
    }
)

Reserved Metadata Keys

Two keys have special meaning in Valohai:

Key
Type
Purpose
Details

valohai.tags

List of strings

Creates tags

valohai.alias

String

Creates/updates alias

valohai.dataset-versions

List of dataset version URLs

Includes this datum in the dataset version

valohai.model-versions

List of model version URLs

Includes this datum in the model version

All other keys are your custom properties.

Example Combining All Three

metadata = {
    # Reserved Valohai keys
    "valohai.tags": ["validated", "production", "resnet50"],
    "valohai.alias": "model-prod",
    "valohai.dataset-versions": ["dataset://big-data/processed"],
    
    # Your custom properties
    "accuracy": 0.95,
    "precision": 0.93,
    "recall": 0.97,
    "epochs": 100,
    "learning_rate": 0.001,
    "dataset_version": "v2.3",
    "training_duration_minutes": 145,
    "experiment_id": "exp-042"
}

Common Issues & Fixes

Metadata Not Appearing

For sidecar files:

  • Wrong filename → Must be output.ext.metadata.json (exact match plus suffix)

  • Not saved to /valohai/outputs/ → Save in same directory as output

  • Invalid JSON → Validate syntax (commas, quotes, brackets)

For JSONL file:

  • Wrong filename → Must be exactly valohai.metadata.jsonl

  • Missing newlines → Add f.write('\n') after each json.dump()

  • Wrong structure → Each line must have {"file": "...", "metadata": {...}}

"Do I Need .metadata.json for Every File?"

No! This is the most common confusion. Here's the comparison:

Sidecar approach (1-2 files):

# Good for small number of outputs
model.save('/valohai/outputs/model.pkl')
with open('/valohai/outputs/model.pkl.metadata.json', 'w') as f:
    json.dump(metadata, f)

JSONL approach (3+ files):

# Much better for many outputs
for i in range(100):
    image.save(f'/valohai/outputs/image_{i}.jpg')

# One metadata file for all 100 images
with open('/valohai/outputs/valohai.metadata.jsonl', 'w') as f:
    for i in range(100):
        json.dump({"file": f"image_{i}.jpg", "metadata": {...}}, f)
        f.write('\n')

Result:

  • Sidecar: 100 images = 200 files 😱

  • JSONL: 100 images = 101 files 😊


Next Steps

Now that you understand the metadata system, dive into specific use cases:


Last updated

Was this helpful?