> For the complete documentation index, see [llms.txt](https://docs.valohai.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.valohai.com/data.md).

# Data

Valohai treats data as a first-class citizen. Every file you load or save gets versioned, tracked, and linked to the code that created it — automatically.

No more "which dataset did I use for this model?" or "where did this file come from?" Your data has a complete audit trail from raw input to final output.

***

## How Valohai Handles Data

### Everything is Versioned

Every file saved from an execution gets a unique `datum://` link that points to an immutable version. Use these links as inputs in future jobs, and Valohai guarantees you'll always get the exact same file.

**Example:**

```yaml
inputs:
  - name: training-data
    default: datum://01234567-89ab-cdef-0123-456789abcdef
```

This single link captures:

* The exact file content
* When it was created
* Which execution produced it
* What code and parameters were used

### Cloud Storage Without the Complexity

Connect your [S3](/data/configure-data-stores/amazon-s3.md), [Azure Blob Storage](/data/configure-data-stores/azure-blob-storage.md), [GCS](/data/configure-data-stores/google-bucket.md), [OVH Object Storage](/data/configure-data-stores/ovh-object-storage.md) or [Oracle](/data/configure-data-stores/oracle-bucket-storage.md) bucket once, after that, files appear as local paths in your code—no boto3, no authentication logic, just `pd.read_csv('/valohai/inputs/dataset/data.csv')`.

Valohai handles:

* Authentication and credential rotation
* Cross-region, cross-cloud transfers and caching
* Access control between projects
* Metadata and lineage tracking

You control:

* Where data lives (your cloud account)
* Bucket policies and compliance
* Cost and retention

> :warning: **Platform setup required:** Data scientists can start using data immediately, but platform teams need to [configure data stores](/data/configure-data-stores.md) first.

### From Files to Datasets

Individual files work great for single models or CSVs. But when you're managing hundreds of images or train/validation splits, datasets keep everything organized.

**Datasets group related files into versioned collections:**

```yaml
# Instead of manually adding 1000 individual files
inputs:
  - name: training-images
    default: dataset://imagenet/train-v3  # One reference, 1000 files
```

Update all files together, track changes between versions, and use aliases like `production` or `staging` to promote datasets through your workflow.

***

## Working with Data in Valohai

### Save Files from Your Code

Write files to `/valohai/outputs/` and Valohai uploads them automatically:

```python
# Save any file to outputs
import pandas as pd

df.to_csv("/valohai/outputs/processed_data.csv")
model.save("/valohai/outputs/model.pkl")
```

That's it. No upload logic, no authentication. Files appear in your project's Data tab with full lineage.

### Add Context with Metadata

Attach tags, aliases, and custom properties to files for search, filtering, and audit trails:

```python
import json

metadata = {
    "valohai.tags": ["validated", "production"],
    "valohai.alias": "latest-model",
    "accuracy": 0.95,
    "training_date": "2024-01-15",
}

# Save metadata alongside your output
with open("/valohai/outputs/model.pkl.metadata.json", "w") as f:
    json.dump(metadata, f)
```

Now search by tag, reference by alias, or query custom properties through the API.

### Load Files in Your Code

Reference data in your YAML, and Valohai downloads it before your code runs:

```yaml
- step:
    name: train
    image: python:3.9
    command: python train.py
    inputs:
      - name: dataset
        default: s3://mybucket/mydata/project-a/data.csv
```

Access it like a local file:

```python
import pandas as pd

df = pd.read_csv("/valohai/inputs/dataset/data.csv")
```

Supports S3, Azure, GCS, OVH, Oracle or public URLs, `datum://` links from previous executions and Valohai datasets and models (`dataset://` and `model://` links).

### Browse data

#### Directory tree

Valohai automatically recognizes the directory structure in `/valohai/outputs` and lets you browse the produced data in supported directories.

> :bulb: To achieve the structure shown in the screenshot below, execution outputs would have to be saved this way:
>
> ```
> /valohai/outputs/logs/errors/log_part.1.bin
> /valohai/outputs/logs/errors/log_part.2.bin
> ...
> /valohai/outputs/logs/errors/log_part.12.bin
> ```

A collapsible directory tree lets you select the directory to inspect.

<figure><img src="/files/PR2JJFiaUgnlSZ9D0nub" alt=""><figcaption></figcaption></figure>

In the example above, per directory filtering is applied to the execution outputs but such filter is also available for [Dataset version](/data/datasets.md) as well as when browsing data within a project.&#x20;

#### Searching

Using the search bar, filter datums where the search term matches:

* Datum name
* One of the tags assigned to the datum
* Title of the execution that produced the datum
* One of the tags assigned to the execution that produced the datum
* Aliases referencing the datum
* Datum URI or ID

\
By default titles are searched starting with the search term, but an asterisk (\*) or underscored (\_) can be used to turn it into a wildcard expression.&#x20;

* Asterisk (\*) - matches any sequence of characters (including none)
* Underscore(\_) - matches exactly one character

Using the **datum alias.**

<figure><img src="/files/BZZ707j5A6PX5dpa0X68" alt=""><figcaption></figcaption></figure>

Using the **datum URI.**

<figure><img src="/files/WsefEwLKhMlsyJYFQViM" alt=""><figcaption></figcaption></figure>

Using the **datum name wildcard.**

<figure><img src="/files/NUAjM8rOY8bZL9sdwIj0" alt=""><figcaption></figcaption></figure>

#### Comparing dataset versions

While previewing the dataset, select two versions and then click on **Compare** button to check the differences between the two.

<figure><img src="/files/dd6sS88WipcTgYXATZd3" alt=""><figcaption></figcaption></figure>

By choosing the comparison mode, you control which datums will be shown.&#x20;

<figure><img src="/files/T2da7HK4aPJn3ErvBDiJ" alt=""><figcaption></figcaption></figure>

In the mode selected in the screenshot above, all datums (from both versions) are shown.\
The ones that are removed in later version are highlighted in red while the ones added only in the later version will be highlighted in green (no such datums in the given example).&#x20;

***

## Data Storage Options

### Cloud Object Storage (Recommended)

Configure [S3](/data/configure-data-stores/amazon-s3.md), [Azure Blob](/data/configure-data-stores/azure-blob-storage.md), [GCS](/data/configure-data-stores/google-bucket.md), [OVH ](/data/configure-data-stores/ovh-object-storage.md)or [Oracle Bucket Storage](/data/configure-data-stores/oracle-bucket-storage.md) as your primary data store. Best for:

* Versioned inputs and outputs
* Reproducible pipelines
* Audit trails and compliance
* Multi-region access

[Set up your data store →](/data/configure-data-stores.md)

### Network Storage (For Shared Data)

Mount AWS EFS, Google Filestore, or on-premises NFS when you need:

* Shared scratch space across executions
* Access to existing on-prem datasets
* Shared cache layer for large files or large amount of files (tens or hundreds of thousands)

**Trade-off:** Network mounts are fast but could include data that's not tracked by Valohai. Be careful, because use of such data could break reproducibility.

[Mount network storage →](/data/data-nfs.md)

### Databases (For Structured Queries)

Query BigQuery, Redshift, or Snowflake directly from executions. Best for:

* Pulling training data from data warehouses
* Running feature engineering on SQL tables
* Joining external datasets

[Query databases →](/data/data-databases.md)

***

## Data Organization Patterns

### For ML Experiments

Use[ datasets with aliases](/data/datasets/creating-datasets.md#dataset-aliases) for environment promotion:

```yaml
inputs:
  - name: training-data
    default: dataset://customer-churn/production
  - name: validation-data
    default: dataset://customer-churn/staging
```

Update aliases when promoting datasets—no code changes needed.

### For Production Pipelines

Use `datum://` links with [aliases](/data/data-versioning/metadata-overview/aliases.md) for immutable references:

```yaml
inputs:
  - name: model-weights
    default: datum://abc123...  # Exact version, always reproducible
```

### For Large-Scale Data

Package files into tar archives before creating datasets:

```python
import tarfile

with tarfile.open("/valohai/outputs/images.tar", "w") as tar:
    tar.add("/valohai/outputs/images/", arcname="images")
```

Downloading one 10GB tar is much faster than 100,000 individual files.

> :bulb: If packaging is not an option, and you do require a large mount of data, checkout available [caching strategies](/data/data-nfs.md) that could speed up the access to your data.

***

### When to Use What

| Need                         | Solution                        | Why                             |
| ---------------------------- | ------------------------------- | ------------------------------- |
| Single model file            | `datum://` link                 | Immutable, versioned, traceable |
| Training/validation split    | Dataset with versions           | Files versioned together        |
| Image classification folders | Dataset with `keep-directories` | Preserves folder structure      |
| Existing on-prem data        | Network mount                   | Data stays on premises          |
| Query data warehouse         | Database connector              | No data movement needed         |
| Environment promotion        | Dataset aliases                 | Update alias, not code          |

***

## Next Steps

**For Data Scientists:**

1. [Save file from your code](/data/data-versioning/save-files-from-jobs.md) to outputs
2. [Load files in your code](/data/data-versioning/load-files-in-jobs.md) using inputs
3. [Create datasets](/data/datasets/creating-datasets.md) for multi-file workflows
4. [Add metadata](/data/data-versioning/metadata-overview.md) for searchability

**For Platform Teams:**

1. [Configure cloud storage](/data/configure-data-stores.md) as your data store
2. [Connect databases](/data/data-databases.md) for SQL access
3. [Set up network mounts](/data/data-nfs.md) if needed

> 💡 **Platform setup required:** Before data scientists can save files, platform teams must [configure at least one data store](/data/configure-data-stores.md).


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.valohai.com/data.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
