# Data

Valohai treats data as a first-class citizen. Every file you load or save gets versioned, tracked, and linked to the code that created it — automatically.

No more "which dataset did I use for this model?" or "where did this file come from?" Your data has a complete audit trail from raw input to final output.

***

### How Valohai Handles Data

#### Everything is Versioned

Every file saved from an execution gets a unique `datum://` link that points to an immutable version. Use these links as inputs in future jobs, and Valohai guarantees you'll always get the exact same file.

**Example:**

```yaml
inputs:
  - name: training-data
    default: datum://01234567-89ab-cdef-0123-456789abcdef
```

This single link captures:

* The exact file content
* When it was created
* Which execution produced it
* What code and parameters were used

#### Cloud Storage Without the Complexity

Connect your [S3](/data/configure-data-stores/amazon-s3.md), [Azure Blob Storage](/data/configure-data-stores/azure-blob-storage.md), [GCS](/data/configure-data-stores/google-bucket.md), [OVH Object Storage](/data/configure-data-stores/ovh-object-storage.md) or [Oracle](/data/configure-data-stores/oracle-bucket-storage.md) bucket once, after that, files appear as local paths in your code—no boto3, no authentication logic, just `pd.read_csv('/valohai/inputs/dataset/data.csv')`.

Valohai handles:

* Authentication and credential rotation
* Cross-region, cross-cloud transfers and caching
* Access control between projects
* Metadata and lineage tracking

You control:

* Where data lives (your cloud account)
* Bucket policies and compliance
* Cost and retention

> :warning: **Platform setup required:** Data scientists can start using data immediately, but platform teams need to [configure data stores](/data/configure-data-stores.md) first.

#### From Files to Datasets

Individual files work great for single models or CSVs. But when you're managing hundreds of images or train/validation splits, datasets keep everything organized.

**Datasets group related files into versioned collections:**

```yaml
# Instead of manually adding 1000 individual files
inputs:
  - name: training-images
    default: dataset://imagenet/train-v3  # One reference, 1000 files
```

Update all files together, track changes between versions, and use aliases like `production` or `staging` to promote datasets through your workflow.

***

### Working with Data in Valohai

#### Save Files from Your Code

Write files to `/valohai/outputs/` and Valohai uploads them automatically:

```python
# Save any file to outputs
import pandas as pd

df.to_csv("/valohai/outputs/processed_data.csv")
model.save("/valohai/outputs/model.pkl")
```

That's it. No upload logic, no authentication. Files appear in your project's Data tab with full lineage.

#### Add Context with Metadata

Attach tags, aliases, and custom properties to files for search, filtering, and audit trails:

```python
import json

metadata = {
    "valohai.tags": ["validated", "production"],
    "valohai.alias": "latest-model",
    "accuracy": 0.95,
    "training_date": "2024-01-15",
}

# Save metadata alongside your output
with open("/valohai/outputs/model.pkl.metadata.json", "w") as f:
    json.dump(metadata, f)
```

Now search by tag, reference by alias, or query custom properties through the API.

#### Load Files in Your Code

Reference data in your YAML, and Valohai downloads it before your code runs:

```yaml
- step:
    name: train
    image: python:3.9
    command: python train.py
    inputs:
      - name: dataset
        default: s3://mybucket/mydata/project-a/data.csv
```

Access it like a local file:

```python
import pandas as pd

df = pd.read_csv("/valohai/inputs/dataset/data.csv")
```

Supports S3, Azure, GCS, OVH, Oracle or public URLs, `datum://` links from previous executions and Valohai datasets and models (`dataset://` and `model://` links).

***

### Data Storage Options

#### Cloud Object Storage (Recommended)

Configure [S3](/data/configure-data-stores/amazon-s3.md), [Azure Blob](/data/configure-data-stores/azure-blob-storage.md), [GCS](/data/configure-data-stores/google-bucket.md), [OVH ](/data/configure-data-stores/ovh-object-storage.md)or [Oracle Bucket Storage](/data/configure-data-stores/oracle-bucket-storage.md) as your primary data store. Best for:

* Versioned inputs and outputs
* Reproducible pipelines
* Audit trails and compliance
* Multi-region access

[Set up your data store →](/data/configure-data-stores.md)

#### Network Storage (For Shared Data)

Mount AWS EFS, Google Filestore, or on-premises NFS when you need:

* Shared scratch space across executions
* Access to existing on-prem datasets
* Shared cache layer for large files or large amount of files (tens or hundreds of thousands)

**Trade-off:** Network mounts are fast but could include data that's not tracked by Valohai. Be careful, because use of such data could break reproducibility.

[Mount network storage →](/data/data-nfs.md)

#### Databases (For Structured Queries)

Query BigQuery, Redshift, or Snowflake directly from executions. Best for:

* Pulling training data from data warehouses
* Running feature engineering on SQL tables
* Joining external datasets

[Query databases →](/data/data-databases.md)

***

### Data Organization Patterns

#### For ML Experiments

Use[ datasets with aliases](/data/datasets/creating-datasets.md#dataset-aliases) for environment promotion:

```yaml
inputs:
  - name: training-data
    default: dataset://customer-churn/production
  - name: validation-data
    default: dataset://customer-churn/staging
```

Update aliases when promoting datasets—no code changes needed.

#### For Production Pipelines

Use `datum://` links with [aliases](/data/data-versioning/metadata-overview/aliases.md) for immutable references:

```yaml
inputs:
  - name: model-weights
    default: datum://abc123...  # Exact version, always reproducible
```

#### For Large-Scale Data

Package files into tar archives before creating datasets:

```python
import tarfile

with tarfile.open("/valohai/outputs/images.tar", "w") as tar:
    tar.add("/valohai/outputs/images/", arcname="images")
```

Downloading one 10GB tar is much faster than 100,000 individual files.

> :bulb: If packaging is not an option, and you do require a large mount of data, checkout available [caching strategies](/data/data-nfs.md) that could speed up the access to your data.

***

### When to Use What

| Need                         | Solution                        | Why                             |
| ---------------------------- | ------------------------------- | ------------------------------- |
| Single model file            | `datum://` link                 | Immutable, versioned, traceable |
| Training/validation split    | Dataset with versions           | Files versioned together        |
| Image classification folders | Dataset with `keep-directories` | Preserves folder structure      |
| Existing on-prem data        | Network mount                   | Data stays on premises          |
| Query data warehouse         | Database connector              | No data movement needed         |
| Environment promotion        | Dataset aliases                 | Update alias, not code          |

***

### Next Steps

**For Data Scientists:**

1. [Save file from your code](/data/data-versioning/save-files-from-jobs.md) to outputs
2. [Load files in your code](/data/data-versioning/load-files-in-jobs.md) using inputs
3. [Create datasets](/data/datasets/creating-datasets.md) for multi-file workflows
4. [Add metadata](/data/data-versioning/metadata-overview.md) for searchability

**For Platform Teams:**

1. [Configure cloud storage](/data/configure-data-stores.md) as your data store
2. [Connect databases](/data/data-databases.md) for SQL access
3. [Set up network mounts](/data/data-nfs.md) if needed

> 💡 **Platform setup required:** Before data scientists can save files, platform teams must [configure at least one data store](/data/configure-data-stores.md).


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.valohai.com/data.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
