# Network Storage

Mount network file systems (NFS) to access large datasets and shared storage directly from executions without downloading files.

***

### When to Use Network Storage

Network storage serves different purposes depending on your deployment:

#### Cloud Deployments

**Use case:** Large dataset caching and preprocessing

Mount shared network storage (AWS EFS, GCP Filestore) to:

* **Access huge datasets** without downloading to each execution
* **Share preprocessed data** across multiple executions
* **Cache intermediate results** on fast shared storage
* **Preprocess once, use many times** — Mount raw data, process it, save versioned outputs

**Example workflow:**

1. Mount a network storage containing 1M files to the execution
2. Filter only 100k of them
3. Move them to the `/valohai/outputs` - in order to upload them to the cloud store and start tracking them as datums
4. Create [packaged dataset](/data/datasets/package-datasets.md) version for faster access in successive executions
5. Use this [dataset](/data/data-versioning/load-files-in-jobs.md)

***

#### On-Premises Deployments

**Use case:** Access existing data infrastructure

Mount on-premises network drives where:

* **Data already exists** on company NFS servers
* **Legacy systems** store data on network shares
* **Compliance requires** data stays on-premises
* **Moving data to cloud is impractical** due to size or regulations

**Example workflow:**

1. Medical imagining on hospital NFS
2. Mount the volume to your execution
3. Process the data while meeting compliance requirements
4. Save results to outputs to start tracking them as datums
5. Everything is versioned and tracked for audit

***

### Critical Trade-Off: Speed vs. Versioning

> ⚠️ **Important:** Valohai does NOT version or track files on mounted network storage.

**What this means:**

* Files read from mounts: **Not versioned**
* Files written to mounts: **Not versioned**
* Files saved to `/valohai/outputs/`: **Versioned ✅**

#### Decision Tree: Should I Use NFS Mounts?

```
Do you need to access very large datasets (>100GB)?
├─ No → Use Valohai inputs (datum:// or dataset://)
│        ✅ Versioned, reproducible, cached
│
└─ Yes → Does the data change frequently?
    ├─ Yes → Use Valohai inputs
    │         ✅ Version every snapshot
    │
    └─ No → Is the data already on network storage?
        ├─ Yes (on-prem) → Use NFS mount
        │                   ⚠️ But save processed outputs
        │
        └─ No (cloud) → Consider these options:
            ├─ Download once, cache, use many times?
            │  → Use Valohai inputs (cached automatically)
            │
            └─ Need shared preprocessing workspace?
               → Use NFS mount as scratch space
                  ⚠️ But save final results to outputs
```

***

### NFS vs. Valohai Inputs

| Feature                      | NFS Mount                          | Valohai Inputs         |
| ---------------------------- | ---------------------------------- | ---------------------- |
| **Versioning**               | ❌ No tracking                      | ✅ Full versioning      |
| **Reproducibility**          | ❌ Data can change                  | ✅ Immutable references |
| **Speed (first run)**        | ✅ Fast (no download)               | ❌ Download required    |
| **Speed (reruns)**           | ✅ Always fast                      | ✅ Fast (cached)        |
| **Setup complexity**         | ⚠️ Network config required         | ✅ Simple               |
| **Best for**                 | Huge stable datasets, on-prem data | All other cases        |
| **Lineage tracking**         | ❌ No                               | ✅ Yes                  |
| **Shared across executions** | ✅ Yes                              | ✅ Yes (via cache)      |

***

### Recommended Pattern: Mount → Process → Save

**Always save processed results to `/valohai/outputs/` for versioning:**

```python
import os
import pandas as pd

# 1. Read from mounted network storage (NOT versioned)
raw_data_path = "/mnt/shared-data/raw_images/"
files = os.listdir(raw_data_path)

print(f"Found {len(files)} files on network mount")

# 2. Process the data
processed_data = []
for filename in files:
    filepath = os.path.join(raw_data_path, filename)
    # Your preprocessing logic
    data = preprocess_image(filepath)
    processed_data.append(data)

# 3. Save processed results to Valohai outputs (VERSIONED ✅)
output_path = "/valohai/outputs/preprocessed_dataset.csv"
df = pd.DataFrame(processed_data)
df.to_csv(output_path, index=False)

print(f"Saved versioned dataset to {output_path}")
```

**Why this matters:**

:x: ​Mount -> Process -> Write back to mount\
:point\_up:​Nothing is versioned nor tracked - no reproducibility

:white\_check\_mark: ​​Mount -> Process -> Save to `/valohai/outputs`​\
:point\_up: Processed data is tracked and/or versioned

***

### Mount Configuration

Network mounts are defined in `valohai.yaml`:

```yaml
- step:
    name: process-large-dataset
    image: python:3.9
    command:
      - python preprocess.py
    mounts:
      - destination: /mnt/raw-data      # Path inside container
        source: <nfs-source>            # Cloud-specific format
        type: nfs
        readonly: true                  # Recommended for input data
```

**Parameters:**

* `destination` — Mount point inside execution container
* `source` — NFS server address (format varies by cloud)
* `type` — Always `nfs` for network file systems
* `readonly` — `true` (safe) or `false` (allows writes)

#### Environment Variable Interpolation

Mount configurations support environment variable interpolation, enabling dynamic configuration with secrets and runtime variables. Variables are resolved from the execution's environment variable configuration (both step-level and organization-level variables).

```yaml
mounts:
  - destination: /mnt/data
    source: ${SERVER}:/exports/${PROJECT_NAME}
    type: nfs
```

**Syntax:**

* `$VARIABLE` — Simple substitution (silent if variable doesn't exist)
* `${VARIABLE}` — Braced substitution (raises error if variable doesn't exist)
* `${VARIABLE:-default}` — Substitution with default value

**Common use cases:**

* **Credentials** — Store mount passwords as secrets (CIFS/SMB)
* **Dynamic paths** — Different mount sources per project/environment
* **Multi-cloud** — Environment-specific NFS endpoints

See cloud-specific guides for detailed examples:

* [On-Premises NFS](/data/data-nfs/onprem-nfs.md#use-environment-variables-in-mount-configuration) — CIFS credentials

***

### Readonly vs. Writeable Mounts

#### Readonly Mounts (Recommended)

```yaml
mounts:
  - destination: /mnt/input-data
    source: <nfs-source>
    readonly: true  # ✅ Safe
```

**Use when:**

* Accessing shared reference data
* Reading large datasets for processing
* Multiple executions need same data
* Want to prevent accidental modifications

**Benefits:**

* ✅ Prevents accidental data corruption
* ✅ Safe for parallel executions
* ✅ Clear intent (read-only access)

***

#### Writeable Mounts (Use Carefully)

```yaml
mounts:
  - destination: /mnt/scratch
    source: <nfs-source>
    readonly: false  # ⚠️ Use with caution
```

**Use when:**

* Need shared scratch space for intermediate results
* Writing temporary files shared across parallel workers
* Caching expensive computations

**Risks:**

* ⚠️ Files written here are NOT versioned
* ⚠️ Parallel executions can conflict
* ⚠️ No automatic cleanup

**Best practice:** Use writeable mounts for temporary data only. Always save final results to `/valohai/outputs/`.

***

### Complete Workflow Example

#### Mount → Preprocess → Save Pattern

**valohai.yaml:**

```yaml
- step:
    name: preprocess-images
    image: python:3.9
    command:
      - pip install pillow pandas
      - python preprocess_images.py
    mounts:
      - destination: /mnt/raw-images
        source: <cloud-specific-nfs-source>
        type: nfs
        readonly: true
```

**preprocess\_images.py:**

```python
import os
from PIL import Image
import pandas as pd
import json

# 1. Access raw data from network mount (NOT versioned)
mount_path = "/mnt/raw-images/"
image_files = [f for f in os.listdir(mount_path) if f.endswith(".jpg")]

print(f"Found {len(image_files)} images on network mount")

# 2. Process images
processed_data = []
output_dir = "/valohai/outputs/processed_images/"
os.makedirs(output_dir, exist_ok=True)

for filename in image_files:
    # Read from mount
    input_path = os.path.join(mount_path, filename)
    img = Image.open(input_path)

    # Process (resize, augment, etc.)
    img_resized = img.resize((224, 224))

    # Save processed image to Valohai outputs (VERSIONED ✅)
    output_path = os.path.join(output_dir, filename)
    img_resized.save(output_path)

    # Track metadata
    processed_data.append(
        {
            "filename": filename,
            "original_size": img.size,
            "processed_size": img_resized.size,
        },
    )

# 3. Save metadata
df = pd.DataFrame(processed_data)
df.to_csv("/valohai/outputs/processing_metadata.csv", index=False)

# 4. Create dataset version from processed images
metadata = {
    f"processed_images/{filename}": {
        "valohai.dataset-versions": [
            {
                "uri": "dataset://preprocessed-images/v1",
            },
        ],
    }
    for filename in image_files
}

# Add metadata file itself
metadata["processing_metadata.csv"] = {
    "valohai.dataset-versions": [{"uri": "dataset://preprocessed-images/v1"}],
}

# Save metadata for dataset creation
metadata_path = "/valohai/outputs/valohai.metadata.jsonl"
with open(metadata_path, "w") as f:
    for fname, fmeta in metadata.items():
        json.dump({"file": fname, "metadata": fmeta}, f)
        f.write("\n")

print(f"Processed {len(image_files)} images")
print("Created versioned dataset: dataset://preprocessed-images/v1")
```

**Result:**

* ✅ Raw images accessed from fast network mount (no download time)
* ✅ Processed images saved to `/valohai/outputs/` (versioned)
* ✅ Dataset created for reproducible training
* ✅ Can train on `dataset://preprocessed-images/v1` anytime

***

### Cloud-Specific Setup

Each cloud provider has specific requirements for network storage:

#### [AWS Elastic File System (EFS)](/data/data-nfs/aws-efs.md)

* Managed NFS service in AWS
* Setup: VPC configuration, security groups
* Format: `fs-1234abcd.efs.region.amazonaws.com:/`
* Performance modes: General Purpose, Max I/O

#### [Google Cloud Filestore](/data/data-nfs/google-filestore.md)

* Managed NFS service in GCP
* Setup: VPC configuration, IP-based access
* Format: `10.0.0.2:/share-name`
* Tiers: Basic, High Scale SSD

#### [On-Premises NFS](/data/data-nfs/onprem-nfs.md)

* Access existing network file shares
* Setup: Network connectivity, VPN/direct connect
* Format: `/mnt/network-share` or `nfs-server.internal:/share`

***

### Best Practices

#### Always Version Final Results

```python
# ❌ Bad: Only use mount, nothing versioned
data = load_from_mount("/mnt/data/")
model = train(data)
model.save("/mnt/models/model.pkl")  # NOT versioned

# ✅ Good: Mount for input, outputs for results
data = load_from_mount("/mnt/data/")
model = train(data)
model.save("/valohai/outputs/model.pkl")  # VERSIONED
```

***

#### Use Readonly When Possible

```yaml
# ✅ Good: Readonly for safety
mounts:
  - destination: /mnt/input-data
    readonly: true
```

```yaml
# ⚠️ Use carefully: Writeable only when needed
mounts:
  - destination: /mnt/scratch-space
    readonly: false
```

***

#### Document Mount Requirements

```yaml
# ✅ Good: Clear documentation
- step:
    name: process-data
    # Requires: EFS fs-abc123 mounted with medical imaging data
    # Data format: DICOM files in /scans/ subdirectory
    # Access: Readonly recommended
    mounts:
      - destination: /mnt/medical-data
        source: fs-abc123.efs.us-east-1.amazonaws.com:/
        readonly: true
```

***

#### Test with Small Subsets First

```python
# Test with subset before full run
import os

data_path = "/mnt/large-dataset/"
files = os.listdir(data_path)

# Use environment variable to control subset size
test_mode = os.getenv("TEST_MODE", "false") == "true"
files_to_process = files[:100] if test_mode else files

print(f"Processing {len(files_to_process)} files...")
```

***

#### Handle Mount Failures Gracefully

```python
import os
import sys

mount_path = "/mnt/network-storage/"

# Verify mount is accessible
if not os.path.exists(mount_path):
    print(f"ERROR: Mount path {mount_path} not accessible")
    print("Check network connectivity and mount configuration")
    sys.exit(1)

if not os.path.ismount(mount_path):
    print(f"WARNING: {mount_path} exists but may not be mounted")

# Continue with processing...
```

***

### Related Pages

* [AWS Elastic File System](/data/data-nfs/aws-efs.md) — Mount AWS EFS in executions
* [Google Cloud Filestore](/data/data-nfs/google-filestore.md) — Mount GCP Filestore in executions
* [On-Premises NFS](/data/data-nfs/onprem-nfs.md) — Mount on-prem network storage
* [Load Data in Jobs](/data/data-versioning/load-files-in-jobs.md) — Alternative: Use Valohai's versioned inputs

***

### Next Steps

* Evaluate whether NFS or [Valohai inputs](/data/data-versioning/load-files-in-jobs.md) fit your use case better
* Set up cloud network storage (EFS or Filestore) if needed
* Create test execution mounting network storage
* Build [pipeline](/pipelines.md): mount → process → [save to outputs](/data/data-versioning/save-files-from-jobs.md)
* Version processed [datasets](/data/datasets/update-datasets.md) for reproducible training


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.valohai.com/data/data-nfs.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
