# On-Premises NFS

Mount on-premises network file systems to access existing data infrastructure directly from Valohai executions.

***

### When to Use On-Premises NFS

On-premises NFS mounting serves a different purpose than cloud network storage:

#### Data Already Exists on Network Shares

**Use when:**

* Large datasets already on corporate NFS servers
* Legacy systems produce data on network shares
* Multiple departments share data on existing file servers
* Migrating terabytes of data is impractical

**Example workflow:**

1. Medical imagining on hospital NFS
2. Mount the volume to your execution
3. Process the data while meeting compliance requirements
4. Save results to outputs to start tracking them as datums
5. Everything is versioned and tracked for audit

***

#### Data Compliance Requirements

**Use when:**

* Healthcare data must stay in hospital network (HIPAA)
* Financial data has regulatory restrictions (PCI DSS, GDPR)
* Government data cannot leave controlled environment
* Corporate policy prohibits cloud data storage

***

#### Hybrid Cloud Strategy

**Use when:**

* Transitioning gradually to cloud
* Need access to both on-prem and cloud data
* Want to keep sensitive data on-prem while using cloud compute
* Cost optimization (avoid cloud storage costs for large static datasets)

***

### Critical Trade-Off: Speed vs. Versioning

> ⚠️ **Important:** Valohai does NOT version or track files on mounted network storage.

**What this means:**

* Files read from mounts: **Not versioned**
* Files written to mounts: **Not versioned**
* Files saved to `/valohai/outputs/`: **Versioned ✅**

#### Decision Tree: Should I Use NFS Mounts?

```
Is your data already on on-premises network storage?
├─ Yes → Does it need to stay on-prem (compliance)?
│   ├─ Yes → Use on-prem NFS mount
│   │         ⚠️ But save processed outputs to /valohai/outputs/
│   │
│   └─ No → Can you move it to cloud storage?
│       ├─ Yes → Use Valohai inputs (versioned, cached)
│       └─ No (too large) → Use on-prem NFS mount
│
└─ No → Use Valohai inputs (datum:// or dataset://)
          ✅ Versioned, reproducible, cached
```

***

### On-Prem NFS vs. Valohai Inputs

| Feature              | On-Prem NFS Mount                 | Valohai Inputs             |
| -------------------- | --------------------------------- | -------------------------- |
| **Versioning**       | ❌ No tracking                     | ✅ Full versioning          |
| **Reproducibility**  | ❌ Data can change                 | ✅ Immutable references     |
| **Data location**    | ✅ Stays on-premises               | ❌ Must be in cloud storage |
| **Setup complexity** | ⚠️ Network + VPN config           | ✅ Simple                   |
| **Speed**            | ⚠️ Depends on network             | ✅ Fast (cloud-native)      |
| **Best for**         | Existing on-prem data, compliance | All other cases            |
| **Compliance**       | ✅ Data never leaves premises      | ❌ Data moves to cloud      |

***

### Recommended Pattern: Mount → Process → Save

**Always save processed results to `/valohai/outputs/` for versioning:**

```python
import os
import pandas as pd

# 1. Read from on-prem network mount (NOT versioned)
onprem_path = "/mnt/company-data/raw_datasets/"
files = os.listdir(onprem_path)

print(f"Found {len(files)} files on on-prem storage")

# 2. Process the data
processed_data = []
for filename in files:
    filepath = os.path.join(onprem_path, filename)
    # Your preprocessing logic
    data = preprocess_data(filepath)
    processed_data.append(data)

# 3. Save processed results to Valohai outputs (VERSIONED ✅)
output_path = "/valohai/outputs/preprocessed_dataset.csv"
df = pd.DataFrame(processed_data)
df.to_csv(output_path, index=False)

print(f"Saved versioned dataset to {output_path}")
```

**Why this matters:**

```
❌ Bad: Mount → Process → Write back to mount
   (Nothing versioned, can't reproduce, no audit trail)

✅ Good: Mount → Process → Save to /valohai/outputs/
   (Processed data versioned, reproducible, compliant)
```

***

### Prerequisites

Before mounting on-premises NFS in Valohai:

1. **Network connectivity** — Valohai execution environments must reach your on-prem NFS server
2. **VPN or Direct Connect** — Secure connection between cloud and on-premises network
3. **NFS server accessible** — NFS service running and accessible from Valohai worker IPs
4. **Firewall rules** — Allow NFS traffic from Valohai workers
5. **Mount permissions** — NFS export configured to allow access from Valohai workers

***

### Mount On-Premises NFS in Execution

#### Basic Mount Configuration

**valohai.yaml:**

```yaml
- step:
    name: process-onprem-data
    image: python:3.9
    command:
      - python process_data.py
    mounts:
      - destination: /mnt/company-data
        source: /mnt/data/ml-datasets
        readonly: true
```

**For networked NFS server:**

```yaml
mounts:
  - destination: /mnt/medical-scans
    source: nfs-server.company.internal:/exports/medical_imaging
    type: nfs
    readonly: true
```

**Parameters:**

* `destination` — Mount point inside container (e.g., `/mnt/company-data`)
* `source` — NFS path (format: `<server>:<export-path>` or local mount path)
* `type` — `nfs` when specifying remote server
* `readonly` — `true` (recommended) or `false`

***

#### Mount Specific NFS Directory

```yaml
mounts:
  - destination: /mnt/raw-images
    source: nas-server.internal:/data/ml-datasets/images/raw
    type: nfs
    readonly: true
```

Mounts only a specific subdirectory from your NFS server.

***

#### Use Environment Variables in Mount Configuration

Mount configurations support environment variable interpolation, letting you use secrets and runtime variables in your mount paths and credentials.

**Example with CIFS mount using secrets:**

```yaml
- step:
    name: process-with-network-share
    image: python:3.9
    command:
      - python process_data.py
    mounts:
      - destination: /mnt/company-data
        source: //fileserver.company.internal/${PROJECT_NAME}/data
        type: cifs
        options:
          username: ${MOUNT_USERNAME}
          password: ${MOUNT_PASSWORD}
    environment-variables:
      - name: PROJECT_NAME
        default: ml-project
```

> **Security tip:** Always mark mount credentials as secrets in environment variable settings. This prevents them from appearing in UI but still makes them available to executions.

See [Environment Variables & Secrets](/user-and-organization-management/getting-started/environment-variables.md) for more details on managing credentials.

***

### Complete Workflow Example

#### Mount → Process → Save Pattern

**Scenario:** Process medical imaging from hospital NFS, extract features, save to Valohai outputs for compliance tracking.

**valohai.yaml:**

```yaml
- step:
    name: process-medical-scans
    image: python:3.9
    command:
      - pip install pydicom numpy pandas
      - python process_scans.py
    mounts:
      - destination: /mnt/medical-imaging
        source: hospital-nas.internal:/medical_scans/radiology
        type: nfs
        readonly: true  # Protect source data
    environment-variables:
      - name: PATIENT_BATCH
        default: "2024-Q1"
```

**process\_scans.py:**

```python
import os
import pydicom
import numpy as np
import pandas as pd
import json
from datetime import datetime

# Configuration
NFS_PATH = "/mnt/medical-imaging/"
OUTPUT_PATH = "/valohai/outputs/"
BATCH_ID = os.getenv("PATIENT_BATCH", "2024-Q1")

# 1. Scan on-prem NFS for DICOM files (NOT versioned)
print(f"Accessing on-premises medical imaging: {NFS_PATH}")
dicom_files = []
for root, dirs, files in os.walk(NFS_PATH):
    for file in files:
        if file.endswith(".dcm"):
            dicom_files.append(os.path.join(root, file))

print(f"Found {len(dicom_files)} DICOM files on on-prem storage")

# 2. Process medical scans
scan_metadata = []
features_list = []

for i, filepath in enumerate(dicom_files):
    try:
        # Read DICOM from on-prem NFS
        ds = pydicom.dcmread(filepath)

        # Extract metadata (de-identified)
        metadata = {
            "scan_id": f"{BATCH_ID}_{i:05d}",
            "modality": str(ds.get("Modality", "Unknown")),
            "body_part": str(ds.get("BodyPartExamined", "Unknown")),
            "pixel_spacing": str(ds.get("PixelSpacing", "Unknown")),
            "slice_thickness": str(ds.get("SliceThickness", "Unknown")),
            "acquisition_date": str(ds.get("AcquisitionDate", "Unknown")),
        }

        # Extract features (e.g., intensity histogram)
        if hasattr(ds, "pixel_array"):
            pixel_array = ds.pixel_array
            features = {
                "scan_id": metadata["scan_id"],
                "mean_intensity": float(np.mean(pixel_array)),
                "std_intensity": float(np.std(pixel_array)),
                "min_intensity": float(np.min(pixel_array)),
                "max_intensity": float(np.max(pixel_array)),
                "shape": pixel_array.shape,
            }
            features_list.append(features)

        scan_metadata.append(metadata)

        if (i + 1) % 100 == 0:
            print(f"Processed {i + 1}/{len(dicom_files)} scans...")

    except Exception as e:
        print(f"Error processing {filepath}: {e}")
        continue

# 3. Save processed results to Valohai outputs (VERSIONED ✅)
print(f"\nSaving results for {len(scan_metadata)} scans...")

# Save de-identified metadata
metadata_df = pd.DataFrame(scan_metadata)
metadata_df.to_csv(os.path.join(OUTPUT_PATH, "scan_metadata.csv"), index=False)

# Save extracted features
features_df = pd.DataFrame(features_list)
features_df.to_csv(os.path.join(OUTPUT_PATH, "scan_features.csv"), index=False)

# 4. Create dataset version for audit trail
dataset_metadata = {
    "scan_metadata.csv": {
        "valohai.dataset-versions": [
            {
                "uri": f"dataset://medical-scans/{BATCH_ID}",
            },
        ],
        "valohai.tags": ["medical-imaging", "de-identified", BATCH_ID],
        "batch_id": BATCH_ID,
        "processing_date": datetime.now().isoformat(),
        "scan_count": len(scan_metadata),
        "source": "hospital-nas.internal",
    },
    "scan_features.csv": {
        "valohai.dataset-versions": [
            {
                "uri": f"dataset://medical-scans/{BATCH_ID}",
            },
        ],
    },
}

metadata_path = os.path.join(OUTPUT_PATH, "valohai.metadata.jsonl")
with open(metadata_path, "w") as f:
    for filename, file_meta in dataset_metadata.items():
        json.dump({"file": filename, "metadata": file_meta}, f)
        f.write("\n")

print(f"\nProcessing complete:")
print(f"  - Processed {len(scan_metadata)} scans")
print(f"  - Extracted features: {len(features_list)} scans")
print(f"  - Created dataset: dataset://medical-scans/{BATCH_ID}")
print(f"  - Source data remains on-premises (compliance maintained)")
```

**Result:**

* ✅ Medical scans accessed from on-prem NFS (data never leaves hospital network)
* ✅ De-identified metadata and features saved to `/valohai/outputs/` (versioned, compliant)
* ✅ Dataset created for reproducible analysis
* ✅ Audit trail maintained with source tracking

***

### Readonly vs. Writeable Mounts

#### Readonly Mounts (Recommended)

```yaml
mounts:
  - destination: /mnt/input-data
    source: nfs-server.internal:/data
    readonly: true  # ✅ Safe
```

**Use when:**

* Accessing shared reference data
* Reading large datasets for processing
* Multiple executions need same data
* Want to prevent accidental modifications

**Benefits:**

* ✅ Prevents accidental data corruption
* ✅ Safe for parallel executions
* ✅ Clear intent (read-only access)

***

#### Writeable Mounts (Use Carefully)

```yaml
mounts:
  - destination: /mnt/scratch
    source: nfs-server.internal:/scratch
    readonly: false  # ⚠️ Use with caution
```

**Use when:**

* Need shared scratch space for intermediate results
* Writing temporary files shared across parallel workers
* Caching expensive computations

**Risks:**

* ⚠️ Files written here are NOT versioned
* ⚠️ Parallel executions can conflict
* ⚠️ No automatic cleanup

**Best practice:** Use writeable mounts for temporary data only. Always save final results to `/valohai/outputs/`.

***

### Best Practices

#### Use Readonly for Sensitive Data

```yaml
# ✅ Good: Readonly protects source data
mounts:
  - destination: /mnt/protected-data
    readonly: true
```

```yaml
# ❌ Avoid: Writeable for sensitive data
mounts:
  - destination: /mnt/protected-data
    readonly: false  # Risk of data corruption
```

***

#### Always Version Processed Results

```python
# ❌ Bad: Process on-prem data, write back to on-prem
data = process("/mnt/onprem-data/")
data.save("/mnt/onprem-output/")  # NOT versioned, not auditable

# ✅ Good: Process on-prem data, save to Valohai outputs
data = process("/mnt/onprem-data/")
data.save("/valohai/outputs/results.csv")  # VERSIONED, auditable
```

***

### Maintaining Reproducibility

> ⚠️ **Critical:** On-premises data can change. Always save processed results to `/valohai/outputs/` for versioning and audit trails.

**The problem:**

```python
# Today: Process on-prem data
data = load("/mnt/onprem-data/")
model = train(data)

# Next month: On-prem data updated
# Can't reproduce original model
# No audit trail of what data was used
```

**The solution:**

```python
# Load from on-prem (current state)
data = load("/mnt/onprem-data/")

# Save snapshot to versioned outputs
data.to_csv("/valohai/outputs/training_snapshot.csv")

# Document source
metadata = {
    "training_snapshot.csv": {
        "valohai.dataset-versions": [
            {
                "uri": "dataset://medical-scans/2024-Q1",
            },
        ],
        "valohai.tags": ["on-premises", "hospital-data"],
        "source": "hospital-nas.internal:/medical_scans",
        "access_date": datetime.now().isoformat(),
        "file_count": len(data),
    },
}

# Save metadata
import json

with open("/valohai/outputs/valohai.metadata.jsonl", "w") as f:
    for filename, file_meta in metadata.items():
        json.dump({"file": filename, "metadata": file_meta}, f)
        f.write("\n")

# Train on versioned snapshot in future executions
# Full audit trail maintained
```

***

### Related Pages

* [AWS Elastic File System](/data/data-nfs/aws-efs.md) — Cloud NFS for AWS
* [Google Cloud Filestore](/data/data-nfs/google-filestore.md) — Cloud NFS for GCP
* [Load Data in Jobs](/data/data-versioning/load-files-in-jobs.md) — Alternative: Valohai's versioned inputs
* [Databases](/data/data-databases.md) — Access on-prem databases

***

### Next Steps

* Set up VPN or Direct Connect between cloud and on-premises
* Configure NFS exports and firewall rules
* Test connectivity with small execution
* Build [pipeline](/pipelines.md): mount → process →[ save to outputs](/data/data-versioning/save-files-from-jobs.md)
* Document compliance and data handling procedures
* Monitor network performance and optimize access patterns


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.valohai.com/data/data-nfs/onprem-nfs.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
