On-Premises NFS
Mount on-premises network file systems to access existing data infrastructure directly from Valohai executions.
When to Use On-Premises NFS
On-premises NFS mounting serves a different purpose than cloud network storage:
Data Already Exists on Network Shares
Use when:
Large datasets already on corporate NFS servers
Legacy systems produce data on network shares
Multiple departments share data on existing file servers
Migrating terabytes of data is impractical
Example workflow:
Medical imagining on hospital NFS
Mount the volume to your execution
Process the data while meeting compliance requirements
Save results to outputs to start tracking them as datums
Everything is versioned and tracked for audit
Data Compliance Requirements
Use when:
Healthcare data must stay in hospital network (HIPAA)
Financial data has regulatory restrictions (PCI DSS, GDPR)
Government data cannot leave controlled environment
Corporate policy prohibits cloud data storage
Hybrid Cloud Strategy
Use when:
Transitioning gradually to cloud
Need access to both on-prem and cloud data
Want to keep sensitive data on-prem while using cloud compute
Cost optimization (avoid cloud storage costs for large static datasets)
Critical Trade-Off: Speed vs. Versioning
⚠️ Important: Valohai does NOT version or track files on mounted network storage.
What this means:
Files read from mounts: Not versioned
Files written to mounts: Not versioned
Files saved to
/valohai/outputs/: Versioned ✅
Decision Tree: Should I Use NFS Mounts?
Is your data already on on-premises network storage?
├─ Yes → Does it need to stay on-prem (compliance)?
│ ├─ Yes → Use on-prem NFS mount
│ │ ⚠️ But save processed outputs to /valohai/outputs/
│ │
│ └─ No → Can you move it to cloud storage?
│ ├─ Yes → Use Valohai inputs (versioned, cached)
│ └─ No (too large) → Use on-prem NFS mount
│
└─ No → Use Valohai inputs (datum:// or dataset://)
✅ Versioned, reproducible, cachedOn-Prem NFS vs. Valohai Inputs
Versioning
❌ No tracking
✅ Full versioning
Reproducibility
❌ Data can change
✅ Immutable references
Data location
✅ Stays on-premises
❌ Must be in cloud storage
Setup complexity
⚠️ Network + VPN config
✅ Simple
Speed
⚠️ Depends on network
✅ Fast (cloud-native)
Best for
Existing on-prem data, compliance
All other cases
Compliance
✅ Data never leaves premises
❌ Data moves to cloud
Recommended Pattern: Mount → Process → Save
Always save processed results to /valohai/outputs/ for versioning:
import os
import pandas as pd
# 1. Read from on-prem network mount (NOT versioned)
onprem_path = '/mnt/company-data/raw_datasets/'
files = os.listdir(onprem_path)
print(f"Found {len(files)} files on on-prem storage")
# 2. Process the data
processed_data = []
for filename in files:
filepath = os.path.join(onprem_path, filename)
# Your preprocessing logic
data = preprocess_data(filepath)
processed_data.append(data)
# 3. Save processed results to Valohai outputs (VERSIONED ✅)
output_path = '/valohai/outputs/preprocessed_dataset.csv'
df = pd.DataFrame(processed_data)
df.to_csv(output_path, index=False)
print(f"Saved versioned dataset to {output_path}")Why this matters:
❌ Bad: Mount → Process → Write back to mount
(Nothing versioned, can't reproduce, no audit trail)
✅ Good: Mount → Process → Save to /valohai/outputs/
(Processed data versioned, reproducible, compliant)Prerequisites
Before mounting on-premises NFS in Valohai:
Network connectivity — Valohai execution environments must reach your on-prem NFS server
VPN or Direct Connect — Secure connection between cloud and on-premises network
NFS server accessible — NFS service running and accessible from Valohai worker IPs
Firewall rules — Allow NFS traffic from Valohai workers
Mount permissions — NFS export configured to allow access from Valohai workers
Mount On-Premises NFS in Execution
Basic Mount Configuration
valohai.yaml:
- step:
name: process-onprem-data
image: python:3.9
command:
- python process_data.py
mounts:
- destination: /mnt/company-data
source: /mnt/data/ml-datasets
readonly: trueFor networked NFS server:
mounts:
- destination: /mnt/medical-scans
source: nfs-server.company.internal:/exports/medical_imaging
type: nfs
readonly: trueParameters:
destination— Mount point inside container (e.g.,/mnt/company-data)source— NFS path (format:<server>:<export-path>or local mount path)type—nfswhen specifying remote serverreadonly—true(recommended) orfalse
Mount Specific NFS Directory
mounts:
- destination: /mnt/raw-images
source: nas-server.internal:/data/ml-datasets/images/raw
type: nfs
readonly: trueMounts only a specific subdirectory from your NFS server.
Complete Workflow Example
Mount → Process → Save Pattern
Scenario: Process medical imaging from hospital NFS, extract features, save to Valohai outputs for compliance tracking.
valohai.yaml:
- step:
name: process-medical-scans
image: python:3.9
command:
- pip install pydicom numpy pandas
- python process_scans.py
mounts:
- destination: /mnt/medical-imaging
source: hospital-nas.internal:/medical_scans/radiology
type: nfs
readonly: true # Protect source data
environment-variables:
- name: PATIENT_BATCH
default: "2024-Q1"process_scans.py:
import os
import pydicom
import numpy as np
import pandas as pd
import json
from datetime import datetime
# Configuration
NFS_PATH = '/mnt/medical-imaging/'
OUTPUT_PATH = '/valohai/outputs/'
BATCH_ID = os.getenv('PATIENT_BATCH', '2024-Q1')
# 1. Scan on-prem NFS for DICOM files (NOT versioned)
print(f"Accessing on-premises medical imaging: {NFS_PATH}")
dicom_files = []
for root, dirs, files in os.walk(NFS_PATH):
for file in files:
if file.endswith('.dcm'):
dicom_files.append(os.path.join(root, file))
print(f"Found {len(dicom_files)} DICOM files on on-prem storage")
# 2. Process medical scans
scan_metadata = []
features_list = []
for i, filepath in enumerate(dicom_files):
try:
# Read DICOM from on-prem NFS
ds = pydicom.dcmread(filepath)
# Extract metadata (de-identified)
metadata = {
'scan_id': f"{BATCH_ID}_{i:05d}",
'modality': str(ds.get('Modality', 'Unknown')),
'body_part': str(ds.get('BodyPartExamined', 'Unknown')),
'pixel_spacing': str(ds.get('PixelSpacing', 'Unknown')),
'slice_thickness': str(ds.get('SliceThickness', 'Unknown')),
'acquisition_date': str(ds.get('AcquisitionDate', 'Unknown'))
}
# Extract features (e.g., intensity histogram)
if hasattr(ds, 'pixel_array'):
pixel_array = ds.pixel_array
features = {
'scan_id': metadata['scan_id'],
'mean_intensity': float(np.mean(pixel_array)),
'std_intensity': float(np.std(pixel_array)),
'min_intensity': float(np.min(pixel_array)),
'max_intensity': float(np.max(pixel_array)),
'shape': pixel_array.shape
}
features_list.append(features)
scan_metadata.append(metadata)
if (i + 1) % 100 == 0:
print(f"Processed {i + 1}/{len(dicom_files)} scans...")
except Exception as e:
print(f"Error processing {filepath}: {e}")
continue
# 3. Save processed results to Valohai outputs (VERSIONED ✅)
print(f"\nSaving results for {len(scan_metadata)} scans...")
# Save de-identified metadata
metadata_df = pd.DataFrame(scan_metadata)
metadata_df.to_csv(os.path.join(OUTPUT_PATH, 'scan_metadata.csv'), index=False)
# Save extracted features
features_df = pd.DataFrame(features_list)
features_df.to_csv(os.path.join(OUTPUT_PATH, 'scan_features.csv'), index=False)
# 4. Create dataset version for audit trail
dataset_metadata = {
"scan_metadata.csv": {
"valohai.dataset-versions": [{
"uri": f"dataset://medical-scans/{BATCH_ID}"
}],
"valohai.tags": ["medical-imaging", "de-identified", BATCH_ID],
"batch_id": BATCH_ID,
"processing_date": datetime.now().isoformat(),
"scan_count": len(scan_metadata),
"source": "hospital-nas.internal"
},
"scan_features.csv": {
"valohai.dataset-versions": [{
"uri": f"dataset://medical-scans/{BATCH_ID}"
}]
}
}
metadata_path = os.path.join(OUTPUT_PATH, 'valohai.metadata.jsonl')
with open(metadata_path, 'w') as f:
for filename, file_meta in dataset_metadata.items():
json.dump({"file": filename, "metadata": file_meta}, f)
f.write('\n')
print(f"\nProcessing complete:")
print(f" - Processed {len(scan_metadata)} scans")
print(f" - Extracted features: {len(features_list)} scans")
print(f" - Created dataset: dataset://medical-scans/{BATCH_ID}")
print(f" - Source data remains on-premises (compliance maintained)")Result:
✅ Medical scans accessed from on-prem NFS (data never leaves hospital network)
✅ De-identified metadata and features saved to
/valohai/outputs/(versioned, compliant)✅ Dataset created for reproducible analysis
✅ Audit trail maintained with source tracking
Readonly vs. Writeable Mounts
Readonly Mounts (Recommended)
mounts:
- destination: /mnt/input-data
source: nfs-server.internal:/data
readonly: true # ✅ SafeUse when:
Accessing shared reference data
Reading large datasets for processing
Multiple executions need same data
Want to prevent accidental modifications
Benefits:
✅ Prevents accidental data corruption
✅ Safe for parallel executions
✅ Clear intent (read-only access)
Writeable Mounts (Use Carefully)
mounts:
- destination: /mnt/scratch
source: nfs-server.internal:/scratch
readonly: false # ⚠️ Use with cautionUse when:
Need shared scratch space for intermediate results
Writing temporary files shared across parallel workers
Caching expensive computations
Risks:
⚠️ Files written here are NOT versioned
⚠️ Parallel executions can conflict
⚠️ No automatic cleanup
Best practice: Use writeable mounts for temporary data only. Always save final results to /valohai/outputs/.
Best Practices
Use Readonly for Sensitive Data
# ✅ Good: Readonly protects source data
mounts:
- destination: /mnt/protected-data
readonly: true# ❌ Avoid: Writeable for sensitive data
mounts:
- destination: /mnt/protected-data
readonly: false # Risk of data corruptionAlways Version Processed Results
# ❌ Bad: Process on-prem data, write back to on-prem
data = process('/mnt/onprem-data/')
data.save('/mnt/onprem-output/') # NOT versioned, not auditable
# ✅ Good: Process on-prem data, save to Valohai outputs
data = process('/mnt/onprem-data/')
data.save('/valohai/outputs/results.csv') # VERSIONED, auditableMaintaining Reproducibility
⚠️ Critical: On-premises data can change. Always save processed results to
/valohai/outputs/for versioning and audit trails.
The problem:
# Today: Process on-prem data
data = load('/mnt/onprem-data/')
model = train(data)
# Next month: On-prem data updated
# Can't reproduce original model
# No audit trail of what data was usedThe solution:
# Load from on-prem (current state)
data = load('/mnt/onprem-data/')
# Save snapshot to versioned outputs
data.to_csv('/valohai/outputs/training_snapshot.csv')
# Document source
metadata = {
"training_snapshot.csv": {
"valohai.dataset-versions": [{
"uri": "dataset://medical-scans/2024-Q1"
}],
"valohai.tags": ["on-premises", "hospital-data"],
"source": "hospital-nas.internal:/medical_scans",
"access_date": datetime.now().isoformat(),
"file_count": len(data)
}
}
# Save metadata
import json
with open('/valohai/outputs/valohai.metadata.jsonl', 'w') as f:
for filename, file_meta in metadata.items():
json.dump({"file": filename, "metadata": file_meta}, f)
f.write('\n')
# Train on versioned snapshot in future executions
# Full audit trail maintainedRelated Pages
AWS Elastic File System — Cloud NFS for AWS
Google Cloud Filestore — Cloud NFS for GCP
Load Data in Jobs — Alternative: Valohai's versioned inputs
Databases — Access on-prem databases
Next Steps
Set up VPN or Direct Connect between cloud and on-premises
Configure NFS exports and firewall rules
Test connectivity with small execution
Build pipeline: mount → process → save to outputs
Document compliance and data handling procedures
Monitor network performance and optimize access patterns
Last updated
Was this helpful?
