Network Storage
Mount network file systems (NFS) to access large datasets and shared storage directly from executions without downloading files.
When to Use Network Storage
Network storage serves different purposes depending on your deployment:
Cloud Deployments
Use case: Large dataset caching and preprocessing
Mount shared network storage (AWS EFS, GCP Filestore) to:
Access huge datasets without downloading to each execution
Share preprocessed data across multiple executions
Cache intermediate results on fast shared storage
Preprocess once, use many times — Mount raw data, process it, save versioned outputs
Example workflow:
Mount a network storage containing 1M files to the execution
Filter only 100k of them
Move them to the
/valohai/outputs- in order to upload them to the cloud store and start tracking them as datumsCreate packaged dataset version for faster access in successive executions
Use this dataset
On-Premises Deployments
Use case: Access existing data infrastructure
Mount on-premises network drives where:
Data already exists on company NFS servers
Legacy systems store data on network shares
Compliance requires data stays on-premises
Moving data to cloud is impractical due to size or regulations
Example workflow:
Medical imagining on hospital NFS
Mount the volume to your execution
Process the data while meeting compliance requirements
Save results to outputs to start tracking them as datums
Everything is versioned and tracked for audit
Critical Trade-Off: Speed vs. Versioning
⚠️ Important: Valohai does NOT version or track files on mounted network storage.
What this means:
Files read from mounts: Not versioned
Files written to mounts: Not versioned
Files saved to
/valohai/outputs/: Versioned ✅
Decision Tree: Should I Use NFS Mounts?
Do you need to access very large datasets (>100GB)?
├─ No → Use Valohai inputs (datum:// or dataset://)
│ ✅ Versioned, reproducible, cached
│
└─ Yes → Does the data change frequently?
├─ Yes → Use Valohai inputs
│ ✅ Version every snapshot
│
└─ No → Is the data already on network storage?
├─ Yes (on-prem) → Use NFS mount
│ ⚠️ But save processed outputs
│
└─ No (cloud) → Consider these options:
├─ Download once, cache, use many times?
│ → Use Valohai inputs (cached automatically)
│
└─ Need shared preprocessing workspace?
→ Use NFS mount as scratch space
⚠️ But save final results to outputsNFS vs. Valohai Inputs
Versioning
❌ No tracking
✅ Full versioning
Reproducibility
❌ Data can change
✅ Immutable references
Speed (first run)
✅ Fast (no download)
❌ Download required
Speed (reruns)
✅ Always fast
✅ Fast (cached)
Setup complexity
⚠️ Network config required
✅ Simple
Best for
Huge stable datasets, on-prem data
All other cases
Lineage tracking
❌ No
✅ Yes
Shared across executions
✅ Yes
✅ Yes (via cache)
Recommended Pattern: Mount → Process → Save
Always save processed results to /valohai/outputs/ for versioning:
import os
import pandas as pd
# 1. Read from mounted network storage (NOT versioned)
raw_data_path = '/mnt/shared-data/raw_images/'
files = os.listdir(raw_data_path)
print(f"Found {len(files)} files on network mount")
# 2. Process the data
processed_data = []
for filename in files:
filepath = os.path.join(raw_data_path, filename)
# Your preprocessing logic
data = preprocess_image(filepath)
processed_data.append(data)
# 3. Save processed results to Valohai outputs (VERSIONED ✅)
output_path = '/valohai/outputs/preprocessed_dataset.csv'
df = pd.DataFrame(processed_data)
df.to_csv(output_path, index=False)
print(f"Saved versioned dataset to {output_path}")Why this matters:
❌ Mount -> Process -> Write back to mount ☝️Nothing is versioned nor tracked - no reproducibility
✅ Mount -> Process -> Save to /valohai/outputs
☝️ Processed data is tracked and/or versioned
Mount Configuration
Network mounts are defined in valohai.yaml:
- step:
name: process-large-dataset
image: python:3.9
command:
- python preprocess.py
mounts:
- destination: /mnt/raw-data # Path inside container
source: <nfs-source> # Cloud-specific format
type: nfs
readonly: true # Recommended for input dataParameters:
destination— Mount point inside execution containersource— NFS server address (format varies by cloud)type— Alwaysnfsfor network file systemsreadonly—true(safe) orfalse(allows writes)
Readonly vs. Writeable Mounts
Readonly Mounts (Recommended)
mounts:
- destination: /mnt/input-data
source: <nfs-source>
readonly: true # ✅ SafeUse when:
Accessing shared reference data
Reading large datasets for processing
Multiple executions need same data
Want to prevent accidental modifications
Benefits:
✅ Prevents accidental data corruption
✅ Safe for parallel executions
✅ Clear intent (read-only access)
Writeable Mounts (Use Carefully)
mounts:
- destination: /mnt/scratch
source: <nfs-source>
readonly: false # ⚠️ Use with cautionUse when:
Need shared scratch space for intermediate results
Writing temporary files shared across parallel workers
Caching expensive computations
Risks:
⚠️ Files written here are NOT versioned
⚠️ Parallel executions can conflict
⚠️ No automatic cleanup
Best practice: Use writeable mounts for temporary data only. Always save final results to /valohai/outputs/.
Complete Workflow Example
Mount → Preprocess → Save Pattern
valohai.yaml:
- step:
name: preprocess-images
image: python:3.9
command:
- pip install pillow pandas
- python preprocess_images.py
mounts:
- destination: /mnt/raw-images
source: <cloud-specific-nfs-source>
type: nfs
readonly: truepreprocess_images.py:
import os
from PIL import Image
import pandas as pd
import json
# 1. Access raw data from network mount (NOT versioned)
mount_path = '/mnt/raw-images/'
image_files = [f for f in os.listdir(mount_path) if f.endswith('.jpg')]
print(f"Found {len(image_files)} images on network mount")
# 2. Process images
processed_data = []
output_dir = '/valohai/outputs/processed_images/'
os.makedirs(output_dir, exist_ok=True)
for filename in image_files:
# Read from mount
input_path = os.path.join(mount_path, filename)
img = Image.open(input_path)
# Process (resize, augment, etc.)
img_resized = img.resize((224, 224))
# Save processed image to Valohai outputs (VERSIONED ✅)
output_path = os.path.join(output_dir, filename)
img_resized.save(output_path)
# Track metadata
processed_data.append({
'filename': filename,
'original_size': img.size,
'processed_size': img_resized.size
})
# 3. Save metadata
df = pd.DataFrame(processed_data)
df.to_csv('/valohai/outputs/processing_metadata.csv', index=False)
# 4. Create dataset version from processed images
metadata = {
f"processed_images/{filename}": {
"valohai.dataset-versions": [{
"uri": "dataset://preprocessed-images/v1"
}]
}
for filename in image_files
}
# Add metadata file itself
metadata["processing_metadata.csv"] = {
"valohai.dataset-versions": [{"uri": "dataset://preprocessed-images/v1"}]
}
# Save metadata for dataset creation
metadata_path = '/valohai/outputs/valohai.metadata.jsonl'
with open(metadata_path, 'w') as f:
for fname, fmeta in metadata.items():
json.dump({"file": fname, "metadata": fmeta}, f)
f.write('\n')
print(f"Processed {len(image_files)} images")
print("Created versioned dataset: dataset://preprocessed-images/v1")Result:
✅ Raw images accessed from fast network mount (no download time)
✅ Processed images saved to
/valohai/outputs/(versioned)✅ Dataset created for reproducible training
✅ Can train on
dataset://preprocessed-images/v1anytime
Cloud-Specific Setup
Each cloud provider has specific requirements for network storage:
Managed NFS service in AWS
Setup: VPC configuration, security groups
Format:
fs-1234abcd.efs.region.amazonaws.com:/Performance modes: General Purpose, Max I/O
Managed NFS service in GCP
Setup: VPC configuration, IP-based access
Format:
10.0.0.2:/share-nameTiers: Basic, High Scale SSD
Access existing network file shares
Setup: Network connectivity, VPN/direct connect
Format:
/mnt/network-shareornfs-server.internal:/share
Best Practices
Always Version Final Results
# ❌ Bad: Only use mount, nothing versioned
data = load_from_mount('/mnt/data/')
model = train(data)
model.save('/mnt/models/model.pkl') # NOT versioned
# ✅ Good: Mount for input, outputs for results
data = load_from_mount('/mnt/data/')
model = train(data)
model.save('/valohai/outputs/model.pkl') # VERSIONEDUse Readonly When Possible
# ✅ Good: Readonly for safety
mounts:
- destination: /mnt/input-data
readonly: true# ⚠️ Use carefully: Writeable only when needed
mounts:
- destination: /mnt/scratch-space
readonly: falseDocument Mount Requirements
# ✅ Good: Clear documentation
- step:
name: process-data
# Requires: EFS fs-abc123 mounted with medical imaging data
# Data format: DICOM files in /scans/ subdirectory
# Access: Readonly recommended
mounts:
- destination: /mnt/medical-data
source: fs-abc123.efs.us-east-1.amazonaws.com:/
readonly: trueTest with Small Subsets First
# Test with subset before full run
import os
data_path = '/mnt/large-dataset/'
files = os.listdir(data_path)
# Use environment variable to control subset size
test_mode = os.getenv('TEST_MODE', 'false') == 'true'
files_to_process = files[:100] if test_mode else files
print(f"Processing {len(files_to_process)} files...")Handle Mount Failures Gracefully
import os
import sys
mount_path = '/mnt/network-storage/'
# Verify mount is accessible
if not os.path.exists(mount_path):
print(f"ERROR: Mount path {mount_path} not accessible")
print("Check network connectivity and mount configuration")
sys.exit(1)
if not os.path.ismount(mount_path):
print(f"WARNING: {mount_path} exists but may not be mounted")
# Continue with processing...Related Pages
AWS Elastic File System — Mount AWS EFS in executions
Google Cloud Filestore — Mount GCP Filestore in executions
On-Premises NFS — Mount on-prem network storage
Load Data in Jobs — Alternative: Use Valohai's versioned inputs
Next Steps
Evaluate whether NFS or Valohai inputs fit your use case better
Set up cloud network storage (EFS or Filestore) if needed
Create test execution mounting network storage
Build pipeline: mount → process → save to outputs
Version processed datasets for reproducible training
Last updated
Was this helpful?
