Network Storage

Mount network file systems (NFS) to access large datasets and shared storage directly from executions without downloading files.


When to Use Network Storage

Network storage serves different purposes depending on your deployment:

Cloud Deployments

Use case: Large dataset caching and preprocessing

Mount shared network storage (AWS EFS, GCP Filestore) to:

  • Access huge datasets without downloading to each execution

  • Share preprocessed data across multiple executions

  • Cache intermediate results on fast shared storage

  • Preprocess once, use many times — Mount raw data, process it, save versioned outputs

Example workflow:

  1. Mount a network storage containing 1M files to the execution

  2. Filter only 100k of them

  3. Move them to the /valohai/outputs - in order to upload them to the cloud store and start tracking them as datums

  4. Create packaged dataset version for faster access in successive executions

  5. Use this dataset


On-Premises Deployments

Use case: Access existing data infrastructure

Mount on-premises network drives where:

  • Data already exists on company NFS servers

  • Legacy systems store data on network shares

  • Compliance requires data stays on-premises

  • Moving data to cloud is impractical due to size or regulations

Example workflow:

  1. Medical imagining on hospital NFS

  2. Mount the volume to your execution

  3. Process the data while meeting compliance requirements

  4. Save results to outputs to start tracking them as datums

  5. Everything is versioned and tracked for audit


Critical Trade-Off: Speed vs. Versioning

⚠️ Important: Valohai does NOT version or track files on mounted network storage.

What this means:

  • Files read from mounts: Not versioned

  • Files written to mounts: Not versioned

  • Files saved to /valohai/outputs/: Versioned ✅

Decision Tree: Should I Use NFS Mounts?


NFS vs. Valohai Inputs

Feature
NFS Mount
Valohai Inputs

Versioning

❌ No tracking

✅ Full versioning

Reproducibility

❌ Data can change

✅ Immutable references

Speed (first run)

✅ Fast (no download)

❌ Download required

Speed (reruns)

✅ Always fast

✅ Fast (cached)

Setup complexity

⚠️ Network config required

✅ Simple

Best for

Huge stable datasets, on-prem data

All other cases

Lineage tracking

❌ No

✅ Yes

Shared across executions

✅ Yes

✅ Yes (via cache)


Always save processed results to /valohai/outputs/ for versioning:

Why this matters:

​Mount -> Process -> Write back to mount ☝️​Nothing is versioned nor tracked - no reproducibility

​​Mount -> Process -> Save to /valohai/outputs☝️ Processed data is tracked and/or versioned


Mount Configuration

Network mounts are defined in valohai.yaml:

Parameters:

  • destination — Mount point inside execution container

  • source — NFS server address (format varies by cloud)

  • type — Always nfs for network file systems

  • readonlytrue (safe) or false (allows writes)


Readonly vs. Writeable Mounts

Use when:

  • Accessing shared reference data

  • Reading large datasets for processing

  • Multiple executions need same data

  • Want to prevent accidental modifications

Benefits:

  • ✅ Prevents accidental data corruption

  • ✅ Safe for parallel executions

  • ✅ Clear intent (read-only access)


Writeable Mounts (Use Carefully)

Use when:

  • Need shared scratch space for intermediate results

  • Writing temporary files shared across parallel workers

  • Caching expensive computations

Risks:

  • ⚠️ Files written here are NOT versioned

  • ⚠️ Parallel executions can conflict

  • ⚠️ No automatic cleanup

Best practice: Use writeable mounts for temporary data only. Always save final results to /valohai/outputs/.


Complete Workflow Example

Mount → Preprocess → Save Pattern

valohai.yaml:

preprocess_images.py:

Result:

  • ✅ Raw images accessed from fast network mount (no download time)

  • ✅ Processed images saved to /valohai/outputs/ (versioned)

  • ✅ Dataset created for reproducible training

  • ✅ Can train on dataset://preprocessed-images/v1 anytime


Cloud-Specific Setup

Each cloud provider has specific requirements for network storage:

  • Managed NFS service in AWS

  • Setup: VPC configuration, security groups

  • Format: fs-1234abcd.efs.region.amazonaws.com:/

  • Performance modes: General Purpose, Max I/O

  • Managed NFS service in GCP

  • Setup: VPC configuration, IP-based access

  • Format: 10.0.0.2:/share-name

  • Tiers: Basic, High Scale SSD

  • Access existing network file shares

  • Setup: Network connectivity, VPN/direct connect

  • Format: /mnt/network-share or nfs-server.internal:/share


Best Practices

Always Version Final Results


Use Readonly When Possible


Document Mount Requirements


Test with Small Subsets First


Handle Mount Failures Gracefully



Next Steps

  • Evaluate whether NFS or Valohai inputs fit your use case better

  • Set up cloud network storage (EFS or Filestore) if needed

  • Create test execution mounting network storage

  • Build pipeline: mount → process → save to outputs

  • Version processed datasets for reproducible training

Last updated

Was this helpful?