Shared Cache
Configure shared network cache for Valohai workers to optimize large dataset access
Configure a shared network cache between Valohai workers to optimize access to large datasets and reduce download times from cloud storage.
How Valohai Cache Works
Most Valohai machine learning jobs have input files downloaded from cloud object storage like AWS S3, Azure Blob Storage, or GCP Cloud Storage.
Default Behavior (On-Worker Cache)
By default, each Valohai worker (virtual machine) has its own local cache:
How it works:
Worker downloads input data from cloud storage
Data is cached locally on the worker's disk
If the same data is needed again on the same worker, it's read from cache
When the machine is no longer used (after a configurable grace period), it scales down
The local cache is removed with the machine
The next time a machine scales up, it downloads input files to its own cache again
Limitations:
Each new worker downloads data independently
Cache is lost when workers scale down
No sharing of cached data between workers
Shared Cache Behavior
With a shared network cache, input data is stored on an NFS or SMB network mount accessible by multiple workers.
How it works:
Worker needs input data
Worker checks shared network cache
If data exists in cache, worker reads from network mount
If data doesn't exist, worker downloads from cloud storage to the cache
Other workers can immediately access the cached data
Cache persists even when workers scale down
Benefits:
Workers can share cached data
Faster job startup for frequently-used datasets
Reduced data transfer costs from cloud storage
Cache persists across worker lifecycles
Note: Users still provide Valohai inputs by providing the URL to the file. Valohai handles authenticating with object storage, downloading the dataset to the shared cache, and versioning that input file with the execution, the same as standard executions.
When to Use Shared Cache
Shared cache is beneficial when:
Large datasets (100GB+):
You have large datasets that you access often from different workers
Download time from cloud storage is significant
Multiple workers need the same data
Parallel workloads:
You're running Valohai Tasks with multiple parallel GPU instances
All instances download the same dataset from cloud object storage
Startup time is critical
Very large datasets (TBs):
You have terabytes of data that takes a long time to download from object storage
The cost of repeated downloads is significant
Not recommended when:
Datasets are small (<10GB)
Each job uses unique datasets
Network mount would be slower than local disk
You have sufficient instance cache time
Architecture

Components:
Network Mount (NFS/SMB):
Shared file system accessible by all workers
Stores cached input files
Persists independently of worker lifecycle
Workers:
Connect to network mount
Read cached files when available
Download to cache when files are missing
Valohai:
Manages cache versioning
Tracks which files are cached
Handles authentication with cloud storage
Set Up a Shared Cache
Step 1: Configure Network Mount
You'll need to set up an NFS or SMB network mount in your cloud or on-premises environment.
Key requirement: Verify that workers can access the network mount.
Cloud-specific guides below:
Configure AWS EFS
You can use an existing EFS or create a new one.
Requirements:
Create EFS in the same VPC where all Valohai resources are located
OR set up VPC peering between the two VPCs
Use the same region where your workers are located
Create EFS:
Navigate to Amazon EFS in AWS Console
Click Create file system
Configure:
VPC: Select your Valohai VPC
Availability and durability: Regional (recommended)
Performance mode: General Purpose (or Max I/O for high throughput)
Configure network access:
Mount targets in each availability zone
Security group must allow NFS traffic (port 2049) from worker security group
Create the file system
Note the EFS DNS name (e.g., fs-1234aa12.efs.eu-west-1.amazonaws.com)
Mount target format:
fs-1234aa12.efs.eu-west-1.amazonaws.com:/Configure GCP Filestore
You can use an existing Filestore or create a new one.
Requirements:
Create Filestore in the same VPC where all Valohai resources are located
Grant access to all clients on the VPC network
Create Filestore:
Navigate to Filestore in GCP Console
Click Create instance
Configure:
Instance ID:
valohai-cacheInstance type: Basic (or High Scale for large workloads)
Storage capacity: Based on your dataset sizes
Region: Same as your workers
VPC network: Your Valohai VPC
Configure access:
File share name:
valohai_cacheGrant access to all clients on the network
Create the instance
Note the IP address and file share name (e.g., 10.123.12.123:/valohai_cache)
Mount target format:
10.123.12.123:/valohai_cacheConfigure Azure Files
You can use an existing Azure Files share or create a new one.
Requirements:
Storage account in the same region as workers
Premium file share recommended for performance
Private endpoint or service endpoint for secure access
Create Azure Files:
Navigate to Storage accounts in Azure Portal
Create or select a storage account
Navigate to File shares
Click + File share
Configure:
Name:
valohai-cacheTier: Premium (recommended) or Transaction optimized
Provisioned capacity: Based on your needs
Create the file share
Configure network access:
Set up private endpoint or service endpoint
Ensure worker subnet can access the storage account
Get connection details:
Storage account name:
mystorageaccountFile share name:
valohai-cacheAccess key: From storage account keys
Mount target format:
//mystorageaccount.file.core.windows.net/valohai-cacheConfigure On-Premises NFS
For on-premises environments, set up an NFS server accessible to workers.
Requirements:
NFS server accessible from worker network
Sufficient storage capacity
Performance appropriate for your workload
Basic NFS server setup (Ubuntu):
# Install NFS server
sudo apt-get update
sudo apt-get install nfs-kernel-server
# Create shared directory
sudo mkdir -p /mnt/valohai-cache
sudo chown nobody:nogroup /mnt/valohai-cache
sudo chmod 777 /mnt/valohai-cache
# Configure exports
sudo nano /etc/exports
# Add line:
/mnt/valohai-cache 10.0.0.0/24(rw,sync,no_subtree_check,no_root_squash)
# Apply configuration
sudo exportfs -ra
sudo systemctl restart nfs-kernel-serverMount target format:
nfs-server.example.com:/mnt/valohai-cacheStep 2: Send Details to Valohai
After configuring your network mount, send the following information to your Valohai contact at [email protected]:
Required information:
Network mount address:
AWS:
fs-1234aa12.efs.eu-west-1.amazonaws.com:/GCP:
10.123.12.123:/valohai_cacheAzure:
//mystorageaccount.file.core.windows.net/valohai-cacheOn-premises:
nfs-server.example.com:/mnt/valohai-cache
Environments to configure:
List specific environments you want to use the shared cache
OR configure all environments to use the shared cache
Optional: Copy behavior:
By default, workers access files directly from the network mount
Optionally, configure workers to copy data from NFS to local directory before starting a job
Specify per environment if needed
Configuration Options
Direct Access (Default)
Workers read files directly from the network mount.
Pros:
No additional copy time
Lower local disk usage
Immediate access to cached data
Cons:
Network performance affects job performance
All I/O goes over network
Best for:
Sequential read workloads
Large files that don't fit on local disk
Fast network connections
Copy to Local Disk
Workers copy files from network mount to local disk before job starts.
Pros:
Job runs at local disk speed
No network I/O during job execution
Better for random access patterns
Cons:
Additional copy time at job start
Requires sufficient local disk space
Data copied multiple times if reused
Best for:
Random access workloads
Jobs with many small file operations
When local disk is faster than network
Performance Considerations
Network Mount Performance
AWS EFS:
Throughput scales with storage size (Bursting mode)
Or use Provisioned Throughput mode for consistent performance
Max I/O mode for high parallel access
GCP Filestore:
Performance based on capacity tier
Basic: Up to 100 MB/s per TB
High Scale: Up to 480 MB/s per TB
Azure Files:
Premium tier recommended for ML workloads
Performance scales with provisioned capacity
Up to 100,000 IOPS for premium
Worker Configuration
Local disk for temp files:
Configure jobs to write temporary files to local disk
Only use network mount for cached inputs
Reduces network I/O
Parallel access:
NFS performs well with many parallel readers
Avoid many workers writing to the same files
Consider sharding large datasets
Monitoring and Maintenance
Monitor Cache Usage
Storage space:
Monitor available space on network mount
Set up alerts for high usage
Plan for growth
Performance:
Monitor throughput and IOPS
Check for network bottlenecks
Review job startup times
Cache Cleanup
Valohai automatically manages cached files:
Tracks which files are accessed
Removes least-recently-used files when space is low
Maintains cache versioning
Manual cleanup (if needed):
Coordinate with Valohai support
Don't delete files manually without consulting Valohai
Valohai tracks cache state internally
Troubleshooting
Workers can't access network mount
Check network connectivity:
AWS EFS:
# From worker
telnet fs-1234aa12.efs.eu-west-1.amazonaws.com 2049GCP Filestore:
# From worker
ping 10.123.12.123
showmount -e 10.123.12.123Check security groups/firewall:
AWS: Security group allows NFS (port 2049)
GCP: Firewall rules allow NFS traffic
Azure: Network security groups allow SMB (port 445)
Slow job startup
Check network performance:
Test mount performance from worker
Consider copy-to-local configuration
Review network mount performance tier
Check file sizes:
Large files take time to download/cache
Consider splitting datasets
Use compression if appropriate
Cache fills up
Monitor storage:
Check available space
Review cache cleanup settings
Consider increasing storage capacity
Optimize usage:
Remove unused datasets from workflows
Compress large files
Use versioning to avoid duplicates
Cost Optimization
Storage Costs
AWS EFS:
Billed per GB-month stored
Infrequent Access storage class available
Lifecycle policies to move old data
GCP Filestore:
Billed for provisioned capacity
Right-size based on actual usage
Consider Basic tier vs. High Scale
Azure Files:
Premium: Billed for provisioned capacity
Standard: Billed for actual usage
Consider transaction optimized tier
Data Transfer Costs
Within same region:
Usually no data transfer costs
Verify with cloud provider pricing
Cross-region:
Avoid if possible
Significant costs for large transfers
Keep cache in same region as workers
Getting Help
Valohai Support: [email protected]
Include in support requests:
Cloud provider and region
Network mount type (EFS, Filestore, Azure Files, NFS)
Mount address and configuration
Worker environment names
Performance issues or errors
Storage usage and capacity
Last updated
Was this helpful?
