Configure Resources Per Pipeline Node
Optimize your pipeline's performance and cost by allocating appropriate compute resources to each step. Data preprocessing might need high memory, training requires GPUs, and inference runs efficiently on CPUs.
Why customize resources per node?
Consider a typical ML pipeline:
Data preprocessing: Needs 32GB RAM, multiple CPUs, no GPU
Model training: Requires 4 GPUs, moderate memory
Model evaluation: Runs fine on 2 CPUs, minimal memory
Running everything on GPU instances wastes money. Running everything on CPU instances makes training impossibly slow. The solution: configure each node's environment independently.
Resource configuration methods
1. Configure in valohai.yaml (recommended)
Define environments at the step level for consistent, version-controlled configuration:
- step:
name: preprocess-dataset
image: python:3.9
environment: aws-eu-west-1-m5-4xlarge # 16 vCPUs, 64GB RAM, no GPU
command:
- pip install pandas numpy valohai-utils
- python preprocess.py
- step:
name: train-model
image: tensorflow/tensorflow:2.6.0-gpu
environment: aws-eu-west-1-p3-8xlarge # 4x V100 GPUs
command:
- python train.py {parameters}
- step:
name: evaluate-model
image: python:3.9
environment: aws-eu-west-1-t3-medium # 2 vCPUs, 4GB RAM, cost-efficient
command:
- python evaluate.py💡 Find environment slugs available in your project with
vh environmentsin the CLI.
2. Override in pipeline definition
When the same step needs different resources in different contexts:
- pipeline:
name: efficient-training
nodes:
- name: prepare-small
type: execution
step: preprocess-dataset
override:
environment: aws-eu-west-1-t3-large # Small dataset = small instance
parameters:
- name: dataset_size
default: "sample"
- name: prepare-full
type: execution
step: preprocess-dataset
override:
environment: aws-eu-west-1-m5-4xlarge # Full dataset = more memory
parameters:
- name: dataset_size
default: "full"3. Web interface selection
For ad-hoc adjustments:
Create your pipeline
Click on any node in the graph
Select environment from the "Runtime" dropdown

Best practices
1. Profile before optimizing
Run steps individually to understand resource needs:
# Test with different instance types
vh exec run preprocess-dataset --environment aws-eu-west-1-m5-large
vh exec run preprocess-dataset --environment aws-eu-west-1-m5-2xlargeMonitor resource usage in the execution details view. The graphs at the top show CPU, memory, and GPU utilization, helping you identify the optimal resource requirements.
2. Consider spot/preemptible instances
For non-critical steps, use cheaper spot instances:
environment: aws-eu-west-1-p3-2xlarge-spot # Up to 70% cheaper3. Document resource requirements
- step:
name: train-model
# Requirements: 4x V100 GPUs, 32GB+ system RAM, CUDA 11.2+
environment: aws-eu-west-1-p3-8xlargeCost optimization strategies
Right-size your resources
Data prep: High CPU/memory, no GPU
Training: GPU instances only for actual training
Evaluation: Minimal resources
Deployment prep: Standard instances
Example cost comparison
Pipeline with uniform p3.8xlarge (4x V100): ~$12.24/hour
Optimized pipeline:
- Preprocess on m5.xlarge: ~$0.19/hour
- Train on p3.8xlarge: ~$12.24/hour
- Evaluate on t3.medium: ~$0.04/hour
Total savings: ~40% for typical 3-hour pipelineTroubleshooting
Node fails with SIGKILL (9) or "out of memory"
Upgrade to instance with more RAM:
# Before
environment: aws-eu-west-1-m5-large # 8GB RAM# After
environment: aws-eu-west-1-m5-xlarge # 16GB RAMGPU not detected
Ensure:
Environment has GPUs: Check with
vh environments --detailsDocker image supports GPU: Use images that have GPU capabilities
Code checks for GPU:
torch.cuda.is_available()Run
nvidia-smias a part of your execution to see if the job has access to a GPU
Environment not found
# List available environments
vh environments
# Use exact slug from the list
environment: "azure-north-europe-standard-nc6"Last updated
Was this helpful?
