Configure Resources Per Pipeline Node

Optimize your pipeline's performance and cost by allocating appropriate compute resources to each step. Data preprocessing might need high memory, training requires GPUs, and inference runs efficiently on CPUs.

Why customize resources per node?

Consider a typical ML pipeline:

  1. Data preprocessing: Needs 32GB RAM, multiple CPUs, no GPU

  2. Model training: Requires 4 GPUs, moderate memory

  3. Model evaluation: Runs fine on 2 CPUs, minimal memory

Running everything on GPU instances wastes money. Running everything on CPU instances makes training impossibly slow. The solution: configure each node's environment independently.

Resource configuration methods

Define environments at the step level for consistent, version-controlled configuration:

- step:
    name: preprocess-dataset
    image: python:3.9
    environment: aws-eu-west-1-m5-4xlarge  # 16 vCPUs, 64GB RAM, no GPU
    command:
      - pip install pandas numpy valohai-utils
      - python preprocess.py

- step:
    name: train-model
    image: tensorflow/tensorflow:2.6.0-gpu
    environment: aws-eu-west-1-p3-8xlarge  # 4x V100 GPUs
    command:
      - python train.py {parameters}

- step:
    name: evaluate-model
    image: python:3.9
    environment: aws-eu-west-1-t3-medium   # 2 vCPUs, 4GB RAM, cost-efficient
    command:
      - python evaluate.py

💡 Find environment slugs available in your project with vh environments in the CLI.

2. Override in pipeline definition

When the same step needs different resources in different contexts:

- pipeline:
    name: efficient-training
    nodes:
      - name: prepare-small
        type: execution
        step: preprocess-dataset
        override:
          environment: aws-eu-west-1-t3-large  # Small dataset = small instance
          parameters:
            - name: dataset_size
              default: "sample"
      
      - name: prepare-full
        type: execution
        step: preprocess-dataset
        override:
          environment: aws-eu-west-1-m5-4xlarge  # Full dataset = more memory
          parameters:
            - name: dataset_size
              default: "full"

3. Web interface selection

For ad-hoc adjustments:

  1. Create your pipeline

  2. Click on any node in the graph

  3. Select environment from the "Runtime" dropdown

Best practices

1. Profile before optimizing

Run steps individually to understand resource needs:

# Test with different instance types
vh exec run preprocess-dataset --environment aws-eu-west-1-m5-large
vh exec run preprocess-dataset --environment aws-eu-west-1-m5-2xlarge

Monitor resource usage in the execution details view. The graphs at the top show CPU, memory, and GPU utilization, helping you identify the optimal resource requirements.

2. Consider spot/preemptible instances

For non-critical steps, use cheaper spot instances:

environment: aws-eu-west-1-p3-2xlarge-spot  # Up to 70% cheaper

3. Document resource requirements

- step:
    name: train-model
    # Requirements: 4x V100 GPUs, 32GB+ system RAM, CUDA 11.2+
    environment: aws-eu-west-1-p3-8xlarge

Cost optimization strategies

Right-size your resources

  • Data prep: High CPU/memory, no GPU

  • Training: GPU instances only for actual training

  • Evaluation: Minimal resources

  • Deployment prep: Standard instances

Example cost comparison

Pipeline with uniform p3.8xlarge (4x V100): ~$12.24/hour
Optimized pipeline:
- Preprocess on m5.xlarge: ~$0.19/hour
- Train on p3.8xlarge: ~$12.24/hour
- Evaluate on t3.medium: ~$0.04/hour
Total savings: ~40% for typical 3-hour pipeline

Troubleshooting

Node fails with SIGKILL (9) or "out of memory"

Upgrade to instance with more RAM:

# Before
environment: aws-eu-west-1-m5-large    # 8GB RAM
# After  
environment: aws-eu-west-1-m5-xlarge   # 16GB RAM

GPU not detected

Ensure:

  1. Environment has GPUs: Check with vh environments --details

  2. Docker image supports GPU: Use images that have GPU capabilities

  3. Code checks for GPU: torch.cuda.is_available()

  4. Run nvidia-smi as a part of your execution to see if the job has access to a GPU

Environment not found

# List available environments
vh environments

# Use exact slug from the list
environment: "azure-north-europe-standard-nc6"

Last updated

Was this helpful?