Configure Resources Per Pipeline Node

Optimize your pipeline's performance and cost by allocating appropriate compute resources to each step. Data preprocessing might need high memory, training requires GPUs, and inference runs efficiently on CPUs.

Why customize resources per node?

Consider a typical ML pipeline:

  1. Data preprocessing: Needs 32GB RAM, multiple CPUs, no GPU

  2. Model training: Requires 4 GPUs, moderate memory

  3. Model evaluation: Runs fine on 2 CPUs, minimal memory

Running everything on GPU instances wastes money. Running everything on CPU instances makes training impossibly slow. The solution: configure each node's environment independently.

Resource configuration methods

Define environments at the step level for consistent, version-controlled configuration:

- step:
    name: preprocess-dataset
    image: python:3.9
    environment: aws-eu-west-1-m5-4xlarge  # 16 vCPUs, 64GB RAM, no GPU
    command:
      - pip install pandas numpy valohai-utils
      - python preprocess.py

- step:
    name: train-model
    image: tensorflow/tensorflow:2.6.0-gpu
    environment: aws-eu-west-1-p3-8xlarge  # 4x V100 GPUs
    command:
      - python train.py {parameters}

- step:
    name: evaluate-model
    image: python:3.9
    environment: aws-eu-west-1-t3-medium   # 2 vCPUs, 4GB RAM, cost-efficient
    command:
      - python evaluate.py

💡 Find environment slugs available in your project with vh environments in the CLI.

2. Override in pipeline definition

When the same step needs different resources in different contexts:

3. Web interface selection

For ad-hoc adjustments:

  1. Create your pipeline

  2. Click on any node in the graph

  3. Select environment from the "Runtime" dropdown

Best practices

1. Profile before optimizing

Run steps individually to understand resource needs:

Monitor resource usage in the execution details view. The graphs at the top show CPU, memory, and GPU utilization, helping you identify the optimal resource requirements.

2. Consider spot/preemptible instances

For non-critical steps, use cheaper spot instances:

3. Document resource requirements

Cost optimization strategies

Right-size your resources

  • Data prep: High CPU/memory, no GPU

  • Training: GPU instances only for actual training

  • Evaluation: Minimal resources

  • Deployment prep: Standard instances

Example cost comparison

Troubleshooting

Node fails with SIGKILL (9) or "out of memory"

Upgrade to instance with more RAM:

GPU not detected

Ensure:

  1. Environment has GPUs: Check with vh environments --details

  2. Docker image supports GPU: Use images that have GPU capabilities

  3. Code checks for GPU: torch.cuda.is_available()

  4. Run nvidia-smi as a part of your execution to see if the job has access to a GPU

Environment not found

Last updated

Was this helpful?