Debug Pipeline Failures

When pipelines fail, quickly identify whether the issue is at the node level (execution failure) or pipeline level (configuration error). This guide covers both types of failures and debugging strategies.

Understanding pipeline logs

Pipelines have two types of logs:

Node logs

Individual execution logs for each step:

  1. Click on any node in the pipeline graph

  2. View the execution details and logs

  3. Check Logs tab for error messages

Pipeline logs

System-level logs for the pipeline orchestration:

  1. Navigate to the pipeline view

  2. Click the Logs tab

  3. Look for configuration or dependency errors

Common failure patterns

Node execution failures

Symptom: Node shows as "Failed" in red

How to debug: In the UI:

  1. Click the failed node

  2. Select "View execution"

  3. Check the Logs tab

Common causes:

  • Code errors (Python exceptions, import failures)

  • Out of memory or disk space

  • Missing dependencies in Docker image

  • Incorrect file paths

Pipeline configuration errors

Symptom: Pipeline fails to start or shows "Crashed"

Pipeline log messages and solutions:

Node "train" transitioned to crashed

Cause: Execution failed within the node Fix: Check node's execution logs for the actual error

Stopping due to 1 incompletable edges

Cause: Required inputs missing Fix: Verify all edges are correctly defined and source nodes produce expected outputs

No valid environment found for node

Cause: Specified environment doesn't exist or user lacks access Fix: Check environment slug with vh environments and verify permissions

Step-by-step debugging process

1. Examine pipeline logs first

Look for orchestration issues:

  • Missing edges

  • Invalid node references

  • Parameter mismatches

  • Environment problems

2. Check individual node logs

For execution failures:

Via web interface:

  1. Click on the failed node (red) in the pipeline graph

  2. Click "View execution" in the popup

  3. Navigate to the "Logs" tab

  4. Use the log filters to show/hide stdout, stderr, or system logs

Via CLI:

# View specific node logs
vh execution logs EXECUTION_ID

# Or download full logs
vh execution logs EXECUTION_ID > debug_logs.txt

3. Verify data flow

Ensure outputs exist and match expected names:

# In your code, add debug output
import os
print("Files in output:", os.listdir('/valohai/outputs'))

Preventing silent failures

Problem: Step fails but shows "Completed"

By default, Valohai runs all commands even if one fails:

# Problematic configuration
command:
  - python preprocess.py     # Fails
  - python train.py          # Still runs!
  - python evaluate.py       # Also runs

Solution: Add error handling

# Fail fast on any error
command:
  - set -e  # Exit on first error
  - python preprocess.py
  - python train.py
  - python evaluate.py

Or use Python-specific error handling:

command:
  - python -u preprocess.py || exit 1
  - python -u train.py || exit 1
  - python -u evaluate.py || exit 1

YAML configuration debugging

Lint before committing

Always validate your valohai.yaml:

vh lint

# Example error:
# error: PipelineParameter.__init__() missing 'targets'

Common YAML issues

Missing targets:

# Wrong
parameters:
  - name: batch_size
    default: 32
# Correct  
parameters:
  - name: batch_size
    targets:
      - train.parameters.batch_size
    default: 32

Incorrect indentation:

# Wrong (3 spaces)
nodes:
   - name: train
# Correct (2 spaces)
nodes:
  - name: train

Advanced debugging techniques

1. Add debug nodes

Insert lightweight debug nodes between steps:

- name: debug-features
  type: execution
  step: debug-step
  command:
    - ls -la /valohai/inputs/
    - head -n 5 /valohai/inputs/features/*
    - echo "File count: $(ls /valohai/inputs/features | wc -l)"

2. Use conditional debugging

Add debug output based on parameters:

import valohai

debug_mode = valohai.parameters('debug').value
if debug_mode:
    print("=== DEBUG: Input shapes ===")
    print(f"Training data: {X_train.shape}")
    print(f"First 5 samples:\n{X_train[:5]}")

3. Implement checkpoint logging

Log progress at key points:

import json

def log_checkpoint(stage, metrics):
    checkpoint = {
        "stage": stage,
        "timestamp": datetime.now().isoformat(),
        "metrics": metrics
    }
    print(json.dumps(checkpoint))
    
# Usage
log_checkpoint("preprocessing_complete", {"samples": len(data)})
log_checkpoint("training_started", {"epochs": epochs})

Quick reference: Error messages

Error
Location
Likely Cause
Solution

"No such file or directory"

Node logs

Missing input file

Check edge definitions and output names

"Out of memory"

Node logs

Insufficient resources

Use larger environment

"incompletable edges"

Pipeline logs

Missing node outputs

Verify source node completed successfully

"Module not found"

Node logs

Missing dependency

Add to Docker image or pip install

"Permission denied"

Node logs

File access issue

Check file permissions in outputs

Best practices

  1. Always use set -e in multi-command steps

  2. Validate YAML before committing with vh lint

  3. Log liberally during development

  4. Name outputs clearly to avoid edge mismatches

  5. Test nodes individually before pipeline integration

  6. Use version control for configurations

Last updated

Was this helpful?