Debug Pipeline Failures

When pipelines fail, quickly identify whether the issue is at the node level (execution failure) or pipeline level (configuration error). This guide covers both types of failures and debugging strategies.

Understanding pipeline logs

Pipelines have two types of logs:

Node logs

Individual execution logs for each step:

Click on any node in the pipeline graph
View the execution details and logs
Check Logs tab for error messages

Pipeline logs

System-level logs for the pipeline orchestration:

Navigate to the pipeline view
Click the Logs tab
Look for configuration or dependency errors

Common failure patterns

Node execution failures

Symptom: Node shows as "Failed" in red

How to debug: In the UI:

Click the failed node
Select "View execution"
Check the Logs tab

Common causes:

Code errors (Python exceptions, import failures)
Out of memory or disk space
Missing dependencies in Docker image
Incorrect file paths

Pipeline configuration errors

Symptom: Pipeline fails to start or shows "Crashed"

Pipeline log messages and solutions:

Node "train" transitioned to crashed

Cause: Execution failed within the node Fix: Check node's execution logs for the actual error

Stopping due to 1 incompletable edges

Cause: Required inputs missing Fix: Verify all edges are correctly defined and source nodes produce expected outputs

No valid environment found for node

Cause: Specified environment doesn't exist or user lacks access Fix: Check environment slug with vh environments and verify permissions

Step-by-step debugging process

1. Examine pipeline logs first

Look for orchestration issues:

Missing edges
Invalid node references
Parameter mismatches
Environment problems

2. Check individual node logs

For execution failures:

Via web interface:

Click on the failed node (red) in the pipeline graph
Click "View execution" in the popup
Navigate to the "Logs" tab
Use the log filters to show/hide stdout, stderr, or system logs

Via CLI:

# View specific node logs
vh execution logs EXECUTION_ID

# Or download full logs
vh execution logs EXECUTION_ID > debug_logs.txt

3. Verify data flow

Ensure outputs exist and match expected names:

# In your code, add debug output
import os
print("Files in output:", os.listdir('/valohai/outputs'))

Preventing silent failures

Problem: Step fails but shows "Completed"

By default, Valohai runs all commands even if one fails:

# Problematic configuration
command:
  - python preprocess.py     # Fails
  - python train.py          # Still runs!
  - python evaluate.py       # Also runs

Solution: Add error handling

# Fail fast on any error
command:
  - set -e  # Exit on first error
  - python preprocess.py
  - python train.py
  - python evaluate.py

Or use Python-specific error handling:

command:
  - python -u preprocess.py || exit 1
  - python -u train.py || exit 1
  - python -u evaluate.py || exit 1

YAML configuration debugging

Lint before committing

Always validate your valohai.yaml:

vh lint

# Example error:
# error: PipelineParameter.__init__() missing 'targets'

Common YAML issues

Missing targets:

# Wrong
parameters:
  - name: batch_size
    default: 32

# Correct  
parameters:
  - name: batch_size
    targets:
      - train.parameters.batch_size
    default: 32

Incorrect indentation:

# Wrong (3 spaces)
nodes:
   - name: train

# Correct (2 spaces)
nodes:
  - name: train

Advanced debugging techniques

1. Add debug nodes

Insert lightweight debug nodes between steps:

- name: debug-features
  type: execution
  step: debug-step
  command:
    - ls -la /valohai/inputs/
    - head -n 5 /valohai/inputs/features/*
    - echo "File count: $(ls /valohai/inputs/features | wc -l)"

2. Use conditional debugging

Add debug output based on parameters:

import valohai

debug_mode = valohai.parameters('debug').value
if debug_mode:
    print("=== DEBUG: Input shapes ===")
    print(f"Training data: {X_train.shape}")
    print(f"First 5 samples:\n{X_train[:5]}")

3. Implement checkpoint logging

Log progress at key points:

import json

def log_checkpoint(stage, metrics):
    checkpoint = {
        "stage": stage,
        "timestamp": datetime.now().isoformat(),
        "metrics": metrics
    }
    print(json.dumps(checkpoint))
    
# Usage
log_checkpoint("preprocessing_complete", {"samples": len(data)})
log_checkpoint("training_started", {"epochs": epochs})

Quick reference: Error messages

Error

Location

Likely Cause

Solution

"No such file or directory"

Node logs

Missing input file

Check edge definitions and output names

"Out of memory"

Node logs

Insufficient resources

Use larger environment

"incompletable edges"

Pipeline logs

Missing node outputs

Verify source node completed successfully

"Module not found"

Node logs

Missing dependency

Add to Docker image or pip install

"Permission denied"

Node logs

File access issue

Check file permissions in outputs

Best practices

Always use set -e in multi-command steps
Validate YAML before committing with vh lint
Log liberally during development
Name outputs clearly to avoid edge mismatches
Test nodes individually before pipeline integration
Use version control for configurations

PreviousPipeline Error Handling NextPipeline Conditions and Actions

Last updated 4 hours ago

Was this helpful?