Debug Pipeline Failures
When pipelines fail, quickly identify whether the issue is at the node level (execution failure) or pipeline level (configuration error). This guide covers both types of failures and debugging strategies.
Understanding pipeline logs
Pipelines have two types of logs:
Node logs
Individual execution logs for each step:
Click on any node in the pipeline graph
View the execution details and logs
Check Logs tab for error messages
Pipeline logs
System-level logs for the pipeline orchestration:
Navigate to the pipeline view
Click the Logs tab
Look for configuration or dependency errors

Common failure patterns
Node execution failures
Symptom: Node shows as "Failed" in red
How to debug: In the UI:
Click the failed node
Select "View execution"
Check the Logs tab
Common causes:
Code errors (Python exceptions, import failures)
Out of memory or disk space
Missing dependencies in Docker image
Incorrect file paths
Pipeline configuration errors
Symptom: Pipeline fails to start or shows "Crashed"
Pipeline log messages and solutions:
Node "train" transitioned to crashedCause: Execution failed within the node Fix: Check node's execution logs for the actual error
Stopping due to 1 incompletable edgesCause: Required inputs missing Fix: Verify all edges are correctly defined and source nodes produce expected outputs
No valid environment found for nodeCause: Specified environment doesn't exist or user lacks access
Fix: Check environment slug with vh environments and verify permissions
Step-by-step debugging process
1. Examine pipeline logs first
Look for orchestration issues:
Missing edges
Invalid node references
Parameter mismatches
Environment problems
2. Check individual node logs
For execution failures:
Via web interface:
Click on the failed node (red) in the pipeline graph
Click "View execution" in the popup
Navigate to the "Logs" tab
Use the log filters to show/hide stdout, stderr, or system logs
Via CLI:
# View specific node logs
vh execution logs EXECUTION_ID
# Or download full logs
vh execution logs EXECUTION_ID > debug_logs.txt3. Verify data flow
Ensure outputs exist and match expected names:
# In your code, add debug output
import os
print("Files in output:", os.listdir('/valohai/outputs'))Preventing silent failures
Problem: Step fails but shows "Completed"
By default, Valohai runs all commands even if one fails:
# Problematic configuration
command:
- python preprocess.py # Fails
- python train.py # Still runs!
- python evaluate.py # Also runsSolution: Add error handling
# Fail fast on any error
command:
- set -e # Exit on first error
- python preprocess.py
- python train.py
- python evaluate.pyOr use Python-specific error handling:
command:
- python -u preprocess.py || exit 1
- python -u train.py || exit 1
- python -u evaluate.py || exit 1YAML configuration debugging
Lint before committing
Always validate your valohai.yaml:
vh lint
# Example error:
# error: PipelineParameter.__init__() missing 'targets'Common YAML issues
Missing targets:
# Wrong
parameters:
- name: batch_size
default: 32# Correct
parameters:
- name: batch_size
targets:
- train.parameters.batch_size
default: 32Incorrect indentation:
# Wrong (3 spaces)
nodes:
- name: train# Correct (2 spaces)
nodes:
- name: trainAdvanced debugging techniques
1. Add debug nodes
Insert lightweight debug nodes between steps:
- name: debug-features
type: execution
step: debug-step
command:
- ls -la /valohai/inputs/
- head -n 5 /valohai/inputs/features/*
- echo "File count: $(ls /valohai/inputs/features | wc -l)"2. Use conditional debugging
Add debug output based on parameters:
import valohai
debug_mode = valohai.parameters('debug').value
if debug_mode:
print("=== DEBUG: Input shapes ===")
print(f"Training data: {X_train.shape}")
print(f"First 5 samples:\n{X_train[:5]}")3. Implement checkpoint logging
Log progress at key points:
import json
def log_checkpoint(stage, metrics):
checkpoint = {
"stage": stage,
"timestamp": datetime.now().isoformat(),
"metrics": metrics
}
print(json.dumps(checkpoint))
# Usage
log_checkpoint("preprocessing_complete", {"samples": len(data)})
log_checkpoint("training_started", {"epochs": epochs})Quick reference: Error messages
"No such file or directory"
Node logs
Missing input file
Check edge definitions and output names
"Out of memory"
Node logs
Insufficient resources
Use larger environment
"incompletable edges"
Pipeline logs
Missing node outputs
Verify source node completed successfully
"Module not found"
Node logs
Missing dependency
Add to Docker image or pip install
"Permission denied"
Node logs
File access issue
Check file permissions in outputs
Best practices
Always use
set -ein multi-command stepsValidate YAML before committing with
vh lintLog liberally during development
Name outputs clearly to avoid edge mismatches
Test nodes individually before pipeline integration
Use version control for configurations
Last updated
Was this helpful?
