Pipeline Error Handling
Control how your pipeline responds to failures. By default, any node failure stops the entire pipeline but you can customize this behavior for more resilient workflows.
Why customize error handling?
Default behavior works for critical paths where every step must succeed. But consider these scenarios:
Parallel model training: If 9 out of 10 hyperparameter combinations succeed, you want the best model, not a failed pipeline
Data quality checks: Optional validation that shouldn't block core processing
A/B testing: One model variant failing shouldn't prevent evaluating others
Batch processing: A few failed items shouldn't stop processing thousands of others
Error handling strategies
stop-all (default)
stop-all (default)Any failure stops the entire pipeline immediately.
- pipeline:
name: critical-pipeline
nodes:
- name: validate-data
type: execution
step: data-validation
# on-error: stop-all # implicitUse when: Every step is critical to the final output.
continue
continueNode completes regardless of failures. Downstream nodes still run.
- name: hyperparameter-search
type: task
on-error: continue
step: train-modelUse when: You expect some failures and want to collect all successful results.
stop-next
stop-nextFailed node blocks its dependents but allows parallel branches to continue.
- name: optional-preprocessing
type: execution
on-error: stop-next
step: enhance-dataUse when: This branch is optional, but if it runs, subsequent steps need its output.
Task node considerations
Error handling is especially important for task nodes running parallel executions:
- pipeline:
name: parallel-training
parameters:
- name: learning_rates
target: train-models.parameters.lr
default: [0.001, 0.01, 0.1, 1.0] # 1.0 will likely fail
nodes:
- name: train-models
type: task
on-error: continue # Don't let one bad LR stop everything
step: train-model
- name: select-best
type: execution
step: compare-models
edges:
- [train-models.output.model*, select-best.input.models]With on-error: continue:
3 models train successfully with reasonable learning rates
1 fails with
lr=1.0select-bestreceives 3 models and picks the bestPipeline succeeds overall
Practical example
Here's a pipeline that handles failures gracefully:
- step:
name: train-model
image: tensorflow/tensorflow:2.6.0
command: python train.py {parameters}
parameters:
- name: architecture
type: string
- step:
name: evaluate-model
image: tensorflow/tensorflow:2.6.0
command: python evaluate.py
inputs:
- name: model
- pipeline:
name: model-comparison
parameters:
- name: architectures
target: experimental-models.parameters.architecture
default: ["resnet", "efficientnet", "experimental-v1"]
nodes:
# Prepare data - critical step
- name: prepare-data
type: execution
step: preprocess-dataset
# Default on-error: stop-all - data is required
# Try proven models - expect success
- name: proven-models
type: task
on-error: stop-next # If these fail, something's wrong
step: train-model
override:
parameters:
- name: architecture
default: ["resnet", "efficientnet"]
# Try experimental model - might fail
- name: experimental-models
type: task
on-error: continue # Don't block pipeline if experimental fails
step: train-model
# Evaluate all successful models
- name: compare-all
type: execution
step: compare-models
edges:
- [prepare-data.output.*, proven-models.input.dataset]
- [prepare-data.output.*, experimental-models.input.dataset]
- [proven-models.output.model*, compare-all.input.proven-models]
- [experimental-models.output.model*, compare-all.input.experimental-models]Debugging failed pipelines
View execution logs
Check individual execution logs to understand failures:
Click on the failed node in the pipeline graph
Select the failed execution
Review logs for error messages
Last updated
Was this helpful?
