By default, a Valohai pipeline will stop on an error if any of its nodes encounter an error. However, this behavior can be customized, which is particularly useful for Task nodes.
Handling Errors in Task Nodes
stop-all (Default)
If any execution within a Task node fails, the entire node will be marked as errored, and the pipeline will stop.
continue
Regardless of errors in individual executions, the Task node will continue running. This is useful when you expect at least one execution to succeed.
stop-next
Only the nodes following the errored node will be stopped.
Example
Consider a pipeline with two parallel task nodes, “train” and “train2,” each running two executions. In each node, one execution fails.
- “train” is defined with on-error: stop-next.
- “train2” is defined with on-error: continue.
Due to these on-error rules: - The pipeline will not execute the “evaluate” node because “train” had one failed execution. - However, “evaluate2” will be executed because the “on-error” setting for “train2” is “continue.”
YAML example:
- pipeline:
name: Training Pipeline
nodes:
- name: preprocess
type: execution
step: preprocess-dataset
- name: train
type: task
on-error: stop-next
step: train-model
override:
inputs:
- name: dataset
- name: evaluate
type: execution
step: batch-inference
- name: train2
type: task
on-error: continue
step: train-model
override:
inputs:
- name: dataset
- name: evaluate2
type: execution
step: batch-inference
edges:
- [preprocess.output.preprocessed_mnist.npz, train.input.dataset]
- [preprocess.output.preprocessed_mnist.npz, train2.input.dataset]
- [train.output.model*, evaluate.input.model]
- [train2.output.model*, evaluate2.input.model]