By default, a Valohai pipeline will stop on an error if any of its nodes encounter an error. However, this behavior can be customized, which is particularly useful for Task nodes.
Handling Errors in Task Nodes
If any execution within a Task node fails, the entire node will be marked as errored, and the pipeline will stop.
Regardless of errors in individual executions, the Task node will continue running. This is useful when you expect at least one execution to succeed.
Only the nodes following the errored node will be stopped.
Consider a pipeline with two parallel task nodes, “train” and “train2,” each running two executions. In each node, one execution fails.
- “train” is defined with on-error: stop-next.
- “train2” is defined with on-error: continue.
Due to these on-error rules: - The pipeline will not execute the “evaluate” node because “train” had one failed execution. - However, “evaluate2” will be executed because the “on-error” setting for “train2” is “continue.”
- pipeline: name: Training Pipeline nodes: - name: preprocess type: execution step: preprocess-dataset - name: train type: task on-error: stop-next step: train-model override: inputs: - name: dataset - name: evaluate type: execution step: batch-inference - name: train2 type: task on-error: continue step: train-model override: inputs: - name: dataset - name: evaluate2 type: execution step: batch-inference edges: - [preprocess.output.preprocessed_mnist.npz, train.input.dataset] - [preprocess.output.preprocessed_mnist.npz, train2.input.dataset] - [train.output.model*, evaluate.input.model] - [train2.output.model*, evaluate2.input.model]