> For the complete documentation index, see [llms.txt](https://docs.valohai.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.valohai.com/pipelines/configure-failure-handling.md).

# Pipeline Error Handling

Control how your pipeline responds to failures. By default, any node failure stops the entire pipeline but you can customize this behavior for more resilient workflows.

### Why customize error handling?

Default behavior works for critical paths where every step must succeed. But consider these scenarios:

* **Parallel model training**: If 9 out of 10 hyperparameter combinations succeed, you want the best model, not a failed pipeline
* **Data quality checks**: Optional validation that shouldn't block core processing
* **A/B testing**: One model variant failing shouldn't prevent evaluating others
* **Batch processing**: A few failed items shouldn't stop processing thousands of others

### Error handling strategies

#### `stop-all` (default)

Any failure stops the entire pipeline immediately.

```yaml
- pipeline:
    name: critical-pipeline
    nodes:
      - name: validate-data
        type: execution
        step: data-validation
        # on-error: stop-all # implicit
```

Use when: Every step is critical to the final output.

#### `continue`

Node completes regardless of failures. Downstream nodes still run.

```yaml
- name: hyperparameter-search
  type: task
  on-error: continue
  step: train-model
```

Use when: You expect some failures and want to collect all successful results.

#### `stop-next`

Failed node blocks its dependents but allows parallel branches to continue.

```yaml
- name: optional-preprocessing
  type: execution
  on-error: stop-next
  step: enhance-data
```

Use when: This branch is optional, but if it runs, subsequent steps need its output.

### Task node considerations

Error handling is especially important for task nodes running parallel executions:

```yaml
- pipeline:
    name: parallel-training
    parameters:
      - name: learning_rates
        target: train-models.parameters.lr
        default: [0.001, 0.01, 0.1, 1.0]  # 1.0 will likely fail
    nodes:
      - name: train-models
        type: task
        on-error: continue  # Don't let one bad LR stop everything
        step: train-model
      - name: select-best
        type: execution
        step: compare-models
    edges:
      - [train-models.output.model*, select-best.input.models]
```

With `on-error: continue`:

* 3 models train successfully with reasonable learning rates
* 1 fails with `lr=1.0`
* `select-best` receives 3 models and picks the best
* Pipeline succeeds overall

### Practical example

Here's a pipeline that handles failures gracefully:

```yaml
- step:
    name: train-model
    image: tensorflow/tensorflow:2.6.0
    command: python train.py {parameters}
    parameters:
      - name: architecture
        type: string

- step:
    name: evaluate-model
    image: tensorflow/tensorflow:2.6.0
    command: python evaluate.py
    inputs:
      - name: model

- pipeline:
    name: model-comparison
    parameters:
      - name: architectures
        target: experimental-models.parameters.architecture
        default: ["resnet", "efficientnet", "experimental-v1"]
    nodes:
      # Prepare data - critical step
      - name: prepare-data
        type: execution
        step: preprocess-dataset
        # Default on-error: stop-all - data is required

      # Try proven models - expect success
      - name: proven-models
        type: task
        on-error: stop-next  # If these fail, something's wrong
        step: train-model
        override:
          parameters:
            - name: architecture
              default: ["resnet", "efficientnet"]

      # Try experimental model - might fail
      - name: experimental-models
        type: task
        on-error: continue  # Don't block pipeline if experimental fails
        step: train-model

      # Evaluate all successful models
      - name: compare-all
        type: execution
        step: compare-models

    edges:
      - [prepare-data.output.*, proven-models.input.dataset]
      - [prepare-data.output.*, experimental-models.input.dataset]
      - [proven-models.output.model*, compare-all.input.proven-models]
      - [experimental-models.output.model*, compare-all.input.experimental-models]
```

### Debugging failed pipelines

#### View execution logs

Check individual execution logs to understand failures:

1. Click on the failed node in the pipeline graph
2. Select the failed execution
3. Review logs for error messages


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://docs.valohai.com/pipelines/configure-failure-handling.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
