# Retry Executions After Connection Errors

Build resilient pipelines that automatically recover from temporary errors. When database connections timeout, APIs rate-limit, or cloud storage becomes temporarily unavailable, this pattern automatically retries your execution after a configurable delay.

Perfect for production ML workflows where transient failures shouldn't require manual intervention.

### How It Works <a href="#how-it-works" id="how-it-works"></a>

When your execution encounters a connection error:

1. The execution catches the exception
2. Waits for a configurable delay period
3. Uses the Valohai API to start a fresh retry execution
4. Tracks retry count with tags to prevent infinite loops
5. Marks the failed attempt as errored for visibility

If a retry succeeds, it completes normally. If all retries are exhausted, the final execution fails with a clear status message.

### Prerequisites <a href="#prerequisites" id="prerequisites"></a>

* A Valohai project with a step that might encounter transient errors
* Python 3.8+ with `valohai-utils` installed
* Understanding of exception handling in Python

### Generate a Valohai API Token <a href="#generate-a-valohai-api-token" id="generate-a-valohai-api-token"></a>

Create an API token to allow your executions to start new retries:

1. Click **"Hi, \<username>!"** in the top-right corner
2. Go to **My Profile → Authentication**
3. Click **Manage Tokens** and scroll to the bottom
4. Click **Generate New Token**
5. Copy the token immediately — it's shown only once

> ⚠️ **Keep your token safe.** Never commit tokens to version control. Store it as a secret environment variable in Valohai.

#### Store Token as Environment Variable <a href="#store-token-as-environment-variable" id="store-token-as-environment-variable"></a>

Add your token as a secret environment variable so your executions can access it:

**Option 1: Project-wide (recommended)**

1. Go to **Project Settings → Environment Variables**
2. Add a new variable:
   * **Name:** `VH_API_TOKEN`
   * **Value:** Your token
   * **Check:** "Secret" checkbox
3. This variable is available to all executions in the project

**Option 2: Per-execution**

1. When creating an execution, go to **Environment Variables** section
2. Add `VH_API_TOKEN` with your token value
3. This variable is only available to that specific execution

Access environment variables in your code:

```python
import os

auth_token = os.environ["VH_API_TOKEN"]
```

### Configure Retry Parameters <a href="#configure-retry-parameters" id="configure-retry-parameters"></a>

Add retry configuration parameters to your `valohai.yaml`:

```yaml
- step:
    name: train-model
    image: python:3.9
    command:
      - pip install valohai-utils requests
      - python train.py
    parameters:
      - name: retries
        type: integer
        default: 3
        description: Maximum number of retry attempts
      - name: delay
        type: integer
        default: 60
        description: Seconds to wait before retrying
```

These parameters control:

* `retries`: How many times to retry after the initial failure
* `delay`: How long to wait between retry attempts (in seconds)

### Implement Retry Logic <a href="#implement-retry-logic" id="implement-retry-logic"></a>

Add automatic retry logic to your training script:

```python
import valohai
import time
import json
import requests
import os
import sys
import random

# Get retry configuration from parameters
retries = valohai.parameters("retries").value
delay = valohai.parameters("delay").value

# Read execution metadata from the Valohai config file
# This file is automatically available in all executions
f = open("/valohai/config/execution.json")
data = json.load(f)
f.close()

# Extract current execution details
tags = data["valohai.execution-tags"]
project_id = data["valohai.project-id"]
commit = data["valohai.commit-identifier"]
step_name = data["valohai.execution-step"]

# Check if this execution is already a retry
# Expected tag format: "retry: n" where n is the retry count
retry_count = 0
if tags:
    for tag in tags:
        if "retry" in tag:
            retry_count = int(tag.split(" ")[1])

print(f"Starting execution (retry count: {retry_count})...")

# Your main execution logic wrapped in exception handling
try:
    # === YOUR CODE GOES HERE ===
    # Replace this section with your actual training/processing code
    # Example: database connections, API calls, cloud storage access

    # EXAMPLE ONLY: Simulating transient connection errors
    # Remove this in your actual implementation
    k = random.randint(0, 3)

    if k == 0:
        # Simulate successful execution
        print("Connection successful, executing your code...")
        # Your actual work happens here
        print("Execution completed successfully!")
    else:
        # Simulate a connection failure
        raise Exception("Database connection interrupted.")

    # === END OF YOUR CODE ===

except Exception as err:
    # Connection or execution failed
    print(f"Error encountered: {err}")
    print(f"Error type: {type(err)}")

    # Check if we have retries remaining
    if retry_count < retries:
        # Update execution status with failure details
        if retry_count == 0:
            valohai.set_status_detail(
                f"Failed due to: {err}. Will retry up to {retries} times.",
            )
        else:
            valohai.set_status_detail(
                f"Retry {retry_count}/{retries} failed due to: {err}",
            )

        print("Connection error detected.")
        print(f"Waiting {delay} seconds before retrying...")
        time.sleep(delay)

        # Prepare to start a new retry execution via API
        auth_token = os.environ["VH_API_TOKEN"]
        headers = {"Authorization": f"Token {auth_token}"}

        # Tag the new execution with the updated retry count
        new_retry_tag = f"retry: {retry_count + 1}"

        # Create execution payload matching the current execution
        new_execution_json = {
            "project": project_id,
            "commit": commit,
            "step": step_name,
            "tags": [new_retry_tag],
        }

        # Trigger the retry execution
        resp = requests.post(
            "https://app.valohai.com/api/v0/executions/",
            headers=headers,
            json=new_execution_json,
        )
        resp.raise_for_status()

        # Log the retry execution URL
        display_url = resp.json()["urls"]["display"]
        print(f"Retry execution started: {display_url}")

        # Exit with error code to mark this execution as failed
        # This ensures the failure is tracked in Valohai's UI
        sys.exit(1)

    else:
        # No retries remaining
        valohai.set_status_detail(
            f"Failed due to: {err}. Maximum retries ({retries}) exhausted.",
        )
        print("Connection error detected.")
        print(f"Maximum number of retries ({retries}) exhausted.")

        # Exit with error code to mark this execution as failed
        sys.exit(1)
```

### How It Works in Practice <a href="#how-it-works-in-practice" id="how-it-works-in-practice"></a>

**First execution fails:**

* Catches the exception
* Waits 60 seconds (default delay)
* Creates a new execution tagged `retry: 1`
* Exits with error status

**Retry execution fails:**

* Detects `retry: 1` tag
* Increments to `retry: 2`
* Repeats the process

**Retry succeeds:**

* Completes normally without triggering another retry
* No error status, execution marked as complete

**All retries exhausted:**

* Final execution exits with clear error message
* Status detail shows retry count and reason

### Verify It Works <a href="#verify-it-works" id="verify-it-works"></a>

Test your retry logic:

1. Create an execution with intentional failures (or keep the random simulation)
2. Watch the execution fail and automatically retry after the delay
3. Check execution tags in the UI to see `retry: n` labels
4. Review status details for failure reasons and retry counts

You can also query retry executions programmatically:

* [Fetch failed executions via API](/automation-overview/rest-api/examples/fetch-failed-executions.md)

### Best Practices <a href="#best-practices" id="best-practices"></a>

**Set appropriate delays:**

* Too short: May retry before transient issue resolves
* Too long: Wastes time on permanent failures
* Start with 60-120 seconds for network issues

**Limit retry attempts:**

* 3-5 retries work well for most transient failures
* More retries risk masking real problems

**Monitor retry patterns:**

* Frequent retries may indicate systemic issues
* Review failed executions regularly

**Use specific exceptions:**

* Catch only connection-related exceptions when possible
* Let code errors fail immediately without retries

### Troubleshooting <a href="#troubleshooting" id="troubleshooting"></a>

**Retries don't start:**

* Verify `VH_API_TOKEN` environment variable is set correctly
* Check that the token has permissions to create executions
* Review execution logs for API error responses

**Infinite retry loops:**

* Ensure retry count is properly tracked in tags
* Verify the retry count comparison logic

**Self-hosted Valohai:** Replace `https://app.valohai.com` with your installation URL in the retry code


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.valohai.com/automation-overview/rest-api/examples/retry-after-connection-error.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
