Retry Executions After Connection Errors

Build resilient pipelines that automatically recover from temporary errors. When database connections timeout, APIs rate-limit, or cloud storage becomes temporarily unavailable, this pattern automatically retries your execution after a configurable delay.

Perfect for production ML workflows where transient failures shouldn't require manual intervention.

How It Works

When your execution encounters a connection error:

The execution catches the exception
Waits for a configurable delay period
Uses the Valohai API to start a fresh retry execution
Tracks retry count with tags to prevent infinite loops
Marks the failed attempt as errored for visibility

If a retry succeeds, it completes normally. If all retries are exhausted, the final execution fails with a clear status message.

Prerequisites

A Valohai project with a step that might encounter transient errors
Python 3.8+ with valohai-utils installed
Understanding of exception handling in Python

Generate a Valohai API Token

Create an API token to allow your executions to start new retries:

Click "Hi, <username>!" in the top-right corner
Go to My Profile → Authentication
Click Manage Tokens and scroll to the bottom
Click Generate New Token
Copy the token immediately — it's shown only once

⚠️ Keep your token safe. Never commit tokens to version control. Store it as a secret environment variable in Valohai.

Store Token as Environment Variable

Add your token as a secret environment variable so your executions can access it:

Option 1: Project-wide (recommended)

Go to Project Settings → Environment Variables
Add a new variable:
- Name: VH_API_TOKEN
- Value: Your token
- Check: "Secret" checkbox
This variable is available to all executions in the project

Option 2: Per-execution

When creating an execution, go to Environment Variables section
Add VH_API_TOKEN with your token value
This variable is only available to that specific execution

Access environment variables in your code:

import os

auth_token = os.environ["VH_API_TOKEN"]

Configure Retry Parameters

Add retry configuration parameters to your valohai.yaml:

- step:
    name: train-model
    image: python:3.9
    command:
      - pip install valohai-utils requests
      - python train.py
    parameters:
      - name: retries
        type: integer
        default: 3
        description: Maximum number of retry attempts
      - name: delay
        type: integer
        default: 60
        description: Seconds to wait before retrying

These parameters control:

retries: How many times to retry after the initial failure
delay: How long to wait between retry attempts (in seconds)

Implement Retry Logic

Add automatic retry logic to your training script:

import valohai
import time
import json
import requests
import os
import sys
import random

# Get retry configuration from parameters
retries = valohai.parameters("retries").value
delay = valohai.parameters("delay").value

# Read execution metadata from the Valohai config file
# This file is automatically available in all executions
f = open("/valohai/config/execution.json")
data = json.load(f)
f.close()

# Extract current execution details
tags = data["valohai.execution-tags"]
project_id = data["valohai.project-id"]
commit = data["valohai.commit-identifier"]
step_name = data["valohai.execution-step"]

# Check if this execution is already a retry
# Expected tag format: "retry: n" where n is the retry count
retry_count = 0
if tags:
    for tag in tags:
        if "retry" in tag:
            retry_count = int(tag.split(" ")[1])

print(f"Starting execution (retry count: {retry_count})...")

# Your main execution logic wrapped in exception handling
try:
    # === YOUR CODE GOES HERE ===
    # Replace this section with your actual training/processing code
    # Example: database connections, API calls, cloud storage access

    # EXAMPLE ONLY: Simulating transient connection errors
    # Remove this in your actual implementation
    k = random.randint(0, 3)

    if k == 0:
        # Simulate successful execution
        print("Connection successful, executing your code...")
        # Your actual work happens here
        print("Execution completed successfully!")
    else:
        # Simulate a connection failure
        raise Exception("Database connection interrupted.")

    # === END OF YOUR CODE ===

except Exception as err:
    # Connection or execution failed
    print(f"Error encountered: {err}")
    print(f"Error type: {type(err)}")

    # Check if we have retries remaining
    if retry_count < retries:
        # Update execution status with failure details
        if retry_count == 0:
            valohai.set_status_detail(
                f"Failed due to: {err}. Will retry up to {retries} times.",
            )
        else:
            valohai.set_status_detail(
                f"Retry {retry_count}/{retries} failed due to: {err}",
            )

        print("Connection error detected.")
        print(f"Waiting {delay} seconds before retrying...")
        time.sleep(delay)

        # Prepare to start a new retry execution via API
        auth_token = os.environ["VH_API_TOKEN"]
        headers = {"Authorization": f"Token {auth_token}"}

        # Tag the new execution with the updated retry count
        new_retry_tag = f"retry: {retry_count + 1}"

        # Create execution payload matching the current execution
        new_execution_json = {
            "project": project_id,
            "commit": commit,
            "step": step_name,
            "tags": [new_retry_tag],
        }

        # Trigger the retry execution
        resp = requests.post(
            "https://app.valohai.com/api/v0/executions/",
            headers=headers,
            json=new_execution_json,
        )
        resp.raise_for_status()

        # Log the retry execution URL
        display_url = resp.json()["urls"]["display"]
        print(f"Retry execution started: {display_url}")

        # Exit with error code to mark this execution as failed
        # This ensures the failure is tracked in Valohai's UI
        sys.exit(1)

    else:
        # No retries remaining
        valohai.set_status_detail(
            f"Failed due to: {err}. Maximum retries ({retries}) exhausted.",
        )
        print("Connection error detected.")
        print(f"Maximum number of retries ({retries}) exhausted.")

        # Exit with error code to mark this execution as failed
        sys.exit(1)

How It Works in Practice

First execution fails:

Catches the exception
Waits 60 seconds (default delay)
Creates a new execution tagged retry: 1
Exits with error status

Retry execution fails:

Detects retry: 1 tag
Increments to retry: 2
Repeats the process

Retry succeeds:

Completes normally without triggering another retry
No error status, execution marked as complete

All retries exhausted:

Final execution exits with clear error message
Status detail shows retry count and reason

Verify It Works

Test your retry logic:

Create an execution with intentional failures (or keep the random simulation)
Watch the execution fail and automatically retry after the delay
Check execution tags in the UI to see retry: n labels
Review status details for failure reasons and retry counts

You can also query retry executions programmatically:

Fetch failed executions via API

Best Practices

Set appropriate delays:

Too short: May retry before transient issue resolves
Too long: Wastes time on permanent failures
Start with 60-120 seconds for network issues

Limit retry attempts:

3-5 retries work well for most transient failures
More retries risk masking real problems

Monitor retry patterns:

Frequent retries may indicate systemic issues
Review failed executions regularly

Use specific exceptions:

Catch only connection-related exceptions when possible
Let code errors fail immediately without retries

Troubleshooting

Retries don't start:

Verify VH_API_TOKEN environment variable is set correctly
Check that the token has permissions to create executions
Review execution logs for API error responses

Infinite retry loops:

Ensure retry count is properly tracked in tags
Verify the retry count comparison logic

Self-hosted Valohai: Replace https://app.valohai.com with your installation URL in the retry code

PreviousFetch Failed Executions NextObservability & Analytics

Last updated 1 month ago

Was this helpful?