Retry Executions After Connection Errors

Build resilient pipelines that automatically recover from temporary errors. When database connections timeout, APIs rate-limit, or cloud storage becomes temporarily unavailable, this pattern automatically retries your execution after a configurable delay.

Perfect for production ML workflows where transient failures shouldn't require manual intervention.

How It Works

When your execution encounters a connection error:

  1. The execution catches the exception

  2. Waits for a configurable delay period

  3. Uses the Valohai API to start a fresh retry execution

  4. Tracks retry count with tags to prevent infinite loops

  5. Marks the failed attempt as errored for visibility

If a retry succeeds, it completes normally. If all retries are exhausted, the final execution fails with a clear status message.

Prerequisites

  • A Valohai project with a step that might encounter transient errors

  • Python 3.8+ with valohai-utils installed

  • Understanding of exception handling in Python

Generate a Valohai API Token

Create an API token to allow your executions to start new retries:

  1. Click "Hi, <username>!" in the top-right corner

  2. Go to My Profile → Authentication

  3. Click Manage Tokens and scroll to the bottom

  4. Click Generate New Token

  5. Copy the token immediately — it's shown only once

⚠️ Keep your token safe. Never commit tokens to version control. Store it as a secret environment variable in Valohai.

Store Token as Environment Variable

Add your token as a secret environment variable so your executions can access it:

Option 1: Project-wide (recommended)

  1. Go to Project Settings → Environment Variables

  2. Add a new variable:

    • Name: VH_API_TOKEN

    • Value: Your token

    • Check: "Secret" checkbox

  3. This variable is available to all executions in the project

Option 2: Per-execution

  1. When creating an execution, go to Environment Variables section

  2. Add VH_API_TOKEN with your token value

  3. This variable is only available to that specific execution

Access environment variables in your code:

import os

auth_token = os.environ['VH_API_TOKEN']

Configure Retry Parameters

Add retry configuration parameters to your valohai.yaml:

- step:
    name: train-model
    image: python:3.9
    command:
      - pip install valohai-utils requests
      - python train.py
    parameters:
      - name: retries
        type: integer
        default: 3
        description: Maximum number of retry attempts
      - name: delay
        type: integer
        default: 60
        description: Seconds to wait before retrying

These parameters control:

  • retries: How many times to retry after the initial failure

  • delay: How long to wait between retry attempts (in seconds)

Implement Retry Logic

Add automatic retry logic to your training script:

import valohai
import time
import json
import requests
import os
import sys
import random

# Get retry configuration from parameters
retries = valohai.parameters("retries").value
delay = valohai.parameters("delay").value

# Read execution metadata from the Valohai config file
# This file is automatically available in all executions
f = open("/valohai/config/execution.json")
data = json.load(f)
f.close()

# Extract current execution details
tags = data["valohai.execution-tags"]
project_id = data["valohai.project-id"]
commit = data["valohai.commit-identifier"]
step_name = data["valohai.execution-step"]

# Check if this execution is already a retry
# Expected tag format: "retry: n" where n is the retry count
retry_count = 0
if tags:
    for tag in tags:
        if "retry" in tag:
            retry_count = int(tag.split(" ")[1])

print(f"Starting execution (retry count: {retry_count})...")

# Your main execution logic wrapped in exception handling
try:
    # === YOUR CODE GOES HERE ===
    # Replace this section with your actual training/processing code
    # Example: database connections, API calls, cloud storage access
    
    # EXAMPLE ONLY: Simulating transient connection errors
    # Remove this in your actual implementation
    k = random.randint(0, 3)
    
    if k == 0:
        # Simulate successful execution
        print("Connection successful, executing your code...")
        # Your actual work happens here
        print("Execution completed successfully!")
    else:
        # Simulate a connection failure
        raise Exception("Database connection interrupted.")
    
    # === END OF YOUR CODE ===

except Exception as err:
    # Connection or execution failed
    print(f"Error encountered: {err}")
    print(f"Error type: {type(err)}")
    
    # Check if we have retries remaining
    if retry_count < retries:
        # Update execution status with failure details
        if retry_count == 0:
            valohai.set_status_detail(
                f"Failed due to: {err}. "
                f"Will retry up to {retries} times."
            )
        else:
            valohai.set_status_detail(
                f"Retry {retry_count}/{retries} failed due to: {err}"
            )
        
        print("Connection error detected.")
        print(f"Waiting {delay} seconds before retrying...")
        time.sleep(delay)
        
        # Prepare to start a new retry execution via API
        auth_token = os.environ["VH_API_TOKEN"]
        headers = {"Authorization": f"Token {auth_token}"}
        
        # Tag the new execution with the updated retry count
        new_retry_tag = f"retry: {retry_count + 1}"
        
        # Create execution payload matching the current execution
        new_execution_json = {
            "project": project_id,
            "commit": commit,
            "step": step_name,
            "tags": [new_retry_tag],
        }
        
        # Trigger the retry execution
        resp = requests.post(
            "https://app.valohai.com/api/v0/executions/",
            headers=headers,
            json=new_execution_json,
        )
        resp.raise_for_status()
        
        # Log the retry execution URL
        display_url = resp.json()["urls"]["display"]
        print(f"Retry execution started: {display_url}")
        
        # Exit with error code to mark this execution as failed
        # This ensures the failure is tracked in Valohai's UI
        sys.exit(1)
    
    else:
        # No retries remaining
        valohai.set_status_detail(
            f"Failed due to: {err}. "
            f"Maximum retries ({retries}) exhausted."
        )
        print("Connection error detected.")
        print(f"Maximum number of retries ({retries}) exhausted.")
        
        # Exit with error code to mark this execution as failed
        sys.exit(1)

How It Works in Practice

First execution fails:

  • Catches the exception

  • Waits 60 seconds (default delay)

  • Creates a new execution tagged retry: 1

  • Exits with error status

Retry execution fails:

  • Detects retry: 1 tag

  • Increments to retry: 2

  • Repeats the process

Retry succeeds:

  • Completes normally without triggering another retry

  • No error status, execution marked as complete

All retries exhausted:

  • Final execution exits with clear error message

  • Status detail shows retry count and reason

Verify It Works

Test your retry logic:

  1. Create an execution with intentional failures (or keep the random simulation)

  2. Watch the execution fail and automatically retry after the delay

  3. Check execution tags in the UI to see retry: n labels

  4. Review status details for failure reasons and retry counts

You can also query retry executions programmatically:

Best Practices

Set appropriate delays:

  • Too short: May retry before transient issue resolves

  • Too long: Wastes time on permanent failures

  • Start with 60-120 seconds for network issues

Limit retry attempts:

  • 3-5 retries work well for most transient failures

  • More retries risk masking real problems

Monitor retry patterns:

  • Frequent retries may indicate systemic issues

  • Review failed executions regularly

Use specific exceptions:

  • Catch only connection-related exceptions when possible

  • Let code errors fail immediately without retries

Troubleshooting

Retries don't start:

  • Verify VH_API_TOKEN environment variable is set correctly

  • Check that the token has permissions to create executions

  • Review execution logs for API error responses

Infinite retry loops:

  • Ensure retry count is properly tracked in tags

  • Verify the retry count comparison logic

Self-hosted Valohai: Replace https://app.valohai.com with your installation URL in the retry code

Last updated

Was this helpful?