Retry Executions After Connection Errors
Build resilient pipelines that automatically recover from temporary errors. When database connections timeout, APIs rate-limit, or cloud storage becomes temporarily unavailable, this pattern automatically retries your execution after a configurable delay.
Perfect for production ML workflows where transient failures shouldn't require manual intervention.
How It Works
When your execution encounters a connection error:
The execution catches the exception
Waits for a configurable delay period
Uses the Valohai API to start a fresh retry execution
Tracks retry count with tags to prevent infinite loops
Marks the failed attempt as errored for visibility
If a retry succeeds, it completes normally. If all retries are exhausted, the final execution fails with a clear status message.
Prerequisites
A Valohai project with a step that might encounter transient errors
Python 3.8+ with
valohai-utilsinstalledUnderstanding of exception handling in Python
Generate a Valohai API Token
Create an API token to allow your executions to start new retries:
Click "Hi, <username>!" in the top-right corner
Go to My Profile → Authentication
Click Manage Tokens and scroll to the bottom
Click Generate New Token
Copy the token immediately — it's shown only once
⚠️ Keep your token safe. Never commit tokens to version control. Store it as a secret environment variable in Valohai.
Store Token as Environment Variable
Add your token as a secret environment variable so your executions can access it:
Option 1: Project-wide (recommended)
Go to Project Settings → Environment Variables
Add a new variable:
Name:
VH_API_TOKENValue: Your token
Check: "Secret" checkbox
This variable is available to all executions in the project
Option 2: Per-execution
When creating an execution, go to Environment Variables section
Add
VH_API_TOKENwith your token valueThis variable is only available to that specific execution
Access environment variables in your code:
import os
auth_token = os.environ['VH_API_TOKEN']Configure Retry Parameters
Add retry configuration parameters to your valohai.yaml:
- step:
name: train-model
image: python:3.9
command:
- pip install valohai-utils requests
- python train.py
parameters:
- name: retries
type: integer
default: 3
description: Maximum number of retry attempts
- name: delay
type: integer
default: 60
description: Seconds to wait before retryingThese parameters control:
retries: How many times to retry after the initial failuredelay: How long to wait between retry attempts (in seconds)
Implement Retry Logic
Add automatic retry logic to your training script:
import valohai
import time
import json
import requests
import os
import sys
import random
# Get retry configuration from parameters
retries = valohai.parameters("retries").value
delay = valohai.parameters("delay").value
# Read execution metadata from the Valohai config file
# This file is automatically available in all executions
f = open("/valohai/config/execution.json")
data = json.load(f)
f.close()
# Extract current execution details
tags = data["valohai.execution-tags"]
project_id = data["valohai.project-id"]
commit = data["valohai.commit-identifier"]
step_name = data["valohai.execution-step"]
# Check if this execution is already a retry
# Expected tag format: "retry: n" where n is the retry count
retry_count = 0
if tags:
for tag in tags:
if "retry" in tag:
retry_count = int(tag.split(" ")[1])
print(f"Starting execution (retry count: {retry_count})...")
# Your main execution logic wrapped in exception handling
try:
# === YOUR CODE GOES HERE ===
# Replace this section with your actual training/processing code
# Example: database connections, API calls, cloud storage access
# EXAMPLE ONLY: Simulating transient connection errors
# Remove this in your actual implementation
k = random.randint(0, 3)
if k == 0:
# Simulate successful execution
print("Connection successful, executing your code...")
# Your actual work happens here
print("Execution completed successfully!")
else:
# Simulate a connection failure
raise Exception("Database connection interrupted.")
# === END OF YOUR CODE ===
except Exception as err:
# Connection or execution failed
print(f"Error encountered: {err}")
print(f"Error type: {type(err)}")
# Check if we have retries remaining
if retry_count < retries:
# Update execution status with failure details
if retry_count == 0:
valohai.set_status_detail(
f"Failed due to: {err}. "
f"Will retry up to {retries} times."
)
else:
valohai.set_status_detail(
f"Retry {retry_count}/{retries} failed due to: {err}"
)
print("Connection error detected.")
print(f"Waiting {delay} seconds before retrying...")
time.sleep(delay)
# Prepare to start a new retry execution via API
auth_token = os.environ["VH_API_TOKEN"]
headers = {"Authorization": f"Token {auth_token}"}
# Tag the new execution with the updated retry count
new_retry_tag = f"retry: {retry_count + 1}"
# Create execution payload matching the current execution
new_execution_json = {
"project": project_id,
"commit": commit,
"step": step_name,
"tags": [new_retry_tag],
}
# Trigger the retry execution
resp = requests.post(
"https://app.valohai.com/api/v0/executions/",
headers=headers,
json=new_execution_json,
)
resp.raise_for_status()
# Log the retry execution URL
display_url = resp.json()["urls"]["display"]
print(f"Retry execution started: {display_url}")
# Exit with error code to mark this execution as failed
# This ensures the failure is tracked in Valohai's UI
sys.exit(1)
else:
# No retries remaining
valohai.set_status_detail(
f"Failed due to: {err}. "
f"Maximum retries ({retries}) exhausted."
)
print("Connection error detected.")
print(f"Maximum number of retries ({retries}) exhausted.")
# Exit with error code to mark this execution as failed
sys.exit(1)How It Works in Practice
First execution fails:
Catches the exception
Waits 60 seconds (default delay)
Creates a new execution tagged
retry: 1Exits with error status
Retry execution fails:
Detects
retry: 1tagIncrements to
retry: 2Repeats the process
Retry succeeds:
Completes normally without triggering another retry
No error status, execution marked as complete
All retries exhausted:
Final execution exits with clear error message
Status detail shows retry count and reason
Verify It Works
Test your retry logic:
Create an execution with intentional failures (or keep the random simulation)
Watch the execution fail and automatically retry after the delay
Check execution tags in the UI to see
retry: nlabelsReview status details for failure reasons and retry counts
You can also query retry executions programmatically:
Best Practices
Set appropriate delays:
Too short: May retry before transient issue resolves
Too long: Wastes time on permanent failures
Start with 60-120 seconds for network issues
Limit retry attempts:
3-5 retries work well for most transient failures
More retries risk masking real problems
Monitor retry patterns:
Frequent retries may indicate systemic issues
Review failed executions regularly
Use specific exceptions:
Catch only connection-related exceptions when possible
Let code errors fail immediately without retries
Troubleshooting
Retries don't start:
Verify
VH_API_TOKENenvironment variable is set correctlyCheck that the token has permissions to create executions
Review execution logs for API error responses
Infinite retry loops:
Ensure retry count is properly tracked in tags
Verify the retry count comparison logic
Self-hosted Valohai: Replace https://app.valohai.com with your installation URL in the retry code
Last updated
Was this helpful?
