Retry Executions After Connection Errors
Build resilient pipelines that automatically recover from temporary errors. When database connections timeout, APIs rate-limit, or cloud storage becomes temporarily unavailable, this pattern automatically retries your execution after a configurable delay.
Perfect for production ML workflows where transient failures shouldn't require manual intervention.
How It Works
When your execution encounters a connection error:
The execution catches the exception
Waits for a configurable delay period
Uses the Valohai API to start a fresh retry execution
Tracks retry count with tags to prevent infinite loops
Marks the failed attempt as errored for visibility
If a retry succeeds, it completes normally. If all retries are exhausted, the final execution fails with a clear status message.
Prerequisites
A Valohai project with a step that might encounter transient errors
Python 3.8+ with
valohai-utilsinstalledUnderstanding of exception handling in Python
Generate a Valohai API Token
Create an API token to allow your executions to start new retries:
Click "Hi, <username>!" in the top-right corner
Go to My Profile → Authentication
Click Manage Tokens and scroll to the bottom
Click Generate New Token
Copy the token immediately — it's shown only once
⚠️ Keep your token safe. Never commit tokens to version control. Store it as a secret environment variable in Valohai.
Store Token as Environment Variable
Add your token as a secret environment variable so your executions can access it:
Option 1: Project-wide (recommended)
Go to Project Settings → Environment Variables
Add a new variable:
Name:
VH_API_TOKENValue: Your token
Check: "Secret" checkbox
This variable is available to all executions in the project
Option 2: Per-execution
When creating an execution, go to Environment Variables section
Add
VH_API_TOKENwith your token valueThis variable is only available to that specific execution
Access environment variables in your code:
Configure Retry Parameters
Add retry configuration parameters to your valohai.yaml:
These parameters control:
retries: How many times to retry after the initial failuredelay: How long to wait between retry attempts (in seconds)
Implement Retry Logic
Add automatic retry logic to your training script:
How It Works in Practice
First execution fails:
Catches the exception
Waits 60 seconds (default delay)
Creates a new execution tagged
retry: 1Exits with error status
Retry execution fails:
Detects
retry: 1tagIncrements to
retry: 2Repeats the process
Retry succeeds:
Completes normally without triggering another retry
No error status, execution marked as complete
All retries exhausted:
Final execution exits with clear error message
Status detail shows retry count and reason
Verify It Works
Test your retry logic:
Create an execution with intentional failures (or keep the random simulation)
Watch the execution fail and automatically retry after the delay
Check execution tags in the UI to see
retry: nlabelsReview status details for failure reasons and retry counts
You can also query retry executions programmatically:
Best Practices
Set appropriate delays:
Too short: May retry before transient issue resolves
Too long: Wastes time on permanent failures
Start with 60-120 seconds for network issues
Limit retry attempts:
3-5 retries work well for most transient failures
More retries risk masking real problems
Monitor retry patterns:
Frequent retries may indicate systemic issues
Review failed executions regularly
Use specific exceptions:
Catch only connection-related exceptions when possible
Let code errors fail immediately without retries
Troubleshooting
Retries don't start:
Verify
VH_API_TOKENenvironment variable is set correctlyCheck that the token has permissions to create executions
Review execution logs for API error responses
Infinite retry loops:
Ensure retry count is properly tracked in tags
Verify the retry count comparison logic
Self-hosted Valohai: Replace https://app.valohai.com with your installation URL in the retry code
Last updated
Was this helpful?
