Retry Executions After Connection Errors

Build resilient pipelines that automatically recover from temporary errors. When database connections timeout, APIs rate-limit, or cloud storage becomes temporarily unavailable, this pattern automatically retries your execution after a configurable delay.

Perfect for production ML workflows where transient failures shouldn't require manual intervention.

How It Works

When your execution encounters a connection error:

  1. The execution catches the exception

  2. Waits for a configurable delay period

  3. Uses the Valohai API to start a fresh retry execution

  4. Tracks retry count with tags to prevent infinite loops

  5. Marks the failed attempt as errored for visibility

If a retry succeeds, it completes normally. If all retries are exhausted, the final execution fails with a clear status message.

Prerequisites

  • A Valohai project with a step that might encounter transient errors

  • Python 3.8+ with valohai-utils installed

  • Understanding of exception handling in Python

Generate a Valohai API Token

Create an API token to allow your executions to start new retries:

  1. Click "Hi, <username>!" in the top-right corner

  2. Go to My Profile → Authentication

  3. Click Manage Tokens and scroll to the bottom

  4. Click Generate New Token

  5. Copy the token immediately — it's shown only once

⚠️ Keep your token safe. Never commit tokens to version control. Store it as a secret environment variable in Valohai.

Store Token as Environment Variable

Add your token as a secret environment variable so your executions can access it:

Option 1: Project-wide (recommended)

  1. Go to Project Settings → Environment Variables

  2. Add a new variable:

    • Name: VH_API_TOKEN

    • Value: Your token

    • Check: "Secret" checkbox

  3. This variable is available to all executions in the project

Option 2: Per-execution

  1. When creating an execution, go to Environment Variables section

  2. Add VH_API_TOKEN with your token value

  3. This variable is only available to that specific execution

Access environment variables in your code:

Configure Retry Parameters

Add retry configuration parameters to your valohai.yaml:

These parameters control:

  • retries: How many times to retry after the initial failure

  • delay: How long to wait between retry attempts (in seconds)

Implement Retry Logic

Add automatic retry logic to your training script:

How It Works in Practice

First execution fails:

  • Catches the exception

  • Waits 60 seconds (default delay)

  • Creates a new execution tagged retry: 1

  • Exits with error status

Retry execution fails:

  • Detects retry: 1 tag

  • Increments to retry: 2

  • Repeats the process

Retry succeeds:

  • Completes normally without triggering another retry

  • No error status, execution marked as complete

All retries exhausted:

  • Final execution exits with clear error message

  • Status detail shows retry count and reason

Verify It Works

Test your retry logic:

  1. Create an execution with intentional failures (or keep the random simulation)

  2. Watch the execution fail and automatically retry after the delay

  3. Check execution tags in the UI to see retry: n labels

  4. Review status details for failure reasons and retry counts

You can also query retry executions programmatically:

Best Practices

Set appropriate delays:

  • Too short: May retry before transient issue resolves

  • Too long: Wastes time on permanent failures

  • Start with 60-120 seconds for network issues

Limit retry attempts:

  • 3-5 retries work well for most transient failures

  • More retries risk masking real problems

Monitor retry patterns:

  • Frequent retries may indicate systemic issues

  • Review failed executions regularly

Use specific exceptions:

  • Catch only connection-related exceptions when possible

  • Let code errors fail immediately without retries

Troubleshooting

Retries don't start:

  • Verify VH_API_TOKEN environment variable is set correctly

  • Check that the token has permissions to create executions

  • Review execution logs for API error responses

Infinite retry loops:

  • Ensure retry count is properly tracked in tags

  • Verify the retry count comparison logic

Self-hosted Valohai: Replace https://app.valohai.com with your installation URL in the retry code

Last updated

Was this helpful?