Time Limits

Time limits help you control costs and prevent runaway executions. Valohai provides two timeout mechanisms:

Time Limit — Maximum total execution duration
No Output Timeout — Stop if execution produces no logs for a period

Both can be configured in the web UI or in your valohai.yaml.

Time Limit

Set a maximum duration for your execution. When the time limit is reached, Valohai terminates the execution.

Use cases:

Prevent forgotten executions from running indefinitely
Enforce budget constraints on expensive GPU instances
Ensure batch jobs complete within a scheduling window

Set in Web UI

Create a new execution
Scroll to the Runtime section
Check Set a Time Limit
Enter the maximum duration in hours and minutes

Not setting a time limit allows the execution to run indefinitely. This is the default behavior.

Set in valohai.yaml

Define a default time limit for a step:

- step:
    name: train-model
    image: tensorflow/tensorflow:2.13.0-gpu
    time-limit: 1h  # 1 hour
    command:
      - python train.py {parameters}

The time-limit value supports human-readable formats like 1h 30m 5s, or you can specify seconds as an integer (e.g., 3600).

No Output Timeout

Stop executions that become unresponsive. If your execution produces no logs or output for the specified duration, Valohai terminates it.

Use cases:

Detect and stop hung processes
Catch infinite loops that produce no output
Identify network or I/O blocking issues

Set in Web UI

Create a new execution
Scroll to the Runtime section
Check Set a No Output Timeout
Enter the timeout duration in hours and minutes

Not setting this will default to about 8 hours.

Set in valohai.yaml

Define a default no-output timeout for a step:

- step:
    name: train-model
    image: tensorflow/tensorflow:2.13.0-gpu
    no-output-timeout: 30m  # 30 minutes
    command:
      - python train.py {parameters}

The no-output-timeout value supports human-readable formats like 1h 30m 5s, or you can specify seconds as an integer (e.g., 1800).

Example: Complete Step Configuration

Combine time limits with other step settings:

- step:
    name: train-model
    image: tensorflow/tensorflow:2.13.0-gpu
    time-limit: 4h              # 4 hours max
    no-output-timeout: 30m      # 30 min no-output timeout
    command:
      - pip install -r requirements.txt
      - python train.py {parameters}
    parameters:
      - name: epochs
        default: 100
        type: integer

Best Practices

Set reasonable defaults in YAML. Define time limits in your valohai.yaml so all team members use consistent settings. Override in the UI when needed.

Use no-output timeout to catch hangs. Long-running jobs should periodically log progress. If your training loop runs for hours without output, it may be stuck.

Account for setup time. Time limits include dependency installation, data download, and model initialization. Give enough buffer for these steps.

Combine with early stopping. For training jobs, consider using early stopping based on metrics in addition to time limits.

Troubleshooting

Execution stopped unexpectedly

Check the execution logs for timeout messages. Common causes:

Time limit reached — Increase the limit or optimize your code
No output timeout — Add periodic logging to your training loop

Early Stopping — Stop based on metadata conditions
Spot Instances — Handle interruptions for cost savings
Run Basic Execution — Creating and running executions

PreviousSpot Instances NextDynamic GPU Allocation

Last updated 4 days ago

Was this helpful?

hashtagTime Limit

hashtagSet in Web UI

hashtagSet in valohai.yaml

hashtagNo Output Timeout

hashtagSet in Web UI

hashtagSet in valohai.yaml

hashtagExample: Complete Step Configuration

hashtagBest Practices

hashtagTroubleshooting

hashtagExecution stopped unexpectedly

hashtagRelated

Time Limit

Set in Web UI

Set in valohai.yaml

No Output Timeout

Set in Web UI

Set in valohai.yaml

Example: Complete Step Configuration

Best Practices

Troubleshooting

Execution stopped unexpectedly

Related