Time Limits

Time limits help you control costs and prevent runaway executions. Valohai provides two timeout mechanisms:

  • Time Limit — Maximum total execution duration

  • No Output Timeout — Stop if execution produces no logs for a period

Both can be configured in the web UI or in your valohai.yaml.

Time Limit

Set a maximum duration for your execution. When the time limit is reached, Valohai terminates the execution.

Use cases:

  • Prevent forgotten executions from running indefinitely

  • Enforce budget constraints on expensive GPU instances

  • Ensure batch jobs complete within a scheduling window

Set in Web UI

  1. Create a new execution

  2. Scroll to the Runtime section

  3. Check Set a Time Limit

  4. Enter the maximum duration in hours and minutes

Not setting a time limit allows the execution to run indefinitely. This is the default behavior.

Set in valohai.yaml

Define a default time limit for a step:

The time-limit value supports human-readable formats like 1h 30m 5s, or you can specify seconds as an integer (e.g., 3600).


No Output Timeout

Stop executions that become unresponsive. If your execution produces no logs or output for the specified duration, Valohai terminates it.

Use cases:

  • Detect and stop hung processes

  • Catch infinite loops that produce no output

  • Identify network or I/O blocking issues

Set in Web UI

  1. Create a new execution

  2. Scroll to the Runtime section

  3. Check Set a No Output Timeout

  4. Enter the timeout duration in hours and minutes

Not setting this will default to about 8 hours.

Set in valohai.yaml

Define a default no-output timeout for a step:

The no-output-timeout value supports human-readable formats like 1h 30m 5s, or you can specify seconds as an integer (e.g., 1800).


Example: Complete Step Configuration

Combine time limits with other step settings:


Best Practices

Set reasonable defaults in YAML. Define time limits in your valohai.yaml so all team members use consistent settings. Override in the UI when needed.

Use no-output timeout to catch hangs. Long-running jobs should periodically log progress. If your training loop runs for hours without output, it may be stuck.

Account for setup time. Time limits include dependency installation, data download, and model initialization. Give enough buffer for these steps.

Combine with early stopping. For training jobs, consider using early stopping based on metrics in addition to time limits.


Troubleshooting

Execution stopped unexpectedly

Check the execution logs for timeout messages. Common causes:

  • Time limit reached — Increase the limit or optimize your code

  • No output timeout — Add periodic logging to your training loop


Last updated

Was this helpful?