Spot Instances
Spot instances are unused virtual machines that cloud providers offer at steep discounts, often 60-90% cheaper than standard on-demand instances. The tradeoff? Your job can be interrupted when the provider needs that capacity back.
This makes spot instances perfect for fault-tolerant ML workloads like training experiments, hyperparameter sweeps, and batch inference.
Why Use Spot Instances?
Cost savings without infrastructure complexity. Valohai handles interruptions gracefully, so you can focus on your models instead of managing cloud infrastructure.
Spot instances work identically to standard environments in Valohai. Select a spot machine type from the environment dropdown when launching your execution.
How Spot Interruptions Work
When a cloud provider reclaims a spot instance:
Your code receives a
KeyboardInterruptsignalYou have 2-3 minutes to save checkpoints and wrap up
Valohai uploads files from
/valohai/outputs/The machine terminates and removes its disk
Critical: Use Live Outputs to save checkpoints continuously during training. Waiting until the end risks losing large files when interruptions happen.
Automatic Restart
Enable auto-restart to requeue interrupted jobs automatically. Your new execution will include a special _restart input containing all outputs from the interrupted run.
Your code should check for checkpoints in the _restart input and resume from the latest one:
import valohai
import os
# Check if this is a restarted execution
restart_path = "/valohai/inputs/_restart"
if os.path.exists(restart_path) and os.listdir(restart_path):
# Load latest checkpoint
checkpoint_files = sorted(os.listdir(restart_path))
latest_checkpoint = os.path.join(restart_path, checkpoint_files[-1])
model.load_state_dict(torch.load(latest_checkpoint))
print(f"Resuming from checkpoint: {latest_checkpoint}")The disk is removed when a spot instance terminates. Restarted executions begin from a clean slate. Your code must explicitly load checkpoints from the
_restartinput to continue training.
Selecting Spot Environments
Choose spot environments the same way you select any other machine type:
Show only spot types using the filter in the environment dropdown
Note the environment slug (e.g.,
aws-eu-west-1-g4dn-xlarge-spot) for CLI/API usageEnable auto-restart in the execution settings if you want automatic requeuing
CLI Example
vh exec run train \
--adhoc \
--environment aws-eu-west-1-g4dn-xlarge-spot \
--auto-restartAPI Example
{
"execution": {
"step": "train",
"environment": "aws-eu-west-1-g4dn-xlarge-spot",
"auto_restart": true
}
}Managing Outputs During Interruption
When Valohai receives a shutdown signal, it immediately starts uploading everything in /valohai/outputs/.
For small files (< 1GB): Most files upload successfully before termination.
For large files (models, datasets): Upload continuously with Live Outputs instead of waiting until the end.
import shutil
# Save checkpoint to outputs immediately
checkpoint_path = f"/valohai/outputs/checkpoint_epoch_{epoch}.pt"
torch.save(model.state_dict(), checkpoint_path)
os.chmod(checkpoint_path, S_IREAD|S_IRGRP|S_IROTH)
# Valohai uploads this file immediately as a Live OutputYou cannot overwrite or delete files in
/valohai/outputs. Save files with unique names (e.g., timestamped checkpoints).
Testing Interruption Handling
Verify your code handles interruptions correctly using cloud provider tools:
AWS: Fault Injection Simulator
Use AWS FIS to create an experiment that terminates your spot instance mid-execution.
GCP: Simulate Maintenance Event
gcloud auth login
gcloud compute instances simulate-maintenance-event <MACHINE-ID> --zone <ZONE>Your code should catch the KeyboardInterrupt, save critical state, and exit cleanly.
Pricing and Quotas
Spot pricing varies by provider. Understanding these differences helps you choose the right cloud for your workload.
AWS Spot Instances
Pricing adjusts dynamically based on supply and demand. Each AWS environment in Valohai has a "max price" setting (default: on-demand price).
AWS limits the number of running and requested spot instances per region. Request quota increases if you hit these limits during large sweeps.
Reference: AWS Spot Instance documentation
Google Cloud Spot VMs
GCP uses fixed pricing that changes at most once per month. When planning capacity, consider CPU, disk, and GPU quotas.
Request preemptible quotas separately from standard quotas to prevent spot jobs from consuming your regular allocation.
Reference: GCP Spot VM documentation
Azure Spot Virtual Machines
Pricing varies by region and machine type. Azure distinguishes between vCPU quotas for spot and standard VMs.
Reference:
Last updated
Was this helpful?
