Spot Instances

Spot instances are unused virtual machines that cloud providers offer at steep discounts, often 60-90% cheaper than standard on-demand instances. The tradeoff? Your job can be interrupted when the provider needs that capacity back.

This makes spot instances perfect for fault-tolerant ML workloads like training experiments, hyperparameter sweeps, and batch inference.

Why Use Spot Instances?

Cost savings without infrastructure complexity. Valohai handles interruptions gracefully, so you can focus on your models instead of managing cloud infrastructure.

Spot instances work identically to standard environments in Valohai. Select a spot machine type from the environment dropdown when launching your execution.

How Spot Interruptions Work

When a cloud provider reclaims a spot instance:

Your code receives a KeyboardInterrupt signal
You have 2-3 minutes to save checkpoints and wrap up
Valohai uploads files from /valohai/outputs/
The machine terminates and removes its disk

Critical: Use Live Outputs to save checkpoints continuously during training. Waiting until the end risks losing large files when interruptions happen.

Automatic Restart

Enable auto-restart to requeue interrupted jobs automatically. Your new execution will include a special _restart input containing all outputs from the interrupted run.

Your code should check for checkpoints in the _restart input and resume from the latest one:

import valohai
import os

# Check if this is a restarted execution
restart_path = "/valohai/inputs/_restart"
if os.path.exists(restart_path) and os.listdir(restart_path):
    # Load latest checkpoint
    checkpoint_files = sorted(os.listdir(restart_path))
    latest_checkpoint = os.path.join(restart_path, checkpoint_files[-1])
    model.load_state_dict(torch.load(latest_checkpoint))
    print(f"Resuming from checkpoint: {latest_checkpoint}")

The disk is removed when a spot instance terminates. Restarted executions begin from a clean slate. Your code must explicitly load checkpoints from the _restart input to continue training.

Selecting Spot Environments

Choose spot environments the same way you select any other machine type:

Show only spot types using the filter in the environment dropdown
Note the environment slug (e.g., aws-eu-west-1-g4dn-xlarge-spot) for CLI/API usage
Enable auto-restart in the execution settings if you want automatic requeuing

CLI Example

vh exec run train \
  --adhoc \
  --environment aws-eu-west-1-g4dn-xlarge-spot \
  --auto-restart

API Example

{
  "execution": {
    "step": "train",
    "environment": "aws-eu-west-1-g4dn-xlarge-spot",
    "auto_restart": true
  }
}

Managing Outputs During Interruption

When Valohai receives a shutdown signal, it immediately starts uploading everything in /valohai/outputs/.

For small files (< 1GB): Most files upload successfully before termination.

For large files (models, datasets): Upload continuously with Live Outputs instead of waiting until the end.

import shutil

# Save checkpoint to outputs immediately
checkpoint_path = f"/valohai/outputs/checkpoint_epoch_{epoch}.pt"
torch.save(model.state_dict(), checkpoint_path)

os.chmod(checkpoint_path, S_IREAD | S_IRGRP | S_IROTH)
# Valohai uploads this file immediately as a Live Output

You cannot overwrite or delete files in /valohai/outputs. Save files with unique names (e.g., timestamped checkpoints).

Testing Interruption Handling

Verify your code handles interruptions correctly using cloud provider tools:

AWS: Fault Injection Simulator

Use AWS FIS to create an experiment that terminates your spot instance mid-execution.

GCP: Simulate Maintenance Event

gcloud auth login
gcloud compute instances simulate-maintenance-event <MACHINE-ID> --zone <ZONE>

Your code should catch the KeyboardInterrupt, save critical state, and exit cleanly.

Pricing and Quotas

Spot pricing varies by provider. Understanding these differences helps you choose the right cloud for your workload.

AWS Spot Instances

Pricing adjusts dynamically based on supply and demand. Each AWS environment in Valohai has a "max price" setting (default: on-demand price).

AWS limits the number of running and requested spot instances per region. Request quota increases if you hit these limits during large sweeps.

Reference: AWS Spot Instance documentation

Google Cloud Spot VMs

GCP uses fixed pricing that changes at most once per month. When planning capacity, consider CPU, disk, and GPU quotas.

Request preemptible quotas separately from standard quotas to prevent spot jobs from consuming your regular allocation.

Reference: GCP Spot VM documentation

Azure Spot Virtual Machines

Pricing varies by region and machine type. Azure distinguishes between vCPU quotas for spot and standard VMs.

Reference:

PreviousQueue Priority NextTime Limits

Last updated 11 days ago

Was this helpful?