# Spot Instances

Spot instances are unused virtual machines that cloud providers offer at steep discounts, often 60-90% cheaper than standard on-demand instances. The tradeoff? Your job can be interrupted when the provider needs that capacity back.

This makes spot instances perfect for fault-tolerant ML workloads like training experiments, hyperparameter sweeps, and batch inference.

## Why Use Spot Instances?

**Cost savings without infrastructure complexity.** Valohai handles interruptions gracefully, so you can focus on your models instead of managing cloud infrastructure.

Spot instances work identically to standard environments in Valohai. Select a spot machine type from the environment dropdown when launching your execution.

## How Spot Interruptions Work

When a cloud provider reclaims a spot instance:

1. Your code receives a `KeyboardInterrupt` signal
2. You have 2-3 minutes to save checkpoints and wrap up
3. Valohai uploads files from `/valohai/outputs/`
4. The machine terminates and removes its disk

**Critical:** Use [Live Outputs](/data/data-versioning/save-files-from-jobs.md#live-uploads) to save checkpoints continuously during training. Waiting until the end risks losing large files when interruptions happen.

### Automatic Restart

Enable auto-restart to requeue interrupted jobs automatically. Your new execution will include a special `_restart` input containing all outputs from the interrupted run.

Your code should check for checkpoints in the `_restart` input and resume from the latest one:

```python
import valohai
import os

# Check if this is a restarted execution
restart_path = "/valohai/inputs/_restart"
if os.path.exists(restart_path) and os.listdir(restart_path):
    # Load latest checkpoint
    checkpoint_files = sorted(os.listdir(restart_path))
    latest_checkpoint = os.path.join(restart_path, checkpoint_files[-1])
    model.load_state_dict(torch.load(latest_checkpoint))
    print(f"Resuming from checkpoint: {latest_checkpoint}")
```

> **The disk is removed when a spot instance terminates.** Restarted executions begin from a clean slate. Your code must explicitly load checkpoints from the `_restart` input to continue training.

## Selecting Spot Environments

Choose spot environments the same way you select any other machine type:

1. **Show only spot types** using the filter in the environment dropdown
2. **Note the environment slug** (e.g., `aws-eu-west-1-g4dn-xlarge-spot`) for CLI/API usage
3. **Enable auto-restart** in the execution settings if you want automatic requeuing

### CLI Example

```shell
vh exec run train \
  --adhoc \
  --environment aws-eu-west-1-g4dn-xlarge-spot \
  --autorestart
```

### API Example

To use the autorestart feature, add the following in your API call payload. Make sure the environment you have defined is a spot instance.&#x20;

```json
    "runtime_config": {
        "autorestart": True
    },
```

## Managing Outputs During Interruption

When Valohai receives a shutdown signal, it immediately starts uploading everything in `/valohai/outputs/`.

**For small files** (< 1GB): Most files upload successfully before termination.

**For large files** (models, datasets): Upload continuously with Live Outputs instead of waiting until the end.

```python
import shutil

# Save checkpoint to outputs immediately
checkpoint_path = f"/valohai/outputs/checkpoint_epoch_{epoch}.pt"
torch.save(model.state_dict(), checkpoint_path)

os.chmod(checkpoint_path, S_IREAD | S_IRGRP | S_IROTH)
# Valohai uploads this file immediately as a Live Output
```

> You cannot overwrite or delete files in `/valohai/outputs`. Save files with unique names (e.g., timestamped checkpoints).

## Testing Interruption Handling

Verify your code handles interruptions correctly using cloud provider tools:

### AWS: Fault Injection Simulator

Use [AWS FIS](https://aws.amazon.com/fis/) to create an experiment that terminates your spot instance mid-execution.

### GCP: Simulate Maintenance Event

```shell
gcloud auth login
gcloud compute instances simulate-maintenance-event <MACHINE-ID> --zone <ZONE>
```

Your code should catch the `KeyboardInterrupt`, save critical state, and exit cleanly.

## Pricing and Quotas

Spot pricing varies by provider. Understanding these differences helps you choose the right cloud for your workload.

### AWS Spot Instances

Pricing adjusts dynamically based on supply and demand. Each AWS environment in Valohai has a "max price" setting (default: on-demand price).

AWS limits the number of running and requested spot instances per region. [Request quota increases](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-spot-limits.html) if you hit these limits during large sweeps.

**Reference:** [AWS Spot Instance documentation](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-spot-instances.html)

### Google Cloud Spot VMs

GCP uses fixed pricing that changes at most once per month. When planning capacity, consider CPU, disk, and GPU quotas.

Request **preemptible quotas** separately from standard quotas to prevent spot jobs from consuming your regular allocation.

**Reference:** [GCP Spot VM documentation](https://cloud.google.com/spot-vms)

### Azure Spot Virtual Machines

Pricing varies by region and machine type. Azure distinguishes between vCPU quotas for spot and standard VMs.

**Reference:**

* [Azure Spot VM documentation](https://docs.microsoft.com/en-us/azure/virtual-machines/spot-vms)
* [Azure vCPU quota management](https://docs.microsoft.com/en-us/azure/quotas/per-vm-quota-requests)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.valohai.com/executions/advanced-features/spot-instances.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
