Spot Instances/VMs operate similarly to other execution environments within Valohai. You can select a spot instance machine type from the dropdown menu.
What Are Spot Instances?
Spot instances refer to unused virtual machines offered by cloud providers at a reduced cost. For instance, AWS occasionally has unused capacity for certain instance types, such as p2.xlarge, and offers these instances at a discounted rate.
Spot instances are a cost-effective option for running workloads that can tolerate interruptions.
Potential Interruptions
However, it’s important to note that cloud providers may require the spot instance for other purposes, leading to potential interruptions in your work. In such cases, your code will receive a keyboard interrupt signal, and Valohai will halt the job as the machine is repurposed for other tasks.
Choosing a Spot Instance Environment
Choosing a spot instance type on Valohai You can run any workload on a spot instance. Just select the right environment when launching your execution from the UI, API, or CLI.
- You can choose to show only spot instance types.
- The “slug” of the environment. This is used to specify the environment when running from CLI (vh exec run train –adhoc –environment aws-eu-west-1-t3-medium-spot) or from the API (“environment”: “aws-eu-west-1-t3-medium-spot”)
- Auto restart will queue the job again after it’s interrupted. You’ll be able to access the outputs of the interrupted execution by using the _restartinput that is added automatically to the new job.
Spot Instance Job Execution Workflow
When you initiate a job using a spot instance environment in Valohai, here’s what happens:
-
Valohai schedules the job and attempts to acquire a spot instance. If an instance is not immediately available, Valohai will persistently retry until it secures one.
-
Your job might either conclude as planned or be interrupted by the cloud provider, who may indicate that “this instance is no longer available at this price.”
-
In the event of an interruption, your execution receives a notification in the form of a KeyboardInterrupt signal, which your code can respond to.
-
You’ll have a brief window of a few minutes to gracefully conclude your processes before the cloud provider terminates the machine.
-
In your code, anticipate a KeyboardInterrupt error as an indication that your job is ceasing.
-
We strongly recommend utilizing Live Outputs to save your checkpoints and other files as soon as they are created. When a spot instance interruption occurs, there may not be ample time to upload a large volume of files to the cloud. It’s more prudent to upload them during normal execution.
Automatic Restart on Spot Instance Interruption
You have the option to configure each execution to automatically restart if it encounters an interruption from the cloud provider. This enables you to requeue the job and continue once another spot instance becomes available.
-
During your execution, ensure that you upload your checkpoints and files as Live Outputs.
-
In the event of an interruption, you will receive a Keyboard Interrupt and a corresponding log message: “Spot instance interruption found. Wrapping up here!”
-
Valohai will terminate the existing job and enqueue a new one.
-
The new job will feature a special input called “_restart,” and all the outputs from the previous execution will be accessible in this input directory. You can access them just like any other inputs in Valohai, allowing you to select your latest checkpoint and resume your work.
The disk is removed when a spot instance terminates
When a spot instance is terminated, the associated disk is also removed. A restarted execution begins from a “clean slate,” so it’s your code’s responsibility to check for any checkpoints in the “_restart” inputs and pick up where it left off.
Testing Automatic Restart Functionality
To verify that your code functions correctly when a spot instance is interrupted, you can employ tools provided by your cloud provider. Below are examples for AWS and GCP:
For AWS: You can utilize the Fault Injection Simulator (FIS) to create an experiment that will halt your spot instance.
For GCP: You can simulate a host maintenance event using the following steps:
gcloud auth login
gcloud compute instances simulate-maintenance-event <MACHINE-ID> --zone <ZONE>
By conducting these tests, you can ensure that your code can effectively handle spot instance interruptions and take advantage of Valohai’s automatic restart capabilities.
Handling Outputs During Spot Instance Termination
When using spot instances with Valohai and a cloud provider sends a shutdown signal, it’s important to safeguard your data. Here’s how you can handle these situations effectively.
Capturing Shutdown Signals
As soon as your cloud provider decides to terminate your spot instance, Valohai sends a KeyboardInterrupt
to your execution. This is your cue that the instance is about to shut down. Make sure your code is set up to catch this interrupt so you can quickly wrap up any important last tasks.
Uploading Outputs
After your code acknowledges the shutdown signal, Valohai starts uploading everything in the /valohai/outputs
folder. The goal here is to save your work before the instance is completely shut down.
For small files: Valohai can usually upload all files before the shutdown is complete, which helps ensure that you don’t lose any important data.
For large files: If you have a lot of data, there’s a chance not everything will upload in time. To avoid losing important data, it’s a good idea to regularly upload large files during your run with Live Outputs, rather than all at once at the end.
Handling files in /valohai/outputs
Note that you cannot overwrite or delete files in the /valohai/outputs
directory.
Spot Instance Pricing and Quotas
Spot instances offer cost-effective computing resources, but it’s important to understand pricing dynamics and quota limitations on different cloud platforms. Here’s a breakdown for AWS, Google Cloud, and Microsoft Azure:
AWS (Amazon Web Services)
Spot Instance pricing on AWS is dynamic, adjusting based on supply and demand for Spot Instance capacity. In Valohai, each AWS environment has a “max price” setting, representing the maximum hourly rate you’re willing to pay. By default, this is set to the on-demand instance price.
For detailed information on AWS Spot Instances, refer to the AWS documentation.
AWS also imposes limits on the number of running and requested spot instances per account in one region. To learn more and understand how to increase these limits if necessary, refer to the AWS documentation on spot limits.
Google Cloud
Google Cloud Platform offers Spot VMs with fixed pricing, subject to no more than once-a-month pricing changes. When using Spot VMs in GCP, you need to consider CPU, disk, and GPU quota requirements. It is advisable to request preemptible quota for spot instances to prevent consumption of your standard quotas.
For detailed information on Google Cloud Spot VMs, visit the Google Cloud documentation.
Microsoft Azure
Pricing for Azure Spot Virtual Machines varies based on the region and machine type. Microsoft provides pricing details on their website.
For detailed information on Azure Spot Virtual Machines, refer to the Microsoft Azure documentation.
Azure distinguishes between vCPU quotas for Spot and Standard Virtual Machines. These quotas determine the number of virtual CPUs you can request for your spot instances. To understand more about these quotas, consult the Microsoft Azure documentation on vCPU quotas.
Understanding spot instance pricing and quota limitations is essential for efficient utilization of cloud resources and budget management in your machine learning projects.