We encapsulate machine learning workloads in entities called executions. An execution is a similar concept as “a job” or “an experiment” in other systems, the emphasis being that an execution is a smaller piece in a much larger data science process, be it experimental or well defined production computation.
But, simply put, an execution is one or more strictly defined commands ran on a remote server.
Executions and steps
In Valohai, “steps” and “executions” are closely related. A “step” represents a distinct unit of work in a data science project, and “executions” implement these steps. You can use multiple executions to execute the same step with different parameters, inputs, hardware settings, or configurations.
The context in which the commands are run depends on three main things:
Environment means the machine type and cloud. For instance, you might want to run neural net training on a high-end Amazon AWS instance with 8 GPU cards, but a feature extraction step might need a memory-heavy instance with no GPUs instead.
The Docker image contains the main tools, libraries, and frameworks. You can use images from public or private Docker repositories.
The contents of a commit in your linked repository, such as training scripts. The commit’s contents will be available at /valohai/repository, which is also the default working directory during executions.
Creating an execution
Executions implement a step that’s defined in valohai.yaml. You can define your Step by writing the step manually, or using valohai-utils.
Executions can be created from the web app, command-line, or API call
An execution can be in one of six color-coded states:
- created: The execution is not yet queued, most likely because you don’t have enough quota and the system is waiting for one of your older executions to finish.
- queued: The execution is queued as there are no free servers which means that either a new server is being launched or you’ll have to wait for another execution (either your own or someone else’s) to finish, depending on the installation.
- started: The execution is currently running on an instance. You should see real-time details through the web interface, command-line client, and API.
- stopping: An user manually canceled the execution through the web interface, command-line client, or API.
- stopped: The execution has been successfully stopped by the platform.
- error: The last of the execution commands failed; check the logs for more information.
- complete: The execution was run successfully and its results are available through the web interface and command-line client.
- Each execution will always start as created and will end up either stopped, error, or complete.
An execution will only run user-defined code in the started state.