Valohai runs machine learning workloads as executions. Think of an execution as analogous to a “job” or “experiment” in other systems, emphasizing that it represents a smaller component within a broader data science process, whether experimental or well-defined for production.
In simpler terms, an execution comprises one or more precisely defined commands executed on a remote server. Executions are generated when running a step, and multiple executions can be associated with the same step but with differing parameters, input files, hardware settings, or other configurations.
The context in which these commands run is influenced by three key factors:
-
Environment: This refers to the machine type and cloud infrastructure. For example, you might choose to perform neural network training on a high-end Amazon AWS instance equipped with 8 GPU cards, while a memory-intensive feature extraction step might require a different instance with no GPUs.
-
Docker Image: The Docker image contains essential tools, libraries, and frameworks. You can utilize images from public or private Docker repositories to tailor the execution environment to your needs.
-
Repository Commit Contents: The contents of a commit in your linked repository, including training scripts, are accessible at the /valohai/repository location. This location also serves as the default working directory during executions.
To create executions:
Executions correspond to steps defined in the valohai.yaml configuration. You can define your steps manually or Python users can simplify the process by using a helper library called valohai-utils
Executions can be generated through the web app, command-line interface, or API calls, offering flexibility in how you initiate and manage them.