Architecture Overview

Valohai can be deployed in numerous ways; here are the four most common configurations:

Valohai Cloud Installation (pdf)
  • Run workloads under Valohai owned AWS, Google Cloud and Azure accounts.
  • You’re billed depending on how much resources use use. You can also purchase credits in advance.
  • Works out of the box without any setup.
Private Cloud Worker Installation: (pdf)
  • The virtual machines (worker nodes) that handle the data processing, training and inference are all deployed within your own AWS, Google Cloud or Azure.
  • No input or output data leaves your account perimeter.
On-premises Worker Installation:
  • The worker nodes are deployed to your persistent on-premises hardware.
  • This allows for additional features like directory mounting unavailable for the cloud installations.
Full Private Installation: (pdf)
  • All components (inc. app.valohai.com) are deployed inside your own cloud provider account or data center.
  • This allows a self-contained installation that is managed and updated separately from the global Valohai installation.

Note

Valohai technical team will go through customer requirements before each non-“Valohai Cloud” installation and set everything up in collaboration with the customer’s infrastructure team.

Valohai engineers spend between 1 hour and 2 days per installation, depending on the agreed configuration. After the installation, Valohai team will keep on maintaining and updating the software per a signed contract.

Valohai Cloud installations don’t require the above preparation as they don’t have a separate technical setup. They work out of the box.

Components of a private cloud worker installation

The private cloud worker installation is the most common installation method for Valohai.

External code repositories for the data science projects (e.g. GitHub, GitLab, BitBucket or other Git repository). The Valohai master node is the core component that manages all the other resources such as scheduling executions and managing scaling of CPU/GPU machines across cloud providers. Valohai users are managed inside Valohai and can be integrated with 3rd party identity managers (e.g. Azure Active Directory) Once you've published your endpoint you'll receive an HTTPS address that you can use for inference. This can be either public or limited to certain users. Contains user data and details of the executions ran on Valohai (e.g. which machine type, commands, input data was used) Valohai stores store Git commit snapshots in binary storage (AWS S3, Azure Blob Storage, etc.) to maintain reproducibility. Worker machines load the user code archives from this storage. Stores the worker release binaries and configuration scripts that each worker machine uses to download inputs (e.g. training data), start the configured Docker image, report real-time logs and upload outputs (e.g. trained models) Real-time logs are moved to a persistent storage after the target execution finishes. In-memory database instance that hosts execution/build queues and acts as temporary storage for user logs so they can be shown on the Valohai web app and API in real-time. There is one worker group per instance type (e.g. g2.2xlarge on AWS) per region (e.g. AWS Ireland). Workers are the servers that execute user code. The Valohai Master manages these auto-scaling groups. Workers can also be a non-scaling cluster of on-premises machines. Worker groups can be backed by local hardware, AWS, Azure, GCP or OpenStack. Workers are the servers that execute user code. The Valohai Master manages these auto-scaling groups. Workers can also be a non-scaling cluster of on-premises machines. Worker groups can be backed by local hardware, AWS, Azure, GCP or OpenStack. Execution inputs are downloaded from and outputs are uploaded to a file storage (e.g. AWS S3, Azure Blob Storage, GCP Bucket) Each Valohai execution runs inside a Docker container. The configured Docker image is downloaded from a private or public Docker registry. Docker Hub is the most common one but you can also host a Docker registry inside your cloud provider account. Before hosting your model for inference, we build a Docker image to make deployments fast and reliable. This image contains all the files required for the deployment so endpoint can be easily scaled. The inference Docker images used are uploaded to a private Docker registry, usually hosted under the inference provider account (e.g. AWS ECR, Azure Container Registry, GCP) The Kubernetes cluster that hosts the inference request/response endpoints. It downloads the used images from private inference registry and exposes them for clients. Hosted either by Valohai or in your own cloud service (e.g. AWS EKS, Azure AKS, GKE)

Here are descriptions of the individual components:

Valohai Master:
Valohai master node that runs the web application and the API. The master is the core component that manages all the other resources such as scheduling executions and managing individual worker groups’ scale across cloud providers.
Valohai Database:
A relational database that contains user data and saves execution details such as which worker type was used, what commands were run, what Docker image was used, which inputs where used and what was the launch configuration.
Git Repositories:
External code repositories for the data science projects. Usually a private GitHub repository but can be any Git repository such as GitLab, BitBucket or GitHub Enterprise as long as the Valohai Master can access it.
User Code Archive:
We store Git commit snapshots in binary storage (AWS S3, Azure Blob Storage, etc.) to maintain reproducibility. Worker machines load the user code archives from this storage.
Worker Binary Storage:
Worker machines have an executable that downloads inputs (e.g. training data), starts the configured Docker image, reports real-time logs and uploads outputs (e.g. trained models). Worker release binaries and configuration scripts are stored in this binary storage.
Log Storage:
Real-time logs are moved to a persistent storage after the target execution finishes.
Queues and Cache:
In-memory database instance that hosts execution/build queues and acts as temporary storage for user logs so they can be shown on the Valohai web app and API in real-time.
Workers Groups:
Workers are the servers that execute user code. There is one worker group per instance type (e.g. g2.2xlarge on AWS) per region (e.g. AWS Ireland). The Valohai Master manages these auto-scaling groups. Workers can also be a non-scaling cluster of on-premises machines. Worker groups can be backed by local hardware, AWS, Azure, GCP or OpenStack.
Artifact Stores:
Execution inputs are downloaded from and outputs are uploaded to a file storage. Valohai supports various storage backends but an AWS S3 bucket is the most commonly used artifact store.
Docker Registries:
The Docker images used are downloaded from a private or public Docker registry. Docker Hub is the most common one but you can also host a Docker registry inside your cloud provider account.
Inference Builders:
Before hosting your model for inference, we build a Docker image to make deployments fast and reliable. It will prebuild all files required for deployment so endpoint can be easily scaled.
Inference Registry:
The inference Docker images used are uploaded to a private Docker registry, usually hosted under the inference provider account like AWS, GCP or Azure.
Inference Cluster:
The Kubernetes cluster that hosts the inference request/response endpoints. It downloads the used images from private inference registry and exposes them for clients.