The Compute and Data Layer of Valohai can be deployed to a Kubernetes cluster.
This enables you to:
- Encapsulate your machine learning to an autoscaling Kubernetes cluster.
- Use an existing Kubernetes infrastructure to manage your data science efforts.
- Securely access private databases and warehouses directly from your cluster workers.
- Use cloud storage of your choosing to store training artifacts, like trained models, preprocessed datasets and visualizations.
Technical Overview
Valohai Workers on Kubernetes are implemented using Kubernetes jobs. The Valohai schedules the jobs to the cluster and monitors their progress.
These jobs run a single pod with two containers. One container for the machine learning workload and one smaller, side-car container that’s responsible for managing the state, runtime logs and potential caching of files.
These two containers share a volume that is used to store any working files and potential cache. The side-car container is responsible for populating and archiving the volume contents when inputs and outputs are used.
To clean any stale caches, Valohai uses a configurable Kubernetes cron job to periodically scan and remove caches that hasn’t been used for a while.
As Valohai relies on native Kubernetes features, autoscaling of resources happens automatically if that is already set up in the cluster, but we’ll also provide guidance on how to set up autoscaling further in the documentation.
Requirements
- You have a Valohai subscription.
- You have admin access to the Kubernetes cluster using
kubectl
.- If you don’t have an existing cluster, you can learn more at:
Installation
You can install Valohai Workers on Kubernetes using either a Helm chart or manually.
Tip
We recommend using the Helm chart.
Helm Install
Helm is a package manager for Kubernetes. It allows installing and upgrading Kubernetes applications with ease.
A Helm chart is available to install Valohai workers to Kubernetes clusters.
It installs the necessary service accounts, roles and other resource definitions for Valohai.
Please contact your Valohai representative to receive the required custom-values.yaml
file. There are various details that can be configured so remember to voice any needs and limitations.
helm repo add valohai --force-update https://dist.valohai.com/charts/
helm upgrade --install \
-n valohai-workers \
--create-namespace \
valohai-workers \
valohai/valohai-workers \
-f custom-values.yaml
Tip
These same commands can be used to upgrade Valohai workers.
Once the installation is complete, please supply the installer output to the Valohai team as well as connection information to your Kubernetes API (e.g., hostname, port) to complete the integration.
The Installer output looks incomplete?
The installer output might be incomplete with some fields having placeholders. This can happen if Helm reports back before resources are fully initialized in the cluster.
If there are seemingly missing fields, wait a moment for Kubernetes to
complete the creation and rerun the helm upgrade --install
command to
get the complete output.
Manual Install
Helm install is recommended instead of the manual install
Helm install is more reliable and easier to upgrade in the future.
Manual install is provided for those who cannot use Helm or have specific requirements that are not covered by the Helm chart.
Valohai requires access privileges for the Kubernetes cluster API to schedule workloads there. For these instructions, the standard Kubernetes RBAC (role-based access control) framework is used.
Valohai can be authenticated by either:
- a service account token at Manual (3a), or
- a client certificate at Manual (3b).
Manual (1) — Namespace
kubectl create namespace valohai-workers
We recommend naming the namespace valohai-workers
for simplicity. These instructions assume the namespace and object names are set as given. Please make note of any differences.
Manual (2) — Roles and Permissions
The minimal permissions policies are:
kubectl apply -f - <<EOF
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: valohai-workers
name: valohai-workers-runner
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list", "update", "delete"]
- apiGroups: ["batch"]
resources: ["jobs"]
verbs: ["create", "get", "update", "delete"]
- apiGroups: [""]
resources: ["persistentvolumeclaims"]
verbs: ["create", "get", "list", "update", "delete"]
- apiGroups: [""]
resources: ["secrets"]
verbs: ["create", "get", "update", "delete"]
EOF
Upcoming features might require additional Kubernetes permissions, so the minimum permission set needs to be updated accordingly.
If minimizing permissions is not a concern, full permissions over the valohai-workers
namespace can be granted with an alternate role definition:
kubectl apply -f - <<EOF
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: valohai-workers
name: valohai-workers-runner
rules:
- apiGroups: ["*"]
resources: ["*"]
verbs: ["*"]
EOF
Manual (3a) — Service Account
Create a valohai
service account with an access token:
export ACCOUNT=valohai
kubectl create serviceaccount $ACCOUNT -n valohai-workers
kubectl apply -f - <<EOF
apiVersion: v1
kind: Secret
type: kubernetes.io/service-account-token
metadata:
namespace: valohai-workers
name: $ACCOUNT-token
annotations:
kubernetes.io/service-account.name: $ACCOUNT
EOF
Save the secret access token to a file we’ll need later:
kubectl get secret $ACCOUNT-token -o jsonpath='{.data.token}' \
| base64 --decode > $ACCOUNT-token.txt
Bind the valohai-workers-runner
role to the fresh service account:
kubectl apply -f - <<EOF
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: valohai-workers-runner-binding
namespace: valohai-workers
subjects:
- kind: ServiceAccount
name: valohai
namespace: valohai-workers
roleRef:
kind: Role
name: valohai-workers-runner
apiGroup: rbac.authorization.k8s.io
EOF
Manual (3b) — Client Certificate
This step is adapted from the Kubernetes documentation.
Ask the Valohai team to create a certificate signing request (CSR) for the valohai
user and send it to you.
Inspect the CSR with openssl
that it has the user (CN) and group (O) fields set in the subject:
openssl req -in sample.csr -noout -text
# /CN=valohai/O=valohai
Encode the CSR as Base64:
cat valohai.csr | base64 | tr -d "\n"
Substitute the Base64 encoded CSR to {base64 CSR}
and submit it to the Kubernetes cluster:
kubectl apply -f - <<EOF
apiVersion: certificates.k8s.io/v1
kind: CertificateSigningRequest
metadata:
name: valohai
spec:
request: {base64 CSR}
signerName: kubernetes.io/kube-apiserver-client
expirationSeconds: 31536000 # 365 days
usages:
- client auth
EOF
Tip
The certificate duration can be shorter. These steps are repeated to rotate a new certificate in place prior to expiry, so it should not be too frequent.
Sign the request:
kubectl certificate approve valohai
Export the newly signed client certificate
kubectl get csr valohai -o jsonpath='{.status.certificate}'| base64 -d > valohai.crt
Put aside the generated valohai.crt
certificate, it will be transferred to the Valohai team later in these instructions along with other needed information.
Note
Client certificates expire in one year (365 days) by default. You repeat this process to have the access credentials rotated. Please use certificate management, or a calendar event with reminders, to rotate certificates a few weeks prior to expiry.
The certificate signing request (CSR) and the resulting signed client certificate do not contain secret material and as such do not require a high-security channel. However, make sure to authenticate that you are actually talking with Valohai representatives, as the client certificate combined with its associated secret private key does allow access to your Kubernetes system.
Bind the valohai-workers-runner
role for the fresh user:
kubectl apply -f - <<EOF
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: valohai-workers-runner-binding
namespace: valohai-workers
subjects:
- kind: User
name: valohai
apiGroup: rbac.authorization.k8s.io
namespace: valohai-workers
roleRef:
kind: Role
name: valohai-workers-runner
apiGroup: rbac.authorization.k8s.io
EOF
Manual (4) — Kubernetes API
If you limit access to the Kubernetes API by IP, please ask your Valohai contact for which IPs to allow so Valohai services can communicate with the API.
Manual (5) — Queue
We need a Redis instance to pass specifications and logs between the Valohai control services and the Valohai workers.
You can set up a Redis queue in your cluster or ask the Valohai team to set one for you.
If you do so yourself, set aside this information to send to the Valohai team:
- Redis address to access it over the Internet by Valohai control services
- Redis address to access it from within the Kubernetes cluster
- Redis access credentials
These can be supplied as Redis connection strings including credentials and address information for both cluster-internal and Internet access.
Manual (6) — You are Done!
Congratulations! 🎉
Send the following information to the Valohai team to set up your Kubernetes worker environment in the Valohai service:
-
The address, port and certificate authority data where the Kubernetes cluster control plane API is available over the Internet
- The CA certificate is needed for Valohai to verify the cluster’s identity for secure communications. You get it by running:
kubectl get cm kube-root-ca.crt -o jsonpath="{['data']['ca\.crt']}"
- You can also copy this information from your
kubeconfig
file set up to access your Kubernetes cluster, underclusters[].cluster
. - The
valohai-token.txt
(from Manual (3a)) ORvalohai.crt
client certificate (from Manual (3b)) - (optional) The Redis access information
- (optional) The namespace name used if not
valohai-workers
Valohai team will get back to you once the environment has been set up.