Workers Installation
Install Valohai workers on your Kubernetes cluster to run ML workloads
Deploy Valohai workers to your Kubernetes cluster to run machine learning workloads as containerized jobs.
Overview
The Compute and Data Layer of Valohai can be deployed to a Kubernetes cluster.
This enables you to:
Encapsulate your machine learning workloads in an autoscaling Kubernetes cluster
Use existing Kubernetes infrastructure to manage your data science efforts
Securely access private databases and warehouses directly from cluster workers
Use cloud storage of your choosing to store training artifacts (trained models, preprocessed datasets, visualizations)
Technical Overview
Valohai Workers on Kubernetes are implemented using Kubernetes jobs. Valohai schedules jobs to the cluster and monitors their progress.
How it works:
These jobs run a single pod with two containers:
ML workload container - Runs your machine learning code
Sidecar container - Manages state, runtime logs, and file caching
These two containers share a volume that stores working files and potential cache. The sidecar container populates and archives the volume contents when inputs and outputs are used.
Cleanup:
Valohai uses a configurable Kubernetes cron job to periodically scan and remove caches that haven't been used recently.
Autoscaling:
As Valohai relies on native Kubernetes features, autoscaling of resources happens automatically if already set up in the cluster, for example with Karpenter.
Requirements
1. You have a Valohai subscription
Contact [email protected] if you need to set up an account.
2. You have admin access to the Kubernetes cluster using kubectl
If you don't have an existing cluster, learn more at:
Installation
You can install Valohai Workers on Kubernetes using either Helm (recommended) or manually.
Helm Install (Recommended)
Helm is a package manager for Kubernetes that allows installing and upgrading applications with ease.
Prerequisites:
Helm installed
kubectl configured to access your cluster
custom-values.yamlfile from Valohai
Steps:
Contact Valohai support to receive the required custom-values.yaml file. Discuss any specific needs and limitations, as various details can be configured.
Install the Helm chart:
helm repo add valohai --force-update https://dist.valohai.com/charts/
helm upgrade --install \
-n valohai-workers \
--create-namespace \
valohai-workers \
valohai/valohai-workers \
-f custom-values.yamlNote: These same commands can be used to upgrade Valohai workers in the future.
Complete the installation:
Once installation completes, supply the installer output to the Valohai team along with connection information to your Kubernetes API (e.g., hostname, port) to complete the integration.
Installer output looks incomplete? The output might be incomplete with placeholders if Helm reports back before resources are fully initialized. Wait a moment for Kubernetes to complete creation, then rerun the
helm upgrade --installcommand to get complete output.
Manual Install
Helm install is recommended. Helm is more reliable and easier to upgrade in the future. Manual install is provided for those who cannot use Helm or have specific requirements not covered by the Helm chart.
Manual installation requires setting up Kubernetes RBAC (role-based access control) to grant Valohai access privileges.
Valohai can be authenticated by either:
A service account token (Manual 3a)
A client certificate (Manual 3b)
Manual (1) - Create Namespace
kubectl create namespace valohai-workersWe recommend naming the namespace valohai-workers for simplicity. These instructions assume the namespace and object names are set as given. Make note of any differences.
Manual (2) - Create Roles and Permissions
The minimal permissions policies are:
kubectl apply -f - <<EOF
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: valohai-workers
name: valohai-workers-runner
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list", "update", "delete"]
- apiGroups: ["batch"]
resources: ["jobs"]
verbs: ["create", "get", "update", "delete"]
- apiGroups: [""]
resources: ["persistentvolumeclaims"]
verbs: ["create", "get", "list", "update", "delete"]
- apiGroups: [""]
resources: ["secrets"]
verbs: ["create", "get", "update", "delete"]
EOFUpcoming features might require additional Kubernetes permissions, so the minimum permission set needs to be updated accordingly.
Alternative: Full permissions
If minimizing permissions is not a concern, full permissions over the valohai-workers namespace can be granted:
kubectl apply -f - <<EOF
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: valohai-workers
name: valohai-workers-runner
rules:
- apiGroups: ["*"]
resources: ["*"]
verbs: ["*"]
EOFManual (3a) - Create Service Account
Create a valohai service account with an access token:
export ACCOUNT=valohai
kubectl create serviceaccount $ACCOUNT -n valohai-workers
kubectl apply -f - <<EOF
apiVersion: v1
kind: Secret
type: kubernetes.io/service-account-token
metadata:
namespace: valohai-workers
name: $ACCOUNT-token
annotations:
kubernetes.io/service-account.name: $ACCOUNT
EOFSave the secret access token:
kubectl get secret $ACCOUNT-token -o jsonpath='{.data.token}' \
| base64 --decode > $ACCOUNT-token.txtBind the role to the service account:
kubectl apply -f - <<EOF
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: valohai-workers-runner-binding
namespace: valohai-workers
subjects:
- kind: ServiceAccount
name: valohai
namespace: valohai-workers
roleRef:
kind: Role
name: valohai-workers-runner
apiGroup: rbac.authorization.k8s.io
EOFManual (3b) - Create Client Certificate
This step is adapted from the Kubernetes documentation.
1. Request CSR from Valohai team
Ask the Valohai team to create a certificate signing request (CSR) for the valohai user and send it to you.
2. Inspect the CSR
Verify the user (CN) and group (O) fields in the subject:
openssl req -in valohai.csr -noout -text
# Expected output: /CN=valohai/O=valohai3. Encode and submit the CSR
# Encode as Base64
cat valohai.csr | base64 | tr -d "\n"Substitute the Base64 encoded CSR into {base64 CSR} and submit to the cluster:
kubectl apply -f - <<EOF
apiVersion: certificates.k8s.io/v1
kind: CertificateSigningRequest
metadata:
name: valohai
spec:
request: {base64 CSR}
signerName: kubernetes.io/kube-apiserver-client
expirationSeconds: 31536000 # 365 days
usages:
- client auth
EOFNote: The certificate duration can be shorter. These steps are repeated to rotate a new certificate before expiry.
4. Sign the request
kubectl certificate approve valohai5. Export the signed certificate
kubectl get csr valohai -o jsonpath='{.status.certificate}' | base64 -d > valohai.crtPut aside the generated valohai.crt certificate. It will be transferred to the Valohai team later along with other needed information.
Certificate expiry: Client certificates expire in one year (365 days) by default. Repeat this process to rotate certificates a few weeks prior to expiry. Use certificate management or calendar reminders to avoid expiration.
6. Bind the role to the user
kubectl apply -f - <<EOF
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: valohai-workers-runner-binding
namespace: valohai-workers
subjects:
- kind: User
name: valohai
apiGroup: rbac.authorization.k8s.io
namespace: valohai-workers
roleRef:
kind: Role
name: valohai-workers-runner
apiGroup: rbac.authorization.k8s.io
EOFManual (4) - Configure Kubernetes API Access
If you limit access to the Kubernetes API by IP, ask your Valohai contact for which IPs to allow so Valohai services can communicate with the API.
Manual (5) - Set Up Queue
A Redis instance is needed to pass specifications and logs between Valohai control services and workers.
Option 1: Valohai sets up the queue
Ask the Valohai team to set up a Redis queue for you.
Option 2: Self-managed queue
If you set up Redis yourself in your cluster, prepare this information to send to Valohai:
Redis address to access over the Internet (for Valohai control services)
Redis address to access from within the Kubernetes cluster
Redis access credentials
These can be supplied as Redis connection strings including credentials and address information for both cluster-internal and Internet access.
Manual (6) - Complete Setup
Send the following information to the Valohai team at [email protected]:
Cluster access:
Address where the Kubernetes cluster control plane API is available over the Internet
Port
Certificate authority data
Get the CA certificate by running:
kubectl get cm kube-root-ca.crt -o jsonpath="{['data']['ca\.crt']}"You can also copy this information from your kubeconfig file under clusters[].cluster.
Authentication:
The
valohai-token.txtfile (from Manual 3a)OR the
valohai.crtclient certificate (from Manual 3b)
Optional:
Redis access information (if you set up your own queue)
Namespace name (if not
valohai-workers)
The Valohai team will complete the setup and confirm when your environment is ready.
Next Steps
After installation is complete:
1. Verify worker registration
Log in to app.valohai.com
Check that Kubernetes environments appear in your organization
Verify available instance types
2. Run test execution
Create a test project
Run a simple execution
Verify the job runs on your cluster
Check that outputs are saved correctly
3. Configure autoscaling
Set up cluster autoscaler, for example with Karpenter
Test scaling behavior with multiple jobs
Getting Help
Valohai Support: [email protected]
Include in support requests:
Kubernetes version
Cloud provider (if applicable)
Namespace used
kubectl version
Logs or error messages
Steps already attempted
Last updated
Was this helpful?
