Workers Installation

Install Valohai workers on your Kubernetes cluster to run ML workloads

Deploy Valohai workers to your Kubernetes cluster to run machine learning workloads as containerized jobs.

Overview

The Compute and Data Layer of Valohai can be deployed to a Kubernetes cluster.

This enables you to:

  • Encapsulate your machine learning workloads in an autoscaling Kubernetes cluster

  • Use existing Kubernetes infrastructure to manage your data science efforts

  • Securely access private databases and warehouses directly from cluster workers

  • Use cloud storage of your choosing to store training artifacts (trained models, preprocessed datasets, visualizations)

Technical Overview

Valohai Workers on Kubernetes are implemented using Kubernetes jobs. Valohai schedules jobs to the cluster and monitors their progress.

How it works:

These jobs run a single pod with two containers:

  1. ML workload container - Runs your machine learning code

  2. Sidecar container - Manages state, runtime logs, and file caching

These two containers share a volume that stores working files and potential cache. The sidecar container populates and archives the volume contents when inputs and outputs are used.

Cleanup:

Valohai uses a configurable Kubernetes cron job to periodically scan and remove caches that haven't been used recently.

Autoscaling:

As Valohai relies on native Kubernetes features, autoscaling of resources happens automatically if already set up in the cluster, for example with Karpenter.

Requirements

1. You have a Valohai subscription

Contact [email protected] if you need to set up an account.

2. You have admin access to the Kubernetes cluster using kubectl

If you don't have an existing cluster, learn more at:

Installation

You can install Valohai Workers on Kubernetes using either Helm (recommended) or manually.

Helm is a package manager for Kubernetes that allows installing and upgrading applications with ease.

Prerequisites:

  • Helm installed

  • kubectl configured to access your cluster

  • custom-values.yaml file from Valohai

Steps:

Contact Valohai support to receive the required custom-values.yaml file. Discuss any specific needs and limitations, as various details can be configured.

Install the Helm chart:

helm repo add valohai --force-update https://dist.valohai.com/charts/
helm upgrade --install \
    -n valohai-workers \
    --create-namespace \
    valohai-workers \
    valohai/valohai-workers \
    -f custom-values.yaml

Note: These same commands can be used to upgrade Valohai workers in the future.

Complete the installation:

Once installation completes, supply the installer output to the Valohai team along with connection information to your Kubernetes API (e.g., hostname, port) to complete the integration.

Installer output looks incomplete? The output might be incomplete with placeholders if Helm reports back before resources are fully initialized. Wait a moment for Kubernetes to complete creation, then rerun the helm upgrade --install command to get complete output.

Manual Install

Helm install is recommended. Helm is more reliable and easier to upgrade in the future. Manual install is provided for those who cannot use Helm or have specific requirements not covered by the Helm chart.

Manual installation requires setting up Kubernetes RBAC (role-based access control) to grant Valohai access privileges.

Valohai can be authenticated by either:

  • A service account token (Manual 3a)

  • A client certificate (Manual 3b)

Manual (1) - Create Namespace

kubectl create namespace valohai-workers

We recommend naming the namespace valohai-workers for simplicity. These instructions assume the namespace and object names are set as given. Make note of any differences.

Manual (2) - Create Roles and Permissions

The minimal permissions policies are:

kubectl apply -f - <<EOF
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: valohai-workers
  name: valohai-workers-runner
rules:
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["get", "list", "update", "delete"]
  - apiGroups: ["batch"]
    resources: ["jobs"]
    verbs: ["create", "get", "update", "delete"]
  - apiGroups: [""]
    resources: ["persistentvolumeclaims"]
    verbs: ["create", "get", "list", "update", "delete"]
  - apiGroups: [""]
    resources: ["secrets"]
    verbs: ["create", "get", "update", "delete"]
EOF

Upcoming features might require additional Kubernetes permissions, so the minimum permission set needs to be updated accordingly.

Alternative: Full permissions

If minimizing permissions is not a concern, full permissions over the valohai-workers namespace can be granted:

kubectl apply -f - <<EOF
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: valohai-workers
  name: valohai-workers-runner
rules:
  - apiGroups: ["*"]
    resources: ["*"]
    verbs: ["*"]
EOF

Manual (3a) - Create Service Account

Create a valohai service account with an access token:

export ACCOUNT=valohai

kubectl create serviceaccount $ACCOUNT -n valohai-workers

kubectl apply -f - <<EOF
apiVersion: v1
kind: Secret
type: kubernetes.io/service-account-token
metadata:
  namespace: valohai-workers
  name: $ACCOUNT-token
  annotations:
    kubernetes.io/service-account.name: $ACCOUNT
EOF

Save the secret access token:

kubectl get secret $ACCOUNT-token -o jsonpath='{.data.token}' \
    | base64 --decode > $ACCOUNT-token.txt

Bind the role to the service account:

kubectl apply -f - <<EOF
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: valohai-workers-runner-binding
  namespace: valohai-workers
subjects:
  - kind: ServiceAccount
    name: valohai
    namespace: valohai-workers
roleRef:
  kind: Role
  name: valohai-workers-runner
  apiGroup: rbac.authorization.k8s.io
EOF

Manual (3b) - Create Client Certificate

This step is adapted from the Kubernetes documentation.

1. Request CSR from Valohai team

Ask the Valohai team to create a certificate signing request (CSR) for the valohai user and send it to you.

2. Inspect the CSR

Verify the user (CN) and group (O) fields in the subject:

openssl req -in valohai.csr -noout -text
# Expected output: /CN=valohai/O=valohai

3. Encode and submit the CSR

# Encode as Base64
cat valohai.csr | base64 | tr -d "\n"

Substitute the Base64 encoded CSR into {base64 CSR} and submit to the cluster:

kubectl apply -f - <<EOF
apiVersion: certificates.k8s.io/v1
kind: CertificateSigningRequest
metadata:
  name: valohai
spec:
  request: {base64 CSR}
  signerName: kubernetes.io/kube-apiserver-client
  expirationSeconds: 31536000  # 365 days
  usages:
    - client auth
EOF

Note: The certificate duration can be shorter. These steps are repeated to rotate a new certificate before expiry.

4. Sign the request

kubectl certificate approve valohai

5. Export the signed certificate

kubectl get csr valohai -o jsonpath='{.status.certificate}' | base64 -d > valohai.crt

Put aside the generated valohai.crt certificate. It will be transferred to the Valohai team later along with other needed information.

Certificate expiry: Client certificates expire in one year (365 days) by default. Repeat this process to rotate certificates a few weeks prior to expiry. Use certificate management or calendar reminders to avoid expiration.

6. Bind the role to the user

kubectl apply -f - <<EOF
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: valohai-workers-runner-binding
  namespace: valohai-workers
subjects:
  - kind: User
    name: valohai
    apiGroup: rbac.authorization.k8s.io
    namespace: valohai-workers
roleRef:
  kind: Role
  name: valohai-workers-runner
  apiGroup: rbac.authorization.k8s.io
EOF

Manual (4) - Configure Kubernetes API Access

If you limit access to the Kubernetes API by IP, ask your Valohai contact for which IPs to allow so Valohai services can communicate with the API.

Manual (5) - Set Up Queue

A Redis instance is needed to pass specifications and logs between Valohai control services and workers.

Option 1: Valohai sets up the queue

Ask the Valohai team to set up a Redis queue for you.

Option 2: Self-managed queue

If you set up Redis yourself in your cluster, prepare this information to send to Valohai:

  • Redis address to access over the Internet (for Valohai control services)

  • Redis address to access from within the Kubernetes cluster

  • Redis access credentials

These can be supplied as Redis connection strings including credentials and address information for both cluster-internal and Internet access.

Manual (6) - Complete Setup

Send the following information to the Valohai team at [email protected]:

Cluster access:

  • Address where the Kubernetes cluster control plane API is available over the Internet

  • Port

  • Certificate authority data

Get the CA certificate by running:

kubectl get cm kube-root-ca.crt -o jsonpath="{['data']['ca\.crt']}"

You can also copy this information from your kubeconfig file under clusters[].cluster.

Authentication:

  • The valohai-token.txt file (from Manual 3a)

  • OR the valohai.crt client certificate (from Manual 3b)

Optional:

  • Redis access information (if you set up your own queue)

  • Namespace name (if not valohai-workers)

The Valohai team will complete the setup and confirm when your environment is ready.

Next Steps

After installation is complete:

1. Verify worker registration

  • Log in to app.valohai.com

  • Check that Kubernetes environments appear in your organization

  • Verify available instance types

2. Run test execution

  • Create a test project

  • Run a simple execution

  • Verify the job runs on your cluster

  • Check that outputs are saved correctly

3. Configure autoscaling

  • Set up cluster autoscaler, for example with Karpenter

  • Test scaling behavior with multiple jobs

Getting Help

Valohai Support: [email protected]

Include in support requests:

  • Kubernetes version

  • Cloud provider (if applicable)

  • Namespace used

  • kubectl version

  • Logs or error messages

  • Steps already attempted

Last updated

Was this helpful?