Kubernetes Autoscaling
Configure Kubernetes autoscaling for Valohai workers using Karpenter or other autoscaling solutions
Configure autoscaling for your Kubernetes cluster to dynamically provision resources for Valohai ML workloads.
Overview
Valohai Workers on Kubernetes are implemented as Kubernetes jobs. When you have autoscaling configured, your cluster can automatically:
Scale up nodes when Valohai jobs are queued
Scale down nodes when jobs complete and resources are idle
Select appropriate instance types based on job requirements
Optimize costs by using spot/preemptible instances
Note: This guide uses AWS EKS with Karpenter as an example, but the concepts apply to any Kubernetes cluster. The same principles work with GKE, AKS, or on-premises Kubernetes using different autoscalers.
Autoscaling Options
You can use various autoscaling solutions with Valohai:
Karpenter (Recommended for AWS EKS)
Best for: AWS EKS clusters
Advantages:
Fast node provisioning (seconds vs. minutes)
Flexible instance selection
Bin-packing optimization
Direct EC2 API integration
Cloud support: AWS (native), Azure and GCP (experimental)
Cluster Autoscaler
Best for: Multi-cloud environments, stable workloads
Advantages:
Cloud-agnostic
Mature and widely used
Works with all major cloud providers
Simple configuration
Cloud support: AWS, GCP, Azure, and others
Cloud-Native Autoscalers
GKE Autopilot: Fully managed node provisioning on GKE
AKS Cluster Autoscaler: Azure's native autoscaling
Best for: Organizations standardized on one cloud provider
Example: Karpenter on AWS EKS
This section provides a complete example of setting up Karpenter on AWS EKS. If you're using a different cloud provider or autoscaler, adapt these concepts to your environment.
Requirements
Existing infrastructure:
Permissions:
Admin access to your EKS cluster
IAM permissions to create roles and policies
Step 1: Set Up Environment Variables
Define common variables for reuse:
# Check if OIDC is configured
aws iam list-open-id-connect-providers
# Should show: oidc.eks.<region>.amazonaws.com/id/<ID>
export AWS_PROFILE=<aws-profile>
export AWS_REGION=<region>
export KUBECONFIG=~/.kube/<cluster-name>
CLUSTER=<cluster-name>
KARPENTER_NAMESPACE=kube-system
AWS_PARTITION="aws"
OIDC_ENDPOINT="$(aws eks describe-cluster --name ${CLUSTER} --query "cluster.identity.oidc.issuer" --output text)"
AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query 'Account' --output text)Step 2: Create IAM Roles
Create two IAM roles: one for nodes provisioned by Karpenter and one for the Karpenter controller.
Create node trust policy:
echo '{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "ec2.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}' > node-trust-policy.jsonCreate node role:
aws iam create-role \
--role-name "KarpenterNodeRole-${CLUSTER}" \
--assume-role-policy-document file://node-trust-policy.json
aws iam attach-role-policy \
--role-name "KarpenterNodeRole-${CLUSTER}" \
--policy-arn arn:${AWS_PARTITION}:iam::aws:policy/AmazonEKSWorkerNodePolicy
aws iam attach-role-policy \
--role-name "KarpenterNodeRole-${CLUSTER}" \
--policy-arn arn:${AWS_PARTITION}:iam::aws:policy/AmazonEKS_CNI_Policy
aws iam attach-role-policy \
--role-name "KarpenterNodeRole-${CLUSTER}" \
--policy-arn arn:${AWS_PARTITION}:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
aws iam attach-role-policy \
--role-name "KarpenterNodeRole-${CLUSTER}" \
--policy-arn arn:${AWS_PARTITION}:iam::aws:policy/AmazonSSMManagedInstanceCoreCreate controller trust policy:
cat << EOF > controller-trust-policy.json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Federated": "arn:${AWS_PARTITION}:iam::${AWS_ACCOUNT_ID}:oidc-provider/${OIDC_ENDPOINT#*//}"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"${OIDC_ENDPOINT#*//}:aud": "sts.amazonaws.com",
"${OIDC_ENDPOINT#*//}:sub": "system:serviceaccount:${KARPENTER_NAMESPACE}:karpenter"
}
}
}
]
}
EOFCreate controller role:
aws iam create-role \
--role-name KarpenterControllerRole-${CLUSTER} \
--assume-role-policy-document file://controller-trust-policy.jsonCreate controller policy:
cat << EOF > controller-policy.json
{
"Statement": [
{
"Action": [
"ssm:GetParameter",
"ec2:DescribeImages",
"ec2:RunInstances",
"ec2:DescribeSubnets",
"ec2:DescribeSecurityGroups",
"ec2:DescribeLaunchTemplates",
"ec2:DescribeInstances",
"ec2:DescribeInstanceTypes",
"ec2:DescribeInstanceTypeOfferings",
"ec2:DescribeAvailabilityZones",
"ec2:DeleteLaunchTemplate",
"ec2:CreateTags",
"ec2:CreateLaunchTemplate",
"ec2:CreateFleet",
"ec2:DescribeSpotPriceHistory",
"pricing:GetProducts"
],
"Effect": "Allow",
"Resource": "*",
"Sid": "Karpenter"
},
{
"Action": "ec2:TerminateInstances",
"Condition": {
"StringLike": {
"ec2:ResourceTag/karpenter.sh/nodepool": "*"
}
},
"Effect": "Allow",
"Resource": "*",
"Sid": "ConditionalEC2Termination"
},
{
"Effect": "Allow",
"Action": "iam:PassRole",
"Resource": "arn:${AWS_PARTITION}:iam::${AWS_ACCOUNT_ID}:role/KarpenterNodeRole-${CLUSTER}",
"Sid": "PassNodeIAMRole"
},
{
"Effect": "Allow",
"Action": "eks:DescribeCluster",
"Resource": "arn:${AWS_PARTITION}:eks:${AWS_REGION}:${AWS_ACCOUNT_ID}:cluster/${CLUSTER}",
"Sid": "EKSClusterEndpointLookup"
},
{
"Sid": "AllowScopedInstanceProfileCreationActions",
"Effect": "Allow",
"Resource": "*",
"Action": [
"iam:CreateInstanceProfile"
],
"Condition": {
"StringEquals": {
"aws:RequestTag/kubernetes.io/cluster/${CLUSTER}": "owned",
"aws:RequestTag/topology.kubernetes.io/region": "${AWS_REGION}"
},
"StringLike": {
"aws:RequestTag/karpenter.k8s.aws/ec2nodeclass": "*"
}
}
},
{
"Sid": "AllowScopedInstanceProfileTagActions",
"Effect": "Allow",
"Resource": "*",
"Action": [
"iam:TagInstanceProfile"
],
"Condition": {
"StringEquals": {
"aws:ResourceTag/kubernetes.io/cluster/${CLUSTER}": "owned",
"aws:ResourceTag/topology.kubernetes.io/region": "${AWS_REGION}",
"aws:RequestTag/kubernetes.io/cluster/${CLUSTER}": "owned",
"aws:RequestTag/topology.kubernetes.io/region": "${AWS_REGION}"
},
"StringLike": {
"aws:ResourceTag/karpenter.k8s.aws/ec2nodeclass": "*",
"aws:RequestTag/karpenter.k8s.aws/ec2nodeclass": "*"
}
}
},
{
"Sid": "AllowScopedInstanceProfileActions",
"Effect": "Allow",
"Resource": "*",
"Action": [
"iam:AddRoleToInstanceProfile",
"iam:RemoveRoleFromInstanceProfile",
"iam:DeleteInstanceProfile"
],
"Condition": {
"StringEquals": {
"aws:ResourceTag/kubernetes.io/cluster/${CLUSTER}": "owned",
"aws:ResourceTag/topology.kubernetes.io/region": "${AWS_REGION}"
},
"StringLike": {
"aws:ResourceTag/karpenter.k8s.aws/ec2nodeclass": "*"
}
}
},
{
"Sid": "AllowInstanceProfileReadActions",
"Effect": "Allow",
"Resource": "*",
"Action": "iam:GetInstanceProfile"
}
],
"Version": "2012-10-17"
}
EOFAttach policy to role:
aws iam put-role-policy \
--role-name KarpenterControllerRole-${CLUSTER} \
--policy-name KarpenterControllerPolicy-${CLUSTER} \
--policy-document file://controller-policy.jsonStep 3: Tag Resources
Tag node group subnets and security groups so Karpenter knows which resources to use:
Tag subnets:
for NODEGROUP in $(aws eks list-nodegroups --cluster-name ${CLUSTER} \
--query 'nodegroups' --output text); do
aws ec2 create-tags \
--tags "Key=karpenter.sh/discovery,Value=${CLUSTER}" \
--resources $(aws eks describe-nodegroup --cluster-name ${CLUSTER} \
--nodegroup-name $NODEGROUP --query 'nodegroup.subnets' --output text)
doneTag security group:
NODEGROUP=$(aws eks list-nodegroups --cluster-name ${CLUSTER} --query 'nodegroups[0]' --output text)
SECURITY_GROUPS=$(aws eks describe-cluster \
--name ${CLUSTER} \
--query "cluster.resourcesVpcConfig.clusterSecurityGroupId" \
--output text)
aws ec2 create-tags \
--tags "Key=karpenter.sh/discovery,Value=${CLUSTER}" \
--resources ${SECURITY_GROUPS}Step 4: Update aws-auth ConfigMap
Allow nodes with the KarpenterNodeRole to join the cluster:
cat << EOF
- groups:
- system:bootstrappers
- system:nodes
rolearn: arn:${AWS_PARTITION}:iam::${AWS_ACCOUNT_ID}:role/KarpenterNodeRole-${CLUSTER}
username: system:node:{{EC2PrivateDNSName}}
EOFAdd the output to the mapRoles in the aws-auth ConfigMap:
kubectl edit configmap aws-auth -n kube-systemStep 5: Deploy Karpenter
Set Karpenter version:
export KARPENTER_VERSION=v0.33.1Generate Karpenter manifests:
helm template karpenter oci://public.ecr.aws/karpenter/karpenter \
--version "${KARPENTER_VERSION}" \
--namespace "${KARPENTER_NAMESPACE}" \
--set "settings.clusterName=${CLUSTER}" \
--set "serviceAccount.annotations.eks\.amazonaws\.com/role-arn=arn:${AWS_PARTITION}:iam::${AWS_ACCOUNT_ID}:role/KarpenterControllerRole-${CLUSTER}" \
--set controller.resources.requests.cpu=1 \
--set controller.resources.requests.memory=1Gi \
--set controller.resources.limits.cpu=1 \
--set controller.resources.limits.memory=1Gi > karpenter.yamlModify affinity rules:
Edit karpenter.yaml to tell Karpenter to run on existing node group nodes:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: karpenter.sh/nodepool
operator: DoesNotExist
- matchExpressions:
- key: eks.amazonaws.com/nodegroup
operator: In
values:
- ${NODEGROUP}
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- topologyKey: kubernetes.io/hostnameDeploy Karpenter CRDs:
kubectl create -f \
https://raw.githubusercontent.com/aws/karpenter-provider-aws/${KARPENTER_VERSION}/pkg/apis/crds/karpenter.sh_nodepools.yaml
kubectl create -f \
https://raw.githubusercontent.com/aws/karpenter-provider-aws/${KARPENTER_VERSION}/pkg/apis/crds/karpenter.k8s.aws_ec2nodeclasses.yaml
kubectl create -f \
https://raw.githubusercontent.com/aws/karpenter-provider-aws/${KARPENTER_VERSION}/pkg/apis/crds/karpenter.sh_nodeclaims.yamlDeploy Karpenter:
kubectl apply -f karpenter.yamlStep 6: Create Node Pools
Create node pools for different workload types.
CPU Node Pool:
cat <<EOF | envsubst | kubectl apply -f -
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: default
spec:
template:
spec:
requirements:
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
- key: kubernetes.io/os
operator: In
values: ["linux"]
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["c", "m", "r"]
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values: ["2"]
nodeClassRef:
name: default
limits:
cpu: 100
memory: 1000Gi
disruption:
consolidationPolicy: WhenUnderutilized
expireAfter: 720h
---
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
name: default
spec:
amiFamily: AL2
role: "KarpenterNodeRole-${CLUSTER}"
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: "${CLUSTER}"
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: "${CLUSTER}"
EOFGPU Node Pool (Optional):
If using GPUs, install the NVIDIA device plugin first:
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update
# Check version and use it in the command below
helm search repo nvdp --devel
helm upgrade --install nvdp nvdp/nvidia-device-plugin \
--namespace nvidia-device-plugin \
--create-namespace \
--version <version>Create GPU node pool:
cat <<EOF | envsubst | kubectl apply -f -
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: default-gpu
spec:
template:
spec:
requirements:
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
- key: kubernetes.io/os
operator: In
values: ["linux"]
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand"]
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["p"]
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values: ["2"]
nodeClassRef:
name: default
taints:
- key: nvidia.com/gpu
value: true
effect: "NoSchedule"
limits:
cpu: 100
memory: 1000Gi
nvidia.com/gpu: 5
disruption:
consolidationPolicy: WhenUnderutilized
expireAfter: 720h
EOFStep 7: Monitor Scaling
Follow Karpenter logs to see scaling activity:
kubectl logs -f -n ${KARPENTER_NAMESPACE} -c controller -l app.kubernetes.io/name=karpenterTest scaling:
Create a Valohai execution and watch Karpenter provision nodes automatically.
Adapting to Other Environments
The concepts above apply to other Kubernetes environments. Here's how to adapt:
Google Cloud (GKE)
Use GKE Cluster Autoscaler:
gcloud container clusters update CLUSTER_NAME \
--enable-autoscaling \
--min-nodes=1 \
--max-nodes=10 \
--node-pool=default-poolOr use GKE Autopilot for fully managed node provisioning.
Azure (AKS)
Use AKS Cluster Autoscaler:
az aks update \
--resource-group RESOURCE_GROUP \
--name CLUSTER_NAME \
--enable-cluster-autoscaler \
--min-count 1 \
--max-count 10On-Premises or Custom Kubernetes
Use Kubernetes Cluster Autoscaler:
Install Cluster Autoscaler following Kubernetes documentation.
Configure it to work with your infrastructure provider (vSphere, OpenStack, etc.).
Best Practices
Node Pool Configuration
Separate pools for different workloads:
CPU-intensive:
cinstance familyMemory-intensive:
rinstance familyGPU workloads:
porginstance family
Cost optimization:
Use spot/preemptible instances for interruptible workloads
Set appropriate limits to prevent runaway costs
Configure consolidation for efficient resource usage
Resource Requests
Set accurate requests in Valohai:
CPU and memory requests help autoscaler make better decisions
Over-requesting wastes resources
Under-requesting causes scheduling failures
Scaling Parameters
Balance speed and cost:
Fast scale-up for time-sensitive workloads
Gradual scale-down to avoid thrashing
Appropriate consolidation policies
Troubleshooting
Nodes not scaling up
Check Karpenter logs:
kubectl logs -n kube-system -l app.kubernetes.io/name=karpenterCommon issues:
IAM permissions insufficient
No matching node pool for job requirements
Instance type not available in region
Subnet or security group not tagged
Nodes not scaling down
Check disruption settings:
Verify consolidation policy
Check if nodes have workloads preventing disruption
Review expiration settings
Force disruption (careful):
kubectl delete node NODE_NAMEJobs stuck pending
Describe the pod:
kubectl describe pod POD_NAME -n valohai-workersCheck events:
kubectl get events -n valohai-workers --sort-by='.lastTimestamp'Common issues:
Resource requests too large
No node pool matches requirements
Taints preventing scheduling
Getting Help
Valohai Support: [email protected]
Include in support requests:
Kubernetes version
Autoscaler type and version
Node pool configurations
Pod descriptions and events
Autoscaler logs
For Karpenter-specific issues:
Karpenter logs
NodePool and EC2NodeClass definitions
AWS IAM role configuration
Last updated
Was this helpful?
