Kubernetes Autoscaling

Configure autoscaling for your Kubernetes cluster to dynamically provision resources for Valohai ML workloads.

Overview

Valohai Workers on Kubernetes are implemented as Kubernetes jobs. When you have autoscaling configured, your cluster can automatically:

  • Scale up nodes when Valohai jobs are queued

  • Scale down nodes when jobs complete and resources are idle

  • Select appropriate instance types based on job requirements

  • Optimize costs by using spot/preemptible instances

Note: This guide uses AWS EKS with Karpenter as an example, but the concepts apply to any Kubernetes cluster. The same principles work with GKE, AKS, or on-premises Kubernetes using different autoscalers.

Autoscaling Options

You can use various autoscaling solutions with Valohai:

Best for: AWS EKS clusters

Advantages:

  • Fast node provisioning (seconds vs. minutes)

  • Flexible instance selection

  • Bin-packing optimization

  • Direct EC2 API integration

Cloud support: AWS (native), Azure and GCP (experimental)

Cluster Autoscaler

Best for: Multi-cloud environments, stable workloads

Advantages:

  • Cloud-agnostic

  • Mature and widely used

  • Works with all major cloud providers

  • Simple configuration

Cloud support: AWS, GCP, Azure, and others

Cloud-Native Autoscalers

GKE Autopilot: Fully managed node provisioning on GKE

AKS Cluster Autoscaler: Azure's native autoscaling

Best for: Organizations standardized on one cloud provider

Example: Karpenter on AWS EKS

This section provides a complete example of setting up Karpenter on AWS EKS. If you're using a different cloud provider or autoscaler, adapt these concepts to your environment.

Requirements

Existing infrastructure:

  • EKS cluster with Valohai workers installed

  • AWS CLI installed

  • kubectl configured

Permissions:

  • Admin access to your EKS cluster

  • IAM permissions to create roles and policies

Step 1: Set Up Environment Variables

Define common variables for reuse:

Step 2: Create IAM Roles

Create two IAM roles: one for nodes provisioned by Karpenter and one for the Karpenter controller.

Create node trust policy:

Create node role:

Create controller trust policy:

Create controller role:

Create controller policy:

Attach policy to role:

Step 3: Tag Resources

Tag node group subnets and security groups so Karpenter knows which resources to use:

Tag subnets:

Tag security group:

Step 4: Update aws-auth ConfigMap

Allow nodes with the KarpenterNodeRole to join the cluster:

Add the output to the mapRoles in the aws-auth ConfigMap:

Step 5: Deploy Karpenter

Set Karpenter version:

Generate Karpenter manifests:

Modify affinity rules:

Edit karpenter.yaml to tell Karpenter to run on existing node group nodes:

Deploy Karpenter CRDs:

Deploy Karpenter:

Step 6: Create Node Pools

Create node pools for different workload types.

CPU Node Pool:

GPU Node Pool (Optional):

If using GPUs, install the NVIDIA device plugin first:

Create GPU node pool:

Step 7: Monitor Scaling

Follow Karpenter logs to see scaling activity:

Test scaling:

Create a Valohai execution and watch Karpenter provision nodes automatically.

Adapting to Other Environments

The concepts above apply to other Kubernetes environments. Here's how to adapt:

Google Cloud (GKE)

Use GKE Cluster Autoscaler:

Or use GKE Autopilot for fully managed node provisioning.

Azure (AKS)

Use AKS Cluster Autoscaler:

On-Premises or Custom Kubernetes

Use Kubernetes Cluster Autoscaler:

Install Cluster Autoscaler following Kubernetes documentation.

Configure it to work with your infrastructure provider (vSphere, OpenStack, etc.).

Best Practices

Node Pool Configuration

Separate pools for different workloads:

  • CPU-intensive: c instance family

  • Memory-intensive: r instance family

  • GPU workloads: p or g instance family

Cost optimization:

  • Use spot/preemptible instances for interruptible workloads

  • Set appropriate limits to prevent runaway costs

  • Configure consolidation for efficient resource usage

Resource Requests

Set accurate requests in Valohai:

  • CPU and memory requests help autoscaler make better decisions

  • Over-requesting wastes resources

  • Under-requesting causes scheduling failures

Scaling Parameters

Balance speed and cost:

  • Fast scale-up for time-sensitive workloads

  • Gradual scale-down to avoid thrashing

  • Appropriate consolidation policies

Troubleshooting

Nodes not scaling up

Check Karpenter logs:

Common issues:

  • IAM permissions insufficient

  • No matching node pool for job requirements

  • Instance type not available in region

  • Subnet or security group not tagged

Nodes not scaling down

Check disruption settings:

  • Verify consolidation policy

  • Check if nodes have workloads preventing disruption

  • Review expiration settings

Force disruption (careful):

Jobs stuck pending

Describe the pod:

Check events:

Common issues:

  • Resource requests too large

  • No node pool matches requirements

  • Taints preventing scheduling

Getting Help

Valohai Support: [email protected]

Include in support requests:

  • Kubernetes version

  • Autoscaler type and version

  • Node pool configurations

  • Pod descriptions and events

  • Autoscaler logs

For Karpenter-specific issues:

  • Karpenter logs

  • NodePool and EC2NodeClass definitions

  • AWS IAM role configuration

Last updated

Was this helpful?