# EKS for Real-Time Inference

Configure your Amazon EKS cluster to deploy Valohai models for real-time inference.

> **Note:** This guide is specifically for real-time inference deployments. It does not enable using Kubernetes for standard Valohai workers. For Kubernetes workers, see our [installation guide](https://docs.valohai.com/installation-and-setup/kubernetes/workers).

## Overview

Valohai can push deployments to an existing EKS cluster using standard Kubernetes APIs.

**Requirements:**

* app.valohai.com (`34.248.245.191`, `63.34.156.112`) must be able to access your cluster's API Server over HTTPS
* Your cluster can be configured to serve only private deployment endpoints

## Prerequisites

**Existing infrastructure:**

* EKS cluster
* kubectl configured to access your cluster

**Tools:**

* kubectl installed
* AWS CLI installed and configured

## Step 1: Install Gateway API CRDs and Gateway controller

* **Install Gateway API CRDs** — Follow the [Getting Started guide](https://gateway-api.sigs.k8s.io/guides/getting-started/#install-standard-channel) on the official Gateway API documentation.
* **Install a Gateway controller** — Choose and install a controller from the [official implementations list](https://gateway-api.sigs.k8s.io/implementations/#gateway-controller-implementation-status).
* **Create a `Gateway`** — Configure a Gateway for Valohai Deployments to use. This will serve as the base URL / domain for all Valohai Deployment Endpoints, including TLS termination for `https://`. Refer to your chosen controller's documentation for the exact configuration. Note down the Gateway **name** and **namespace** for later.
* **Grant permissions to manage HTTP routes** — See the `httproutes` resource in Step 3.

Note down the Gateway **name**, **namespace**, and whether it uses `https://` or `http://` — you'll need these in Step 6.

## Step 2: Create Kubernetes Service Account

Create a service account that Valohai will use to manage deployments.

```shell
kubectl create serviceaccount valohai-deployment
```

### Create Service Account Token

For Kubernetes 1.22 and higher, tokens are not created automatically:

```shell
kubectl apply -f - <<EOF
apiVersion: v1
kind: Secret
type: kubernetes.io/service-account-token
metadata:
  name: valohai-deployment-token
  namespace: <NAMESPACE HERE>
  annotations:
    kubernetes.io/service-account.name: valohai-deployment
EOF
```

Replace `<NAMESPACE HERE>` with your namespace (or use `default`).

### Get the Token

Retrieve the token:

```shell
kubectl get serviceaccounts valohai-deployment -o json
kubectl get secret valohai-deployment-token -o jsonpath='{.data.token}' | base64 --decode
```

Save this token value to provide to Valohai.

## Step 3: Create Kubernetes Role

Create a role that defines the permissions Valohai needs.

Create a file `valohai-deployment-role.yml`:

```yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: valohai-deployment-role
rules:
  - apiGroups: [""]
    resources: ["events", "namespaces"]
    verbs: ["get", "list", "watch"]
  - apiGroups: [""]
    resources: ["pods", "pods/log", "services"]
    verbs: ["create", "delete", "deletecollection", "get", "list", "patch", "update", "watch"]
  - apiGroups: ["apps", "extensions"]
    resources: ["deployments", "deployments/rollback", "deployments/scale"]
    verbs: ["create", "delete", "deletecollection", "get", "list", "patch", "update", "watch"]
  - apiGroups: ["gateway.networking.k8s.io"]
    resources: ["httproutes"]
    verbs: ["create", "delete", "deletecollection", "get", "list", "patch", "update", "watch"]
```

If you need to limit access to a certain namespace, add `namespace: <NAMESPACE>` under metadata.

Apply the role:

```shell
kubectl apply -f valohai-deployment-role.yml
```

## Step 4: Create Role Binding

Bind the role to the service account.

Replace `<namespace>` with your namespace if you defined one when creating the service account:

```shell
kubectl create rolebinding valohai-deployment-binding \
    --role=valohai-deployment-role \
    --serviceaccount=<namespace>:valohai-deployment
```

## Step 5: Configure Repository Access

Ensure your cluster's nodes can pull from the repository that Valohai pushes images to.

### AWS IAM User for ECR

Create an IAM user that Valohai can use to access the cluster and push to your ECR.

**1. Create IAM user**

Navigate to **IAM** in AWS Console and create a user named `valohai-eks-user`.

* Enable **Programmatic access** and **Console access**

**2. Attach policies**

Attach these existing policies:

* `AmazonEC2ContainerRegistryFullAccess`
* `AmazonEKSServicePolicy`

**3. Create custom policy**

Create a new policy named `VH_EKS_USER`:

```json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "1",
      "Effect": "Allow",
      "Action": "eks:ListClusters",
      "Resource": "*"
    }
  ]
}
```

**4. Attach custom policy**

Go back to the user, refresh, and attach the `VH_EKS_USER` policy.

**5. Save credentials**

Store the access key and secret in a safe place. You'll provide these to Valohai.

### Alternative: Other Container Registries

You can use standard Docker login (username/password) credentials for:

* Azure Container Registry
* GitLab Container Registry
* Artifactory
* Docker Hub
* Other registries

Create a separate account for Valohai to push to your repository.

## Step 6: Collect Information

Gather the following information to send to Valohai:

**Cluster details:**

Find these on your cluster's page in EKS:

* Cluster name: `____________`
* AWS region: `____________`
* API server endpoint: `____________`
* Cluster ARN: `____________`
* Certificate authority (cluster-certificate-data): `____________`

**Service account:**

* `valohai-deployment` service account token: `____________`

**Networking:**

* Gateway name: `____________`
* Gateway namespace: `____________`
* Base URL scheme (`https://` or `http://`): `____________`

**Container registry:**

ECR:

* ECR name/URL: `____________`
  * Example: `accountid.dkr.ecr.eu-west-1.amazonaws.com`
  * Find this when creating a new repository in ECR

IAM credentials:

* `valohai-eks-user` access key ID: `____________`
* `valohai-eks-user` secret access key: `____________`

Send the collected information to your Valohai contact at **<support@valohai.com>** using your organization's secure communication method.

Your Valohai contact will complete the configuration on the platform side.

## Next Steps

After Valohai confirms the setup:

**1. Test deployments**

* Create a deployment in Valohai
* Verify it deploys to your EKS cluster
* Check that the endpoint is accessible

**2. Configure access controls**

* Review security groups for the load balancer
* Configure authentication for endpoints if needed
* Set up network policies in Kubernetes

**3. Monitor deployments**

* Set up CloudWatch monitoring for your deployments
* Configure alerts for deployment health
* Review resource usage and costs

## Troubleshooting

### Cannot access cluster API

**Verify API endpoint:**

```shell
aws eks describe-cluster --name CLUSTER_NAME --query "cluster.endpoint"
```

**Check security group:**

Ensure the cluster security group allows access from Valohai IPs (`34.248.245.191/32`, `63.34.156.112/32`).

### Service account token invalid

**Regenerate token:**

Delete and recreate the secret:

```shell
kubectl delete secret valohai-deployment-token
kubectl apply -f - <<EOF
apiVersion: v1
kind: Secret
type: kubernetes.io/service-account-token
metadata:
  name: valohai-deployment-token
  namespace: default
  annotations:
    kubernetes.io/service-account.name: valohai-deployment
EOF
```

**Get new token:**

```shell
kubectl get secret valohai-deployment-token -o jsonpath='{.data.token}' | base64 --decode
```

### HTTPRoutes not working

**Check that HTTPRoutes were created:**

```bash
kubectl get httproutes -n <valohai-namespace>
```

**Check Gateway status:**

```bash
kubectl describe gateway <gateway-name> -n <gateway-namespace>
```

**Verify the Gateway controller is running** by checking the controller's pods according to your chosen controller's documentation.

### Cannot push to ECR

**Test ECR access:**

```shell
aws ecr get-login-password --region REGION | docker login --username AWS --password-stdin ACCOUNT.dkr.ecr.REGION.amazonaws.com
```

**Verify IAM permissions:**

Ensure `valohai-eks-user` has `AmazonEC2ContainerRegistryFullAccess`.

### Deployments fail

**Check deployment logs:**

```shell
kubectl logs -l app=your-deployment -n default
```

**Check pod events:**

```shell
kubectl describe pod POD_NAME -n default
```

**Common issues:**

* Image pull errors (check ECR permissions)
* Resource limits too low
* Network policies blocking traffic
* Service ports misconfigured

## Getting Help

**Valohai Support:** <support@valohai.com>

**Include in support requests:**

* EKS cluster name and region
* Kubernetes version
* Service account token status
* kubectl version
* Error messages from deployments or Kubernetes events
* Gateway and HTTPRoute configurations
