# Hybrid Deployment

Deploy Valohai workers and storage to your AWS account while Valohai manages the application layer at app.valohai.com.

### What Gets Deployed

<table><thead><tr><th width="132.70703125">Category</th><th width="258.609375">Resource</th><th>Purpose</th></tr></thead><tbody><tr><td><strong>Networking</strong></td><td>VPC with subnets</td><td>Spans multiple availability zones for high availability for workers</td></tr><tr><td></td><td>Internet Gateway</td><td>Enables outbound internet access for workers</td></tr><tr><td></td><td>Route tables</td><td>Manages network routing within VPC</td></tr><tr><td></td><td>Security groups</td><td>Two groups: one for queue instance, one for workers</td></tr><tr><td></td><td>Elastic IP</td><td>Static IP address for queue instance</td></tr><tr><td><strong>Compute</strong></td><td>EC2 instance (<code>valohai-queue</code>)</td><td>Manages job queue and scheduling</td></tr><tr><td></td><td>Autoscaling groups</td><td>Dynamically scales worker instances based on job demand</td></tr><tr><td></td><td>Launch templates</td><td>Defines worker instance configuration</td></tr><tr><td><strong>Storage</strong></td><td>S3 bucket</td><td>Stores execution logs, artifacts, and code snapshots</td></tr><tr><td><strong>IAM &#x26; Security</strong></td><td><code>ValohaiMaster</code> role</td><td>Allows Valohai to manage autoscaling</td></tr><tr><td></td><td><code>ValohaiWorkerRole</code> role</td><td>Worker instance permissions</td></tr><tr><td></td><td><code>ValohaiQueueRole</code> role</td><td>Queue instance permissions</td></tr><tr><td></td><td><code>ValohaiS3MultipartRole</code> role</td><td>Handles large file uploads to S3</td></tr><tr><td></td><td>AWS Secrets Manager secret</td><td>Stores Redis password securely</td></tr></tbody></table>

### What Valohai Accesses

<table><thead><tr><th width="205.50390625">Access Pattern</th><th>Details</th></tr></thead><tbody><tr><td><strong>Queue communication</strong></td><td>Valohai connects to your <code>valohai-queue</code> instance via Redis over TLS (port 63790)</td></tr><tr><td><strong>Worker isolation</strong></td><td>Valohai never directly accesses worker instances that download your data</td></tr><tr><td><strong>Job execution</strong></td><td>All ML jobs run on your EC2 instances within your VPC</td></tr><tr><td><strong>Data residency</strong></td><td>All training data and generated artifacts stay in your S3 bucket, nothing leaves your AWS account</td></tr></tbody></table>

## Prerequisites

Before starting, gather this information from Valohai:

**Required from Valohai:**

* `AssumeRoleARN` - ARN of the Valohai user that will manage your resources
* `QueueAddress` - DNS name for your queue (e.g., `yourcompany.vqueue.net`)

**Required from your AWS account:**

* AWS account with admin access
* EC2 key pair for SSH access to instances (create in AWS Console if needed)
* Region selected (consider GPU availability and data location)

**Contact <support@valohai.com>** to receive your `AssumeRoleARN` and `QueueAddress` before proceeding.

## Installation Methods

Choose your preferred deployment method:

### Terraform (Recommended)

Use Terraform for infrastructure as code with version control and repeatability.

**Requirements:**

* Terraform installed
* AWS CLI configured with appropriate credentials
* `AssumeRoleARN` and `QueueAddress` from Valohai

**Steps:**

Clone the Valohai Terraform repository:

```shell
git clone https://github.com/valohai/aws-hybrid-workers-terraform.git
cd aws-hybrid-workers-terraform
```

Copy the `terraform.tfvars` file and input your values:

```hcl
# Required values from Valohai
valohai_assume_user = "arn:aws:iam::635691382966:user/valohai-customer-yourcompany"
queue_address       = "yourcompany.vqueue.net"

# Your AWS settings
aws_region          = "us-east-1"
key_pair_name       = "your-key-pair-name"

# Optional: Customize resource names
# resource_suffix   = "production"
```

Initialize and apply:

```shell
terraform init
terraform plan
terraform apply
```

Review the planned changes. Type `yes` to create the resources.

After deployment completes, Terraform will output the `ValohaiMaster` role ARN. Send this to your Valohai contact to complete setup.

**Repository:** [github.com/valohai/aws-hybrid-workers-terraform](https://github.com/valohai/aws-hybrid-workers-terraform)

### CloudFormation

Deploy using AWS CloudFormation templates for a quick setup.

**Requirements:**

* AWS CLI installed and configured
* `AssumeRoleARN` and `QueueAddress` from Valohai
* EC2 key pair created

**Steps:**

Deploy the IAM stack first:

```shell
aws cloudformation deploy \
  --template-file https://valohai-cfn-templates-public.s3.eu-west-1.amazonaws.com/iam.yml \
  --stack-name ValohaiIAM \
  --parameter-overrides AssumeRoleARN=<your-assume-role-arn> \
  --capabilities CAPABILITY_NAMED_IAM
```

Note the `ValohaiMasterRoleArn` from the stack outputs:

```shell
aws cloudformation describe-stacks \
  --stack-name ValohaiIAM \
  --query 'Stacks[0].Outputs[?OutputKey==`ValohaiMasterRoleArn`].OutputValue' \
  --output text
```

Deploy the main stack:

```shell
aws cloudformation deploy \
  --template-file https://valohai-cfn-templates-public.s3.eu-west-1.amazonaws.com/aws-hybrid-workers.yml \
  --stack-name Valohai \
  --parameter-overrides \
    KeyPair=<your-key-pair> \
    QueueAddress=<your-queue-address> \
    ValohaiMasterRoleArn=<role-arn-from-previous-step>
```

After deployment, send the `ValohaiMasterRoleArn` to your Valohai contact.

**Templates:** [github.com/valohai/aws-hybrid-workers-cloudformation](https://github.com/valohai/aws-hybrid-workers-cloudformation)

### Manual Setup

> **Need full control?** If you can't use CloudFormation or Terraform, or have specific customization requirements, follow the [Manual Setup Guide](/installation-and-setup/aws/hybrid-manual.md).

The manual guide walks through creating each resource individually via the AWS Console or CLI.

### Network Configuration <a href="#network-configuration" id="network-configuration"></a>

Here's what gets deployed with default settings.

#### VPC and Subnets <a href="#vpc-and-subnets" id="vpc-and-subnets"></a>

<table><thead><tr><th width="195.3125">Component</th><th>Configuration</th></tr></thead><tbody><tr><td><strong>VPC CIDR</strong></td><td>10.0.0.0/16</td></tr><tr><td><strong>Subnets</strong></td><td>One subnet per availability zone in your region</td></tr><tr><td><strong>Internet Gateway</strong></td><td>Enabled for outbound connectivity</td></tr></tbody></table>

#### Security Groups <a href="#security-groups" id="security-groups"></a>

| Security Group         | Direction | Port  | Source/Destination | Purpose                                                    |
| ---------------------- | --------- | ----- | ------------------ | ---------------------------------------------------------- |
| **valohai-sg-workers** | Inbound   | None  | -                  | No ports open by default (add SSH for debugging if needed) |
|                        | Outbound  | All   | 0.0.0.0/0          | Workers pull Docker images and access S3                   |
| **valohai-sg-queue**   | Inbound   | 80    | 0.0.0.0/0          | Let's Encrypt certificate renewal                          |
|                        | Inbound   | 63790 | 34.248.245.191/32  | app.valohai.com access to queue                            |
|                        | Inbound   | 63790 | valohai-sg-workers | Worker access to Redis queue                               |
|                        | Outbound  | All   | 0.0.0.0/0          | General outbound connectivity                              |

#### Customizing the VPC <a href="#customizing-the-vpc" id="customizing-the-vpc"></a>

<table><thead><tr><th width="189.90234375">Method</th><th>Support</th></tr></thead><tbody><tr><td><strong>Terraform</strong></td><td>Set <code>use_existing_vpc = true</code> and provide VPC/subnet IDs in <code>terraform.tfvars</code></td></tr><tr><td><strong>CloudFormation</strong></td><td>Not supported, use manual setup</td></tr><tr><td><strong>Manual setup</strong></td><td>Follow the Manual Setup Guide with your existing VPC</td></tr></tbody></table>

### IAM Roles and Permissions <a href="#iam-roles-and-permissions" id="iam-roles-and-permissions"></a>

<table><thead><tr><th width="213.6953125">Role</th><th width="190.0390625">Attached To</th><th>Key Permissions</th></tr></thead><tbody><tr><td><strong>ValohaiMaster</strong></td><td>Valohai service</td><td>• Describe and create EC2 instances and launch templates<br>• Manage autoscaling groups<br>• Access Secrets Manager for Redis password<br>• Full access to Valohai S3 bucket</td></tr><tr><td><strong>ValohaiWorkerRole</strong></td><td>Worker EC2 instances</td><td>• Set instance protection on itself<br>• Describe its own instance metadata<br>• <strong>Customizable:</strong> Add policies for your resources (S3 buckets, databases, etc.)</td></tr><tr><td><strong>ValohaiQueueRole</strong></td><td>Queue instance</td><td>• Read secrets tagged with <code>valohai:1</code> from Secrets Manager</td></tr><tr><td><strong>ValohaiS3MultipartRole</strong></td><td>User uploads via UI</td><td>• Multipart upload operations to Valohai S3 bucket (for files >5GB)</td></tr></tbody></table>

### Queue Instance <a href="#queue-instance" id="queue-instance"></a>

The `valohai-queue` instance manages job scheduling and runs Redis for the job queue.

<table><thead><tr><th width="203.015625">Specification</th><th>Value</th></tr></thead><tbody><tr><td><strong>Instance type</strong></td><td>t3.medium (2 vCPUs, 4GB RAM)</td></tr><tr><td><strong>Operating system</strong></td><td>Ubuntu 20.04 LTS</td></tr><tr><td><strong>Networking</strong></td><td>Elastic IP attached for stable addressing</td></tr><tr><td><strong>Services</strong></td><td>Redis on port 63790 with TLS</td></tr></tbody></table>

#### What It Does <a href="#what-it-does" id="what-it-does"></a>

* Stores job queue and short-term logs
* Receives job submissions from app.valohai.com
* Workers pull jobs from this queue
* Handles Let's Encrypt certificate renewal

> 💡 *SSH access using your EC2 key pair is available for debugging but not required for normal operation.*

### S3 Bucket <a href="#s3-bucket" id="s3-bucket"></a>

<table><thead><tr><th width="204.41015625">Property</th><th>Details</th></tr></thead><tbody><tr><td><strong>Naming convention</strong></td><td><code>valohai-data-&#x3C;AWS-ACCOUNT-ID></code></td></tr><tr><td><strong>CORS configuration</strong></td><td>Applied automatically to allow uploads from app.valohai.com</td></tr></tbody></table>

#### What It Stores <a href="#what-it-stores" id="what-it-stores"></a>

* Git repository snapshots (for reproducibility)
* Execution logs (moved from Redis after job completion)
* Input datasets (if uploaded via Valohai)
* Output artifacts (models, visualizations, processed data)

#### Access <a href="#access" id="access"></a>

<table><thead><tr><th width="205.84375">Principal</th><th>Permission Level</th></tr></thead><tbody><tr><td>Workers</td><td>Read/write via ValohaiWorkerRole</td></tr><tr><td>Valohai service</td><td>Full access via ValohaiMaster role</td></tr><tr><td>Your AWS account</td><td>Direct access with your AWS credentials</td></tr></tbody></table>

***

### Worker Autoscaling <a href="#worker-autoscaling" id="worker-autoscaling"></a>

Workers are EC2 instances that execute your ML jobs. They scale automatically based on demand.

#### How It Works <a href="#how-it-works" id="how-it-works"></a>

<table><thead><tr><th width="193.02734375">Step</th><th>Action</th></tr></thead><tbody><tr><td>1. Job submission</td><td>User creates execution in app.valohai.com</td></tr><tr><td>2. Queue</td><td>Valohai submits job to your Redis queue</td></tr><tr><td>3. Scale up</td><td>Valohai launches appropriate worker instance</td></tr><tr><td>4. Execution</td><td>Worker pulls job, executes code, uploads results to S3</td></tr><tr><td>5. Scale down</td><td>After a default 15-minute grace period, idle workers terminate</td></tr></tbody></table>

#### Default Configuration <a href="#default-configuration" id="default-configuration"></a>

<table><thead><tr><th width="213.0859375">Setting</th><th>Value</th></tr></thead><tbody><tr><td><strong>Launch method</strong></td><td>Autoscaling groups</td></tr><tr><td><strong>Instance types</strong></td><td>Configured per environment in Valohai</td></tr><tr><td><strong>Spot instances</strong></td><td>Supported for cost savings</td></tr><tr><td><strong>Grace period</strong></td><td>15 minutes (configurable)</td></tr></tbody></table>

> 💡 *Contact* [*support@valohai.com*](mailto:support@valohai.com) *to customize instance types, spot/on-demand mix, or scaling behavior.*

## Next Steps

After deployment is complete:

**1. Send information to Valohai**

Share the `ValohaiMaster` role ARN with your Valohai contact:

```shell
# For Terraform
terraform output valohai_master_role_arn

# For CloudFormation
aws cloudformation describe-stacks \
  --stack-name ValohaiIAM \
  --query 'Stacks[0].Outputs[?OutputKey==`ValohaiMasterRoleArn`].OutputValue' \
  --output text
```

**2. Valohai configures your organization**

Your Valohai contact will:

* Link your AWS resources to your Valohai organization
* Create execution environments (e.g., `aws-eu-west-1-t3.medium`)
* Configure available instance types

**3. Verify the setup**

Once Valohai confirms setup is complete:

* Log in to app.valohai.com
* Create a test project
* Run a simple execution to verify workers launch correctly

**4. Configure additional resources**

Consider setting up:

* [Additional S3 buckets](https://docs.valohai.com/topic-guides/core-concepts/data-stores/) for your datasets
* [Private Docker registries](/docker-in-valohai/private-docker-registries.md) for custom images
* [Environment splitting](/installation-and-setup/advanced-topics/environment-splitting.md) for dev/prod separation
* [Shared cache](/installation-and-setup/advanced-topics/shared-cache.md) for large datasets

## Common Configurations

### Using Existing S3 Buckets

To access your existing S3 buckets from Valohai workers:

Add a policy to `ValohaiWorkerRole` granting access:

```json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::your-bucket-name",
        "arn:aws:s3:::your-bucket-name/*"
      ]
    }
  ]
}
```

Then add the bucket as a data store in Valohai's web UI.

### Using Private Docker Registries

To use Amazon ECR or other private registries:

**For ECR:**

1. Add `AmazonEC2ContainerRegistryReadOnly` policy to `ValohaiWorkerRole`
2. Configure the registry in Valohai's web UI with your ECR URL

**For other registries:**

1. Create credentials in AWS Secrets Manager
2. Grant `ValohaiWorkerRole` access to the secret
3. Contact <support@valohai.com> to configure authentication

### Accessing RDS Databases

To connect workers to RDS databases:

1. Add workers' security group (`valohai-sg-workers`) to RDS security group inbound rules
2. Ensure workers are in the same VPC as RDS, or set up VPC peering
3. Use the database endpoint in your execution code

### GPU Instances

GPU instances work out of the box. Valohai will:

* Use instance types with GPUs (e.g., `p3.2xlarge`, `g4dn.xlarge`)
* Install NVIDIA drivers automatically
* Make GPUs available to your containers

Ensure you have sufficient GPU quota in your AWS account.

## Troubleshooting

### Workers not launching

**Check the queue instance:**

```shell
ssh -i your-key.pem ubuntu@<queue-ip>
sudo systemctl status valohai-queue
```

**Verify network access:**

* Ensure port 63790 is open from app.valohai.com
* Check security group rules

### Jobs stuck in queue

**Check IAM permissions:**

* Verify Valohai can assume the `ValohaiMaster` role
* Check CloudTrail logs for permission errors

**Check instance limits:**

* Verify your AWS account has sufficient EC2 instance quota
* Check for any AWS service limits

### Cannot upload to S3

**Verify CORS configuration:**

```shell
aws s3api get-bucket-cors --bucket valohai-data-<account-id>
```

**Check IAM roles:**

* Workers need write access to the bucket
* Valohai needs read access for the web UI

### Costs higher than expected

**Review instance types:**

* Use spot instances for non-critical workloads
* Adjust grace period to scale down faster
* Use smaller instance types for lighter workloads

**Monitor S3 usage:**

* Set lifecycle policies to archive old artifacts
* Delete unnecessary execution outputs

## Getting Help

**Valohai Support:** <support@valohai.com>

**AWS Resources:**

* EC2 instance limits: Check AWS Service Quotas console
* CloudFormation issues: Check stack events in AWS Console
* Terraform issues: Review terraform.tfstate and run `terraform plan`


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.valohai.com/installation-and-setup/aws/hybrid.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
