Hybrid Deployment

Deploy Valohai's compute and data layer to your AWS account using CloudFormation or Terraform

Deploy Valohai workers and storage to your AWS account while Valohai manages the application layer at app.valohai.com.

What Gets Deployed

Category
Resource
Purpose

Networking

VPC with subnets

Spans multiple availability zones for high availability for workers

Internet Gateway

Enables outbound internet access for workers

Route tables

Manages network routing within VPC

Security groups

Two groups: one for queue instance, one for workers

Elastic IP

Static IP address for queue instance

Compute

EC2 instance (valohai-queue)

Manages job queue and scheduling

Autoscaling groups

Dynamically scales worker instances based on job demand

Launch templates

Defines worker instance configuration

Storage

S3 bucket

Stores execution logs, artifacts, and code snapshots

IAM & Security

ValohaiMaster role

Allows Valohai to manage autoscaling

ValohaiWorkerRole role

Worker instance permissions

ValohaiQueueRole role

Queue instance permissions

ValohaiS3MultipartRole role

Handles large file uploads to S3

AWS Secrets Manager secret

Stores Redis password securely

What Valohai Accesses

Access Pattern
Details

Queue communication

Valohai connects to your valohai-queue instance via Redis over TLS (port 63790)

Worker isolation

Valohai never directly accesses worker instances that download your data

Job execution

All ML jobs run on your EC2 instances within your VPC

Data residency

All training data and generated artifacts stay in your S3 bucket, nothing leaves your AWS account

Prerequisites

Before starting, gather this information from Valohai:

Required from Valohai:

  • AssumeRoleARN - ARN of the Valohai user that will manage your resources

  • QueueAddress - DNS name for your queue (e.g., yourcompany.vqueue.net)

Required from your AWS account:

  • AWS account with admin access

  • EC2 key pair for SSH access to instances (create in AWS Console if needed)

  • Region selected (consider GPU availability and data location)

Contact [email protected] to receive your AssumeRoleARN and QueueAddress before proceeding.

Installation Methods

Choose your preferred deployment method:

Use Terraform for infrastructure as code with version control and repeatability.

Requirements:

  • Terraform installed

  • AWS CLI configured with appropriate credentials

  • AssumeRoleARN and QueueAddress from Valohai

Steps:

Clone the Valohai Terraform repository:

git clone https://github.com/valohai/aws-hybrid-workers-terraform.git
cd aws-hybrid-workers-terraform

Copy the terraform.tfvars file and input your values:

# Required values from Valohai
valohai_assume_user = "arn:aws:iam::635691382966:user/valohai-customer-yourcompany"
queue_address       = "yourcompany.vqueue.net"

# Your AWS settings
aws_region          = "us-east-1"
key_pair_name       = "your-key-pair-name"

# Optional: Customize resource names
# resource_suffix   = "production"

Initialize and apply:

terraform init
terraform plan
terraform apply

Review the planned changes. Type yes to create the resources.

After deployment completes, Terraform will output the ValohaiMaster role ARN. Send this to your Valohai contact to complete setup.

Repository: github.com/valohai/aws-hybrid-workers-terraform

CloudFormation

Deploy using AWS CloudFormation templates for a quick setup.

Requirements:

  • AWS CLI installed and configured

  • AssumeRoleARN and QueueAddress from Valohai

  • EC2 key pair created

Steps:

Deploy the IAM stack first:

aws cloudformation deploy \
  --template-file https://valohai-cfn-templates-public.s3.eu-west-1.amazonaws.com/iam.yml \
  --stack-name ValohaiIAM \
  --parameter-overrides AssumeRoleARN=<your-assume-role-arn> \
  --capabilities CAPABILITY_NAMED_IAM

Note the ValohaiMasterRoleArn from the stack outputs:

aws cloudformation describe-stacks \
  --stack-name ValohaiIAM \
  --query 'Stacks[0].Outputs[?OutputKey==`ValohaiMasterRoleArn`].OutputValue' \
  --output text

Deploy the main stack:

aws cloudformation deploy \
  --template-file https://valohai-cfn-templates-public.s3.eu-west-1.amazonaws.com/aws-hybrid-workers.yml \
  --stack-name Valohai \
  --parameter-overrides \
    KeyPair=<your-key-pair> \
    QueueAddress=<your-queue-address> \
    ValohaiMasterRoleArn=<role-arn-from-previous-step>

After deployment, send the ValohaiMasterRoleArn to your Valohai contact.

Templates: github.com/valohai/aws-hybrid-workers-cloudformation

Manual Setup

Need full control? If you can't use CloudFormation or Terraform, or have specific customization requirements, follow the Manual Setup Guide.

The manual guide walks through creating each resource individually via the AWS Console or CLI.

Network Configuration

Here's what gets deployed with default settings.

VPC and Subnets

Component
Configuration

VPC CIDR

10.0.0.0/16

Subnets

One subnet per availability zone in your region

Internet Gateway

Enabled for outbound connectivity

Security Groups

Security Group
Direction
Port
Source/Destination
Purpose

valohai-sg-workers

Inbound

None

-

No ports open by default (add SSH for debugging if needed)

Outbound

All

0.0.0.0/0

Workers pull Docker images and access S3

valohai-sg-queue

Inbound

80

0.0.0.0/0

Let's Encrypt certificate renewal

Inbound

63790

34.248.245.191/32

app.valohai.com access to queue

Inbound

63790

valohai-sg-workers

Worker access to Redis queue

Outbound

All

0.0.0.0/0

General outbound connectivity

Customizing the VPC

Method
Support

Terraform

Set use_existing_vpc = true and provide VPC/subnet IDs in terraform.tfvars

CloudFormation

Not supported, use manual setup

Manual setup

Follow the Manual Setup Guide with your existing VPC

IAM Roles and Permissions

Role
Attached To
Key Permissions

ValohaiMaster

Valohai service

• Describe and create EC2 instances and launch templates • Manage autoscaling groups • Access Secrets Manager for Redis password • Full access to Valohai S3 bucket

ValohaiWorkerRole

Worker EC2 instances

• Set instance protection on itself • Describe its own instance metadata • Customizable: Add policies for your resources (S3 buckets, databases, etc.)

ValohaiQueueRole

Queue instance

• Read secrets tagged with valohai:1 from Secrets Manager

ValohaiS3MultipartRole

User uploads via UI

• Multipart upload operations to Valohai S3 bucket (for files >5GB)

Queue Instance

The valohai-queue instance manages job scheduling and runs Redis for the job queue.

Specification
Value

Instance type

t3.medium (2 vCPUs, 4GB RAM)

Operating system

Ubuntu 20.04 LTS

Networking

Elastic IP attached for stable addressing

Services

Redis on port 63790 with TLS

What It Does

  • Stores job queue and short-term logs

  • Receives job submissions from app.valohai.com

  • Workers pull jobs from this queue

  • Handles Let's Encrypt certificate renewal

💡 SSH access using your EC2 key pair is available for debugging but not required for normal operation.

S3 Bucket

Property
Details

Naming convention

valohai-data-<AWS-ACCOUNT-ID>

CORS configuration

Applied automatically to allow uploads from app.valohai.com

What It Stores

  • Git repository snapshots (for reproducibility)

  • Execution logs (moved from Redis after job completion)

  • Input datasets (if uploaded via Valohai)

  • Output artifacts (models, visualizations, processed data)

Access

Principal
Permission Level

Workers

Read/write via ValohaiWorkerRole

Valohai service

Full access via ValohaiMaster role

Your AWS account

Direct access with your AWS credentials


Worker Autoscaling

Workers are EC2 instances that execute your ML jobs. They scale automatically based on demand.

How It Works

Step
Action

1. Job submission

User creates execution in app.valohai.com

2. Queue

Valohai submits job to your Redis queue

3. Scale up

Valohai launches appropriate worker instance

4. Execution

Worker pulls job, executes code, uploads results to S3

5. Scale down

After a default 15-minute grace period, idle workers terminate

Default Configuration

Setting
Value

Launch method

Autoscaling groups

Instance types

Configured per environment in Valohai

Spot instances

Supported for cost savings

Grace period

15 minutes (configurable)

💡 Contact [email protected] to customize instance types, spot/on-demand mix, or scaling behavior.

Next Steps

After deployment is complete:

1. Send information to Valohai

Share the ValohaiMaster role ARN with your Valohai contact:

# For Terraform
terraform output valohai_master_role_arn

# For CloudFormation
aws cloudformation describe-stacks \
  --stack-name ValohaiIAM \
  --query 'Stacks[0].Outputs[?OutputKey==`ValohaiMasterRoleArn`].OutputValue' \
  --output text

2. Valohai configures your organization

Your Valohai contact will:

  • Link your AWS resources to your Valohai organization

  • Create execution environments (e.g., aws-eu-west-1-t3.medium)

  • Configure available instance types

3. Verify the setup

Once Valohai confirms setup is complete:

  • Log in to app.valohai.com

  • Create a test project

  • Run a simple execution to verify workers launch correctly

4. Configure additional resources

Consider setting up:

Common Configurations

Using Existing S3 Buckets

To access your existing S3 buckets from Valohai workers:

Add a policy to ValohaiWorkerRole granting access:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::your-bucket-name",
        "arn:aws:s3:::your-bucket-name/*"
      ]
    }
  ]
}

Then add the bucket as a data store in Valohai's web UI.

Using Private Docker Registries

To use Amazon ECR or other private registries:

For ECR:

  1. Add AmazonEC2ContainerRegistryReadOnly policy to ValohaiWorkerRole

  2. Configure the registry in Valohai's web UI with your ECR URL

For other registries:

  1. Create credentials in AWS Secrets Manager

  2. Grant ValohaiWorkerRole access to the secret

  3. Contact [email protected] to configure authentication

Accessing RDS Databases

To connect workers to RDS databases:

  1. Add workers' security group (valohai-sg-workers) to RDS security group inbound rules

  2. Ensure workers are in the same VPC as RDS, or set up VPC peering

  3. Use the database endpoint in your execution code

GPU Instances

GPU instances work out of the box. Valohai will:

  • Use instance types with GPUs (e.g., p3.2xlarge, g4dn.xlarge)

  • Install NVIDIA drivers automatically

  • Make GPUs available to your containers

Ensure you have sufficient GPU quota in your AWS account.

Troubleshooting

Workers not launching

Check the queue instance:

ssh -i your-key.pem ubuntu@<queue-ip>
sudo systemctl status valohai-queue

Verify network access:

  • Ensure port 63790 is open from app.valohai.com

  • Check security group rules

Jobs stuck in queue

Check IAM permissions:

  • Verify Valohai can assume the ValohaiMaster role

  • Check CloudTrail logs for permission errors

Check instance limits:

  • Verify your AWS account has sufficient EC2 instance quota

  • Check for any AWS service limits

Cannot upload to S3

Verify CORS configuration:

aws s3api get-bucket-cors --bucket valohai-data-<account-id>

Check IAM roles:

  • Workers need write access to the bucket

  • Valohai needs read access for the web UI

Costs higher than expected

Review instance types:

  • Use spot instances for non-critical workloads

  • Adjust grace period to scale down faster

  • Use smaller instance types for lighter workloads

Monitor S3 usage:

  • Set lifecycle policies to archive old artifacts

  • Delete unnecessary execution outputs

Getting Help

Valohai Support: [email protected]

AWS Resources:

  • EC2 instance limits: Check AWS Service Quotas console

  • CloudFormation issues: Check stack events in AWS Console

  • Terraform issues: Review terraform.tfstate and run terraform plan

Last updated

Was this helpful?