Hybrid Deployment

Deploy Valohai's compute and data layer to your AWS account using CloudFormation or Terraform

Deploy Valohai workers and storage to your AWS account while Valohai manages the application layer at app.valohai.com.

What Gets Deployed

What Valohai Accesses

Access Pattern

Details

Queue communication

Valohai connects to your valohai-queue instance via Redis over TLS (port 63790)

Worker isolation

Valohai never directly accesses worker instances that download your data

Job execution

All ML jobs run on your EC2 instances within your VPC

Data residency

All training data and generated artifacts stay in your S3 bucket, nothing leaves your AWS account

Prerequisites

Before starting, gather this information from Valohai:

Required from Valohai:

AssumeRoleARN - ARN of the Valohai user that will manage your resources
QueueAddress - DNS name for your queue (e.g., yourcompany.vqueue.net)

Required from your AWS account:

AWS account with admin access
EC2 key pair for SSH access to instances (create in AWS Console if needed)
Region selected (consider GPU availability and data location)

Contact [email protected] to receive your AssumeRoleARN and QueueAddress before proceeding.

Installation Methods

Choose your preferred deployment method:

Terraform (Recommended)

Use Terraform for infrastructure as code with version control and repeatability.

Requirements:

Terraform installed
AWS CLI configured with appropriate credentials
AssumeRoleARN and QueueAddress from Valohai

Steps:

Clone the Valohai Terraform repository:

git clone https://github.com/valohai/aws-hybrid-workers-terraform.git
cd aws-hybrid-workers-terraform

Copy the terraform.tfvars file and input your values:

# Required values from Valohai
valohai_assume_user = "arn:aws:iam::635691382966:user/valohai-customer-yourcompany"
queue_address       = "yourcompany.vqueue.net"

# Your AWS settings
aws_region          = "us-east-1"
key_pair_name       = "your-key-pair-name"

# Optional: Customize resource names
# resource_suffix   = "production"

Initialize and apply:

terraform init
terraform plan
terraform apply

Review the planned changes. Type yes to create the resources.

After deployment completes, Terraform will output the ValohaiMaster role ARN. Send this to your Valohai contact to complete setup.

Repository: github.com/valohai/aws-hybrid-workers-terraform

CloudFormation

Deploy using AWS CloudFormation templates for a quick setup.

Requirements:

AWS CLI installed and configured
AssumeRoleARN and QueueAddress from Valohai
EC2 key pair created

Steps:

Deploy the IAM stack first:

aws cloudformation deploy \
  --template-file https://valohai-cfn-templates-public.s3.eu-west-1.amazonaws.com/iam.yml \
  --stack-name ValohaiIAM \
  --parameter-overrides AssumeRoleARN=<your-assume-role-arn> \
  --capabilities CAPABILITY_NAMED_IAM

Note the ValohaiMasterRoleArn from the stack outputs:

aws cloudformation describe-stacks \
  --stack-name ValohaiIAM \
  --query 'Stacks[0].Outputs[?OutputKey==`ValohaiMasterRoleArn`].OutputValue' \
  --output text

Deploy the main stack:

aws cloudformation deploy \
  --template-file https://valohai-cfn-templates-public.s3.eu-west-1.amazonaws.com/aws-hybrid-workers.yml \
  --stack-name Valohai \
  --parameter-overrides \
    KeyPair=<your-key-pair> \
    QueueAddress=<your-queue-address> \
    ValohaiMasterRoleArn=<role-arn-from-previous-step>

After deployment, send the ValohaiMasterRoleArn to your Valohai contact.

Templates: github.com/valohai/aws-hybrid-workers-cloudformation

Manual Setup

Need full control? If you can't use CloudFormation or Terraform, or have specific customization requirements, follow the Manual Setup Guide.

The manual guide walks through creating each resource individually via the AWS Console or CLI.

Network Configuration

Here's what gets deployed with default settings.

VPC and Subnets

Component

Configuration

VPC CIDR

10.0.0.0/16

Subnets

One subnet per availability zone in your region

Internet Gateway

Enabled for outbound connectivity

Security Groups

Security Group

Direction

Port

Source/Destination

Purpose

valohai-sg-workers

Inbound

None

No ports open by default (add SSH for debugging if needed)

Outbound

All

0.0.0.0/0

Workers pull Docker images and access S3

valohai-sg-queue

Inbound

0.0.0.0/0

Let's Encrypt certificate renewal

Inbound

63790

34.248.245.191/32

app.valohai.com access to queue

Inbound

63790

valohai-sg-workers

Worker access to Redis queue

Outbound

All

0.0.0.0/0

General outbound connectivity

Customizing the VPC

Method

Support

Terraform

Set use_existing_vpc = true and provide VPC/subnet IDs in terraform.tfvars

CloudFormation

Not supported, use manual setup

Manual setup

Follow the Manual Setup Guide with your existing VPC

IAM Roles and Permissions

Role

Attached To

Key Permissions

ValohaiMaster

Valohai service

• Describe and create EC2 instances and launch templates • Manage autoscaling groups • Access Secrets Manager for Redis password • Full access to Valohai S3 bucket

ValohaiWorkerRole

Worker EC2 instances

• Set instance protection on itself • Describe its own instance metadata • Customizable: Add policies for your resources (S3 buckets, databases, etc.)

ValohaiQueueRole

Queue instance

• Read secrets tagged with valohai:1 from Secrets Manager

ValohaiS3MultipartRole

User uploads via UI

• Multipart upload operations to Valohai S3 bucket (for files >5GB)

Queue Instance

The valohai-queue instance manages job scheduling and runs Redis for the job queue.

Specification

Value

Instance type

t3.medium (2 vCPUs, 4GB RAM)

Operating system

Ubuntu 20.04 LTS

Networking

Elastic IP attached for stable addressing

Services

Redis on port 63790 with TLS

What It Does

Stores job queue and short-term logs
Receives job submissions from app.valohai.com
Workers pull jobs from this queue
Handles Let's Encrypt certificate renewal

💡 SSH access using your EC2 key pair is available for debugging but not required for normal operation.

S3 Bucket

Property

Details

Naming convention

valohai-data-<AWS-ACCOUNT-ID>

CORS configuration

Applied automatically to allow uploads from app.valohai.com

What It Stores

Git repository snapshots (for reproducibility)
Execution logs (moved from Redis after job completion)
Input datasets (if uploaded via Valohai)
Output artifacts (models, visualizations, processed data)

Access

Principal

Permission Level

Workers

Read/write via ValohaiWorkerRole

Valohai service

Full access via ValohaiMaster role

Your AWS account

Direct access with your AWS credentials

Worker Autoscaling

Workers are EC2 instances that execute your ML jobs. They scale automatically based on demand.

How It Works

Step

Action

1. Job submission

User creates execution in app.valohai.com

2. Queue

Valohai submits job to your Redis queue

3. Scale up

Valohai launches appropriate worker instance

4. Execution

Worker pulls job, executes code, uploads results to S3

5. Scale down

After a default 15-minute grace period, idle workers terminate

Default Configuration

Setting

Value

Launch method

Autoscaling groups

Instance types

Configured per environment in Valohai

Spot instances

Supported for cost savings

Grace period

15 minutes (configurable)

💡 Contact [email protected] to customize instance types, spot/on-demand mix, or scaling behavior.

Next Steps

After deployment is complete:

1. Send information to Valohai

Share the ValohaiMaster role ARN with your Valohai contact:

# For Terraform
terraform output valohai_master_role_arn

# For CloudFormation
aws cloudformation describe-stacks \
  --stack-name ValohaiIAM \
  --query 'Stacks[0].Outputs[?OutputKey==`ValohaiMasterRoleArn`].OutputValue' \
  --output text

2. Valohai configures your organization

Your Valohai contact will:

Link your AWS resources to your Valohai organization
Create execution environments (e.g., aws-eu-west-1-t3.medium)
Configure available instance types

3. Verify the setup

Once Valohai confirms setup is complete:

Log in to app.valohai.com
Create a test project
Run a simple execution to verify workers launch correctly

4. Configure additional resources

Consider setting up:

Additional S3 buckets for your datasets
Private Docker registries for custom images
Environment splitting for dev/prod separation
Shared cache for large datasets

Common Configurations

Using Existing S3 Buckets

To access your existing S3 buckets from Valohai workers:

Add a policy to ValohaiWorkerRole granting access:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::your-bucket-name",
        "arn:aws:s3:::your-bucket-name/*"
      ]
    }
  ]
}

Then add the bucket as a data store in Valohai's web UI.

Using Private Docker Registries

To use Amazon ECR or other private registries:

For ECR:

Add AmazonEC2ContainerRegistryReadOnly policy to ValohaiWorkerRole
Configure the registry in Valohai's web UI with your ECR URL

For other registries:

Create credentials in AWS Secrets Manager
Grant ValohaiWorkerRole access to the secret
Contact [email protected] to configure authentication

Accessing RDS Databases

To connect workers to RDS databases:

Add workers' security group (valohai-sg-workers) to RDS security group inbound rules
Ensure workers are in the same VPC as RDS, or set up VPC peering
Use the database endpoint in your execution code

GPU Instances

GPU instances work out of the box. Valohai will:

Use instance types with GPUs (e.g., p3.2xlarge, g4dn.xlarge)
Install NVIDIA drivers automatically
Make GPUs available to your containers

Ensure you have sufficient GPU quota in your AWS account.

Troubleshooting

Workers not launching

Check the queue instance:

ssh -i your-key.pem ubuntu@<queue-ip>
sudo systemctl status valohai-queue

Verify network access:

Ensure port 63790 is open from app.valohai.com
Check security group rules

Jobs stuck in queue

Check IAM permissions:

Verify Valohai can assume the ValohaiMaster role
Check CloudTrail logs for permission errors

Check instance limits:

Verify your AWS account has sufficient EC2 instance quota
Check for any AWS service limits

Cannot upload to S3

Verify CORS configuration:

aws s3api get-bucket-cors --bucket valohai-data-<account-id>

Check IAM roles:

Workers need write access to the bucket
Valohai needs read access for the web UI

Costs higher than expected

Review instance types:

Use spot instances for non-critical workloads
Adjust grace period to scale down faster
Use smaller instance types for lighter workloads

Monitor S3 usage:

Set lifecycle policies to archive old artifacts
Delete unnecessary execution outputs

Getting Help

Valohai Support: [email protected]

AWS Resources:

EC2 instance limits: Check AWS Service Quotas console
CloudFormation issues: Check stack events in AWS Console
Terraform issues: Review terraform.tfstate and run terraform plan

PreviousAWS NextHybrid Deployment - Manual Setup

Last updated 21 days ago

Was this helpful?