Hybrid Deployment
Deploy Valohai's compute and data layer to your AWS account using CloudFormation or Terraform
Deploy Valohai workers and storage to your AWS account while Valohai manages the application layer at app.valohai.com.
What Gets Deployed
Networking
VPC with subnets
Spans multiple availability zones for high availability for workers
Internet Gateway
Enables outbound internet access for workers
Route tables
Manages network routing within VPC
Security groups
Two groups: one for queue instance, one for workers
Elastic IP
Static IP address for queue instance
Compute
EC2 instance (valohai-queue)
Manages job queue and scheduling
Autoscaling groups
Dynamically scales worker instances based on job demand
Launch templates
Defines worker instance configuration
Storage
S3 bucket
Stores execution logs, artifacts, and code snapshots
IAM & Security
ValohaiMaster role
Allows Valohai to manage autoscaling
ValohaiWorkerRole role
Worker instance permissions
ValohaiQueueRole role
Queue instance permissions
ValohaiS3MultipartRole role
Handles large file uploads to S3
AWS Secrets Manager secret
Stores Redis password securely
What Valohai Accesses
Queue communication
Valohai connects to your valohai-queue instance via Redis over TLS (port 63790)
Worker isolation
Valohai never directly accesses worker instances that download your data
Job execution
All ML jobs run on your EC2 instances within your VPC
Data residency
All training data and generated artifacts stay in your S3 bucket, nothing leaves your AWS account
Prerequisites
Before starting, gather this information from Valohai:
Required from Valohai:
AssumeRoleARN- ARN of the Valohai user that will manage your resourcesQueueAddress- DNS name for your queue (e.g.,yourcompany.vqueue.net)
Required from your AWS account:
AWS account with admin access
EC2 key pair for SSH access to instances (create in AWS Console if needed)
Region selected (consider GPU availability and data location)
Contact [email protected] to receive your AssumeRoleARN and QueueAddress before proceeding.
Installation Methods
Choose your preferred deployment method:
Terraform (Recommended)
Use Terraform for infrastructure as code with version control and repeatability.
Requirements:
Terraform installed
AWS CLI configured with appropriate credentials
AssumeRoleARNandQueueAddressfrom Valohai
Steps:
Clone the Valohai Terraform repository:
git clone https://github.com/valohai/aws-hybrid-workers-terraform.git
cd aws-hybrid-workers-terraformCopy the terraform.tfvars file and input your values:
# Required values from Valohai
valohai_assume_user = "arn:aws:iam::635691382966:user/valohai-customer-yourcompany"
queue_address = "yourcompany.vqueue.net"
# Your AWS settings
aws_region = "us-east-1"
key_pair_name = "your-key-pair-name"
# Optional: Customize resource names
# resource_suffix = "production"Initialize and apply:
terraform init
terraform plan
terraform applyReview the planned changes. Type yes to create the resources.
After deployment completes, Terraform will output the ValohaiMaster role ARN. Send this to your Valohai contact to complete setup.
Repository: github.com/valohai/aws-hybrid-workers-terraform
CloudFormation
Deploy using AWS CloudFormation templates for a quick setup.
Requirements:
AWS CLI installed and configured
AssumeRoleARNandQueueAddressfrom ValohaiEC2 key pair created
Steps:
Deploy the IAM stack first:
aws cloudformation deploy \
--template-file https://valohai-cfn-templates-public.s3.eu-west-1.amazonaws.com/iam.yml \
--stack-name ValohaiIAM \
--parameter-overrides AssumeRoleARN=<your-assume-role-arn> \
--capabilities CAPABILITY_NAMED_IAMNote the ValohaiMasterRoleArn from the stack outputs:
aws cloudformation describe-stacks \
--stack-name ValohaiIAM \
--query 'Stacks[0].Outputs[?OutputKey==`ValohaiMasterRoleArn`].OutputValue' \
--output textDeploy the main stack:
aws cloudformation deploy \
--template-file https://valohai-cfn-templates-public.s3.eu-west-1.amazonaws.com/aws-hybrid-workers.yml \
--stack-name Valohai \
--parameter-overrides \
KeyPair=<your-key-pair> \
QueueAddress=<your-queue-address> \
ValohaiMasterRoleArn=<role-arn-from-previous-step>After deployment, send the ValohaiMasterRoleArn to your Valohai contact.
Templates: github.com/valohai/aws-hybrid-workers-cloudformation
Manual Setup
Need full control? If you can't use CloudFormation or Terraform, or have specific customization requirements, follow the Manual Setup Guide.
The manual guide walks through creating each resource individually via the AWS Console or CLI.
Network Configuration
Here's what gets deployed with default settings.
VPC and Subnets
VPC CIDR
10.0.0.0/16
Subnets
One subnet per availability zone in your region
Internet Gateway
Enabled for outbound connectivity
Security Groups
valohai-sg-workers
Inbound
None
-
No ports open by default (add SSH for debugging if needed)
Outbound
All
0.0.0.0/0
Workers pull Docker images and access S3
valohai-sg-queue
Inbound
80
0.0.0.0/0
Let's Encrypt certificate renewal
Inbound
63790
34.248.245.191/32
app.valohai.com access to queue
Inbound
63790
valohai-sg-workers
Worker access to Redis queue
Outbound
All
0.0.0.0/0
General outbound connectivity
Customizing the VPC
Terraform
Set use_existing_vpc = true and provide VPC/subnet IDs in terraform.tfvars
CloudFormation
Not supported, use manual setup
Manual setup
Follow the Manual Setup Guide with your existing VPC
IAM Roles and Permissions
ValohaiMaster
Valohai service
• Describe and create EC2 instances and launch templates • Manage autoscaling groups • Access Secrets Manager for Redis password • Full access to Valohai S3 bucket
ValohaiWorkerRole
Worker EC2 instances
• Set instance protection on itself • Describe its own instance metadata • Customizable: Add policies for your resources (S3 buckets, databases, etc.)
ValohaiQueueRole
Queue instance
• Read secrets tagged with valohai:1 from Secrets Manager
ValohaiS3MultipartRole
User uploads via UI
• Multipart upload operations to Valohai S3 bucket (for files >5GB)
Queue Instance
The valohai-queue instance manages job scheduling and runs Redis for the job queue.
Instance type
t3.medium (2 vCPUs, 4GB RAM)
Operating system
Ubuntu 20.04 LTS
Networking
Elastic IP attached for stable addressing
Services
Redis on port 63790 with TLS
What It Does
Stores job queue and short-term logs
Receives job submissions from app.valohai.com
Workers pull jobs from this queue
Handles Let's Encrypt certificate renewal
💡 SSH access using your EC2 key pair is available for debugging but not required for normal operation.
S3 Bucket
Naming convention
valohai-data-<AWS-ACCOUNT-ID>
CORS configuration
Applied automatically to allow uploads from app.valohai.com
What It Stores
Git repository snapshots (for reproducibility)
Execution logs (moved from Redis after job completion)
Input datasets (if uploaded via Valohai)
Output artifacts (models, visualizations, processed data)
Access
Workers
Read/write via ValohaiWorkerRole
Valohai service
Full access via ValohaiMaster role
Your AWS account
Direct access with your AWS credentials
Worker Autoscaling
Workers are EC2 instances that execute your ML jobs. They scale automatically based on demand.
How It Works
1. Job submission
User creates execution in app.valohai.com
2. Queue
Valohai submits job to your Redis queue
3. Scale up
Valohai launches appropriate worker instance
4. Execution
Worker pulls job, executes code, uploads results to S3
5. Scale down
After a default 15-minute grace period, idle workers terminate
Default Configuration
Launch method
Autoscaling groups
Instance types
Configured per environment in Valohai
Spot instances
Supported for cost savings
Grace period
15 minutes (configurable)
💡 Contact [email protected] to customize instance types, spot/on-demand mix, or scaling behavior.
Next Steps
After deployment is complete:
1. Send information to Valohai
Share the ValohaiMaster role ARN with your Valohai contact:
# For Terraform
terraform output valohai_master_role_arn
# For CloudFormation
aws cloudformation describe-stacks \
--stack-name ValohaiIAM \
--query 'Stacks[0].Outputs[?OutputKey==`ValohaiMasterRoleArn`].OutputValue' \
--output text2. Valohai configures your organization
Your Valohai contact will:
Link your AWS resources to your Valohai organization
Create execution environments (e.g.,
aws-eu-west-1-t3.medium)Configure available instance types
3. Verify the setup
Once Valohai confirms setup is complete:
Log in to app.valohai.com
Create a test project
Run a simple execution to verify workers launch correctly
4. Configure additional resources
Consider setting up:
Additional S3 buckets for your datasets
Private Docker registries for custom images
Environment splitting for dev/prod separation
Shared cache for large datasets
Common Configurations
Using Existing S3 Buckets
To access your existing S3 buckets from Valohai workers:
Add a policy to ValohaiWorkerRole granting access:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::your-bucket-name",
"arn:aws:s3:::your-bucket-name/*"
]
}
]
}Then add the bucket as a data store in Valohai's web UI.
Using Private Docker Registries
To use Amazon ECR or other private registries:
For ECR:
Add
AmazonEC2ContainerRegistryReadOnlypolicy toValohaiWorkerRoleConfigure the registry in Valohai's web UI with your ECR URL
For other registries:
Create credentials in AWS Secrets Manager
Grant
ValohaiWorkerRoleaccess to the secretContact [email protected] to configure authentication
Accessing RDS Databases
To connect workers to RDS databases:
Add workers' security group (
valohai-sg-workers) to RDS security group inbound rulesEnsure workers are in the same VPC as RDS, or set up VPC peering
Use the database endpoint in your execution code
GPU Instances
GPU instances work out of the box. Valohai will:
Use instance types with GPUs (e.g.,
p3.2xlarge,g4dn.xlarge)Install NVIDIA drivers automatically
Make GPUs available to your containers
Ensure you have sufficient GPU quota in your AWS account.
Troubleshooting
Workers not launching
Check the queue instance:
ssh -i your-key.pem ubuntu@<queue-ip>
sudo systemctl status valohai-queueVerify network access:
Ensure port 63790 is open from app.valohai.com
Check security group rules
Jobs stuck in queue
Check IAM permissions:
Verify Valohai can assume the
ValohaiMasterroleCheck CloudTrail logs for permission errors
Check instance limits:
Verify your AWS account has sufficient EC2 instance quota
Check for any AWS service limits
Cannot upload to S3
Verify CORS configuration:
aws s3api get-bucket-cors --bucket valohai-data-<account-id>Check IAM roles:
Workers need write access to the bucket
Valohai needs read access for the web UI
Costs higher than expected
Review instance types:
Use spot instances for non-critical workloads
Adjust grace period to scale down faster
Use smaller instance types for lighter workloads
Monitor S3 usage:
Set lifecycle policies to archive old artifacts
Delete unnecessary execution outputs
Getting Help
Valohai Support: [email protected]
AWS Resources:
EC2 instance limits: Check AWS Service Quotas console
CloudFormation issues: Check stack events in AWS Console
Terraform issues: Review terraform.tfstate and run
terraform plan
Last updated
Was this helpful?
