This guide will go through how to prepare your AWS account for Valohai self-hosted Installation, AWS resources that will be set up and the access control permissions that need to be configured.
A Valohai self-hosted model allows you to run all components of Valohai inside your own network. This means that your users won’t use app.valohai.com to manage their ML projects but a version of Valohai that’s hosted by you.
Updates to the platform are delivered through Docker images.
What will get deployed?
- Security Groups
valohai-sg-workers
: Security Group defining the inbound and outbound rules of Valohai workers that executed all user launched machine learning jobs.valohai-sg-master
: Security Group that’s attached to the Valohai web application instance.valohai-sg-database
: Attached to the Postgres database instance that stores application data and job information (but not content).valohai-sg-queue
: Attached to the Redis instance that’s used as as a job queue and a short term storage for machine learning job logs.
EC2 instance
: Hosts the core Valohai web application, deployment image building and scaling services. The end-users will access the web application hosted here.RDS PostgreSQL database
: A relational database that contains user data and saves execution details such as which worker type was used, what commands were run, what Docker image was used, which inputs where used and what was the launch configuration.ElastiCache Redis
: Stores information about the job queue and short-term execution logs so they can be shown on the web app and API in real-time. Each job is connected to a queue. The workers fetch a job from the Redis job queue based on their queue name (e.g. machines that belong to queuet3.medium
will fetch only jobs that marked for that queue)LoadBalancer
:- IAM Roles
valohai-master-role
: A role that’s attached to the EC2 instance running the core Valohai web app. This role has permissions to- create and edit existing auto scaling groups
- launch and terminate EC2 instances for the machine learning jobs
- upload and download files from the default S3 storage
- access Valohai related secrets from the Secrets Manager
valohai-worker-role
: A role attached to all the EC2 instance that run user schedule machine learning jobs. By default this role only access permissions to set instance protection on itself and describe itself.valohai-multipart-role
: Used by the web app to allow users to upload large files (over 5GB) to S3 using the user interface.
S3 Bucket
: Valohai stores Git commit snapshots in S3 to maintain reproducibility. Worker instances download the user code archives from this storage. Real-time logs are moved to a persistent storage after the target execution finishes.- AWS Secrets Manager Secret to store RDS password and Valohai configuration secrets
- AWS SSM Parameter Store to store configuration details
Deployment Templates
Contact support
You’ll need to get the required details and permissions to the Valohai images before you can deploy a self-hosted version of Valohai.
AWS Cloud Development Kit (CDK)
You can deploy the self-hosted environment and it’s components using the Self-Hosted CDK scripts.
See details in the public GitHub repository
Terraform
You can deploy the self-hosted environment and it’s components using the Self-Hosted Terraform scripts.
See details in the public GitHub repository
Manual Guide
VPC
Valohai can be deployed either in your existing VPC, or in a new separate VPC.
Security Groups
valohai-sg-workers
- Inbound: Allow SSH connection for admins for debugging purposes
- Outbound: Block outbound access if ML jobs are not allowed to access the public internet.
valohai-sg-master
- Port: 22
- Source: IP of admin who will do the installation.
- Port: 80
- Source: valohai-sg-loadblanacer
valohai-sg-database
- Port: 5432
- Source: valohai-sg-master
valohai-sg-queue
- Port: 6379
- Source: valohai-sg-master, valohai-sg-workers
valohai-sg-loadbalancer
- Port: 443
- Source: 0.0.0.0/0 (All traffic)
IAM
ValohaiWorker - IAM Role
Default role for all created EC2 instances launched by Valohai for ML jobs. This is the minimum requirement.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "1",
"Effect": "Allow",
"Action": "autoscaling:SetInstanceProtection",
"Resource": "*"
},
{
"Sid": "2",
"Effect": "Allow",
"Action": "ec2:DescribeInstances",
"Resource": "*"
}
]
}
ValohaiMaster - IAM User
Used for creating and scaling of EC2 resources for ML jobs launched by users. This user also has access to Valohai default S3 bucket and can access secrets from AWS Secrets Manager that are tagged with Valohai.
{
"Version" : "2012-10-17",
"Statement" : [
{
"Sid" : "2",
"Effect" : "Allow",
"Action" : [
"ec2:DescribeInstances",
"ec2:DescribeVpcs",
"ec2:DescribeKeyPairs",
"ec2:DescribeImages",
"ec2:DescribeSecurityGroups",
"ec2:DescribeSubnets",
"ec2:DescribeInstanceTypes",
"ec2:DescribeLaunchTemplates",
"ec2:DescribeLaunchTemplateVersions",
"ec2:DescribeInstanceAttribute",
"ec2:CreateTags",
"ec2:DescribeInternetGateways",
"ec2:DescribeRouteTables",
"autoscaling:DescribeAutoScalingGroups",
"autoscaling:DescribeScalingActivities"
],
"Resource" : "*"
},
{
"Sid" : "AllowUpdatingSpotLaunchTemplates",
"Effect" : "Allow",
"Action" : [
"ec2:CreateLaunchTemplate",
"ec2:CreateLaunchTemplateVersion",
"ec2:ModifyLaunchTemplate",
"ec2:RunInstances",
"ec2:TerminateInstances",
"ec2:RebootInstances",
"autoscaling:UpdateAutoScalingGroup",
"autoscaling:CreateOrUpdateTags",
"autoscaling:SetDesiredCapacity",
"autoscaling:CreateAutoScalingGroup"
],
"Resource" : "*",
"Condition" : {
"ForAllValues:StringEquals" : {
"aws:ResourceTag/valohai" : "1"
}
}
},
{
"Sid" : "ServiceLinkedRole",
"Effect" : "Allow",
"Action" : "iam:CreateServiceLinkedRole",
"Resource" : "arn:aws:iam::*:role/aws-service-role/autoscaling.amazonaws.com/AWSServiceRoleForAutoScaling"
},
{
"Sid" : "4",
"Effect" : "Allow",
"Action" : [
"iam:PassRole",
"iam:GetRole"
],
"Resource" : "arn:aws:iam::ACCOUNT-ID:role/ValohaiWorkerRole"
},
{
"Sid" : "0",
"Effect" : "Allow",
"Action" : [
"secretsmanager:GetResourcePolicy",
"secretsmanager:GetSecretValue",
"secretsmanager:DescribeSecret",
"secretsmanager:ListSecretVersionIds"
],
"Resource" : "*",
"Condition" : {
"StringEquals" : {
"secretsmanager:ResourceTag/valohai" : "1"
}
}
},
{
"Action" : "secretsmanager:GetRandomPassword",
"Resource" : "*",
"Effect" : "Allow",
"Sid" : "1"
},
{
"Effect" : "Allow",
"Action" : "s3:*",
"Resource" : [
"arn:aws:s3:::your S3 bucket name",
"arn:aws:s3:::your S3 bucket name/*"
]
},
{
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents",
"logs:DescribeLogStreams",
"logs:DescribeLogGroups"
],
"Resource": [
"arn:aws:logs:*:*:log-group:*",
"arn:aws:logs:*:*:log-group:*:log-stream:",
"arn:aws:logs:*:*:log-group:*:log-stream:*"
]
}
]
}
The last effect block is only required for Cloudwatch. If you are not using Cloudwatch, you can remove the last block completely.
ValohaiMultiPartUploadRole- IAM Role
Used to upload files over 5GB to S3 bucket
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Stmt1503921756000",
"Effect": "Allow",
"Action": [
"s3:AbortMultipartUpload",
"s3:GetObject",
"s3:ListBucket",
"s3:ListBucketMultipartUploads",
"s3:ListBucketVersions",
"s3:ListMultipartUploadParts",
"s3:PutObject"
],
"Resource": [
"arn:aws:s3:::your S3 bucket name",
"arn:aws:s3:::your S3 bucket name/*"
]
}
]
}
Core Valohai Resources
Your configuration will depend on your organization’s requirements. The below list describes the minimum configuration needed for Valohai.
EC2
- Name: valohai-roi
- Security group: valohai-sg-master
- OS: Ubuntu 22.04 LTS
- Instance: m5a.xlarge
- Storage: 32GB
S3
- Name: yourbucketname-valohai
- Block all public access.
RDS
- Name: valohai-psql
- Class: db.t2.large
- Security Group: valohai-sg-database
- Port: 5432
- Public Accessibility: No
- Engine Version: 14.2
ElastiCache (Redis)
- Name: valohai-queue
- Node type: cache.m3.xlarge
- Number of nodes: 1
- Engine version: 6.2
EC2 Load Balancer
- The Valohai web application is served at port 8000 on the EC2 instance.
- HTTP/2 Enabled.
DNS Name
Provide a DNS name to point at the load balancer (used for the web application, e.g., valohai.yourdomain.net)