AWS Redshift
Query AWS Redshift from Valohai executions and save snapshots for reproducible ML pipelines.
Overview
AWS Redshift is a cloud data warehouse that you can query directly from Valohai executions. This guide shows you how to:
Set up authentication (IAM roles or credentials)
Query Redshift from your code
Save snapshots for reproducibility
Prerequisites
Before you begin:
Existing Redshift cluster with data to query
Network access — Inbound rule allowing connections from Valohai workers' security group
Credentials or IAM role for authentication
Authentication: Choose Your Method
Decision Tree
Option 1: IAM Role Authentication (Recommended)
Use your EC2 instances' IAM identity to authenticate without storing credentials.
How It Works
Your AWS account has an IAM Role for EC2 instances used by Valohai (e.g.,
ValohaiWorkerRole)Grant this role permission to access Redshift
Executions automatically authenticate using the machine's identity
No credentials stored in Valohai
Benefits
No credentials to manage or rotate
More secure (no secrets in environment variables)
Automatic authentication
Fine-grained permissions per Valohai environment
Step 1: Create IAM Policy
Grant the ValohaiWorkerRole permission to get Redshift credentials.
IAM Policy JSON:
Replace these values:
<region>— AWS region (e.g.,us-east-1)<account-id>— Your AWS account ID<cluster-identifier>— Redshift cluster identifier<db-username>— Database user for Valohai<database-name>— Database name
Policy explanation:
redshift:GetClusterCredentials— Allows temporary credential generationdbuserresource — Specifies which database user Valohai can impersonatedbnameresource — Restricts access to specific database
Step 2: Attach Policy to Valohai Role
In AWS IAM, find your Valohai worker role (e.g.,
ValohaiWorkerRole)Attach the policy you created above
Valohai executions can now authenticate to Redshift
💡 Multiple environments: Create separate IAM roles for dev/staging/production Valohai environments to control access granularly.
Step 3: Query Using IAM Authentication
Python example with IAM authentication:
⚠️ The example above only shows the "happy path" - assuming that all the requests will be successful and return expected data.
For a full example, take a look at this GitHub page.
Install dependencies:
Add to your valohai.yaml command:
Or include in your Docker image:
Option 2: Username/Password Authentication
Store Redshift credentials as Valohai environment variables.
Step 1: Store Credentials in Valohai
Open your project in Valohai
Go to Settings → Env Variables
Add the following variables:
REDSHIFT_HOST
my-cluster.abc123.us-east-1.redshift.amazonaws.com
No
REDSHIFT_DATABASE
analytics
No
REDSHIFT_PORT
5439
No
REDSHIFT_USER
ml_pipeline_user
No
REDSHIFT_PASSWORD
your-secure-password
Yes
💡 Environment Variable Groups: Organization admins can create shared credential groups under Organization Settings instead of configuring each project separately.
Step 2: Query Using Credentials
Python example with psycopg2:
Install dependencies:
Or in Dockerfile:
Using psql Command-Line Client
For SQL-heavy workflows, use the psql command-line tool directly.
Example: Query to CSV Output
valohai.yaml:
query.sql (in your repository):
What happens:
psqlconnects to RedshiftExecutes query from
query.sqlfileOutputs results as CSV to
/valohai/outputs/results.csvFile uploaded to your data store automatically
Install psql in custom image:
Complete Workflow: Query → Snapshot → Train
Step 1: Query and Save Snapshot
fetch_data.py:
Step 2: Train on Snapshot
train.py:
Step 3: Pipeline Configuration
valohai.yaml:
Maintaining Reproducibility
⚠️ Critical: Redshift data changes continuously. Query results today differ from results tomorrow.
The problem:
The solution:
Best practices:
Query once — Run query in dedicated execution
Snapshot immediately — Save to
/valohai/outputs/Version snapshots — Create dataset versions
Train on snapshots — Use dataset as input, never query directly in training
Schedule snapshots — Create fresh snapshots daily/weekly/monthly
See: Databases for complete reproducibility patterns.
Common Issues & Fixes
Connection Timeout
Symptom: OperationalError: timeout expired
Causes & Fixes:
Security group blocking Valohai IPs → Add Valohai worker security group to Redshift inbound rules
Wrong host/port → Verify Redshift endpoint and port (usually 5439)
Network connectivity → Check VPC configuration
IAM Authentication Fails
Symptom: botocore.exceptions.ClientError: An error occurred (AccessDenied)
Causes & Fixes:
IAM policy missing → Verify policy attached to Valohai worker role
Wrong ARN in policy → Double-check cluster identifier, region, account ID
DB user doesn't exist → Create user in Redshift:
CREATE USER valohai_user
Related Pages
Databases — Overview and reproducibility patterns
BigQuery — Connect to Google BigQuery
Snowflake — Connect to Snowflake
Create and Manage Datasets — Version your snapshots
Next Steps
Set up IAM authentication or store credentials
Create a test query execution
Save your first snapshot as a dataset version
Build a reproducible training pipeline using snapshots
Last updated
Was this helpful?
