AWS Redshift

Query AWS Redshift from Valohai executions and save snapshots for reproducible ML pipelines.


Overview

AWS Redshift is a cloud data warehouse that you can query directly from Valohai executions. This guide shows you how to:

  1. Set up authentication (IAM roles or credentials)

  2. Query Redshift from your code

  3. Save snapshots for reproducibility


Prerequisites

Before you begin:

  • Existing Redshift cluster with data to query

  • Network access — Inbound rule allowing connections from Valohai workers' security group

  • Credentials or IAM role for authentication


Authentication: Choose Your Method

Decision Tree


Use your EC2 instances' IAM identity to authenticate without storing credentials.

How It Works

  1. Your AWS account has an IAM Role for EC2 instances used by Valohai (e.g., ValohaiWorkerRole)

  2. Grant this role permission to access Redshift

  3. Executions automatically authenticate using the machine's identity

  4. No credentials stored in Valohai

Benefits

  • No credentials to manage or rotate

  • More secure (no secrets in environment variables)

  • Automatic authentication

  • Fine-grained permissions per Valohai environment


Step 1: Create IAM Policy

Grant the ValohaiWorkerRole permission to get Redshift credentials.

IAM Policy JSON:

Replace these values:

  • <region> — AWS region (e.g., us-east-1)

  • <account-id> — Your AWS account ID

  • <cluster-identifier> — Redshift cluster identifier

  • <db-username> — Database user for Valohai

  • <database-name> — Database name

Policy explanation:

  • redshift:GetClusterCredentials — Allows temporary credential generation

  • dbuser resource — Specifies which database user Valohai can impersonate

  • dbname resource — Restricts access to specific database


Step 2: Attach Policy to Valohai Role

  1. In AWS IAM, find your Valohai worker role (e.g., ValohaiWorkerRole)

  2. Attach the policy you created above

  3. Valohai executions can now authenticate to Redshift

💡 Multiple environments: Create separate IAM roles for dev/staging/production Valohai environments to control access granularly.


Step 3: Query Using IAM Authentication

Python example with IAM authentication:

⚠️ ​The example above only shows the "happy path" - assuming that all the requests will be successful and return expected data.

For a full example, take a look at this GitHub page.

Install dependencies:

Add to your valohai.yaml command:

Or include in your Docker image:


Option 2: Username/Password Authentication

Store Redshift credentials as Valohai environment variables.

Step 1: Store Credentials in Valohai

  1. Open your project in Valohai

  2. Go to Settings → Env Variables

  3. Add the following variables:

Name
Value
Secret

REDSHIFT_HOST

my-cluster.abc123.us-east-1.redshift.amazonaws.com

No

REDSHIFT_DATABASE

analytics

No

REDSHIFT_PORT

5439

No

REDSHIFT_USER

ml_pipeline_user

No

REDSHIFT_PASSWORD

your-secure-password

Yes

💡 Environment Variable Groups: Organization admins can create shared credential groups under Organization Settings instead of configuring each project separately.


Step 2: Query Using Credentials

Python example with psycopg2:

Install dependencies:

Or in Dockerfile:


Using psql Command-Line Client

For SQL-heavy workflows, use the psql command-line tool directly.

Example: Query to CSV Output

valohai.yaml:

query.sql (in your repository):

What happens:

  1. psql connects to Redshift

  2. Executes query from query.sql file

  3. Outputs results as CSV to /valohai/outputs/results.csv

  4. File uploaded to your data store automatically

Install psql in custom image:


Complete Workflow: Query → Snapshot → Train

Step 1: Query and Save Snapshot

fetch_data.py:


Step 2: Train on Snapshot

train.py:


Step 3: Pipeline Configuration

valohai.yaml:


Maintaining Reproducibility

⚠️ Critical: Redshift data changes continuously. Query results today differ from results tomorrow.

The problem:

The solution:

Best practices:

  1. Query once — Run query in dedicated execution

  2. Snapshot immediately — Save to /valohai/outputs/

  3. Version snapshots — Create dataset versions

  4. Train on snapshots — Use dataset as input, never query directly in training

  5. Schedule snapshots — Create fresh snapshots daily/weekly/monthly

See: Databases for complete reproducibility patterns.


Common Issues & Fixes

Connection Timeout

Symptom: OperationalError: timeout expired

Causes & Fixes:

  • Security group blocking Valohai IPs → Add Valohai worker security group to Redshift inbound rules

  • Wrong host/port → Verify Redshift endpoint and port (usually 5439)

  • Network connectivity → Check VPC configuration


IAM Authentication Fails

Symptom: botocore.exceptions.ClientError: An error occurred (AccessDenied)

Causes & Fixes:

  • IAM policy missing → Verify policy attached to Valohai worker role

  • Wrong ARN in policy → Double-check cluster identifier, region, account ID

  • DB user doesn't exist → Create user in Redshift: CREATE USER valohai_user



Next Steps

  • Set up IAM authentication or store credentials

  • Create a test query execution

  • Save your first snapshot as a dataset version

  • Build a reproducible training pipeline using snapshots

Last updated

Was this helpful?