Databases
Connect to production databases from your ML pipelines to query fresh data, but snapshot results to maintain reproducibility as data evolves.
The Challenge: Databases Change
Production databases are constantly updated with new data. This creates a reproducibility problem for ML:
Today's query:
SELECT * FROM customer_data WHERE signup_date >= '2024-01-01'
-- Returns: 50,000 rowsSame query next month:
SELECT * FROM customer_data WHERE signup_date >= '2024-01-01'
-- Returns: 65,000 rows (15,000 new customers!)The problem:
Rerunning your training pipeline produces different results
Can't reproduce experiments from last week
Hard to debug: "Did the model improve, or did the data change?"
Compliance issues: "Which data was used to train the production model?"
The Solution: Query + Snapshot Pattern
Valohai solves this with a simple workflow:
Query — Fetch data from your database
Snapshot — Save results to
/valohai/outputs/Version — Create dataset version with snapshot
Train — Use versioned dataset as input
Benefits:
Reproducible: Same dataset version = same results forever
Traceable: Know exactly which data trained each model
Fast: Reuse snapshots instead of requerying database
Compliant: Audit trail of data used in production models
Authentication: Machine Identity vs. Credentials
Choose your authentication method based on security requirements and available infrastructure.
Decision Tree
Machine Identity (Recommended)
What it is: Your execution environment's cloud identity authenticates automatically.
Pros:
No credentials to manage
More secure (no secrets in code or environment variables)
Automatic rotation via cloud IAM
Fine-grained permissions per environment
Cons:
Cloud-specific (AWS IAM, GCP Service Accounts)
Requires infrastructure setup
When to use: Production pipelines on AWS (Redshift) or GCP (BigQuery)
Supported:
AWS Redshift → IAM Roles
GCP BigQuery → Service Accounts
Stored Credentials
What it is: Store database username/password as Valohai environment variables.
Pros:
Works with any database
Simple setup
Cloud-agnostic
Cons:
Credentials to manage and rotate
Stored as environment variables (marked secret)
When to use:
Databases without machine identity support
Quick prototypes
On-premises databases
Cross-cloud scenarios
Supported: All databases (Redshift, BigQuery, Snowflake, PostgreSQL, MySQL, etc.)
Complete Workflow Example
Here's the full pattern: query database → save snapshot → create dataset → train model.
Step 1: Query Database and Save Snapshot
Step 2: Create Dataset Version with Snapshot
Step 3: Train Using Versioned Dataset
valohai.yaml:
train.py:
Step 4: Update Snapshot (Monthly/Weekly)
Create new dataset versions as data evolves:
Result:
2024-q1version: 50,000 rows (immutable)2024-q2version: 65,000 rows (immutable)Can train on either version reproducibly
Can compare model performance across data versions
Why Snapshots Matter: Real Example
Without snapshots:
With snapshots:
Best Practices
Snapshot Every Query
Version Your Snapshots
Tag Snapshots with Metadata
This metadata helps with:
Debugging ("Which data was this?")
Compliance ("Prove what data trained this model")
Optimization ("Do we need to requery or can we reuse?")
Schedule Regular Snapshots
Create a pipeline that runs automatically:
Run weekly/monthly via Valohai scheduling or external orchestration (Airflow, GitHub Actions, etc.)
Separate Query from Training
Benefits:
Reuse snapshots across experiments
Iterate on models without requerying database
Clear separation of concerns
Environment Variables for Credentials
All database pages use environment variables for storing credentials. Here's the standard pattern:
Project-Level Variables
Open your project in Valohai
Go to Settings → Env Variables
Add database credentials
Mark sensitive values (passwords, tokens) as Secret
Example for any database:
DB_HOST
my-database.example.com
No
DB_NAME
production_db
No
DB_USER
ml_pipeline_user
No
DB_PASSWORD
your-secure-password
Yes
DB_PORT
5432
No
Organization-Level Variable Groups
For shared credentials across projects:
Organization admin goes to Organization Settings → Environment Variables
Add new environment variable group
Create group (e.g., "Production Database Credentials")
Add variables
Projects can inherit these groups
Benefits:
Central credential management
One update affects all projects
Easier rotation
Consistent naming
Common Patterns
Daily Snapshot for Real-Time Features
Incremental Snapshots
Train/Validation/Test Splits from Database
Database-Specific Guides
Each database has unique setup requirements and best practices:
IAM Role authentication (recommended)
Username/password authentication
Security group configuration
Python and psql examples
Service Account authentication (recommended)
Cross-project access
Python client setup
Query optimization
Username/password authentication
Time Travel feature integration
Python connector setup
Dependency management
Related Pages
AWS Redshift — Connect to Amazon Redshift
BigQuery — Connect to Google BigQuery
Snowflake — Connect to Snowflake
Create and Manage Datasets — Dataset versioning details
Ingest & Save Files — Save outputs as datasets
Next Steps
Choose your database from the guides above
Set up authentication (machine identity or credentials)
Create your first query → snapshot → train pipeline
Schedule regular snapshots for fresh data
Build reproducible ML pipelines with versioned data
Last updated
Was this helpful?
