Databases

Connect to production databases from your ML pipelines to query fresh data, but snapshot results to maintain reproducibility as data evolves.


The Challenge: Databases Change

Production databases are constantly updated with new data. This creates a reproducibility problem for ML:

Today's query:

SELECT * FROM customer_data WHERE signup_date >= '2024-01-01'
-- Returns: 50,000 rows

Same query next month:

SELECT * FROM customer_data WHERE signup_date >= '2024-01-01'
-- Returns: 65,000 rows (15,000 new customers!)

The problem:

  • Rerunning your training pipeline produces different results

  • Can't reproduce experiments from last week

  • Hard to debug: "Did the model improve, or did the data change?"

  • Compliance issues: "Which data was used to train the production model?"


The Solution: Query + Snapshot Pattern

Valohai solves this with a simple workflow:

  1. Query — Fetch data from your database

  2. Snapshot — Save results to /valohai/outputs/

  3. Version — Create dataset version with snapshot

  4. Train — Use versioned dataset as input

Benefits:

  • Reproducible: Same dataset version = same results forever

  • Traceable: Know exactly which data trained each model

  • Fast: Reuse snapshots instead of requerying database

  • Compliant: Audit trail of data used in production models


Authentication: Machine Identity vs. Credentials

Choose your authentication method based on security requirements and available infrastructure.

Decision Tree


What it is: Your execution environment's cloud identity authenticates automatically.

Pros:

  • No credentials to manage

  • More secure (no secrets in code or environment variables)

  • Automatic rotation via cloud IAM

  • Fine-grained permissions per environment

Cons:

  • Cloud-specific (AWS IAM, GCP Service Accounts)

  • Requires infrastructure setup

When to use: Production pipelines on AWS (Redshift) or GCP (BigQuery)

Supported:

  • AWS Redshift → IAM Roles

  • GCP BigQuery → Service Accounts


Stored Credentials

What it is: Store database username/password as Valohai environment variables.

Pros:

  • Works with any database

  • Simple setup

  • Cloud-agnostic

Cons:

  • Credentials to manage and rotate

  • Stored as environment variables (marked secret)

When to use:

  • Databases without machine identity support

  • Quick prototypes

  • On-premises databases

  • Cross-cloud scenarios

Supported: All databases (Redshift, BigQuery, Snowflake, PostgreSQL, MySQL, etc.)


Complete Workflow Example

Here's the full pattern: query database → save snapshot → create dataset → train model.

Step 1: Query Database and Save Snapshot


Step 2: Create Dataset Version with Snapshot


Step 3: Train Using Versioned Dataset

valohai.yaml:

train.py:


Step 4: Update Snapshot (Monthly/Weekly)

Create new dataset versions as data evolves:

Result:

  • 2024-q1 version: 50,000 rows (immutable)

  • 2024-q2 version: 65,000 rows (immutable)

  • Can train on either version reproducibly

  • Can compare model performance across data versions


Why Snapshots Matter: Real Example

Without snapshots:

With snapshots:


Best Practices

Snapshot Every Query


Version Your Snapshots


Tag Snapshots with Metadata

This metadata helps with:

  • Debugging ("Which data was this?")

  • Compliance ("Prove what data trained this model")

  • Optimization ("Do we need to requery or can we reuse?")


Schedule Regular Snapshots

Create a pipeline that runs automatically:

Run weekly/monthly via Valohai scheduling or external orchestration (Airflow, GitHub Actions, etc.)


Separate Query from Training

Benefits:

  • Reuse snapshots across experiments

  • Iterate on models without requerying database

  • Clear separation of concerns


Environment Variables for Credentials

All database pages use environment variables for storing credentials. Here's the standard pattern:

Project-Level Variables

  1. Open your project in Valohai

  2. Go to Settings → Env Variables

  3. Add database credentials

  4. Mark sensitive values (passwords, tokens) as Secret

Example for any database:

Name
Value
Secret

DB_HOST

my-database.example.com

No

DB_NAME

production_db

No

DB_USER

ml_pipeline_user

No

DB_PASSWORD

your-secure-password

Yes

DB_PORT

5432

No


Organization-Level Variable Groups

For shared credentials across projects:

  1. Organization admin goes to Organization Settings → Environment Variables

  2. Add new environment variable group

  3. Create group (e.g., "Production Database Credentials")

  4. Add variables

  5. Projects can inherit these groups

Benefits:

  • Central credential management

  • One update affects all projects

  • Easier rotation

  • Consistent naming


Common Patterns

Daily Snapshot for Real-Time Features


Incremental Snapshots


Train/Validation/Test Splits from Database


Database-Specific Guides

Each database has unique setup requirements and best practices:

  • IAM Role authentication (recommended)

  • Username/password authentication

  • Security group configuration

  • Python and psql examples

  • Service Account authentication (recommended)

  • Cross-project access

  • Python client setup

  • Query optimization

  • Username/password authentication

  • Time Travel feature integration

  • Python connector setup

  • Dependency management



Next Steps

  • Choose your database from the guides above

  • Set up authentication (machine identity or credentials)

  • Create your first querysnapshottrain pipeline

  • Schedule regular snapshots for fresh data

  • Build reproducible ML pipelines with versioned data

Last updated

Was this helpful?