# Databases

Connect to production databases from your ML pipelines to query fresh data, but snapshot results to maintain reproducibility as data evolves.

***

### The Challenge: Databases Change

Production databases are constantly updated with new data. This creates a reproducibility problem for ML:

**Today's query:**

```sql
SELECT * FROM customer_data WHERE signup_date >= '2024-01-01'
-- Returns: 50,000 rows
```

**Same query next month:**

```sql
SELECT * FROM customer_data WHERE signup_date >= '2024-01-01'
-- Returns: 65,000 rows (15,000 new customers!)
```

**The problem:**

* Rerunning your training pipeline produces different results
* Can't reproduce experiments from last week
* Hard to debug: "Did the model improve, or did the data change?"
* Compliance issues: "Which data was used to train the production model?"

***

### The Solution: Query + Snapshot Pattern

Valohai solves this with a simple workflow:

1. **Query** — Fetch data from your database
2. **Snapshot** — Save results to `/valohai/outputs/`
3. **Version** — Create dataset version with snapshot
4. **Train** — Use versioned dataset as input

```
[Database] → [Query Execution] → [Snapshot in S3/GCS/Azure]
                ↓
         [Dataset Version]
                ↓
         [Training Execution]
```

**Benefits:**

* Reproducible: Same dataset version = same results forever
* Traceable: Know exactly which data trained each model
* Fast: Reuse snapshots instead of requerying database
* Compliant: Audit trail of data used in production models

***

### Authentication: Machine Identity vs. Credentials

Choose your authentication method based on security requirements and available infrastructure.

#### Decision Tree

```
Do you run on AWS and use Redshift?
├─ Yes: Use IAM Role (machine identity) **Recommended**
└─ No: ↓

Do you run on GCP and use BigQuery?
├─ Yes: Use Service Account (machine identity) **Recommended**
└─ No: ↓

Use stored credentials (username/password)
```

***

#### Machine Identity (Recommended)

**What it is:** Your execution environment's cloud identity authenticates automatically.

**Pros:**

* No credentials to manage
* More secure (no secrets in code or environment variables)
* Automatic rotation via cloud IAM
* Fine-grained permissions per environment

**Cons:**

* Cloud-specific (AWS IAM, GCP Service Accounts)
* Requires infrastructure setup

**When to use:** Production pipelines on AWS (Redshift) or GCP (BigQuery)

**Supported:**

* AWS Redshift → IAM Roles
* GCP BigQuery → Service Accounts

***

#### Stored Credentials

**What it is:** Store database username/password as Valohai environment variables.

**Pros:**

* Works with any database
* Simple setup
* Cloud-agnostic

**Cons:**

* Credentials to manage and rotate
* Stored as environment variables (marked secret)

**When to use:**

* Databases without machine identity support
* Quick prototypes
* On-premises databases
* Cross-cloud scenarios

**Supported:** All databases (Redshift, BigQuery, Snowflake, PostgreSQL, MySQL, etc.)

***

### Complete Workflow Example

Here's the full pattern: query database → save snapshot → create dataset → train model.

#### Step 1: Query Database and Save Snapshot

```python
import pandas as pd
import your_database_connector  # psycopg2, bigquery, snowflake, etc.

# Connect to database (see specific database guides for auth)
conn = connect_to_database()

# Query data
query = """
    SELECT
        customer_id,
        signup_date,
        total_purchases,
        churn_label
    FROM customer_data
    WHERE signup_date >= '2024-01-01'
"""

df = pd.read_sql(query, conn)
conn.close()

print(f"Fetched {len(df)} rows from database")

# Snapshot: Save to Valohai outputs
output_path = "/valohai/outputs/customer_data_snapshot.csv"
df.to_csv(output_path, index=False)

print(f"Saved snapshot to {output_path}")
```

***

#### Step 2: Create Dataset Version with Snapshot

```python
import json

# Add snapshot to dataset version
metadata = {
    "customer_data_snapshot.csv": {
        "valohai.dataset-versions": [
            {
                "uri": "dataset://customer-data/2024-q1",
            },
        ],
        "valohai.tags": ["snapshot", "2024-q1", "production"],
        "query_date": "2024-01-15",
        "row_count": len(df),
    },
}

# Save metadata
metadata_path = "/valohai/outputs/valohai.metadata.jsonl"
with open(metadata_path, "w") as f:
    for filename, file_metadata in metadata.items():
        json.dump({"file": filename, "metadata": file_metadata}, f)
        f.write("\n")

print("Created dataset version: dataset://customer-data/2024-q1")
```

***

#### Step 3: Train Using Versioned Dataset

**valohai.yaml:**

```yaml
- step:
    name: query-database
    image: python:3.9
    command:
      - pip install pandas psycopg2-binary  # or your database connector
      - python fetch_data.py
    environment-variables:
      - name: DB_HOST
        default: my-database.example.com
      - name: DB_USER
        default: ml_pipeline

- step:
    name: train-model
    image: python:3.9
    command:
      - pip install pandas scikit-learn
      - python train.py
    inputs:
      - name: training-data
        default: dataset://customer-data/2024-q1  # Uses snapshot, not live database
```

**train.py:**

```python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

# Load data from snapshot (not database!)
data_path = "/valohai/inputs/training-data/customer_data_snapshot.csv"
df = pd.read_csv(data_path)

print(f"Training on {len(df)} rows from snapshot")

# Train model
X = df[["total_purchases", "days_since_signup"]]
y = df["churn_label"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = RandomForestClassifier()
model.fit(X_train, y_train)

accuracy = model.score(X_test, y_test)
print(f"Model accuracy: {accuracy:.4f}")

# Save model
import pickle

with open("/valohai/outputs/model.pkl", "wb") as f:
    pickle.dump(model, f)
```

***

#### Step 4: Update Snapshot (Monthly/Weekly)

Create new dataset versions as data evolves:

```python
# Next month: Create new snapshot
metadata = {
    "customer_data_snapshot.csv": {
        "valohai.dataset-versions": [
            {
                "uri": "dataset://customer-data/2024-q2",  # New version
            },
        ],
        "valohai.tags": ["snapshot", "2024-q2", "production"],
        "query_date": "2024-04-15",
        "row_count": len(df),
    },
}
```

**Result:**

* `2024-q1` version: 50,000 rows (immutable)
* `2024-q2` version: 65,000 rows (immutable)
* Can train on either version reproducibly
* Can compare model performance across data versions

***

### Why Snapshots Matter: Real Example

**Without snapshots:**

```
Week 1: Train model → 85% accuracy
Week 2: Retrain "same" model → 87% accuracy
Question: Did the model improve, or did the data change?
```

**With snapshots:**

```
Week 1: Train on dataset://data/week1 → 85% accuracy
Week 2:
  - Retrain on dataset://data/week1 → 85% accuracy (reproducible)
  - Train on dataset://data/week2 → 87% accuracy (data improved model)
Answer: The data helped! We can prove it. 🎯
```

***

### Best Practices

#### Snapshot Every Query

```python
# Good: Always save snapshots
df = pd.read_sql(query, conn)
df.to_csv("/valohai/outputs/data_snapshot.csv")

# Bad: Query database directly in training
# (Can't reproduce if data changes)
df = pd.read_sql(query, conn)
model.fit(df[features], df[target])
```

***

#### Version Your Snapshots

```python
# Good: Clear versioning by time period
"uri": "dataset://sales-data/2024-q1"
"uri": "dataset://sales-data/2024-q2"

# Avoid: Arbitrary version names
"uri": "dataset://sales-data/working"
"uri": "dataset://sales-data/last"
"uri": "dataset://sales-data/last_1"
```

***

#### Tag Snapshots with Metadata

```python
metadata = {
    "snapshot.csv": {
        "valohai.dataset-versions": [
            {
                "uri": "dataset://data/2024-q1",
            },
        ],
        "valohai.tags": ["production", "validated", "2024-q1"],
        "query_date": "2024-01-15T10:30:00Z",
        "query": "SELECT * FROM customers WHERE ...",
        "row_count": 50000,
        "date_range": "2024-01-01 to 2024-03-31",
        "source_table": "customers",
        "source_database": "production_db",
    },
}
```

This metadata helps with:

* Debugging ("Which data was this?")
* Compliance ("Prove what data trained this model")
* Optimization ("Do we need to requery or can we reuse?")

***

#### Schedule Regular Snapshots

Create a pipeline that runs automatically:

```yaml
- step:
    name: weekly-snapshot
    image: python:3.9
    command:
      - pip install pandas psycopg2-binary
      - python create_weekly_snapshot.py
    environment-variables:
      - name: SNAPSHOT_DATE
        default: "2024-01-15"  # Parameterize for automation
```

Run weekly/monthly via Valohai scheduling or external orchestration (Airflow, GitHub Actions, etc.)

***

#### Separate Query from Training

```yaml
# Good: Two-step pipeline
- step:
    name: snapshot-data
    image: python:3.9
    command: python fetch_and_save.py

- step:
    name: train
    image: python:3.9
    command: python train.py
    inputs:
      - name: data
        default: dataset://data/latest

# Avoid: Query inside training step
# (Harder to reproduce, slower iteration)
```

Benefits:

* Reuse snapshots across experiments
* Iterate on models without requerying database
* Clear separation of concerns

***

### Environment Variables for Credentials

All database pages use environment variables for storing credentials. Here's the standard pattern:

#### Project-Level Variables

1. Open your project in Valohai
2. Go to **Settings → Env Variables**
3. Add database credentials
4. Mark sensitive values (passwords, tokens) as **Secret**

**Example for any database:**

| Name          | Value                     | Secret  |
| ------------- | ------------------------- | ------- |
| `DB_HOST`     | `my-database.example.com` | No      |
| `DB_NAME`     | `production_db`           | No      |
| `DB_USER`     | `ml_pipeline_user`        | No      |
| `DB_PASSWORD` | `your-secure-password`    | **Yes** |
| `DB_PORT`     | `5432`                    | No      |

***

#### Organization-Level Variable Groups

**For shared credentials across projects:**

1. Organization admin goes to **Organization Settings → Environment Variables**
2. **Add new environment variable group**
3. Create group (e.g., "Production Database Credentials")
4. Add variables
5. Projects can inherit these groups

**Benefits:**

* Central credential management
* One update affects all projects
* Easier rotation
* Consistent naming

***

### Common Patterns

#### Daily Snapshot for Real-Time Features

```python
# Daily snapshot with date in filename
from datetime import datetime

today = datetime.now().strftime("%Y-%m-%d")
snapshot_path = f"/valohai/outputs/daily_snapshot_{today}.csv"

df.to_csv(snapshot_path)

metadata = {
    f"daily_snapshot_{today}.csv": {
        "valohai.dataset-versions": [
            {
                "uri": f"dataset://daily-features/{today}",
            },
        ],
    },
}
```

***

#### Incremental Snapshots

```python
# Only fetch new data since last snapshot
query = """
    SELECT * FROM events
    WHERE event_date > '{last_snapshot_date}'
"""

# Append to existing dataset version
metadata = {
    "incremental_data.csv": {
        "valohai.dataset-versions": [
            {
                "uri": "dataset://events/2024-q1",
                "from": "dataset://events/2024-q1-partial",
            },
        ],
    },
}
```

***

#### Train/Validation/Test Splits from Database

```python
# Create three snapshots from one query
df = pd.read_sql(query, conn)

train, temp = train_test_split(df, test_size=0.3)
val, test = train_test_split(temp, test_size=0.5)

train.to_csv("/valohai/outputs/train.csv")
val.to_csv("/valohai/outputs/validation.csv")
test.to_csv("/valohai/outputs/test.csv")

# Create three dataset versions
metadata = {
    "train.csv": {
        "valohai.dataset-versions": [{"uri": "dataset://data/train-2024-q1"}],
    },
    "validation.csv": {
        "valohai.dataset-versions": [{"uri": "dataset://data/val-2024-q1"}],
    },
    "test.csv": {
        "valohai.dataset-versions": [{"uri": "dataset://data/test-2024-q1"}],
    },
}
```

***

### Database-Specific Guides

Each database has unique setup requirements and best practices:

#### [AWS Redshift](/data/data-databases/aws-redshift.md)

* IAM Role authentication (recommended)
* Username/password authentication
* Security group configuration
* Python and psql examples

#### [Google BigQuery](/data/data-databases/bigquery.md)

* Service Account authentication (recommended)
* Cross-project access
* Python client setup
* Query optimization

#### [Snowflake](/data/data-databases/snowflake.md)

* Username/password authentication
* Time Travel feature integration
* Python connector setup
* Dependency management

***

### Related Pages

* [AWS Redshift](/data/data-databases/aws-redshift.md) — Connect to Amazon Redshift
* [BigQuery](/data/data-databases/bigquery.md) — Connect to Google BigQuery
* [Snowflake](/data/data-databases/snowflake.md) — Connect to Snowflake
* [Create and Manage Datasets](/data/datasets/creating-datasets.md) — Dataset versioning details
* [Ingest](/data/data-versioning/load-files-in-jobs.md) & [Save Files](/data/data-versioning/save-files-from-jobs.md) — Save outputs as datasets

***

### Next Steps

* Choose your database from the guides above
* Set up authentication (machine identity or credentials)
* Create your first **query** → **snapshot** → **train pipeline**
* Schedule regular snapshots for fresh data
* Build reproducible ML pipelines with versioned data


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.valohai.com/data/data-databases.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
