Snowflake

Query Snowflake from Valohai executions and save snapshots for reproducible ML pipelines.


Overview

Snowflake is a cloud data warehouse that you can query directly from Valohai executions. This guide shows you how to:

  1. Store Snowflake credentials securely

  2. Query Snowflake from your code

  3. Save snapshots for reproducibility

  4. Use Snowflake Time Travel features


Prerequisites

Before you begin:

  • Existing Snowflake account with a database containing your data

  • Snowflake credentials (username, password, account identifier)

  • Database access for the user account


Store Credentials in Valohai

Authenticate to Snowflake using username, password, and account identifier stored as environment variables.

Step 1: Find Your Snowflake Account Identifier

Your account identifier format depends on your Snowflake deployment:

Format examples:

  • xy12345.us-east-1 (AWS)

  • xy12345.us-central1.gcp (GCP)

  • xy12345.east-us-2.azure (Azure)

Find it in your Snowflake URL: https://<account_identifier>.snowflakecomputing.com


Step 2: Add Environment Variables

  1. Open your project in Valohai

  2. Go to Settings → Env Variables

  3. Add the following variables:

Name
Value
Secret

SNOWFLAKE_USER

Your Snowflake username

No

SNOWFLAKE_PASSWORD

Your Snowflake password

Yes

SNOWFLAKE_ACCOUNT

Account identifier (e.g., xy12345.us-east-1)

No

SNOWFLAKE_WAREHOUSE

Warehouse name (e.g., COMPUTE_WH)

No

SNOWFLAKE_DATABASE

Database name

No

SNOWFLAKE_SCHEMA

Schema name (e.g., PUBLIC)

No

💡 Environment Variable Groups: Organization admins can create shared credential groups under Organization Settings → Environment Variable Groups instead of configuring each project separately.


Install Snowflake Connector

The Snowflake Python connector requires Python 3.8+. Install the connector and its dependencies in your execution.

valohai.yaml:


Option 2: Include in Docker Image

Dockerfile:


Query Snowflake

Basic Query Example

query_snowflake.py:


Complete Workflow: Query → Snapshot → Train

Step 1: Query and Save Snapshot

fetch_data.py:


Step 2: Train on Snapshot

train.py:


Step 3: Pipeline Configuration

valohai.yaml:


Maintaining Reproducibility

⚠️ Critical: Snowflake data changes continuously. Query results today differ from results tomorrow.

The problem:

The solution:

Best practices:

  1. Query once — Run query in dedicated execution

  2. Snapshot immediately — Save to /valohai/outputs/

  3. Version snapshots — Create dataset versions

  4. Train on snapshots — Use dataset as input, never query directly in training

  5. Use Time Travel for debugging — But snapshot for reproducibility

See: Databases for complete reproducibility patterns.


Common Issues & Fixes

Connection Failed

Symptom: snowflake.connector.errors.DatabaseError: 250001

Causes & Fixes:

  • Wrong account identifier → Verify format (include region and cloud)

  • Wrong username/password → Check credentials in Snowflake UI

  • Network connectivity → Check firewall/VPN settings


Warehouse Not Running

Symptom: SQL compilation error: Object does not exist

Causes & Fixes:

  • Warehouse suspended → Resume warehouse: ALTER WAREHOUSE COMPUTE_WH RESUME

  • Wrong warehouse name → Verify warehouse exists: SHOW WAREHOUSES


Insufficient Privileges

Symptom: SQL access control error: Insufficient privileges

Causes & Fixes:

  • User missing permissions → Grant necessary roles: GRANT SELECT ON DATABASE analytics TO ROLE ml_role

  • Wrong role active → Use correct role: cursor.execute("USE ROLE ml_role")


Python Version Incompatibility

Symptom: Import errors or dependency conflicts

Causes & Fixes:

  • Wrong requirements file → Use correct version for your Python (e.g., requirements_39.reqs for Python 3.9)

  • Missing dependencies → Install tested requirements before connector



Next Steps

  • Store Snowflake credentials in Valohai

  • Create a test query execution

  • Save your first snapshot as a dataset version

Last updated

Was this helpful?