BigQuery

Query Google BigQuery from Valohai executions and save snapshots for reproducible ML pipelines.


Overview

Google BigQuery is a serverless data warehouse that you can query directly from Valohai executions. This guide shows you how to:

  1. Set up authentication (service account or credentials)

  2. Query BigQuery from your code

  3. Save snapshots for reproducibility


Prerequisites

Before you begin:

  • Existing BigQuery dataset with data to query

  • GCP project with BigQuery API enabled

  • Service account or credentials for authentication


Authentication: Choose Your Method

Decision Tree


Use your GCP compute instances' service account identity to authenticate without storing credentials.

How It Works

  1. Valohai VMs run with an attached service account

  2. Grant this service account BigQuery permissions

  3. Executions automatically authenticate using the machine's identity

  4. No credentials stored in Valohai

Benefits

  • No credentials to manage or rotate

  • More secure (no JSON keys in environment variables)

  • Automatic authentication

  • Fine-grained permissions per Valohai environment


Step 1: Create Service Account

  1. Go to your GCP project hosting Valohai resources

  2. Navigate to IAM & Admin → Service Accounts

  3. Click "Create Service Account"

  4. Name it (e.g., valohai-bigquery-access)

  5. Click "Create and Continue"


Step 2: Grant BigQuery Permissions

Grant the service account access to BigQuery:

  1. In the service account creation wizard, click Grant this service account access to project

  2. Add role: BigQuery User

  3. Add role: BigQuery Data Viewer (if you only need read access)

  4. Click Done

For write access: Add BigQuery Data Editor role


Step 3: Attach to Valohai Environment

  1. Copy the service account email (e.g., [email protected])

  2. Share this with Valohai support to attach it to your environments

  3. Different Valohai environments can use different service accounts (dev/staging/production)

💡 Multiple environments: Use separate service accounts with different permissions for each Valohai environment to control access granularly.


Step 4: Query Using Service Account

Python example with automatic service account authentication:

Install dependencies:

Or in Dockerfile:

💡 Note: db-dtypes package improves BigQuery data type handling in pandas.


Accessing BigQuery from Another GCP Project

If your BigQuery data is in a different GCP project than your Valohai resources:

  1. In the BigQuery data project:

    • Grant your Valohai service account BigQuery User permissions

    • Grant BigQuery Data Viewer (or Editor) on specific datasets

  2. In the Valohai infrastructure project:

    • The service account doesn't need BigQuery permissions here

    • Only needs permissions in the project with actual data

Query cross-project data:


Option 2: Credentials Authentication

Store BigQuery service account JSON key as a Valohai environment variable.

Step 1: Create Service Account Key

  1. In GCP Console, go to IAM & Admin → Service Accounts

  2. Find or create a service account with BigQuery permissions

  3. Click on the service account

  4. Go to Keys tab

  5. Click Add Key → Create new key

  6. Choose JSON format

  7. Download the JSON key file


Step 2: Store Key in Valohai

  1. Open your project in Valohai

  2. Go to Settings → Env Variables

  3. Add variable:

Name
Value
Secret

GOOGLE_APPLICATION_CREDENTIALS_JSON

{entire JSON key file content}

Yes

Alternatively, for the key file path approach:

Name
Value
Secret

GCP_PROJECT_ID

my-gcp-project

No

💡 Environment Variable Groups: Organization admins can create shared credential groups under Organization Settings.


Step 3: Query Using Credentials

Python example with JSON credentials:


Complete Workflow: Query → Snapshot → Train

Step 1: Query and Save Snapshot

fetch_data.py:


Step 2: Train on Snapshot

train.py:


Step 3: Pipeline Configuration

valohai.yaml:


Maintaining Reproducibility

⚠️ Critical: BigQuery data changes continuously. Query results today differ from results tomorrow.

The problem:

The solution:

Best practices:

  1. Query once — Run query in dedicated execution

  2. Snapshot immediately — Save to /valohai/outputs/

  3. Version snapshots — Create dataset versions

  4. Train on snapshots — Use dataset as input, never query directly in training

  5. Schedule snapshots — Create fresh snapshots daily/weekly/monthly

See: Databases for complete reproducibility patterns.


Common Issues & Fixes

Authentication Failed

Symptom: google.auth.exceptions.DefaultCredentialsError

Causes & Fixes:

  • Service account not attached → Contact Valohai support to attach service account

  • Wrong project ID → Verify GCP project ID in client initialization

  • Missing credentials → Check GOOGLE_APPLICATION_CREDENTIALS or service account setup


Permission Denied on Query

Symptom: google.api_core.exceptions.Forbidden: 403 Access Denied

Causes & Fixes:

  • Service account missing BigQuery User role → Add role in IAM

  • Missing dataset permissions → Grant BigQuery Data Viewer on specific datasets

  • Cross-project access not configured → Grant permissions in data project


Out of Memory

Symptom: MemoryError when converting to DataFrame

Causes & Fixes:

  • Result set too large → Add LIMIT clause or process in chunks

  • Execution environment too small → Use larger machine type in Valohai



Next Steps

  • Set up service account authentication

  • Create a test query execution

  • Save your first snapshot as a dataset version

  • Build a reproducible training pipeline using snapshots

  • Optimize query costs with partitioning and filtering

Last updated

Was this helpful?