BigQuery Connector
Run SQL queries on Google Cloud BigQuery and save results to your data store.
Why use this connector?
Query directly from Valohai: No need to export data manually. Write SQL, run execution, get CSV output.
Version your queries: Every query is saved with the execution. Reproduce results months later by checking which query ran when.
Feed downstream jobs: Query outputs get datum URLs. Use them as inputs in other executions or pipelines.
Requirements
BigQuery workspace on your GCP account with data.
GCP Service Account with BigQuery Data Viewer and BigQuery User roles.
Authentication via keyfile or machine identity.
Authentication options
Option 1: Machine identity (recommended)
If your Valohai workers run on GCP with a service account attached:
Attach the service account to your workers (contact Valohai support if needed)
Set environment variables:
GCP_PROJECT: Your GCP project IDGCP_IAM: Set to1
Option 2: Service account keyfile
If not using machine identity:
Create a service account with required roles
Download the JSON keyfile
Set environment variables:
GCP_PROJECT: Your GCP project IDGCP_IAM: Set to0GCP_KEYFILE_CONTENTS_JSON: Paste entire JSON keyfile (mark as secret)
Add environment variables
Environment variables can be added:
Project-wide: Project Settings → Environment Variables
Organization-wide: Admin users can create environment variable groups that can be passed to several projects.
Per-execution: Set when creating the execution
We recommend project or organization settings for credentials.
Run a query
Open your project
Click Create Execution
Expand valohai-ecosystem → Select
bigquery-queryConfigure parameters:
query: Your SQL query
output-path: (Optional) Output filename, default is
results.csvdatum-alias: (Optional) Alias for easy reference, e.g.,
latest-sales-data
Verify environment variables are set
Click Create Execution
Example query
SELECT
user_id,
DATE(event_timestamp) as date,
COUNT(*) as events,
SUM(revenue) as total_revenue
FROM `my-project.analytics.events`
WHERE DATE(event_timestamp) >= '2025-01-01'
GROUP BY user_id, date
ORDER BY total_revenue DESC
LIMIT 1000Results are saved as results.csv (or your custom output path) and uploaded to your data store.
Use query results
The output of the execution gets a datum URL. Reference it in other executions by the URL directly or by using the datum alias shown in the example below:
- step:
name: train-model
image: python:3.11
command:
- python train.py
inputs:
- name: training-data
default: datum://latest-sales-dataOr use it in a pipeline by passing the execution output to the next node.
Troubleshooting
Authentication fails
Check:
GCP_PROJECTmatches your BigQuery project IDService account has BigQuery Data Viewer and BigQuery User roles
If using keyfile:
GCP_KEYFILE_CONTENTS_JSONcontains the full JSON (not just the path)If using machine identity: Worker service account is properly attached
Query returns no results
BigQuery queries run successfully even if they return zero rows. Check your WHERE clauses and table names.
Timeout on large queries
BigQuery has query execution limits. For very large datasets, consider:
Adding
LIMITclauses for testingBreaking queries into smaller chunks
Using BigQuery's table partitioning
Next steps
Other database connectors:
Build your own:
Last updated
Was this helpful?
