Dataset Version Trigger
Automatically launch pipelines when new dataset versions are created
Launch a pipeline automatically when a new dataset version is created. This guide shows you how to parse notification payloads, handle dataset inputs, and configure pipeline and trigger settings for event-driven data processing.
Use Case
Process newly uploaded datasets automatically without manual intervention. When a user uploads data to Valohai and creates a new dataset version out of it, your preprocessing pipeline launches immediately and processes all files in the dataset version.
Perfect for:
Auto-validating uploaded data
Preprocessing raw data for training
Generating data quality reports
Extracting features from new samples
How It Works
Upload dataset: User creates a new dataset version
Notification fires: Valohai generates "dataset version created" event
Trigger launches: Pipeline starts automatically
Parse payload: First node extracts dataset URI from notification
Process data: Second node receives URI and processes files
Step 1: Create the Scripts
Parse Notification Payload
This script extracts the dataset version URI from the notification and outputs it as metadata for the next pipeline node.
Create parse-notification.py:
"""Parse dataset version URL from notification payload.
The payload is a JSON file that Valohai sends to the step.
"""
import json
import valohai
# notification payload is provided in a Valohai input file
input_file = valohai.inputs("payload").path()
# get the json "payload" content from the input file
with open(input_file) as file:
payload = json.load(file)
# retrieve the new dataset version URI from the payload
dataset_version_uri = payload["data"]["version"]["uri"]
# output the URI as execution metadata
# this will be available to the next step
print(json.dumps({"dataset": dataset_version_uri}))Data Processing Script
This example lists all files in the dataset. Replace this with your actual processing logic.
Create list-inputs.py:
"""List all inputs given to the current step."""
import valohai
valohai.prepare(step="list-inputs")
for file_path in valohai.inputs("dataset").paths():
print(file_path)Dependencies
Add to requirements.txt:
valohai-utilsStep 2: Configure Pipeline in YAML
Create your valohai.yaml with two steps and a pipeline that connects them:
- step:
name: parse-notification
image: python:3.12
command:
- pip install -r requirements.txt
- python ./parse-notification.py {parameters}
inputs:
- name: payload
- step:
name: list-inputs
image: python:3.12
command:
- pip install -r requirements.txt
- python ./list-inputs.py {parameters}
parameters:
- name: dataset_url
type: string
inputs:
- name: dataset
default: "{parameter:dataset_url}"
- pipeline:
name: Dataset handling automation
nodes:
- name: parse-notification
step: parse-notification
type: execution
- name: list-inputs
step: list-inputs
type: execution
edges:
- [parse-notification.metadata.dataset, list-inputs.parameter.dataset_url]How the Pipeline Works
Node 1: parse-notification
Receives webhook payload as input
Extracts dataset version URI
Outputs URI as metadata (
dataset)
Edge: Connect metadata to parameter
Takes
parse-notification.metadata.datasetPasses it to
list-inputs.parameter.dataset_url
Node 2: list-inputs
Receives dataset URI via parameter
Parameter becomes input via
{parameter:dataset_url}Processes all files in the dataset
Naming matters: Step names, parameter names, and input names must match exactly between your YAML and trigger configuration.
Step 3: Create the Trigger
Go to Project Settings → Triggers
Click Create Trigger
Configure the trigger:
Basic Settings:
Title:
Dataset version -> new data handler pipelineTrigger Type: Notification
Conditions (optional):
Payload Filter if you want to filter by specific dataset:
Lookup Path:
data.dataset.nameOperation: Equals
Invariant:
your-dataset-name
Actions:
Action Type: Run Pipeline
Source Commit Reference:
main(or your branch)Pipeline Name:
Dataset handling automationPipeline Title: (optional title for pipeline runs)
Payload Input Name:
parse-notification.payload
Click Save Trigger
A Managed Trigger Channel is automatically created for this trigger.
Step 4: Set Up Notification Routing
Connect the "dataset version created" event to your trigger:
Go to Project Settings → Notifications → Project Notifications
Click Create new notification routing
Configure routing:
Event:
dataset version is createdFilter events by users: All users (or select specific users)
Channel: Select
Launches trigger: Dataset version -> new data handler pipeline
Click Save
Now your automation is live!
Step 5: Test the Workflow
Test Parsing Node Manually
Before testing the full trigger, verify the parsing step works:
Create a test payload file
test-payload.json:
{
"type": "dataset_version_created",
"data": {
"version": {
"uri": "datum://your-dataset/version-id"
}
}
}Create a new execution:
Select step:
parse-notificationAdd input: Upload your test payload file to the
payloadinput
Run execution
Check logs: Dataset version URL should be printed
Check metadata: Should show extracted URI
Test Processing Node Manually
Verify the data processing step works:
Create a dataset and add files to it
Copy the dataset version datum URL (from dataset page)
Create execution:
Select step:
list-inputsPaste dataset URI into Data → Inputs → dataset → URL field
Set
dataset_urlparameter to empty string ("")
Run execution
Check logs: Should list all files in dataset
Test Full Trigger
Once manual tests pass, trigger the real automation:
Go to your dataset
Create a new version (upload files or create empty version)
Check Executions page: Pipeline should launch automatically
Verify both nodes execute correctly
Check trigger logs if anything fails: Project Settings → Triggers → ... menu → View Logs
Troubleshooting
Trigger Not Launching
Check notification routing:
Verify event type is "dataset version is created"
Confirm channel is selected correctly
Check if user filter is too restrictive
Check trigger status:
Ensure trigger is enabled
Look at trigger logs for errors
Verify payload filters aren't blocking events
Pipeline Fails at Parsing Node
Common issues:
Payload input not configured correctly
Input name mismatch between trigger config and YAML
JSON parsing error in script
Debug steps:
Check execution logs for Python errors
Verify payload file exists and is valid JSON
Print payload content to see structure
Pipeline Fails at Processing Node
Common issues:
Dataset URI not passed correctly through edge
Parameter name mismatch
Input configuration error
Debug steps:
Check if parameter received correct value
Verify edge configuration in pipeline YAML
Test processing node manually with known-good dataset URI
Trigger Fires Too Often
Solutions:
Add payload filter to only trigger for specific datasets
Add user filter to only trigger for certain team members
Check if multiple notification routings are configured
Next Steps
Auto-deploy models: Model Version Trigger Guide
Schedule training: Scheduled Triggers
Notify team: Outgoing Webhooks
Last updated
Was this helpful?
