Pipelines: Chain Your Jobs

Connect your existing jobs into automated workflows. Define how data flows between steps and let Valohai handle the orchestration.

💡 Already have working steps? You're ready to build pipelines. Just define how outputs connect to inputs.

How Pipelines Work

A pipeline is a recipe for connecting jobs:

  • Nodes = Your jobs (preprocessing, training, evaluation, etc.)

  • Edges = Data / information flow (e.g. which outputs become which inputs)

When you run a pipeline, Valohai automatically:

  • Executes jobs in the right order

  • Passes outputs between steps into defined inputs

  • Handles parallel execution where possible

  • Tracks the complete lineage

Quick Example

Connect three existing steps into a pipeline:

- pipeline:
    name: ml-workflow
    nodes:
      - name: preprocess
        type: execution
        step: preprocess-data
      - name: train
        type: execution
        step: train-model
      - name: evaluate
        type: execution
        step: evaluate-model
    edges:
      # Connect outputs → inputs
      - [preprocess.outputs.*, train.inputs.dataset]
      - [train.outputs.model*, evaluate.inputs.model]
      - [preprocess.outputs.*test*, evaluate.inputs.test-data]

Run it:

💡 If you have pushed the valohai.yaml to Git and fetched the commit to your Valohai project, you can omit the --adhoc flag.

Complete Example

Let's build a real pipeline with three steps:

1. Define Your Steps (if not already done)

2. Connect as Pipeline

Optional: Use the valohai-utils Python helper tool

Using valohai-utils, define pipelines in Python:

Generate YAML:

Edge Patterns

Basic Output → Input

Wildcard Matching

Pass parameters and metrics between nodes

In addition to defining the edges via outputs and inputs, they can be also used to pass parameters between nodes.

Multiple Targets

Advanced Features

Conditional Execution

You can define specific conditions for pipeline nodes.

When: Actions trigger when certain events occur during pipeline execution. The available options include:

  • node-starting: When a node is about to start.

  • node-complete: When a node successfully completes.

  • node-error: When a node encounters an error.

If Condition: The condition to trigger the action can be based on either metric or a parameter value.

Then: Depending on the condition being met, you can take one of the following actions:

  • stop-pipeline: Halts the entire pipeline.

  • require-approval: Pauses the pipeline until a user manually approves the previous results.

Parallel Execution

In the example below the nodes train-model-a and train-model-b will run in parallel. The ensemble node will only start once both of them are finished.

It is also possible to run Task nodes inside pipelines:

Deployments

In addition to execution and Task nodes, it is possible to create deployments from pipelines.

It is possible to create pipeline nodes after a deployment node. This can be used to for example check the endpoint once it has been created or clean old endpoints within the pipeline.

Running Pipelines

From CLI

Quick Reference

Minimal Pipeline

Edge Syntax

Sources:

  • node-name.output.* — All outputs

  • node-name.output.*.csv — Only CSV files

  • node-name.output.name* — Starts with "name"

  • node-name.metadata.accuracy — Metadata value

  • node-name.parameter.learning_rate — Parameter value

  • deploy.deployment.id / deploy.deployment.version_id — Deployment / deployment version id

Targets:

  • node-name.input.input-name — Any input available on the node

  • node-name.parameter.learning_rate — Parameter value

  • deploy.file.predict-digit.model — File for deployment nodes

Node Types

  • execution — Run a step

  • task — Run parameter sweep

  • deployment — Create endpoint


Bottom line: If your steps work individually, connecting them into a pipeline takes just a few lines of YAML.

Last updated

Was this helpful?