Pipelines

Valohai pipelines transform complex ML workflows into modular, reusable components. Instead of running monolithic scripts, break your work into steps that can be versioned, reused, and optimized independently.

Why use pipelines?

Automatic checkpointing between steps

Each pipeline step runs as a separate execution, creating natural checkpoints. When something fails, you don't lose hours of computation, just restart from the last successful step.

Efficient resource allocation

Different steps need different resources. Your data preprocessing might need high CPU and memory, while training needs GPUs. Pipelines let you specify exact requirements per step, freeing up expensive resources when they're not needed.

Reuse previous work

Made a code change to step 4 of a 6-step pipeline? Use the "reuse nodes" capability to skip steps 1-3 and start directly from your fix. No more waiting for preprocessing to finish again.

Built for experimentation and production

Pipelines aren't just for production workflows. During experimentation:

  • Benchmark multiple models against different datasets in parallel

  • Add conditional logic to explore different paths based on results

  • Pause for human approval before expensive training steps

  • Run hyperparameter tuning as a pipeline after data processing has been completed

Core concepts

Nodes

Individual jobs within your pipeline:

  • Executions: Standard Valohai executions running your code

  • Tasks: Collections of executions with the same code but different parameters/data (perfect for hyperparameter tuning or benchmarking models/datasets)

  • Deployments: Create new model endpoints as part of your workflow

Edges

Connections that pass data between nodes:

  • Output → Input: Files produced by one node become inputs for the next

  • Input → Input: Share the same input files across multiple nodes

  • Parameters → Parameters: Pass parameter values between nodes

  • Metadata → Parameters: Use runtime-generated values (like optimal hyperparameters) as parameters to downstream nodes

When to use pipelines

Pipelines excel when you have:

  • Multi-step workflows where each step has different resource requirements

  • Long-running processes where failure recovery matters

  • Workflows you'll run repeatedly (with small variations)

  • Complex dependencies between different processing stages

  • Need for conditional execution or human approval steps

Common patterns

Training pipeline

  1. Preprocess: Clean and transform raw data (CPU-intensive)

  2. Train: Train your model (GPU-intensive)

  3. Evaluate: Test model performance

  4. Deploy: Create endpoint if metrics pass threshold

Experimentation pipeline

  1. Prepare datasets: Create train/validation/test splits

  2. Hyperparameter search: Run parallel training jobs with different parameters

  3. Compare results: Analyze performance across experiments

  4. Select best model: Automatically identify top performer

Production pipeline (scheduled)

  1. Fetch new data: Pull latest data from your warehouse

  2. Validate quality: Check data integrity and distributions

  3. Retrain model: Update model with new data

  4. A/B test: Deploy to staging for comparison

  5. Promote: Move to production after approval

Last updated

Was this helpful?