Execution Reuse and Caching

Skip redundant computations by reusing results from previous executions. When Valohai detects an identical step configuration, it uses cached results instead of running the step again.

How execution reuse saves time

Consider this scenario: You're iterating on a model architecture, but your 3-hour data preprocessing step hasn't changed. With execution reuse:

  1. First run: All steps execute normally

  2. Second run (after model code changes): Preprocessing is skipped, saving 3 hours

  3. Result: Iterate on model development 5x faster

When executions are reused

Valohai reuses an execution when ALL of these match exactly:

  • Source code: Same Git commit or file contents

  • Parameters: Identical parameter values

  • Input data: Same files (verified by checksums)

  • Docker image: Same container environment

  • Step name: Same step definition

If any element differs, the step runs fresh to ensure reproducibility.

Enable execution reuse

Method 1: Pipeline-wide in valohai.yaml

Enable for all runs of a pipeline:

- pipeline:
    name: model-training
    reuse-executions: true  # Enable caching
    nodes:
      - name: preprocess
        type: execution
        step: preprocess-dataset
      - name: train
        type: execution
        step: train-model
      - name: evaluate
        type: execution
        step: evaluate-model
    edges:
      - [preprocess.output.*, train.input.dataset]
      - [train.output.model, evaluate.input.model]

Method 2: Per-run in the web interface

Toggle reuse for individual pipeline runs:

💡 Use the web interface to temporarily disable reuse when you need fresh results despite unchanged inputs.

Practical examples

Data science iteration workflow

- pipeline:
    name: experiment-pipeline
    reuse-executions: true
    nodes:
      # This rarely changes - perfect for reuse
      - name: fetch-and-clean
        type: execution
        step: data-preparation
        
      # This might change - but reuse when it doesn't
      - name: feature-engineering
        type: execution
        step: create-features
        
      # This changes frequently - but benefits from upstream reuse
      - name: train-experiment
        type: execution
        step: train-model

Reuse pattern:

  • Data preparation: Reused 95% of the time

  • Feature engineering: Reused 70% of the time

  • Model training: Runs fresh but starts immediately with cached inputs

Understanding cache behavior

What triggers a fresh run?

Any change to:

# Parameters
parameters:
  - name: batch_size
    default: 32  # Changing to 64 = fresh run

# Inputs  
inputs:
  - name: dataset
    default: s3://bucket/v1/*.csv  # New files = fresh run

# Code
command: python train.py  # Different commit = fresh run

# Environment
environment: aws-p3-2xlarge  # Different instance = fresh run

Best practices

1. Structure pipelines for maximum reuse

# Good: Separate volatile and stable steps
nodes:
  - name: stable-preprocessing  # Changes monthly
  - name: volatile-training     # Changes daily
# Bad: Combining volatile and stable logic
nodes:
  - name: preprocess-and-train  # Any change reruns everything

2. Use deterministic operations

# Good: Deterministic preprocessing
def preprocess(data):
    return data.sort_values('id').reset_index(drop=True)

# Bad: Non-deterministic operations
def preprocess(data):
    return data.sample(frac=0.8)  # Random sampling = no reuse

3. Version your data explicitly

inputs:
  - name: dataset
    # Good: Versioned data
    default: s3://bucket/data/v2.1/train.parquet
    
    # Bad: Mutable references
    # default: s3://bucket/data/latest/train.parquet

4. Monitor reuse effectiveness

In the pipeline view, reused executions show a special indicator. Track reuse rates to optimize pipeline structure.

Manual execution reuse

Besides automatic reuse, you can manually select specific past executions to use as pipeline nodes. This is useful when:

  • You have a perfect execution from last week you want to reuse

  • You're building a pipeline incrementally, testing one node at a time

  • You want to skip expensive steps during development

Reuse via web interface

  1. Click on the Reuse nodes button

  2. Select from the Pipeline from which to reuse

  3. Click the checkboxes on what nodes you want to reuse

  4. The node will use that execution's outputs without running again

Reuse via API

For programmatic pipeline creation, use reuse_execution_id instead of a template:

import requests
import os

pipeline_config = {
    "project": "PROJECT_ID",
    "title": "experiment-with-reuse",
    "nodes": [
        {
            "name": "preprocess",
            "type": "execution",
            "reuse_execution_id": "exec_123456",  # Reuse past execution
        },
        {
            "name": "train",
            "type": "execution",
            "template": {  # Run fresh
                "step": "train-model",
                "environment": "aws-p3-2xlarge",
                "commit": "main"
            }
        }
    ],
    "edges": [
        ["preprocess.output.*", "train.input.dataset"]
    ]
}

response = requests.post(
    'https://app.valohai.com/api/v0/pipelines/',
    json=pipeline_config,
    headers={
        'Authorization': f'Token {os.getenv("VH_TOKEN")}',
        'Content-Type': 'application/json'
    }
)

Manual vs automatic reuse

Aspect
Automatic Reuse
Manual Reuse

When to use

Iterative development with small changes

Building pipelines with known good executions

Selection

System finds matching execution

You choose specific execution

Flexibility

Based on exact parameter/input match

Use any compatible execution

Use case

"Run this again if nothing changed"

"Use that great run from Tuesday"

Last updated

Was this helpful?