Execution Reuse and Caching
Skip redundant computations by reusing results from previous executions. When Valohai detects an identical step configuration, it uses cached results instead of running the step again.
How execution reuse saves time
Consider this scenario: You're iterating on a model architecture, but your 3-hour data preprocessing step hasn't changed. With execution reuse:
First run: All steps execute normally
Second run (after model code changes): Preprocessing is skipped, saving 3 hours
Result: Iterate on model development 5x faster
When executions are reused
Valohai reuses an execution when ALL of these match exactly:
Source code: Same Git commit or file contents
Parameters: Identical parameter values
Input data: Same files (verified by checksums)
Docker image: Same container environment
Step name: Same step definition
If any element differs, the step runs fresh to ensure reproducibility.
Enable execution reuse
Method 1: Pipeline-wide in valohai.yaml
Enable for all runs of a pipeline:
- pipeline:
name: model-training
reuse-executions: true # Enable caching
nodes:
- name: preprocess
type: execution
step: preprocess-dataset
- name: train
type: execution
step: train-model
- name: evaluate
type: execution
step: evaluate-model
edges:
- [preprocess.output.*, train.input.dataset]
- [train.output.model, evaluate.input.model]Method 2: Per-run in the web interface
Toggle reuse for individual pipeline runs:

💡 Use the web interface to temporarily disable reuse when you need fresh results despite unchanged inputs.
Practical examples
Data science iteration workflow
- pipeline:
name: experiment-pipeline
reuse-executions: true
nodes:
# This rarely changes - perfect for reuse
- name: fetch-and-clean
type: execution
step: data-preparation
# This might change - but reuse when it doesn't
- name: feature-engineering
type: execution
step: create-features
# This changes frequently - but benefits from upstream reuse
- name: train-experiment
type: execution
step: train-modelReuse pattern:
Data preparation: Reused 95% of the time
Feature engineering: Reused 70% of the time
Model training: Runs fresh but starts immediately with cached inputs
Understanding cache behavior
What triggers a fresh run?
Any change to:
# Parameters
parameters:
- name: batch_size
default: 32 # Changing to 64 = fresh run
# Inputs
inputs:
- name: dataset
default: s3://bucket/v1/*.csv # New files = fresh run
# Code
command: python train.py # Different commit = fresh run
# Environment
environment: aws-p3-2xlarge # Different instance = fresh runBest practices
1. Structure pipelines for maximum reuse
# Good: Separate volatile and stable steps
nodes:
- name: stable-preprocessing # Changes monthly
- name: volatile-training # Changes daily# Bad: Combining volatile and stable logic
nodes:
- name: preprocess-and-train # Any change reruns everything2. Use deterministic operations
# Good: Deterministic preprocessing
def preprocess(data):
return data.sort_values('id').reset_index(drop=True)
# Bad: Non-deterministic operations
def preprocess(data):
return data.sample(frac=0.8) # Random sampling = no reuse3. Version your data explicitly
inputs:
- name: dataset
# Good: Versioned data
default: s3://bucket/data/v2.1/train.parquet
# Bad: Mutable references
# default: s3://bucket/data/latest/train.parquet4. Monitor reuse effectiveness
In the pipeline view, reused executions show a special indicator. Track reuse rates to optimize pipeline structure.
Manual execution reuse
Besides automatic reuse, you can manually select specific past executions to use as pipeline nodes. This is useful when:
You have a perfect execution from last week you want to reuse
You're building a pipeline incrementally, testing one node at a time
You want to skip expensive steps during development
Reuse via web interface
Click on the Reuse nodes button
Select from the Pipeline from which to reuse
Click the checkboxes on what nodes you want to reuse
The node will use that execution's outputs without running again

Reuse via API
For programmatic pipeline creation, use reuse_execution_id instead of a template:
import requests
import os
pipeline_config = {
"project": "PROJECT_ID",
"title": "experiment-with-reuse",
"nodes": [
{
"name": "preprocess",
"type": "execution",
"reuse_execution_id": "exec_123456", # Reuse past execution
},
{
"name": "train",
"type": "execution",
"template": { # Run fresh
"step": "train-model",
"environment": "aws-p3-2xlarge",
"commit": "main"
}
}
],
"edges": [
["preprocess.output.*", "train.input.dataset"]
]
}
response = requests.post(
'https://app.valohai.com/api/v0/pipelines/',
json=pipeline_config,
headers={
'Authorization': f'Token {os.getenv("VH_TOKEN")}',
'Content-Type': 'application/json'
}
)Manual vs automatic reuse
When to use
Iterative development with small changes
Building pipelines with known good executions
Selection
System finds matching execution
You choose specific execution
Flexibility
Based on exact parameter/input match
Use any compatible execution
Use case
"Run this again if nothing changed"
"Use that great run from Tuesday"
Last updated
Was this helpful?
