Compare Executions
Compare metrics from multiple experiments side-by-side. See all runs together on the same graphs and in a comparison table to quickly identify your best-performing models.
Quick Start
1. Select Executions to Compare
From your project's Executions tab:
Use the checkboxes at the left of each row to select executions
Select 2 or more executions you want to compare
Click the Compare button above the table
The comparison view opens.

2. View Comparison Graphs
All selected executions appear on the same visualization:
Time series graphs: Each execution's metrics plotted together Same controls: Horizontal axis, smoothing, vertical axes—all work the same as single-execution views Color-coded: Each execution gets a unique color (or enable "One Color per Execution" to group all metrics from one execution)
3. View Comparison Table
Scroll down below the graphs to see the comparison table.
The table shows:
One row per execution
One column per metric
Latest value of each metric
Execution metadata (name, parameters, environment, etc.)
Sort by any column to quickly find:
Highest accuracy
Lowest loss
Fastest training time
Best F1 score
Comparison Workflow
1. Run Multiple Experiments
Train models with different settings:
# Experiment 1: learning_rate=0.001
# Experiment 2: learning_rate=0.01
# Experiment 3: learning_rate=0.1Each logs metrics the same way:
print(json.dumps({
"epoch": epoch,
"train_loss": train_loss,
"val_loss": val_loss,
"val_accuracy": val_acc
}))2. Compare Visually
Select executions and click Compare
What you'll see:
All training curves overlaid on one graph
Which experiments converge faster
Which achieve better final performance
Which settings cause instability
3. Analyze the Table
Click column headers to sort:
Sort by val_accuracy descending → Find best model
Sort by train_loss ascending → See which converged best
Sort by epoch → See which finished training
Find patterns:
Do higher learning rates train faster but plateau lower?
Do larger batch sizes improve stability?
Does dropout improve generalization (lower gap between train/val)?
Comparison Features
Overlay Metrics on Graphs
All executions appear on the same graph with different colors:
Example: Comparing 3 learning rates
Blue line: LR=0.001 (slow but steady)
Green line: LR=0.01 (faster convergence)
Red line: LR=0.1 (unstable, diverges)
You can instantly see which learning rate works best.
Use "One Color per Execution"
Enable this option in Chart Options to:
Give all metrics from one execution the same color
Make it easier to track which line belongs to which execution
Reduce visual clutter when comparing many executions
Use when: Comparing 3+ executions with multiple metrics each
Filter with Smoothing
Apply smoothing to noisy metrics to see trends more clearly:
Select a metric in Vertical Axes
Adjust the Smoothing slider
Compare smoothed trends across executions
Especially useful when comparing runs with batch-level logging.
Create Multiple Comparison Views
Just like single executions, you can create multiple visualization tabs:
Click the + button
Name your tab (e.g., "Loss Comparison", "Accuracy Only")
Add different metrics to each tab
Use cases:
One tab for loss curves
One tab for accuracy metrics
One tab for learning rate comparison
Comparison Table
Understanding the Table
Rows: One per execution Columns: One per unique metric across all executions Values: Latest logged value of each metric
Example:
#142
0.001
0.95
0.23
100
#143
0.01
0.93
0.28
100
#144
0.1
0.78
0.65
50
Insight: Execution #142 has the best accuracy, but #143 might be acceptable and trains at similar speed.
Sort to Find Best
Click any column header to sort:
Sort by val_accuracy (descending): Immediately see which execution achieved the highest validation accuracy.
Sort by train_loss (ascending): See which execution had the best training convergence.
Sort by execution number: View in chronological order to see how recent changes affected performance.
Missing Values
If an execution didn't log a particular metric, the cell appears empty.
Example:
Execution #142 logs
val_f1_scoreExecution #143 does not log
val_f1_scoreTable shows value for #142, empty for #143
Common Comparison Scenarios
Hyperparameter Tuning
Goal: Find the best learning rate
Steps:
Run executions with learning_rate = [0.0001, 0.001, 0.01, 0.1]
Compare all executions
Sort table by
val_accuracy(descending)Look at graphs to see convergence speed
Choose learning rate that balances accuracy and training time
Architecture Comparison
Goal: Compare ResNet50 vs. EfficientNet
Steps:
Train both architectures with the same settings
Log
model_nameas a metric or parameterCompare executions
Sort by
val_accuracyandtraining_time_minutesEvaluate tradeoff between accuracy and speed
# Log model architecture
print(json.dumps({
"model_name": "resnet50", # or "efficientnet"
"epoch": epoch,
"val_accuracy": val_acc
}))Optimizer Comparison
Goal: SGD vs. Adam vs. AdamW
Steps:
Train with each optimizer
Log optimizer name
Compare convergence speed and final accuracy
Consider stability (look for spikes in loss)
Best Practices
Use Consistent Metric Names
Keep metric names identical across all executions:
# Good: All experiments use same names
"val_accuracy"
"val_loss"
"train_accuracy"
# Avoid: Different names per experiment
"validation_accuracy" # Experiment 1
"val_acc" # Experiment 2If metrics have different names, they appear as separate columns in the comparison table.
Use Descriptive Execution Names
Name your executions descriptively in the Valohai UI:
Good: "ResNet50-LR0.001-BS32"
Avoid: "Execution #142"
Makes it easier to identify executions in the comparison view.
Compare Small Batches First
Don't compare 50 executions at once. Start small:
Compare 2-5 related executions
Identify patterns
Drill down with more comparisons as needed
Too many executions create cluttered graphs and slow loading.
Export Comparison Data
Download the comparison table for external analysis:
Click Download raw data (top right)
Choose CSV or JSON
Get all metrics from all selected executions
Use exported data for:
Statistical analysis (e.g., significance tests)
Custom visualizations
Reporting to stakeholders
Creating summary tables
Example: Analysis in Python
import pandas as pd
import matplotlib.pyplot as plt
# Load comparison data
df = pd.read_csv('comparison_data.csv')
# Group by hyperparameter
grouped = df.groupby('learning_rate')['val_accuracy'].mean()
# Plot
grouped.plot(kind='bar')
plt.title('Average Val Accuracy by Learning Rate')
plt.xlabel('Learning Rate')
plt.ylabel('Val Accuracy')
plt.savefig('comparison_analysis.png')Next Steps
Visualize time series metrics for individual executions
Create confusion matrices to compare classification performance
Compare output images across different runs
Use comparison results to inform your next experiments
Back to Experiment Tracking overview
Last updated
Was this helpful?
