Compare Executions

Compare metrics from multiple experiments side-by-side. See all runs together on the same graphs and in a comparison table to quickly identify your best-performing models.

Quick Start

1. Select Executions to Compare

From your project's Executions tab:

Use the checkboxes at the left of each row to select executions
Select 2 or more executions you want to compare
Click the Compare button above the table

The comparison view opens.

2. View Comparison Graphs

All selected executions appear on the same visualization:

Time series graphs: Each execution's metrics plotted together Same controls: Horizontal axis, smoothing, vertical axes—all work the same as single-execution views Color-coded: Each execution gets a unique color (or enable "One Color per Execution" to group all metrics from one execution)

3. View Comparison Table

Scroll down below the graphs to see the comparison table.

The table shows:

One row per execution
One column per metric
Latest value of each metric
Execution metadata (name, parameters, environment, etc.)

Sort by any column to quickly find:

Highest accuracy
Lowest loss
Fastest training time
Best F1 score

Comparison Workflow

1. Run Multiple Experiments

Train models with different settings:

# Experiment 1: learning_rate=0.001
# Experiment 2: learning_rate=0.01
# Experiment 3: learning_rate=0.1

Each logs metrics the same way:

print(json.dumps({
    "epoch": epoch,
    "train_loss": train_loss,
    "val_loss": val_loss,
    "val_accuracy": val_acc
}))

2. Compare Visually

Select executions and click Compare

What you'll see:

All training curves overlaid on one graph
Which experiments converge faster
Which achieve better final performance
Which settings cause instability

3. Analyze the Table

Click column headers to sort:

Sort by val_accuracy descending → Find best model Sort by train_loss ascending → See which converged best Sort by epoch → See which finished training

Find patterns:

Do higher learning rates train faster but plateau lower?
Do larger batch sizes improve stability?
Does dropout improve generalization (lower gap between train/val)?

Comparison Features

Overlay Metrics on Graphs

All executions appear on the same graph with different colors:

Example: Comparing 3 learning rates

Blue line: LR=0.001 (slow but steady)
Green line: LR=0.01 (faster convergence)
Red line: LR=0.1 (unstable, diverges)

You can instantly see which learning rate works best.

Use "One Color per Execution"

Enable this option in Chart Options to:

Give all metrics from one execution the same color
Make it easier to track which line belongs to which execution
Reduce visual clutter when comparing many executions

Use when: Comparing 3+ executions with multiple metrics each

Filter with Smoothing

Apply smoothing to noisy metrics to see trends more clearly:

Select a metric in Vertical Axes
Adjust the Smoothing slider
Compare smoothed trends across executions

Especially useful when comparing runs with batch-level logging.

Create Multiple Comparison Views

Just like single executions, you can create multiple visualization tabs:

Click the + button
Name your tab (e.g., "Loss Comparison", "Accuracy Only")
Add different metrics to each tab

Use cases:

One tab for loss curves
One tab for accuracy metrics
One tab for learning rate comparison

Comparison Table

Understanding the Table

Rows: One per execution Columns: One per unique metric across all executions Values: Latest logged value of each metric

Example:

Execution

learning_rate

val_accuracy

val_loss

final_epoch

#142

0.001

0.95

0.23

100

#143

0.01

0.93

0.28

100

#144

0.1

0.78

0.65

Insight: Execution #142 has the best accuracy, but #143 might be acceptable and trains at similar speed.

Sort to Find Best

Click any column header to sort:

Sort by val_accuracy (descending): Immediately see which execution achieved the highest validation accuracy.

Sort by train_loss (ascending): See which execution had the best training convergence.

Sort by execution number: View in chronological order to see how recent changes affected performance.

Missing Values

If an execution didn't log a particular metric, the cell appears empty.

Example:

Execution #142 logs val_f1_score
Execution #143 does not log val_f1_score
Table shows value for #142, empty for #143

Common Comparison Scenarios

Hyperparameter Tuning

Goal: Find the best learning rate

Steps:

Run executions with learning_rate = [0.0001, 0.001, 0.01, 0.1]
Compare all executions
Sort table by val_accuracy (descending)
Look at graphs to see convergence speed
Choose learning rate that balances accuracy and training time

Architecture Comparison

Goal: Compare ResNet50 vs. EfficientNet

Steps:

Train both architectures with the same settings
Log model_name as a metric or parameter
Compare executions
Sort by val_accuracy and training_time_minutes
Evaluate tradeoff between accuracy and speed

# Log model architecture
print(json.dumps({
    "model_name": "resnet50",  # or "efficientnet"
    "epoch": epoch,
    "val_accuracy": val_acc
}))

Optimizer Comparison

Goal: SGD vs. Adam vs. AdamW

Steps:

Train with each optimizer
Log optimizer name
Compare convergence speed and final accuracy
Consider stability (look for spikes in loss)

Best Practices

Use Consistent Metric Names

Keep metric names identical across all executions:

# Good: All experiments use same names
"val_accuracy"
"val_loss"
"train_accuracy"

# Avoid: Different names per experiment
"validation_accuracy"  # Experiment 1
"val_acc"             # Experiment 2

If metrics have different names, they appear as separate columns in the comparison table.

Use Descriptive Execution Names

Name your executions descriptively in the Valohai UI:

Good: "ResNet50-LR0.001-BS32"
Avoid: "Execution #142"

Makes it easier to identify executions in the comparison view.

Compare Small Batches First

Don't compare 50 executions at once. Start small:

Compare 2-5 related executions
Identify patterns
Drill down with more comparisons as needed

Too many executions create cluttered graphs and slow loading.

Export Comparison Data

Download the comparison table for external analysis:

Click Download raw data (top right)
Choose CSV or JSON
Get all metrics from all selected executions

Use exported data for:

Statistical analysis (e.g., significance tests)
Custom visualizations
Reporting to stakeholders
Creating summary tables

Example: Analysis in Python

import pandas as pd
import matplotlib.pyplot as plt

# Load comparison data
df = pd.read_csv('comparison_data.csv')

# Group by hyperparameter
grouped = df.groupby('learning_rate')['val_accuracy'].mean()

# Plot
grouped.plot(kind='bar')
plt.title('Average Val Accuracy by Learning Rate')
plt.xlabel('Learning Rate')
plt.ylabel('Val Accuracy')
plt.savefig('comparison_analysis.png')

Next Steps

Visualize time series metrics for individual executions
Create confusion matrices to compare classification performance
Compare output images across different runs
Use comparison results to inform your next experiments
Back to Experiment Tracking overview

PreviousConfusion Matrix NextCompare Images

Last updated 26 days ago

Was this helpful?