Evaluating Multiple Models and Providers

In modern GenAI, valuation is step 0. Rather than fine-tuning first, compare out-of-the-box models (OpenAI, Anthropic, Meta Llama) on the same versioned evaluation dataset and pick the best fit. Keep it running continually as providers evolve.

Why run multiple providers

  • Quality varies by task: summarization ≠ extraction ≠ reasoning.

  • Vendors evolve quickly: today’s top pick may lag tomorrow.

  • Cost & latency matter: measure $/request and p95 latency alongside accuracy.

  • Avoid lock-in: keep your pipelines provider-agnostic and data-driven.

Valohai turns this into a repeatable pipeline: one dataset, multiple runs, comparable metrics stored in Model Catalog.

Inputs: evaluation dataset

Use the same dataset version of prompts and gold answers for each provider:

evaluation-prompts:v3/
 ├─ prompts.jsonl
 ├─ gold_answers.jsonl
 └─ metadata.yaml

Store it as a Valohai dataset. When it changes, create a new version (v4, v5) so you can easily reproduce and compare across versions.

Provider-matrix pipeline (YAML)

Run the same step against different providers/models by passing parameters. You can parallelize with a task.

⚙️ Secrets: Store OPENAI_API_KEY, ANTHROPIC_API_KEY, and any Meta/Ollama tokens as Valohai secrets. They are injected at runtime, never committed to code or YAML.


Provider-agnostic evaluation (pseudo‑Python)

eval_provider.py should load prompts, call the provider, score outputs, and log metrics (quality, latency, cost).

Replace score() with your evaluation of choice (BLEU/ROUGE/BERTScore, GPT-as-a-judge, task-specific accuracy). Log response_variance by repeating prompts N times and averaging per-prompt variance.


Building the leaderboard

Use the Model Catalog to compare runs. Or write a small aggregator step that pulls metrics from the three executions and writes a table.

Example metrics you might track:

Provider
Model
Quality (0–1)
p95 Latency (s)
Avg Cost (USD)
Decision

OpenAI

gpt-4-turbo

0.86

1.40

0.0042

Anthropic

claude-3-5-sonnet

0.88

1.25

0.0038

Meta

llama-3-70b-instruct

0.80

1.05

0.0011

🔄 re‑test

Automate and iterate

  • Re-run the leaderboard weekly or when a dataset changes (webhook/trigger).

  • Add human approval before adopting a new winner.

  • Track response_variance alongside quality to avoid unstable models.

  • Export a CSV/Markdown report at the end of each run for stakeholders.

Next steps

Last updated

Was this helpful?