Evaluating Multiple Models and Providers
Run the same evaluation suite across OpenAI, Anthropic, and Llama models in Valohai; compare quality, latency, and cost with a reproducible leaderboard.
In modern GenAI, valuation is step 0. Rather than fine-tuning first, compare out-of-the-box models (OpenAI, Anthropic, Meta Llama) on the same versioned evaluation dataset and pick the best fit. Keep it running continually as providers evolve.
Why run multiple providers
Quality varies by task: summarization ≠ extraction ≠ reasoning.
Vendors evolve quickly: today’s top pick may lag tomorrow.
Cost & latency matter: measure $/request and p95 latency alongside accuracy.
Avoid lock-in: keep your pipelines provider-agnostic and data-driven.
Valohai turns this into a repeatable pipeline: one dataset, multiple runs, comparable metrics stored in Model Catalog.
Inputs: evaluation dataset
Use the same dataset version of prompts and gold answers for each provider:
evaluation-prompts:v3/
├─ prompts.jsonl
├─ gold_answers.jsonl
└─ metadata.yamlStore it as a Valohai dataset. When it changes, create a new version (v4, v5) so you can easily reproduce and compare across versions.
Provider-matrix pipeline (YAML)
Run the same step against different providers/models by passing parameters. You can parallelize with a task.
# Evaluate OpenAI, Anthropic, and Llama on the same dataset
- step: eval-provider
image: python:3.12
command:
- python eval_provider.py
inputs:
- name: evalset
default: dataset://valuation-prompts/v3
- name: model
default: model://{parameter:model-name}/{parameter:model-version}
parameters:
- name: provider
default: "openai" # openai | anthropic | meta
- name: model
default: "gpt" # claude | llama-3
- name: model-version
default: "4-turbo" # 3-5-sonnet | 70b-instruct⚙️ Secrets: Store
OPENAI_API_KEY,ANTHROPIC_API_KEY, and any Meta/Ollama tokens as Valohai secrets. They are injected at runtime, never committed to code or YAML.
Provider-agnostic evaluation (pseudo‑Python)
eval_provider.py should load prompts, call the provider, score outputs, and log metrics (quality, latency, cost).
import os, time, json, numpy as np
import valohai # Valohai helper
import openai, anthropic
from llamaapi import Llama
from pathlib import Path
PROVIDER = valohai.parameters("provider").value
MODEL = valohai.parameters("model").value
DATASET = valohai.inputs("evalset").path()
def call_model(prompt, context=""):
start = time.time()
if PROVIDER == "openai":
r = openai.ChatCompletion.create(model=MODEL, messages=[{"role":"user","content":f"{context}\n\n{prompt}"}])
text = r["choices"][0]["message"]["content"]
usage = r.get("usage", {})
cost = 0.0 # compute from usage if you have a pricing table
elif PROVIDER == "anthropic":
client = anthropic.Anthropic()
r = client.messages.create(model=MODEL, messages=[{"role":"user","content":f"{context}\n\n{prompt}"}])
text = r.content[0].text
usage = {} ; cost = 0.0
else: # meta / llama
llama = Llama(api_key=os.getenv("LLAMA_API_KEY", ""))
r = llama.generate(prompt=f"{context}\n\n{prompt}")
text = r.get("text","")
usage = {} ; cost = 0.0
latency = time.time() - start
return text, latency, cost
def score(pred, ref):
# Example: exact-match or semantic score placeholder
return 1.0 if pred.strip().lower() == ref.strip().lower() else 0.0
prompts = [json.loads(l) for l in open(DATASET)]
scores, latencies, costs = [], [], []
for ex in prompts:
pred, lat, cst = call_model(ex["prompt"], ex.get("context",""))
s = score(pred, ex.get("gold",""))
scores.append(s); latencies.append(lat); costs.append(cst)
metrics = {
"provider": PROVIDER,
"model": MODEL,
"quality": float(np.mean(scores)),
"p95_latency": float(np.percentile(latencies, 95)),
"avg_cost_usd": float(np.mean(costs)),
}
print(json.dumps({metrics}))
# Save raw outputs for auditing
Path("/valohai/outputs/results.json").write_text(json.dumps({
"provider": PROVIDER, "model": MODEL,
"examples": len(prompts), "metrics": metrics,
}))
print("Logged:", metrics)Replace
score()with your evaluation of choice (BLEU/ROUGE/BERTScore, GPT-as-a-judge, task-specific accuracy). Log response_variance by repeating prompts N times and averaging per-prompt variance.
Building the leaderboard
Use the Model Catalog to compare runs. Or write a small aggregator step that pulls metrics from the three executions and writes a table.
Example metrics you might track:
OpenAI
gpt-4-turbo
0.86
1.40
0.0042
✅
Anthropic
claude-3-5-sonnet
0.88
1.25
0.0038
✅
Meta
llama-3-70b-instruct
0.80
1.05
0.0011
🔄 re‑test
Automate and iterate
Re-run the leaderboard weekly or when a dataset changes (webhook/trigger).
Add human approval before adopting a new winner.
Track response_variance alongside quality to avoid unstable models.
Export a CSV/Markdown report at the end of each run for stakeholders.
Next steps
Last updated
Was this helpful?
