Evaluating Multiple Models and Providers

Run the same evaluation suite across OpenAI, Anthropic, and Llama models in Valohai; compare quality, latency, and cost with a reproducible leaderboard.

In modern GenAI, valuation is step 0. Rather than fine-tuning first, compare out-of-the-box models (OpenAI, Anthropic, Meta Llama) on the same versioned evaluation dataset and pick the best fit. Keep it running continually as providers evolve.

Why run multiple providers

  • Quality varies by task: summarization ≠ extraction ≠ reasoning.

  • Vendors evolve quickly: today’s top pick may lag tomorrow.

  • Cost & latency matter: measure $/request and p95 latency alongside accuracy.

  • Avoid lock-in: keep your pipelines provider-agnostic and data-driven.

Valohai turns this into a repeatable pipeline: one dataset, multiple runs, comparable metrics stored in Model Catalog.

Inputs: evaluation dataset

Use the same dataset version of prompts and gold answers for each provider:

evaluation-prompts:v3/
 ├─ prompts.jsonl
 ├─ gold_answers.jsonl
 └─ metadata.yaml

Store it as a Valohai dataset. When it changes, create a new version (v4, v5) so you can easily reproduce and compare across versions.

Provider-matrix pipeline (YAML)

Run the same step against different providers/models by passing parameters. You can parallelize with a task.

# Evaluate OpenAI, Anthropic, and Llama on the same dataset
- step: eval-provider
  image: python:3.12
  command:
    - python eval_provider.py
  inputs:
    - name: evalset
      default: dataset://valuation-prompts/v3
    - name: model
      default: model://{parameter:model-name}/{parameter:model-version}
  parameters:
    - name: provider
      default: "openai" # openai | anthropic | meta
    - name: model
      default: "gpt" # claude | llama-3
    - name: model-version
      default: "4-turbo" # 3-5-sonnet | 70b-instruct

⚙️ Secrets: Store OPENAI_API_KEY, ANTHROPIC_API_KEY, and any Meta/Ollama tokens as Valohai secrets. They are injected at runtime, never committed to code or YAML.


Provider-agnostic evaluation (pseudo‑Python)

eval_provider.py should load prompts, call the provider, score outputs, and log metrics (quality, latency, cost).

import os, time, json, numpy as np
import valohai  # Valohai helper
import openai, anthropic
from llamaapi import Llama
from pathlib import Path

PROVIDER = valohai.parameters("provider").value
MODEL = valohai.parameters("model").value
DATASET = valohai.inputs("evalset").path()

def call_model(prompt, context=""):
    start = time.time()
    if PROVIDER == "openai":
        r = openai.ChatCompletion.create(model=MODEL, messages=[{"role":"user","content":f"{context}\n\n{prompt}"}])
        text = r["choices"][0]["message"]["content"]
        usage = r.get("usage", {})
        cost = 0.0  # compute from usage if you have a pricing table
    elif PROVIDER == "anthropic":
        client = anthropic.Anthropic()
        r = client.messages.create(model=MODEL, messages=[{"role":"user","content":f"{context}\n\n{prompt}"}])
        text = r.content[0].text
        usage = {} ; cost = 0.0
    else:  # meta / llama
        llama = Llama(api_key=os.getenv("LLAMA_API_KEY", ""))
        r = llama.generate(prompt=f"{context}\n\n{prompt}")
        text = r.get("text","")
        usage = {} ; cost = 0.0
    latency = time.time() - start
    return text, latency, cost

def score(pred, ref):
    # Example: exact-match or semantic score placeholder
    return 1.0 if pred.strip().lower() == ref.strip().lower() else 0.0

prompts = [json.loads(l) for l in open(DATASET)]
scores, latencies, costs = [], [], []
for ex in prompts:
    pred, lat, cst = call_model(ex["prompt"], ex.get("context",""))
    s = score(pred, ex.get("gold",""))
    scores.append(s); latencies.append(lat); costs.append(cst)

metrics = {
    "provider": PROVIDER,
    "model": MODEL,
    "quality": float(np.mean(scores)),
    "p95_latency": float(np.percentile(latencies, 95)),
    "avg_cost_usd": float(np.mean(costs)),
}

print(json.dumps({metrics}))

# Save raw outputs for auditing
Path("/valohai/outputs/results.json").write_text(json.dumps({
    "provider": PROVIDER, "model": MODEL,
    "examples": len(prompts), "metrics": metrics,
}))
print("Logged:", metrics)

Replace score() with your evaluation of choice (BLEU/ROUGE/BERTScore, GPT-as-a-judge, task-specific accuracy). Log response_variance by repeating prompts N times and averaging per-prompt variance.


Building the leaderboard

Use the Model Catalog to compare runs. Or write a small aggregator step that pulls metrics from the three executions and writes a table.

Example metrics you might track:

Provider
Model
Quality (0–1)
p95 Latency (s)
Avg Cost (USD)
Decision

OpenAI

gpt-4-turbo

0.86

1.40

0.0042

Anthropic

claude-3-5-sonnet

0.88

1.25

0.0038

Meta

llama-3-70b-instruct

0.80

1.05

0.0011

🔄 re‑test

Automate and iterate

  • Re-run the leaderboard weekly or when a dataset changes (webhook/trigger).

  • Add human approval before adopting a new winner.

  • Track response_variance alongside quality to avoid unstable models.

  • Export a CSV/Markdown report at the end of each run for stakeholders.

Next steps

Last updated

Was this helpful?