Evaluating Multiple Models and Providers

Run the same evaluation suite across OpenAI, Anthropic, and Llama models in Valohai; compare quality, latency, and cost with a reproducible leaderboard.

In modern GenAI, valuation is step 0. Rather than fine-tuning first, compare out-of-the-box models (OpenAI, Anthropic, Meta Llama) on the same versioned evaluation dataset and pick the best fit. Keep it running continually as providers evolve.

Why run multiple providers

Quality varies by task: summarization ≠ extraction ≠ reasoning.
Vendors evolve quickly: today’s top pick may lag tomorrow.
Cost & latency matter: measure $/request and p95 latency alongside accuracy.
Avoid lock-in: keep your pipelines provider-agnostic and data-driven.

Valohai turns this into a repeatable pipeline: one dataset, multiple runs, comparable metrics stored in Model Catalog.

Inputs: evaluation dataset

Use the same dataset version of prompts and gold answers for each provider:

evaluation-prompts:v3/
 ├─ prompts.jsonl
 ├─ gold_answers.jsonl
 └─ metadata.yaml

Store it as a Valohai dataset. When it changes, create a new version (v4, v5) so you can easily reproduce and compare across versions.

Provider-matrix pipeline (YAML)

Run the same step against different providers/models by passing parameters. You can parallelize with a task.

# Evaluate OpenAI, Anthropic, and Llama on the same dataset
- step: eval-provider
  image: python:3.12
  command:
    - python eval_provider.py
  inputs:
    - name: evalset
      default: dataset://valuation-prompts/v3
    - name: model
      default: model://{parameter:model-name}/{parameter:model-version}
  parameters:
    - name: provider
      default: "openai" # openai | anthropic | meta
    - name: model
      default: "gpt" # claude | llama-3
    - name: model-version
      default: "4-turbo" # 3-5-sonnet | 70b-instruct

⚙️ Secrets: Store OPENAI_API_KEY, ANTHROPIC_API_KEY, and any Meta/Ollama tokens as Valohai secrets. They are injected at runtime, never committed to code or YAML.

Provider-agnostic evaluation (pseudo‑Python)

eval_provider.py should load prompts, call the provider, score outputs, and log metrics (quality, latency, cost).

import os, time, json, numpy as np
import valohai  # Valohai helper
import openai, anthropic
from llamaapi import Llama
from pathlib import Path

PROVIDER = valohai.parameters("provider").value
MODEL = valohai.parameters("model").value
DATASET = valohai.inputs("evalset").path()

def call_model(prompt, context=""):
    start = time.time()
    if PROVIDER == "openai":
        r = openai.ChatCompletion.create(model=MODEL, messages=[{"role":"user","content":f"{context}\n\n{prompt}"}])
        text = r["choices"][0]["message"]["content"]
        usage = r.get("usage", {})
        cost = 0.0  # compute from usage if you have a pricing table
    elif PROVIDER == "anthropic":
        client = anthropic.Anthropic()
        r = client.messages.create(model=MODEL, messages=[{"role":"user","content":f"{context}\n\n{prompt}"}])
        text = r.content[0].text
        usage = {} ; cost = 0.0
    else:  # meta / llama
        llama = Llama(api_key=os.getenv("LLAMA_API_KEY", ""))
        r = llama.generate(prompt=f"{context}\n\n{prompt}")
        text = r.get("text","")
        usage = {} ; cost = 0.0
    latency = time.time() - start
    return text, latency, cost

def score(pred, ref):
    # Example: exact-match or semantic score placeholder
    return 1.0 if pred.strip().lower() == ref.strip().lower() else 0.0

prompts = [json.loads(l) for l in open(DATASET)]
scores, latencies, costs = [], [], []
for ex in prompts:
    pred, lat, cst = call_model(ex["prompt"], ex.get("context",""))
    s = score(pred, ex.get("gold",""))
    scores.append(s); latencies.append(lat); costs.append(cst)

metrics = {
    "provider": PROVIDER,
    "model": MODEL,
    "quality": float(np.mean(scores)),
    "p95_latency": float(np.percentile(latencies, 95)),
    "avg_cost_usd": float(np.mean(costs)),
}

print(json.dumps({metrics}))

# Save raw outputs for auditing
Path("/valohai/outputs/results.json").write_text(json.dumps({
    "provider": PROVIDER, "model": MODEL,
    "examples": len(prompts), "metrics": metrics,
}))
print("Logged:", metrics)

Replace score() with your evaluation of choice (BLEU/ROUGE/BERTScore, GPT-as-a-judge, task-specific accuracy). Log response_variance by repeating prompts N times and averaging per-prompt variance.

Building the leaderboard

Use the Model Catalog to compare runs. Or write a small aggregator step that pulls metrics from the three executions and writes a table.

Example metrics you might track:

Provider

Model

Quality (0–1)

p95 Latency (s)

Avg Cost (USD)

Decision

OpenAI

gpt-4-turbo

0.86

1.40

0.0042

✅

Anthropic

claude-3-5-sonnet

0.88

1.25

0.0038

✅

Automate and iterate

Re-run the leaderboard weekly or when a dataset changes (webhook/trigger).
Add human approval before adopting a new winner.
Track response_variance alongside quality to avoid unstable models.
Export a CSV/Markdown report at the end of each run for stakeholders.

Next steps

PreviousEvaluating and Validating GenAI Applications NextRAG and Context Pipelines in Valohai

Last updated 13 days ago

Was this helpful?