# Evaluating Multiple Models and Providers

In modern GenAI, valuation is step 0. Rather than fine-tuning first, compare out-of-the-box models (OpenAI, Anthropic, Meta Llama) on the same versioned evaluation dataset and pick the best fit. Keep it running continually as providers evolve.

### Why run multiple providers

* **Quality varies by task:** summarization ≠ extraction ≠ reasoning.
* **Vendors evolve quickly:** today's top pick may lag tomorrow.
* **Cost & latency matter:** measure $/request and p95 latency alongside accuracy.
* **Avoid lock-in:** keep your pipelines provider-agnostic and data-driven.

Valohai turns this into a repeatable pipeline: one dataset, multiple runs, comparable metrics stored in Model Catalog.

### Inputs: evaluation dataset

Use the same dataset version of prompts and gold answers for each provider:

```markdown
evaluation-prompts:v3/
 ├─ prompts.jsonl
 ├─ gold_answers.jsonl
 └─ metadata.yaml
```

Store it as a Valohai dataset. When it changes, create a new version (`v4`, `v5`) so you can easily reproduce and compare across versions.

### Provider-matrix pipeline (YAML)

Run the same step against different providers/models by passing parameters. You can parallelize with a task.

```yaml
# Evaluate OpenAI, Anthropic, and Llama on the same dataset
- step: eval-provider
  image: python:3.12
  command:
    - python eval_provider.py
  inputs:
    - name: evalset
      default: dataset://valuation-prompts/v3
    - name: model
      default: model://{parameter:model-name}/{parameter:model-version}
  parameters:
    - name: provider
      default: "openai" # openai | anthropic | meta
    - name: model
      default: "gpt" # claude | llama-3
    - name: model-version
      default: "4-turbo" # 3-5-sonnet | 70b-instruct
```

> ⚙️ **Secrets:** Store `OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, and any Meta/Ollama tokens as Valohai secrets. They are injected at runtime, never committed to code or YAML.

***

### Provider-agnostic evaluation (pseudo‑Python)

`eval_provider.py` should load prompts, call the provider, score outputs, and log metrics (quality, latency, cost).

```python
import os, time, json, numpy as np
import valohai  # Valohai helper
import openai, anthropic
from llamaapi import Llama
from pathlib import Path

PROVIDER = valohai.parameters("provider").value
MODEL = valohai.parameters("model").value
DATASET = valohai.inputs("evalset").path()


def call_model(prompt, context=""):
    start = time.time()
    if PROVIDER == "openai":
        r = openai.ChatCompletion.create(model=MODEL, messages=[{"role": "user", "content": f"{context}\n\n{prompt}"}])
        text = r["choices"][0]["message"]["content"]
        usage = r.get("usage", {})
        cost = 0.0  # compute from usage if you have a pricing table
    elif PROVIDER == "anthropic":
        client = anthropic.Anthropic()
        r = client.messages.create(model=MODEL, messages=[{"role": "user", "content": f"{context}\n\n{prompt}"}])
        text = r.content[0].text
        usage = {}
        cost = 0.0
    else:  # meta / llama
        llama = Llama(api_key=os.getenv("LLAMA_API_KEY", ""))
        r = llama.generate(prompt=f"{context}\n\n{prompt}")
        text = r.get("text", "")
        usage = {}
        cost = 0.0
    latency = time.time() - start
    return text, latency, cost


def score(pred, ref):
    # Example: exact-match or semantic score placeholder
    return 1.0 if pred.strip().lower() == ref.strip().lower() else 0.0


prompts = [json.loads(l) for l in open(DATASET)]
scores, latencies, costs = [], [], []
for ex in prompts:
    pred, lat, cst = call_model(ex["prompt"], ex.get("context", ""))
    s = score(pred, ex.get("gold", ""))
    scores.append(s)
    latencies.append(lat)
    costs.append(cst)

metrics = {
    "provider": PROVIDER,
    "model": MODEL,
    "quality": float(np.mean(scores)),
    "p95_latency": float(np.percentile(latencies, 95)),
    "avg_cost_usd": float(np.mean(costs)),
}

print(json.dumps({metrics}))

# Save raw outputs for auditing
Path("/valohai/outputs/results.json").write_text(
    json.dumps(
        {
            "provider": PROVIDER,
            "model": MODEL,
            "examples": len(prompts),
            "metrics": metrics,
        },
    ),
)
print("Logged:", metrics)
```

> Replace `score()` with your evaluation of choice (BLEU/ROUGE/BERTScore, GPT-as-a-judge, task-specific accuracy). Log response\_variance by repeating prompts N times and averaging per-prompt variance.

***

### Building the leaderboard

Use the Model Catalog to compare runs. Or write a small aggregator step that pulls metrics from the three executions and writes a table.

Example metrics you might track:

| Provider  | Model                | Quality (0–1) | p95 Latency (s) | Avg Cost (USD) | Decision   |
| --------- | -------------------- | ------------- | --------------- | -------------- | ---------- |
| OpenAI    | gpt-4-turbo          | 0.86          | 1.40            | 0.0042         | ✅          |
| Anthropic | claude-3-5-sonnet    | 0.88          | 1.25            | 0.0038         | ✅          |
| Meta      | llama-3-70b-instruct | 0.80          | 1.05            | 0.0011         | 🔄 re‑test |

### Automate and iterate

* Re-run the leaderboard weekly or when a dataset changes (webhook/trigger).
* Add **human approval** before adopting a new winner.
* Track **response\_variance** alongside quality to avoid unstable models.
* Export a CSV/Markdown report at the end of each run for stakeholders.

### Next steps

* [Evaluating and Validating GenAI Applications](/genai/evaluate-and-validate-genai.md)
* [RAG and Context Pipelines in Valohai](/genai/rag-context-pipelines.md)
* [Finetuning LLMs in Valohai](/genai/finetune-llms.md)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.valohai.com/genai/evaluate-multiple-models-and-providers.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
