In modern GenAI, valuation is step 0. Rather than fine-tuning first, compare out-of-the-box models (OpenAI, Anthropic, Meta Llama) on the same versioned evaluation dataset and pick the best fit. Keep it running continually as providers evolve.
Why run multiple providers
Quality varies by task: summarization ≠ extraction ≠ reasoning.
Vendors evolve quickly: today’s top pick may lag tomorrow.
Store it as a Valohai dataset. When it changes, create a new version (v4, v5) so you can easily reproduce and compare across versions.
Provider-matrix pipeline (YAML)
Run the same step against different providers/models by passing parameters. You can parallelize with a task.
⚙️ Secrets: Store OPENAI_API_KEY, ANTHROPIC_API_KEY, and any Meta/Ollama tokens as Valohai secrets. They are injected at runtime, never committed to code or YAML.
Provider-agnostic evaluation (pseudo‑Python)
eval_provider.py should load prompts, call the provider, score outputs, and log metrics (quality, latency, cost).
Replace score() with your evaluation of choice (BLEU/ROUGE/BERTScore, GPT-as-a-judge, task-specific accuracy). Log response_variance by repeating prompts N times and averaging per-prompt variance.
Building the leaderboard
Use the Model Catalog to compare runs. Or write a small aggregator step that pulls metrics from the three executions and writes a table.
Example metrics you might track:
Provider
Model
Quality (0–1)
p95 Latency (s)
Avg Cost (USD)
Decision
OpenAI
gpt-4-turbo
0.86
1.40
0.0042
✅
Anthropic
claude-3-5-sonnet
0.88
1.25
0.0038
✅
Meta
llama-3-70b-instruct
0.80
1.05
0.0011
🔄 re‑test
Automate and iterate
Re-run the leaderboard weekly or when a dataset changes (webhook/trigger).
Add human approval before adopting a new winner.
Track response_variance alongside quality to avoid unstable models.
Export a CSV/Markdown report at the end of each run for stakeholders.