LLM Evaluations (Local Execution)
Evaluate large language models directly from your local environment and analyze the results in Valohai LLM without running your code inside the Valohai platform.
This tool is designed for teams who want:
Structured multi-model evaluation
Clear comparison views across models and datasets
Custom business-relevant metrics
Lightweight local execution
A seamless path to scaling with Valohai
Overview
The video demonstrates a full example. The documentation below explains how the tool works in a generic way so you can adapt it to your own use case.
Why Valohai LLM Evaluations?
Evaluating LLMs is often messy.
Scripts live in notebooks
Metrics are scattered across spreadsheets
Comparing models requires manual aggregation
Decisions rely on intuition instead of structured data
Valohai LLM Evaluations provides structure without forcing you into heavy infrastructure.
You keep your local workflow. Valohai adds organization and comparison.
How It Works
Valohai LLM Evaluations consists of two simple components:
valohai-llm Python library
Local execution and metric reporting
Task configuration and structured comparison
Your evaluation logic runs locally. Results are streamed to Valohai in real time.
You stay in control of:
Which models are evaluated
Which prompts or datasets are used
How metrics are computed
How results are interpreted
The result is a lightweight but structured evaluation workflow that bridges experimentation and production decision-making.
Access the Valohai LLM Interface
Go to: https://llm.valohai.com
Log in using your Google or GitHub account.
After logging in, you can create datasets, configure tasks and analyze results.
Define an Evaluation Dataset
An evaluation dataset represents your benchmark.
The format is flexible - Valohai does not enforce a strict schema.
Typical structure:
Each row becomes one evaluation unit.
💡 Your evaluation function defines which fields are required.
Datasets are stored in S3 (Valohai-managed or your own storage).
Configure a Task
A task defines what to test.
It specifies:
Parameters (e.g., model names, temperatures, prompts)
The dataset to evaluate against
Example parameter:
Think of a task as a contract between the UI configuration and your local evaluation script.
If you define a parameter called
model, your script must expect it.
Installation and Authentication
Install the Python library:
Using uv:
Generate an API key from the Valohai LLM interface and set it as an environment variable:
This key allows your local script to post evaluation results to Valohai.
Implement the Evaluation Logic
Your script is responsible for:
Running inference
Comparing outputs to ground truth
Computing metrics
Returning metrics as a dictionary
Minimal metric reporting example:
Common metric categories:
Quality
accuracy, relevance, faithfulness
Coverage
completeness
Efficiency
latency_ms
Cost
output_tokens
Valohai does not restrict how metrics are computed.
Running Structured Evaluations with task.run()
task.run()For parameterized multi-model evaluations, use task.run().
This pattern:
Fetches task configuration
Iterates over parameter × dataset combinations
Streams results automatically
Example:
Let's say if you evaluate 3 models against 20 dataset rows - you automatically produce 60 structured evaluation results.
Analyze and Compare
Evaluation is only useful if comparison is structured.
The Results and Compare views help you move from raw metrics to clear decisions.
Start Broad
Use Group by → model to see aggregated metrics like:
Quality
Latency
Token usage or cost
This answers: Which model performs best overall?
Break It Down
Add a second grouping dimension such as:
dataset
category
difficulty
This reveals where a model underperforms, even if its overall average looks strong.

Filter What Matters
Filter specific subsets, for example:
Complex prompts only
A single category
High-latency responses
This helps evaluate tradeoffs in the scenarios that matter to your application.
Compare Side by Side
In the Compare view, select multiple models to inspect:
Aggregated metric tables
Direct side-by-side comparisons
Visual summaries for fast interpretation

Typical Evaluation Patterns
Multi-provider comparison
OpenAI vs Anthropic vs self-hosted
Prompt testing
Compare prompt versions
Cost vs quality analysis
Balance accuracy against token usage
Routing validation
Validate hybrid strategies
Pre-production benchmarking
Validate before rollout
Scaling Further with Valohai
The LLM Evaluation tool is intentionally lightweight.
When you need more advanced workflows, use the full Valohai platform for:
Dataset versioning and lineage
Parallel large-scale evaluation
Automated recurring evaluation jobs
Pipelines and orchestration
Model training or fine-tuning
Hybrid and multi-cloud execution
LLM Evaluations → fast local experimentation Valohai Platform → orchestration, governance, infrastructure
Last updated
Was this helpful?
