# LLM Evaluations (Local Execution)

Evaluate large language models directly from your local environment and analyze the results in Valohai LLM without running your code inside the Valohai platform.

This tool is designed for teams who want:

* Structured multi-model evaluation
* Clear comparison views across models and datasets
* Custom business-relevant metrics
* Lightweight local execution
* A seamless path to scaling with Valohai

### Overview

{% embed url="<https://youtu.be/DA0CMaw755M>" %}

The video demonstrates a full example. The documentation below explains how the tool works in a generic way so you can adapt it to your own use case.

## Why Valohai LLM Evaluations?

Evaluating LLMs is often messy.

* Scripts live in notebooks
* Metrics are scattered across spreadsheets
* Comparing models requires manual aggregation
* Decisions rely on intuition instead of structured data

Valohai LLM Evaluations provides structure without forcing you into heavy infrastructure.

You keep your local workflow. Valohai adds organization and comparison.

### How It Works

Valohai LLM Evaluations consists of two simple components:

| Component                                  | Responsibility                               |
| ------------------------------------------ | -------------------------------------------- |
| `valohai-llm` Python library               | Local execution and metric reporting         |
| [Valohai LLM UI](https://llm.valohai.com/) | Task configuration and structured comparison |

Your evaluation logic runs locally.\
Results are streamed to Valohai in real time.

You stay in control of:

1. Which models are evaluated
2. Which prompts or datasets are used
3. How metrics are computed
4. How results are interpreted

The result is a lightweight but structured evaluation workflow that bridges experimentation and production decision-making.

## Access the Valohai LLM Interface

Go to: [https://llm.valohai.com](https://llm.valohai.com/)

Log in using your Google or GitHub account.

After logging in, you can create datasets, configure tasks and analyze results.

## Define an Evaluation Dataset

An evaluation dataset represents your benchmark.

The format is flexible - Valohai does not enforce a strict schema.&#x20;

Typical structure:

```
question
expected_answer
context
category
```

Each row becomes one evaluation unit.

> 💡 Your evaluation function defines which fields are required.

Read more about how to [organize your evaluations with Datasets](/genai/llm-evaluations-local-execution/organizing-your-evaluations-with-datasets.md)

## Configure a Task

A task defines what to test.

It specifies:

* Parameters (e.g., model names, temperatures, prompts)
* The dataset to evaluate against

Example parameter:

```json
{
  "model": ["gpt-4o", "claude-3-opus", "llama-3-70b"]
}
```

Think of a task as a contract between the UI configuration and your local evaluation script.

> If you define a parameter called `model`, your script must expect it.

## Installation and Authentication

Install the Python library:

```
pip install valohai-llm
```

Using `uv`:

```
uv pip install valohai-llm
uv add valohai-llm
```

Generate an API key from the [Valohai LLM](https://llm.valohai.com) interface and set it as an environment variable:

```
export VALOHAI_LLM_API_KEY=your_key_here
```

This key allows your local script to post evaluation results to Valohai.

## Implement the Evaluation Logic

Your script is responsible for:

1. Running inference
2. Comparing outputs to ground truth
3. Computing metrics
4. Returning metrics as a dictionary

Minimal metric reporting example:

```python
import valohai_llm

def main():
    result = valohai_llm.post_result(
        task="my-evaluation",
        labels={"model": "gpt-4"},
        metrics={"accuracy": 0.85, "latency_ms": 150},
    )
    print("Result posted:", result)

if __name__ == "__main__":
    main()
```

Common metric categories:

| Metric Type | Example                           |
| ----------- | --------------------------------- |
| Quality     | accuracy, relevance, faithfulness |
| Coverage    | completeness                      |
| Efficiency  | latency\_ms                       |
| Cost        | output\_tokens                    |

Valohai does not restrict how metrics are computed.

## Running Structured Evaluations with `task.run()`

For parameterized multi-model evaluations, use `task.run()`.

This pattern:

* Fetches task configuration
* Iterates over parameter × dataset combinations
* Streams results automatically

**Example**:

```python
import valohai_llm

def evaluate(item, parameters):
    # Run inference and compute metrics
    return {
        "accuracy": 0.92,
        "latency_ms": 180,
        "completeness": 0.88,
    }


def main():
    task = valohai_llm.get_current_task()

    print(f"Running evaluation: {task.name}")
    print(f"Models: {task.parameters.get('model', [])}")
    print(f"Datasets: {[ds.name for ds in task.datasets]}")
    print("-" * 50)

    results = task.run(
        evaluate,
        item_labels=["category", "requires_reasoning"],
    )

    print(f"\nEvaluation complete! Posted {len(results)} results.")


if __name__ == "__main__":
    main()
```

> Let's say if you evaluate 3 models against 20 dataset rows - you automatically produce 60 structured evaluation results.

## Analyze and Compare

Evaluation is only useful if comparison is structured.

The Results and Compare views help you move from raw metrics to clear decisions.

### Start Broad

Use Group by → model to see aggregated metrics like:

* Quality
* Latency
* Token usage or cost

This answers: Which model performs best overall?

### Break It Down

Add a second grouping dimension such as:

* dataset
* category
* difficulty

This reveals where a model underperforms, even if its overall average looks strong.

<figure><img src="/files/KcEJ00OQf9QEg8Yvg0Ce" alt=""><figcaption></figcaption></figure>

### Filter What Matters

Filter specific subsets, for example:

* Complex prompts only
* A single category
* High-latency responses

This helps evaluate tradeoffs in the scenarios that matter to your application.

### Compare Side by Side

In the Compare view, select multiple models to inspect:

* Aggregated metric tables
* Direct side-by-side comparisons
* Visual summaries for fast interpretation

<figure><img src="/files/i3GBWYd6kEklCf6F0vao" alt=""><figcaption></figcaption></figure>

## Typical Evaluation Patterns

| Pattern                     | Description                          |
| --------------------------- | ------------------------------------ |
| Multi-provider comparison   | OpenAI vs Anthropic vs self-hosted   |
| Prompt testing              | Compare prompt versions              |
| Cost vs quality analysis    | Balance accuracy against token usage |
| Routing validation          | Validate hybrid strategies           |
| Pre-production benchmarking | Validate before rollout              |

## Scaling Further with Valohai

The LLM Evaluation tool is intentionally lightweight.

When you need more advanced workflows, use the full Valohai platform for:

* Dataset versioning and lineage
* Parallel large-scale evaluation
* Automated recurring evaluation jobs
* Pipelines and orchestration
* Model training or fine-tuning
* Hybrid and multi-cloud execution

LLM Evaluations → fast local experimentation\
Valohai Platform → orchestration, governance, infrastructure


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.valohai.com/genai/llm-evaluations-local-execution.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.