LLM Evaluations (Local Execution)

Evaluate large language models directly from your local environment and analyze the results in Valohai LLM without running your code inside the Valohai platform.

This tool is designed for teams who want:

  • Structured multi-model evaluation

  • Clear comparison views across models and datasets

  • Custom business-relevant metrics

  • Lightweight local execution

  • A seamless path to scaling with Valohai

Overview

The video demonstrates a full example. The documentation below explains how the tool works in a generic way so you can adapt it to your own use case.

Why Valohai LLM Evaluations?

Evaluating LLMs is often messy.

  • Scripts live in notebooks

  • Metrics are scattered across spreadsheets

  • Comparing models requires manual aggregation

  • Decisions rely on intuition instead of structured data

Valohai LLM Evaluations provides structure without forcing you into heavy infrastructure.

You keep your local workflow. Valohai adds organization and comparison.

How It Works

Valohai LLM Evaluations consists of two simple components:

Component
Responsibility

valohai-llm Python library

Local execution and metric reporting

Task configuration and structured comparison

Your evaluation logic runs locally. Results are streamed to Valohai in real time.

You stay in control of:

  1. Which models are evaluated

  2. Which prompts or datasets are used

  3. How metrics are computed

  4. How results are interpreted

The result is a lightweight but structured evaluation workflow that bridges experimentation and production decision-making.

Access the Valohai LLM Interface

Go to: https://llm.valohai.comarrow-up-right

Log in using your Google or GitHub account.

After logging in, you can create datasets, configure tasks and analyze results.

Define an Evaluation Dataset

An evaluation dataset represents your benchmark.

The format is flexible - Valohai does not enforce a strict schema.

Typical structure:

Each row becomes one evaluation unit.

💡 Your evaluation function defines which fields are required.

Datasets are stored in S3 (Valohai-managed or your own storage).

Configure a Task

A task defines what to test.

It specifies:

  • Parameters (e.g., model names, temperatures, prompts)

  • The dataset to evaluate against

Example parameter:

Think of a task as a contract between the UI configuration and your local evaluation script.

If you define a parameter called model, your script must expect it.

Installation and Authentication

Install the Python library:

Using uv:

Generate an API key from the Valohai LLMarrow-up-right interface and set it as an environment variable:

This key allows your local script to post evaluation results to Valohai.

Implement the Evaluation Logic

Your script is responsible for:

  1. Running inference

  2. Comparing outputs to ground truth

  3. Computing metrics

  4. Returning metrics as a dictionary

Minimal metric reporting example:

Common metric categories:

Metric Type
Example

Quality

accuracy, relevance, faithfulness

Coverage

completeness

Efficiency

latency_ms

Cost

output_tokens

Valohai does not restrict how metrics are computed.

Running Structured Evaluations with task.run()

For parameterized multi-model evaluations, use task.run().

This pattern:

  • Fetches task configuration

  • Iterates over parameter × dataset combinations

  • Streams results automatically

Example:

Let's say if you evaluate 3 models against 20 dataset rows - you automatically produce 60 structured evaluation results.

Analyze and Compare

Evaluation is only useful if comparison is structured.

The Results and Compare views help you move from raw metrics to clear decisions.

Start Broad

Use Group by → model to see aggregated metrics like:

  • Quality

  • Latency

  • Token usage or cost

This answers: Which model performs best overall?

Break It Down

Add a second grouping dimension such as:

  • dataset

  • category

  • difficulty

This reveals where a model underperforms, even if its overall average looks strong.

Filter What Matters

Filter specific subsets, for example:

  • Complex prompts only

  • A single category

  • High-latency responses

This helps evaluate tradeoffs in the scenarios that matter to your application.

Compare Side by Side

In the Compare view, select multiple models to inspect:

  • Aggregated metric tables

  • Direct side-by-side comparisons

  • Visual summaries for fast interpretation

Typical Evaluation Patterns

Pattern
Description

Multi-provider comparison

OpenAI vs Anthropic vs self-hosted

Prompt testing

Compare prompt versions

Cost vs quality analysis

Balance accuracy against token usage

Routing validation

Validate hybrid strategies

Pre-production benchmarking

Validate before rollout

Scaling Further with Valohai

The LLM Evaluation tool is intentionally lightweight.

When you need more advanced workflows, use the full Valohai platform for:

  • Dataset versioning and lineage

  • Parallel large-scale evaluation

  • Automated recurring evaluation jobs

  • Pipelines and orchestration

  • Model training or fine-tuning

  • Hybrid and multi-cloud execution

LLM Evaluations → fast local experimentation Valohai Platform → orchestration, governance, infrastructure

Last updated

Was this helpful?