Evaluating and Validating GenAI Applications

In an AI Factory, evaluation is the control system. GenAI models change quickly, and their quality can’t be captured by a single metric. Valohai turns evaluation into a reproducible workflow: versioned datasets → automated metrics → human approval.

In modern GenAI systems, evaluation comes before and after model selection. Many teams compare multiple model providers (OpenAI, Anthropic, Meta Llama) on a fixed evaluation dataset before deciding whether fine-tuning is even necessary.

Define Evaluation Datasets

Every benchmark (prompt + expected answer set) should live as a Valohai dataset.

Each dataset version represents one evaluation suite.

dataset://evaluation-prompts/v3
├── prompts.jsonl
├── gold_answers.jsonl
└── metadata.yaml

When updating the benchmark:

  1. Create a new version based on the previous one.

  2. Add or remove only the files that changed.

  3. Publish the new version.

In Valohai, you only update what changed, not duplicate the whole dataset. This keeps a precise lineage of each file’s origin and usage, making it easy to trace dependencies (e.g., which models used a specific gold standard).

💡 You can export evaluation data from a database, but Valohai datasets are recommended for versioning and reproducibility.

Recommended structure

File
Purpose

prompts.jsonl

evaluation prompts

gold_answers.jsonl

reference or “gold” answers

metadata.yaml

optional context (metrics, reviewer, notes)

Each new dataset version forms a frozen, reproducible benchmark.

Build an Evaluation Step

Use a Valohai step to run automated metrics for a model version.

Your evaluate.py script should:

  1. Load the dataset files.

  2. Generate model responses.

  3. Compute metrics.

  4. Log them with

In GenAI, a “baseline” is a suite of metrics, not just one number.

Attach and Compare Metrics

All metrics logged in executions are visible in both the Execution view and Model Catalog entry.

Use consistent field names (bleu, factuality, response_variance, etc.) so models can be compared automatically.

The Model Catalog keeps a historical record of:

  • Metrics per model version

  • Datasets and code used

  • Pipelines with human approvals (if present)

Handle Non-Determinism

Large language models produce stochastic outputs. Measure stability by running the same prompts multiple times.

Aggregate results in your script and log a response_variance metric:

This quantifies how consistent the model is across runs or temperatures.

Add Human Approval

Add a pause for human approval after automated evaluation:

Use this gate when a subjective check is required:

  1. Review generated outputs externally (e.g., side-by-side A/B view).

  2. Optionally record qualitative ratings in the Model Catalog card.

  3. Approve or reject to continue the pipeline.

The approval step acts as a lightweight, auditable “human-in-the-loop” checkpoint, similar to RLHF oversight.

Compare Models Side-by-Side

In Model Catalog → Compare Models:

  • Select two or more versions.

  • Inspect metrics and metadata differences.

  • Verify the new version meets or exceeds baseline thresholds.

  • Promote the model when satisfied.

GenAI Considerations

Topic
Recommendation

Prompt–response pairs

Store as JSONL files within datasets.

Gold answers

Keep inline or in a separate file; version them.

Human evaluation

Use the approval gate and add qualitative notes to Model Catalog.

Consistency checks

Launch Tasks with multiple executions; log variance.

Baseline tracking

Compare suites of metrics, not single scores.

Last updated

Was this helpful?