Evaluating and Validating GenAI Applications

Learn how to evaluate and validate GenAI applications using Valohai datasets, executions, and the Model Catalog.

In an AI Factory, evaluation is the control system. GenAI models change quickly, and their quality can’t be captured by a single metric. Valohai turns evaluation into a reproducible workflow: versioned datasets → automated metrics → human approval.

In modern GenAI systems, evaluation comes before and after model selection. Many teams compare multiple model providers (OpenAI, Anthropic, Meta Llama) on a fixed evaluation dataset before deciding whether fine-tuning is even necessary.

Define Evaluation Datasets

Every benchmark (prompt + expected answer set) should live as a Valohai dataset.

Each dataset version represents one evaluation suite.

dataset://evaluation-prompts/v3
├── prompts.jsonl
├── gold_answers.jsonl
└── metadata.yaml

When updating the benchmark:

  1. Create a new version based on the previous one.

  2. Add or remove only the files that changed.

  3. Publish the new version.

In Valohai, you only update what changed, not duplicate the whole dataset. This keeps a precise lineage of each file’s origin and usage, making it easy to trace dependencies (e.g., which models used a specific gold standard).

💡 You can export evaluation data from a database, but Valohai datasets are recommended for versioning and reproducibility.

Recommended structure

File
Purpose

prompts.jsonl

evaluation prompts

gold_answers.jsonl

reference or “gold” answers

metadata.yaml

optional context (metrics, reviewer, notes)

Each new dataset version forms a frozen, reproducible benchmark.

Build an Evaluation Step

Use a Valohai step to run automated metrics for a model version.

- step: evaluate-genai
  image: valohai/python:3.10
  command:
    - python evaluate.py
  inputs:
    - name: model
      default: model://model-name/v3
    - name: evaluation_dataset
      default: dataset:evaluation-prompts/v3

Your evaluate.py script should:

  1. Load the dataset files.

  2. Generate model responses.

  3. Compute metrics.

  4. Log them with

import json

print(json.dumps({
    "bleu": 0.41,
    "bert_score": 0.89,
    "factuality": 0.83,
}))

# or using valohai-utils
import valohai

with valohai.metadata.logger() as logger:
    logger.log("bleu", 0.41)
    logger.log("bert_score", 0.89)
    logger.log("factuality", 0.83)

In GenAI, a “baseline” is a suite of metrics, not just one number.

Attach and Compare Metrics

All metrics logged in executions are visible in both the Execution view and Model Catalog entry.

Use consistent field names (bleu, factuality, response_variance, etc.) so models can be compared automatically.

The Model Catalog keeps a historical record of:

  • Metrics per model version

  • Datasets and code used

  • Pipelines with human approvals (if present)

Handle Non-Determinism

Large language models produce stochastic outputs. Measure stability by running the same prompts multiple times.

- pipeline:
    name: evaluate-stability
    nodes:
      - name: generate-seeds
        type: execution
        step: Generate Random Seeds # generates for example [1,2,3,4,5,6,7,8,9,10]
      - name: evaluate-genai
        type: task  # This node runs multiple executions
        step: evaluate-genai
      - name: aggregate-results
        type: execution
        step: aggregate-results
        actions:
          - when: node-complete
            if: metadata.response_variance > 0.02 # Stop pipeline is variance is greater than 0.02
            then: stop-pipeline
      - name: promote-model
        type: execution
        step: promote-model
    edges:
    - [generate-seeds.metadata.seeds, evaluate-genai.parameters.seed]
    - [evaluate-genai.outputs.*, aggregate-results.inputs.results]

Aggregate results in your script and log a response_variance metric:

variance = np.var(metric_scores)

with valohai.metadata.logger() as logger:
    logger.log("response_variance", variance)

This quantifies how consistent the model is across runs or temperatures.

Add Human Approval

Add a pause for human approval after automated evaluation:

- pipeline:
    name: evaluate-stability
    nodes:
      - name: generate-seeds
        type: execution
        step: Generate Random Seeds # generates for example [1,2,3,4,5,6,7,8,9,10]
      - name: evaluate-genai
        type: task  # This node runs multiple executions
        step: evaluate-genai
      - name: aggregate-results
        type: execution
        step: aggregate-results
        actions:
          - when: node-complete
            if: metadata.response_variance < 0.02
            then: stop-pipeline
      - name: promote-model
        type: execution
        step: promote-model
        actions:
          - when: node-starting
            then: require-approval
    edges:
    - [generate-seeds.metadata.seeds, evaluate-genai.parameters.seed]
    - [evaluate-genai.outputs.*, aggregate-results.inputs.results]

Use this gate when a subjective check is required:

  1. Review generated outputs externally (e.g., side-by-side A/B view).

  2. Optionally record qualitative ratings in the Model Catalog card.

  3. Approve or reject to continue the pipeline.

The approval step acts as a lightweight, auditable “human-in-the-loop” checkpoint, similar to RLHF oversight.

Compare Models Side-by-Side

In Model Catalog → Compare Models:

  • Select two or more versions.

  • Inspect metrics and metadata differences.

  • Verify the new version meets or exceeds baseline thresholds.

  • Promote the model when satisfied.

[ Model A (v5) ]   BLEU 0.40   Factuality 0.82
[ Model B (v6) ]   BLEU 0.44   Factuality 0.85 ✅

GenAI Considerations

Topic
Recommendation

Prompt–response pairs

Store as JSONL files within datasets.

Gold answers

Keep inline or in a separate file; version them.

Human evaluation

Use the approval gate and add qualitative notes to Model Catalog.

Consistency checks

Launch Tasks with multiple executions; log variance.

Baseline tracking

Compare suites of metrics, not single scores.

Last updated

Was this helpful?