Evaluating and Validating GenAI Applications
Learn how to evaluate and validate GenAI applications using Valohai datasets, executions, and the Model Catalog.
In an AI Factory, evaluation is the control system. GenAI models change quickly, and their quality can’t be captured by a single metric. Valohai turns evaluation into a reproducible workflow: versioned datasets → automated metrics → human approval.
In modern GenAI systems, evaluation comes before and after model selection. Many teams compare multiple model providers (OpenAI, Anthropic, Meta Llama) on a fixed evaluation dataset before deciding whether fine-tuning is even necessary.
Define Evaluation Datasets
Every benchmark (prompt + expected answer set) should live as a Valohai dataset.
Each dataset version represents one evaluation suite.
dataset://evaluation-prompts/v3
├── prompts.jsonl
├── gold_answers.jsonl
└── metadata.yamlWhen updating the benchmark:
Create a new version based on the previous one.
Add or remove only the files that changed.
Publish the new version.
In Valohai, you only update what changed, not duplicate the whole dataset. This keeps a precise lineage of each file’s origin and usage, making it easy to trace dependencies (e.g., which models used a specific gold standard).
💡 You can export evaluation data from a database, but Valohai datasets are recommended for versioning and reproducibility.
Recommended structure
prompts.jsonl
evaluation prompts
gold_answers.jsonl
reference or “gold” answers
metadata.yaml
optional context (metrics, reviewer, notes)
Each new dataset version forms a frozen, reproducible benchmark.
Build an Evaluation Step
Use a Valohai step to run automated metrics for a model version.
- step: evaluate-genai
image: valohai/python:3.10
command:
- python evaluate.py
inputs:
- name: model
default: model://model-name/v3
- name: evaluation_dataset
default: dataset:evaluation-prompts/v3Your evaluate.py script should:
Load the dataset files.
Generate model responses.
Compute metrics.
Log them with
import json
print(json.dumps({
"bleu": 0.41,
"bert_score": 0.89,
"factuality": 0.83,
}))
# or using valohai-utils
import valohai
with valohai.metadata.logger() as logger:
logger.log("bleu", 0.41)
logger.log("bert_score", 0.89)
logger.log("factuality", 0.83)In GenAI, a “baseline” is a suite of metrics, not just one number.
Attach and Compare Metrics
All metrics logged in executions are visible in both the Execution view and Model Catalog entry.
Use consistent field names (bleu, factuality, response_variance, etc.) so models can be compared automatically.
The Model Catalog keeps a historical record of:
Metrics per model version
Datasets and code used
Pipelines with human approvals (if present)
Handle Non-Determinism
Large language models produce stochastic outputs. Measure stability by running the same prompts multiple times.
- pipeline:
name: evaluate-stability
nodes:
- name: generate-seeds
type: execution
step: Generate Random Seeds # generates for example [1,2,3,4,5,6,7,8,9,10]
- name: evaluate-genai
type: task # This node runs multiple executions
step: evaluate-genai
- name: aggregate-results
type: execution
step: aggregate-results
actions:
- when: node-complete
if: metadata.response_variance > 0.02 # Stop pipeline is variance is greater than 0.02
then: stop-pipeline
- name: promote-model
type: execution
step: promote-model
edges:
- [generate-seeds.metadata.seeds, evaluate-genai.parameters.seed]
- [evaluate-genai.outputs.*, aggregate-results.inputs.results]Aggregate results in your script and log a response_variance metric:
variance = np.var(metric_scores)
with valohai.metadata.logger() as logger:
logger.log("response_variance", variance)This quantifies how consistent the model is across runs or temperatures.
Add Human Approval
Add a pause for human approval after automated evaluation:
- pipeline:
name: evaluate-stability
nodes:
- name: generate-seeds
type: execution
step: Generate Random Seeds # generates for example [1,2,3,4,5,6,7,8,9,10]
- name: evaluate-genai
type: task # This node runs multiple executions
step: evaluate-genai
- name: aggregate-results
type: execution
step: aggregate-results
actions:
- when: node-complete
if: metadata.response_variance < 0.02
then: stop-pipeline
- name: promote-model
type: execution
step: promote-model
actions:
- when: node-starting
then: require-approval
edges:
- [generate-seeds.metadata.seeds, evaluate-genai.parameters.seed]
- [evaluate-genai.outputs.*, aggregate-results.inputs.results]Use this gate when a subjective check is required:
Review generated outputs externally (e.g., side-by-side A/B view).
Optionally record qualitative ratings in the Model Catalog card.
Approve or reject to continue the pipeline.
The approval step acts as a lightweight, auditable “human-in-the-loop” checkpoint, similar to RLHF oversight.
Compare Models Side-by-Side
In Model Catalog → Compare Models:
Select two or more versions.
Inspect metrics and metadata differences.
Verify the new version meets or exceeds baseline thresholds.
Promote the model when satisfied.
[ Model A (v5) ] BLEU 0.40 Factuality 0.82
[ Model B (v6) ] BLEU 0.44 Factuality 0.85 ✅GenAI Considerations
Prompt–response pairs
Store as JSONL files within datasets.
Gold answers
Keep inline or in a separate file; version them.
Human evaluation
Use the approval gate and add qualitative notes to Model Catalog.
Consistency checks
Launch Tasks with multiple executions; log variance.
Baseline tracking
Compare suites of metrics, not single scores.
Last updated
Was this helpful?
