Retraining and Updating GenAI Models

Learn how to retrain, test, and promote GenAI models safely using Valohai pipelines and Model Catalog integrations.

In an AI Factory, retraining is continuous but promotion must be deliberate. Valohai lets you chain training, evaluation, and approval into one reproducible pipeline, ensuring every new GenAI model is validated before it’s released.

Structure Your Retraining Pipeline

A retraining pipeline typically includes:

  • Training step

  • Evaluation step

  • Conditional promotion logic

  • Human approval gate

Example pipeline:

- step: train-model
  image: pytorch/pytorch:2.9.0-cuda12.8-cudnn9-runtime
  command: python train.py
  inputs:
    - name: training_data
      default: dataset://domain-data/v5

- step: evaluate-model
  image: python:3.10
  command: python evaluate.py
  inputs:
    - name: model
      default: model://my-gen-ai-model/candidate

- step: promote-model
  image: python:3.10
  command: python promote.py
  inputs:
    - name: results

- pipeline:
    name: train-and-evaluate-model
    nodes:
      - name: train-model
        type: execution
        step: train-model
      - name: evaluate-model
        type: execution
        step: evaluate-genai
      - name: promote-model
        type: execution
        step: promote-model
        actions:
          - when: node-starting
            then: require-approval
    edges:
    - [train-model.outputs.*, evaluate-model.inputs.model]
    - [evaluate-model.outputs.*, promote-model.inputs.results]

Each step is tracked automatically, datasets, parameters, outputs, and lineage are all recorded.

Automate Regression Checks

During the evaluation step, compare new metrics against the previous model’s baseline from the Model Catalog.

if new_bleu < baseline_bleu:
    raise Exception("Regression detected: BLEU score dropped")

Valohai will mark the execution as failed, preventing promotion. You can trigger notifications or pipeline conditions based on these checks.

In GenAI, regression may also mean higher variance, longer latency, or reduced factuality, not just lower numeric scores.

Add a Human Approval Step

Insert a pause for human approval after automated evaluation:

      - name: promote-model
        type: execution
        step: promote-model
        actions:
          - when: node-starting
            then: require-approval

Use this gate to ensure governance before promotion:

  1. Review evaluation results in Valohai or externally.

  2. Add comments or qualitative scores to the model’s card in Model Catalog.

  3. Approve to continue or reject to stop the pipeline.

The approval step acts as a lightweight, auditable checkpoint, perfect for subjective or business-critical GenAI evaluations.



Update and Promote in Model Catalog

After evaluation and approval, you can register a new model version directly from your Python code.

In a GenAI pipeline, the final promote-model step might save the trained or fine-tuned model folder to /valohai/outputs/model/ and generate the necessary metadata.

Example:

import os
import json
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from evaluate import load  # optional Hugging Face evaluate lib

# Load data and base model
dataset = load_dataset("cnn_dailymail", "3.0.0", split="test[:200]")
model = AutoModelForCausalLM.from_pretrained("/valohai/inputs/base-model")
tokenizer = AutoTokenizer.from_pretrained("/valohai/inputs/base-model")

# Evaluate or generate predictions
prompts = dataset["article"][:10]
references = dataset["highlights"][:10]
outputs = [tokenizer.decode(model.generate(tokenizer(p, return_tensors="pt").input_ids, max_new_tokens=100)[0],
                            skip_special_tokens=True) for p in prompts]

# Example metric
metric = load("rouge")
scores = metric.compute(predictions=outputs, references=references)
rougeL = scores["rougeL"]

print(f"ROUGE-L: {rougeL:.4f}")

# Save model files
save_dir = "/valohai/outputs/model"
os.makedirs(save_dir, exist_ok=True)
model.save_pretrained(save_dir)
tokenizer.save_pretrained(save_dir)

print(f"Saved model to {save_dir}")

# Create Valohai metadata file
metadata = {
    "model/": {
        "valohai.model-versions": ["model://summarizer/v5"],
        "valohai.tags": ["genai", "summarization", "production-candidate"],
        "rougeL": rougeL,
        "training_dataset": "vh://dataset/domain-news:v2",
        "evaluation_dataset": "vh://dataset/evaluation-prompts:v3"
    }
}

metadata_path = "/valohai/outputs/valohai.metadata.jsonl"
with open(metadata_path, "w") as f:
    for file, meta in metadata.items():
        json.dump({"file": file, "metadata": meta}, f)
        f.write("\n")

print("Created model version in model://summarizer/v5")

When the step completes:

  • The model folder (/valohai/outputs/model/) is uploaded as the new model version.

  • The metadata file (valohai.metadata.jsonl) tells Valohai how to tag and track it.

  • The Model Catalog entry automatically links all relevant context:

    • Datasets and pipeline lineage

    • Metrics (ROUGE-L, BLEU, etc.)

    • Custom tags (genai, summarization, etc.)

  • Version references (model://summarizer/v5)

Use the same pattern for any GenAI workflow, from prompt-tuned adapters to finetuned instruction models, to promote them directly from your pipeline code.

Maintain Reproducibility and Traceability

Every retraining run in Valohai automatically preserves:

  • Dataset lineage: which training and evaluation datasets were used

  • Parameter traceability: hyperparameters and environment details

  • Model lineage: which model version the new one replaced

  • Approval records: human validation history

This full chain of evidence makes it easy to answer questions like: “Which models were trained with the flawed evaluation dataset v2?”

GenAI Considerations

Topic
Recommendation

Continuous improvement

Trigger retraining when new labeled data or human feedback arrives.

Baseline comparison

Use the same evaluation dataset for old and new models to ensure fair comparisons.

Human-in-the-loop

Require manual approval for subjective quality checks.

Automation balance

Automate quantitative checks, but keep qualitative approvals human.

Lineage tracking

Use Model Catalog to visualize model–dataset–approval chains.


Learn More

Last updated

Was this helpful?