Finetuning LLMs in Valohai
Finetuning adapts a foundation model to a specific domain, tone, or dataset. In Valohai, you can run finetuning jobs reproducibly, track datasets, parameters, and resulting models exactly as you would in traditional ML.
When to finetune: Finetuning is often a last resort. Before committing to a specific base model, use Valohai pipelines to compare out‑of‑the‑box models (OpenAI, Anthropic, Llama) or build lightweight RAG pipelines. If retrieval and prompt engineering cannot reach your target quality, finetuning becomes relevant.
Prepare Training Data
Create a Valohai dataset containing your training and evaluation splits.
dataset://finetune-data/v1
├── train.jsonl
├── eval.jsonl
└── metadata.yamlEach JSONL file should contain prompt–response pairs:
{"prompt": "Translate 'hello' to French", "response": "bonjour"}
{"prompt": "Summarize: Large language models are...", "response": "They are models trained to generate text."}When the dataset evolves, create a new version (v2, v3, …) instead of overwriting files. This preserves reproducibility across experiments.
Managing evolving datasets (versioned, append/replace/remove)
When your finetuning dataset changes, create a new dataset version (v2, v3, …) based on the previous version instead of overwriting files.
Update only what changed: You can append new samples, replace corrected files, or remove deprecated files in the new version.
No full duplication: Unchanged files are reused from the previous version; only modified or new files are stored again.
Clear lineage: Each file retains where it originated (e.g., an execution output or upload) and where it is used (which models or evaluations depend on it).
Why this matters
Reproducibility: Every experiment and model can be traced back to the exact dataset version used.
Audit & impact analysis: If a gold answer or training sample is wrong, you can instantly find which models or evaluations relied on it via dataset history.
Rollbacks: Promote or re-run against a previous version without guesswork.
Storage efficiency: You don’t create massive duplicates of unchanged files.
Governance: Versioned changes form an auditable trail of how your benchmark or training data evolved.
Example evolution
💡 TL;DR: Version the dataset, don’t overwrite it. In each new version, append, replace, or remove just the changed parts. You get precise lineage and reproducibility without duplicating everything.
Define a Finetuning Step
Example YAML snippet for finetuning a Hugging Face model:
Your finetune.py script can follow Hugging Face’s Trainer API or a custom loop:
Evaluate the Finetuned Model
After finetuning, add an evaluation step similar to the one described in Evaluating and Validating GenAI Applications.
The evaluation script should compute relevant metrics (BLEU, ROUGE, BERTScore, factuality) and log them with:
Use the same evaluation dataset for all model versions to make metrics comparable.
Save and Register the Model
When training completes, you can register the finetuned model directly from your code:
This automatically creates a new model version in the Model Catalog and links it to the datasets and metrics from the pipeline.
Integrate with Retraining Pipelines
You can reuse the finetuning step inside a larger retraining pipeline. See an example in the "Retraining and Updating GenAI Models" page.
This ensures continuous improvement, retraining, finetuning, and evaluation, all captured under one Valohai lineage.
GenAI Considerations
Dataset structure
Use prompt–response JSONL files with clear splits for training and evaluation.
Model formats
Save full Hugging Face model folders (.safetensors, config.json, tokenizer.json).
Evaluation
Reuse the evaluation workflow described in the previous guide for consistency.
Storage
For large weights, link external URIs but keep metadata inside Valohai.
Reproducibility
Each dataset, base model, and pipeline version is automatically tracked in lineage.
Next Steps
Last updated
Was this helpful?
