Finetuning LLMs in Valohai

Learn how to finetune large language models (LLMs) in Valohai using reproducible datasets, pipelines, and model catalog entries.

Finetuning adapts a foundation model to a specific domain, tone, or dataset. In Valohai, you can run finetuning jobs reproducibly, track datasets, parameters, and resulting models exactly as you would in traditional ML.

When to finetune: Finetuning is often a last resort. Before committing to a specific base model, use Valohai pipelines to compare out‑of‑the‑box models (OpenAI, Anthropic, Llama) or build lightweight RAG pipelines. If retrieval and prompt engineering cannot reach your target quality, finetuning becomes relevant.

Prepare Training Data

Create a Valohai dataset containing your training and evaluation splits.

dataset://finetune-data/v1
 ├── train.jsonl
 ├── eval.jsonl
 └── metadata.yaml

Each JSONL file should contain prompt–response pairs:

{"prompt": "Translate 'hello' to French", "response": "bonjour"}
{"prompt": "Summarize: Large language models are...", "response": "They are models trained to generate text."}

When the dataset evolves, create a new version (v2, v3, …) instead of overwriting files. This preserves reproducibility across experiments.

Managing evolving datasets (versioned, append/replace/remove)

When your finetuning dataset changes, create a new dataset version (v2, v3, …) based on the previous version instead of overwriting files.

Update only what changed: You can append new samples, replace corrected files, or remove deprecated files in the new version.
No full duplication: Unchanged files are reused from the previous version; only modified or new files are stored again.
Clear lineage: Each file retains where it originated (e.g., an execution output or upload) and where it is used (which models or evaluations depend on it).

Why this matters

Reproducibility: Every experiment and model can be traced back to the exact dataset version used.
Audit & impact analysis: If a gold answer or training sample is wrong, you can instantly find which models or evaluations relied on it via dataset history.
Rollbacks: Promote or re-run against a previous version without guesswork.
Storage efficiency: You don’t create massive duplicates of unchanged files.
Governance: Versioned changes form an auditable trail of how your benchmark or training data evolved.

Example evolution

finetune-data (dataset)
 ├─ v1/
 │   ├─ train.jsonl
 │   └─ eval.jsonl
 ├─ v2/   # based on v1; append and fix
 │   ├─ train.jsonl          # replaced with new/extra samples
 │   ├─ eval.jsonl           # reused from v1 (unchanged)
 │   └─ notes.md             # new file (changelog/rationale)
 └─ v3/   # based on v2; retire a subset
     ├─ train.jsonl          # replaced (removed deprecated block)
     └─ eval.jsonl           # reused from v2

💡 TL;DR: Version the dataset, don’t overwrite it. In each new version, append, replace, or remove just the changed parts. You get precise lineage and reproducibility without duplicating everything.

Define a Finetuning Step

Example YAML snippet for finetuning a Hugging Face model:

- step: finetune-llm
  image: valohai/pytorch:2.1
  command:
    - python finetune.py {parameters}
  inputs:
    - name: training_data
      default: dataset://finetune-data/v1
  parameters:
    - name: base_model
      default: "mistral-7b"
    - name: epochs
      default: 3

Your finetune.py script can follow Hugging Face’s Trainer API or a custom loop:

from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments
from datasets import load_dataset
import valohai

model = AutoModelForCausalLM.from_pretrained("/valohai/inputs/base-model")
tokenizer = AutoTokenizer.from_pretrained("/valohai/inputs/base-model")
data = load_dataset("json", data_files="/valohai/inputs/training_data/train.jsonl")

args = TrainingArguments(
    output_dir="/valohai/outputs/model",
    per_device_train_batch_size=4,
    num_train_epochs=valohai.parameters("epochs").value,
    save_total_limit=1,
)

trainer = Trainer(model=model, args=args, train_dataset=data["train"])
trainer.train()
trainer.save_model("/valohai/outputs/model")
tokenizer.save_pretrained("/valohai/outputs/model")

Evaluate the Finetuned Model

After finetuning, add an evaluation step similar to the one described in Evaluating and Validating GenAI Applications.

- step: evaluate-model
  image: valohai/python:3.10
  command:
    - python evaluate.py
  inputs:
    - name: model
      default: models://finetune-llm/v5
    - name: eval_data
      default: dataset://finetune-data/v1

The evaluation script should compute relevant metrics (BLEU, ROUGE, BERTScore, factuality) and log them with:

import json

print(json.dumps({
    "rougeL": 0.42,
    "bleu": 0.37,
    "factuality": 0.83,
}))

# or using valohai-utils
import valohai

with valohai.metadata.logger() as logger:
    logger.log("rougeL", 0.42)
    logger.log("bleu", 0.37)
    logger.log("factuality", 0.83)

Use the same evaluation dataset for all model versions to make metrics comparable.

Save and Register the Model

When training completes, you can register the finetuned model directly from your code:

import json, os

metadata = {
    "model/": {
        "valohai.model-versions": ["model://domain-llm/v6"],
        "valohai.tags": ["genai", "finetuned"],
        "base_model": "mistral-7b",
        "epochs": 3
    }
}

with open("/valohai/outputs/valohai.metadata.jsonl", "w") as f:
    for file, meta in metadata.items():
        json.dump({"file": file, "metadata": meta}, f)
        f.write("\n")

This automatically creates a new model version in the Model Catalog and links it to the datasets and metrics from the pipeline.

Integrate with Retraining Pipelines

You can reuse the finetuning step inside a larger retraining pipeline. See an example in the "Retraining and Updating GenAI Models" page.

This ensures continuous improvement, retraining, finetuning, and evaluation, all captured under one Valohai lineage.

GenAI Considerations

Topic

Recommendation

Dataset structure

Use prompt–response JSONL files with clear splits for training and evaluation.

Model formats

Save full Hugging Face model folders (.safetensors, config.json, tokenizer.json).

Evaluation

Reuse the evaluation workflow described in the previous guide for consistency.

Storage

For large weights, link external URIs but keep metadata inside Valohai.

Reproducibility

Each dataset, base model, and pipeline version is automatically tracked in lineage.

Next Steps

Example: Fine-Tuning Mistral 7B LLM

PreviousRetraining and Updating GenAI Models NextDistributed Training

Last updated 5 hours ago

Was this helpful?