# Finetuning LLMs in Valohai

Finetuning adapts a foundation model to a specific domain, tone, or dataset.\
In Valohai, you can run finetuning jobs reproducibly, track datasets, parameters, and resulting models exactly as you would in traditional ML.

When to finetune: Finetuning is often a last resort. Before committing to a specific base model, use Valohai pipelines to compare out‑of‑the‑box models (OpenAI, Anthropic, Llama) or build lightweight RAG pipelines.\
If retrieval and prompt engineering cannot reach your target quality, finetuning becomes relevant.

### Prepare Training Data

Create a Valohai dataset containing your training and evaluation splits.

```
dataset://finetune-data/v1
 ├── train.jsonl
 ├── eval.jsonl
 └── metadata.yaml
```

Each JSONL file should contain prompt–response pairs:

```json
{"prompt": "Translate 'hello' to French", "response": "bonjour"}
{"prompt": "Summarize: Large language models are...", "response": "They are models trained to generate text."}
```

When the dataset evolves, create a new version (v2, v3, …) instead of overwriting files.\
This preserves reproducibility across experiments.

#### Managing evolving datasets (versioned, append/replace/remove)

When your finetuning dataset changes, create a new dataset version (`v2`, `v3`, …) based on the previous version instead of overwriting files.

* **Update only what changed:** You can append new samples, replace corrected files, or remove deprecated files in the new version.
* **No full duplication:** Unchanged files are reused from the previous version; only modified or new files are stored again.
* **Clear lineage:** Each file retains where it originated (e.g., an execution output or upload) and where it is used (which models or evaluations depend on it).

**Why this matters**

* **Reproducibility:** Every experiment and model can be traced back to the exact dataset version used.
* **Audit & impact analysis:** If a gold answer or training sample is wrong, you can instantly find which models or evaluations relied on it via dataset history.
* **Rollbacks:** Promote or re-run against a previous version without guesswork.
* **Storage efficiency:** You don't create massive duplicates of unchanged files.
* **Governance:** Versioned changes form an auditable trail of how your benchmark or training data evolved.

**Example evolution**

```markdown
finetune-data (dataset)
 ├─ v1/
 │   ├─ train.jsonl
 │   └─ eval.jsonl
 ├─ v2/   # based on v1; append and fix
 │   ├─ train.jsonl          # replaced with new/extra samples
 │   ├─ eval.jsonl           # reused from v1 (unchanged)
 │   └─ notes.md             # new file (changelog/rationale)
 └─ v3/   # based on v2; retire a subset
     ├─ train.jsonl          # replaced (removed deprecated block)
     └─ eval.jsonl           # reused from v2
```

> 💡 **TL;DR:** Version the dataset, don't overwrite it.\
> In each new version, **append**, **replace**, or **remove** just the changed parts.\
> You get precise lineage and reproducibility without duplicating everything.

### Define a Finetuning Step

Example YAML snippet for finetuning a Hugging Face model:

```yaml
- step: finetune-llm
  image: valohai/pytorch:2.1
  command:
    - python finetune.py {parameters}
  inputs:
    - name: training_data
      default: dataset://finetune-data/v1
  parameters:
    - name: base_model
      default: "mistral-7b"
    - name: epochs
      default: 3
```

Your `finetune.py` script can follow Hugging Face's Trainer API or a custom loop:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments
from datasets import load_dataset
import valohai

model = AutoModelForCausalLM.from_pretrained("/valohai/inputs/base-model")
tokenizer = AutoTokenizer.from_pretrained("/valohai/inputs/base-model")
data = load_dataset("json", data_files="/valohai/inputs/training_data/train.jsonl")

args = TrainingArguments(
    output_dir="/valohai/outputs/model",
    per_device_train_batch_size=4,
    num_train_epochs=valohai.parameters("epochs").value,
    save_total_limit=1,
)

trainer = Trainer(model=model, args=args, train_dataset=data["train"])
trainer.train()
trainer.save_model("/valohai/outputs/model")
tokenizer.save_pretrained("/valohai/outputs/model")
```

***

### Evaluate the Finetuned Model

After finetuning, add an evaluation step similar to the one described in [Evaluating and Validating GenAI Applications](https://github.com/valohai/dokuhai/blob/main/docs/how-to/evaluating-and-validating-genai-applications.md).

```yaml
- step: evaluate-model
  image: valohai/python:3.10
  command:
    - python evaluate.py
  inputs:
    - name: model
      default: models://finetune-llm/v5
    - name: eval_data
      default: dataset://finetune-data/v1
```

The evaluation script should compute relevant metrics (BLEU, ROUGE, BERTScore, factuality) and log them with:

```python
import json

print(
    json.dumps(
        {
            "rougeL": 0.42,
            "bleu": 0.37,
            "factuality": 0.83,
        },
    ),
)

# or using valohai-utils
import valohai

with valohai.metadata.logger() as logger:
    logger.log("rougeL", 0.42)
    logger.log("bleu", 0.37)
    logger.log("factuality", 0.83)
```

> Use the same evaluation dataset for all model versions to make metrics comparable.

### Save and Register the Model

When training completes, you can register the finetuned model directly from your code:

```python
import json, os

metadata = {
    "model/": {
        "valohai.model-versions": ["model://domain-llm/v6"],
        "valohai.tags": ["genai", "finetuned"],
        "base_model": "mistral-7b",
        "epochs": 3,
    },
}

with open("/valohai/outputs/valohai.metadata.jsonl", "w") as f:
    for file, meta in metadata.items():
        json.dump({"file": file, "metadata": meta}, f)
        f.write("\n")
```

This automatically creates a new model version in the Model Catalog and links it to the datasets and metrics from the pipeline.

### Integrate with Retraining Pipelines

You can reuse the finetuning step inside a larger retraining pipeline. See an example in the "Retraining and Updating GenAI Models" page.

This ensures continuous improvement, retraining, finetuning, and evaluation, all captured under one Valohai lineage.

***

### GenAI Considerations

| Topic                 | Recommendation                                                                          |
| --------------------- | --------------------------------------------------------------------------------------- |
| **Dataset structure** | Use prompt–response JSONL files with clear splits for training and evaluation.          |
| **Model formats**     | Save full Hugging Face model folders (`.safetensors`, `config.json`, `tokenizer.json`). |
| **Evaluation**        | Reuse the evaluation workflow described in the previous guide for consistency.          |
| **Storage**           | For large weights, link external URIs but keep metadata inside Valohai.                 |
| **Reproducibility**   | Each dataset, base model, and pipeline version is automatically tracked in lineage.     |

***

### Next Steps

* [Example: Fine-Tuning Mistral 7B LLM](/project-gallery/nlp-and-llm/mistral-example.md)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.valohai.com/genai/finetune-llms.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
