Collect Metrics from Output Files (YOLO & Others)

Some frameworks like YOLOv8 write metrics to output files instead of printing them. Use a file watcher to monitor these files and stream their contents to Valohai metadata automatically.

This pattern works for any framework that writes metrics to CSV, JSON, or text files during training.


The Problem

YOLOv8 (and similar frameworks) write training metrics to CSV files in the outputs directory:

runs/train/exp/
├── results.csv          # Training metrics per epoch
├── weights/
│   ├── best.pt          # Best model
│   └── last.pt          # Latest model
└── ...

Challenge: These metrics aren't printed as JSON, so Valohai doesn't capture them automatically.

Solution: Use a file watcher script that monitors output files and prints their contents as JSON.


Quick Example

valohai.yaml

- step:
    name: train-yolov8
    image: ultralytics/yolov8:latest
    command:
      - git clone https://github.com/ultralytics/yolov8.git
      - tar -xf /valohai/inputs/dataset/coco128.tar
      - pip install watchdog
      - nohup python ./scripts/valohai_watch.py &  # Start watcher in background
      - python yolov8/train.py --data coco128.yaml --epochs {parameters}
    inputs:
      - name: dataset
        default: https://github.com/ultralytics/yolov8/releases/download/v1.0/coco128.tar.xz
    parameters:
      - name: epochs
        type: integer
        default: 10
    environment: aws-eu-west-1-g4dn-xlarge

scripts/valohai_watch.py


How It Works

  1. Start watcher in background: nohup python valohai_watch.py & runs the watcher script as a background process

  2. Monitor output directory: The watcher uses watchdog to detect file changes in /valohai/outputs/

  3. Parse and log: When a CSV is modified, the watcher reads the latest row and prints it as JSON

  4. Valohai captures: Valohai sees the printed JSON and records it as metadata


Complete Working Example

Here's a full implementation with model aliasing:

scripts/valohai_watch.py (Complete)


Adapting for Other File Formats

JSON Files


Text Files with Key-Value Pairs


TensorBoard Event Files

For TensorBoard logs, use tensorboard library:


Best Practices

Start Watcher Before Training

Always start the watcher before your training script:


Use nohup for Background Execution

nohup ensures the watcher keeps running even if the parent process terminates:


Handle Partial Writes

Files might be written incrementally. Add a small delay:


Filter by Filename Pattern

Only watch specific files to avoid unnecessary processing:


Error Handling

Always wrap file operations in try-except:


Common Issues

Watcher Not Starting

Symptom: No metrics logged, watcher script never runs

Causes & Fixes:

  • Missing dependency → Install watchdog: pip install watchdog

  • Script not in correct location → Check path in command

  • Background process killed → Use nohup and &

Debug:


Metrics Logged Multiple Times

Symptom: Same epoch metrics appear repeatedly

Cause: CSV file modified multiple times per epoch

Solution: Track last processed row:


File Not Found Errors

Symptom: Watcher crashes when trying to read files

Cause: File deleted or moved before watcher can read it

Solution: Check file exists before reading:


When to Use This Pattern

Use file watchers when:

  • Framework writes metrics to files (YOLOv8, MMDetection, etc.)

  • You can't modify the framework's code

  • Metrics are in CSV, JSON, or structured text

Don't use file watchers when:

  • You can modify your training code (use direct JSON printing instead)

  • Framework has callback/hook system (use callbacks)

  • Metrics are printed to stdout (already captured by Valohai)


Example Project

Check out our complete working example on GitHub:

valohai/yolo-example

The repository includes:

  • Complete watcher script

  • YOLOv5 and YOLOv8 training configuration

  • valohai.yaml with proper setup

  • Step-by-step instructions


Next Steps

Last updated

Was this helpful?