Multiple YAML Files & Monorepos

Organize ML workflows in repositories with multiple teams or services

Your repository can contain multiple valohai.yaml files. This is useful when different teams or services share one repository but need separate ML configurations.

Each Valohai project connects to one valohai.yaml file, which can live anywhere in your repository.


Why Multiple YAML Files?

Monorepo management: Different teams (data engineering, ML research, inference) maintain their own configurations without conflicts.

Service separation: Each microservice or model has its own isolated workflow definition.

Environment isolation: Dev, staging, and production pipelines use different YAML files with different resource requirements.


How It Works

By default, Valohai looks for valohai.yaml in your repository root:

my-repo/
├── valohai.yaml          # Default location
├── train.py
└── preprocess.py

But you can point Valohai to any subfolder:

my-repo/
├── data-engineering/
│   └── valohai.yaml      # ETL pipelines
├── model-training/
│   └── valohai.yaml      # ML training jobs
├── inference/
│   └── valohai.yaml      # Batch prediction
└── shared/
    └── utils.py

Configure Custom YAML Path

Go to Project Settings > Repository and set the YAML path:

data-engineering/valohai.yaml

Valohai will now use that file instead of the root valohai.yaml.


Execution Default Working Directory

Even if your valohai.yaml is in a subfolder, Valohai clones your entire repository during execution. The default working directory will be the root of the Git repository, not the one where valohai.yaml is placed.

You can reference code from anywhere:

# File: model-training/valohai.yaml
- step:
    name: train
    image: python:3.9
    command:
      - python ./shared/utils.py      # Access code from parent directory
      - python train.py

The full Git commit is available, so imports and relative paths work as expected.


Launch Jobs from CLI with Custom YAML

When running jobs from the command line, specify the YAML path.

Adhoc Execution

For adhoc jobs (no Git commit), use the --yaml flag:

vh exec run train-model --adhoc --yaml model-training/valohai.yaml

From Git Commit

For jobs based on a Git commit, first set the YAML path in the web UI (as shown above).

Then run:

vh exec run train-model --commit abc123

Or use project mode to launch from the latest fetched commit:

vh --project-mode remote --project <project-id> exec run train-model

Example: Monorepo Structure

Here's a real-world monorepo setup:

ml-platform/
├── data-ingestion/
│   ├── valohai.yaml       # ETL and data validation
│   └── ingest.py
├── training/
│   ├── valohai.yaml       # Model training
│   ├── train.py
│   └── evaluate.py
├── inference/
│   ├── valohai.yaml       # Batch and real-time inference
│   └── predict.py
└── shared/
    ├── preprocessing.py
    └── metrics.py

Create three Valohai projects:

  1. Data Ingestion Project → points to data-ingestion/valohai.yaml

  2. Training Project → points to training/valohai.yaml

  3. Inference Project → points to inference/valohai.yaml

Each team works independently but shares the shared/ utilities.


Best Practices

Use descriptive paths: Name folders by function (training/, inference/) not by team (team-a/, team-b/).

Share common code: Put reusable utilities in a shared/ or common/ directory accessible to all projects.

Coordinate dependencies: If one YAML depends on outputs from another, use dataset versioning to pass data between projects.

Keep YAML close to code: Place valohai.yaml in the directory where the relevant Python scripts live for easier navigation.


What's Next?

Last updated

Was this helpful?