Multiple YAML Files & Monorepos
Organize ML workflows in repositories with multiple teams or services
Your repository can contain multiple valohai.yaml files. This is useful when different teams or services share one repository but need separate ML configurations.
Each Valohai project connects to one valohai.yaml file, which can live anywhere in your repository.
Why Multiple YAML Files?
Monorepo management: Different teams (data engineering, ML research, inference) maintain their own configurations without conflicts.
Service separation: Each microservice or model has its own isolated workflow definition.
Environment isolation: Dev, staging, and production pipelines use different YAML files with different resource requirements.
How It Works
By default, Valohai looks for valohai.yaml in your repository root:
my-repo/
├── valohai.yaml # Default location
├── train.py
└── preprocess.pyBut you can point Valohai to any subfolder:
my-repo/
├── data-engineering/
│ └── valohai.yaml # ETL pipelines
├── model-training/
│ └── valohai.yaml # ML training jobs
├── inference/
│ └── valohai.yaml # Batch prediction
└── shared/
└── utils.pyConfigure Custom YAML Path
Go to Project Settings > Repository and set the YAML path:
data-engineering/valohai.yamlValohai will now use that file instead of the root valohai.yaml.
Execution Default Working Directory
Even if your valohai.yaml is in a subfolder, Valohai clones your entire repository during execution. The default working directory will be the root of the Git repository, not the one where valohai.yaml is placed.
You can reference code from anywhere:
# File: model-training/valohai.yaml
- step:
name: train
image: python:3.9
command:
- python ./shared/utils.py # Access code from parent directory
- python train.pyThe full Git commit is available, so imports and relative paths work as expected.
Launch Jobs from CLI with Custom YAML
When running jobs from the command line, specify the YAML path.
Adhoc Execution
For adhoc jobs (no Git commit), use the --yaml flag:
vh exec run train-model --adhoc --yaml model-training/valohai.yamlFrom Git Commit
For jobs based on a Git commit, first set the YAML path in the web UI (as shown above).
Then run:
vh exec run train-model --commit abc123Or use project mode to launch from the latest fetched commit:
vh --project-mode remote --project <project-id> exec run train-modelExample: Monorepo Structure
Here's a real-world monorepo setup:
ml-platform/
├── data-ingestion/
│ ├── valohai.yaml # ETL and data validation
│ └── ingest.py
├── training/
│ ├── valohai.yaml # Model training
│ ├── train.py
│ └── evaluate.py
├── inference/
│ ├── valohai.yaml # Batch and real-time inference
│ └── predict.py
└── shared/
├── preprocessing.py
└── metrics.pyCreate three Valohai projects:
Data Ingestion Project → points to
data-ingestion/valohai.yamlTraining Project → points to
training/valohai.yamlInference Project → points to
inference/valohai.yaml
Each team works independently but shares the shared/ utilities.
Best Practices
Use descriptive paths: Name folders by function (training/, inference/) not by team (team-a/, team-b/).
Share common code: Put reusable utilities in a shared/ or common/ directory accessible to all projects.
Coordinate dependencies: If one YAML depends on outputs from another, use dataset versioning to pass data between projects.
Keep YAML close to code: Place valohai.yaml in the directory where the relevant Python scripts live for easier navigation.
What's Next?
Generate YAML with valohai-utils to skip writing YAML by hand (Python users)
Validate your YAML with the linter
Manage large YAML files with anchors
Last updated
Was this helpful?
