Managing Large YAML Files

Use YAML anchors and aliases to reduce repetition and keep configs maintainable

As your project grows to 30+ steps, your valohai.yaml can become repetitive and hard to maintain.

YAML anchors and aliases let you define reusable blocks once and reference them everywhere, keeping your config clean and consistent.


Why This Matters

Reduce duplication: Define common inputs, parameters, or commands once instead of copying them across dozens of steps.

Easier updates: Change a dataset path in one place, and it updates everywhere that references it.

Better readability: A 500-line YAML with anchors is easier to scan than a 2000-line file with repetition.


YAML Anchors & Aliases: The Basics

Define a reusable block with &anchor

- definitions:
    my-common-inputs: &common_inputs  # <- Anchor named "common_inputs"
      - name: dataset
        default: s3://my-bucket/train.csv
      - name: config
        default: s3://my-bucket/config.yaml

- step:
    name: train-model
    image: tensorflow/tensorflow:2.6.0
    command: python train.py
    inputs: *common_inputs  # Uses the block defined above

- step:
    name: evaluate-model
    image: tensorflow/tensorflow:2.6.0
    command: python evaluate.py
    inputs: *common_inputs  # Same inputs, no repetition

Both steps now share the same input definitions. Update &common_inputs once, and both steps inherit the change.


Common Use Cases

Shared input datasets

- definitions: 
    standard-datasets: &datasets
    - name: train-set
      default: s3://data/train/*
    - name: test-set
      default: s3://data/test/*

- step:
    name: model-a
    image: python:3.13
    command: python train_a.py
    inputs: *datasets

- step:
    name: model-b
    image: python:3.13
    command: python train_b.py
    inputs: *datasets

Repeated parameters

- definitions:
    tuning-params: &hyperparams
      - name: learning_rate
        default: 0.001
        type: float
      - name: weight_decay
        default: 0.0001
        type: float

- step:
    name: train-cnn
    image: python:3.13
    command: python train_cnn.py {parameters}
    parameters: *hyperparams

- step:
    name: train-transformer
    image: python:3.13
    command: python train_transformer.py {parameters}
    parameters: *hyperparams

Standard commands

- definitions:
    setup-commands: &setup
      - apt-get update
      - pip install -r requirements.txt
      - pip install valohai-utils

- step:
    name: preprocess
    image: python:3.13
    command:
      - *setup
      - python preprocess.py

- step:
    name: train
    image: python:3.13
    command:
      - *setup
      - python train.py

Merge and Override with <<: *anchor

You can merge an anchor and add extra fields:

- definitions:
    base-params: &base
      - name: epochs
        default: 10
        type: integer

- step:
    name: quick-test
    image: python:3.13
    command: python train.py {parameters}
    parameters:
      - <<: *base  # Merge base parameters
      - name: debug_mode  # Add a new parameter
        default: true
        type: flag

This keeps the epochs parameter from &base and adds debug_mode.


Tips for Large YAML Files

Define anchors at the top: Keep all reusable blocks in a definitions section at the start of your file for easy reference.

# Anchor definitions
- definitions:
    common-inputs: &inputs
      - name: dataset
        default: s3://bucket/data.csv
    
    training-params: &params
      - name: epochs
        default: 10
        type: integer

# Steps
- step:
    name: train
    image: python:3.13
    command: python train.py {parameters}
    inputs: *inputs
    parameters: *params

Use descriptive anchor names: &training_params is clearer than &params1.

Don't over-anchor: If a block is only used once, don't create an anchor. They're for repeated content.

Lint regularly: Run vh lint after editing anchors to catch syntax mistakes


What's Next?

Last updated

Was this helpful?