Managing Large YAML Files

As your project grows to 30+ steps, your valohai.yaml can become repetitive and hard to maintain.

YAML anchors and aliases let you define reusable blocks once and reference them everywhere, keeping your config clean and consistent.


Why This Matters

Reduce duplication: Define common inputs, parameters, or commands once instead of copying them across dozens of steps.

Easier updates: Change a dataset path in one place, and it updates everywhere that references it.

Better readability: A 500-line YAML with anchors is easier to scan than a 2000-line file with repetition.


YAML Anchors & Aliases: The Basics

Define a reusable block with &anchor

- definitions:
    my-common-inputs: &common_inputs  # <- Anchor named "common_inputs"
      - name: dataset
        default: s3://my-bucket/train.csv
      - name: config
        default: s3://my-bucket/config.yaml

- step:
    name: train-model
    image: tensorflow/tensorflow:2.6.0
    command: python train.py
    inputs: *common_inputs  # Uses the block defined above

- step:
    name: evaluate-model
    image: tensorflow/tensorflow:2.6.0
    command: python evaluate.py
    inputs: *common_inputs  # Same inputs, no repetition

Both steps now share the same input definitions. Update &common_inputs once, and both steps inherit the change.


Common Use Cases

Shared input datasets

Repeated parameters

Standard commands


Merge and Override with <<: *anchor

You can merge an anchor and add extra fields:

This keeps the epochs parameter from &base and adds debug_mode.


Tips for Large YAML Files

Define anchors at the top: Keep all reusable blocks in a definitions section at the start of your file for easy reference.

Use descriptive anchor names: &training_params is clearer than &params1.

Don't over-anchor: If a block is only used once, don't create an anchor. They're for repeated content.

Lint regularly: Run vh lint after editing anchors to catch syntax mistakes


What's Next?

Last updated

Was this helpful?