Git Submodules

Access multiple Git repositories from the same Valohai project

Git submodules let you include one repository as a subdirectory within another. This is useful when your ML project depends on shared code, models, or configurations stored in separate repositories.

What Are Git Submodules?

Submodules are pointers to specific commits in external repositories. They let you:

  • Reuse code across multiple projects without duplication

  • Version external dependencies alongside your project

  • Keep repositories separate while maintaining relationships

When to use submodules:

  • Shared utility libraries used by multiple ML projects

  • Common preprocessing pipelines

  • Vendored dependencies that need version tracking

When not to use submodules:

  • Frequent changes to both repos (consider merging instead)

  • Simple one-time code sharing (just copy the files)

  • External packages available via pip/conda

Add a Submodule

Add a submodule to your repository:

# Add submodule in a subdirectory
git submodule add [email protected]:username/shared-utils.git libs/shared-utils

# Commit the submodule reference
git add .gitmodules libs/shared-utils
git commit -m "Add shared-utils submodule"
git push

Your repository now contains:

  • .gitmodules – Configuration file listing submodules

  • libs/shared-utils/ – Directory pointing to the external repo

Configure Valohai Access

Valohai needs access to both your main repository and all submodule repositories.

The simplest approach is to use one SSH key across all repositories.

For GitHub: GitHub deploy keys are repository-specific, so you can't reuse them. Instead:

  1. Generate an SSH key pair in Valohai

  2. Add the public key to your GitHub account (not as a deploy key):

    • Go to Settings (your account) → SSH and GPG keysNew SSH key

  3. Add the private key to Valohai project settings

Now Valohai can access your main repo and all submodules.

For GitLab: GitLab allows shared deploy keys:

  1. Add the deploy key to your main repository

  2. Go to each submodule repository

  3. Navigate to SettingsRepositoryDeploy keysPrivately accessible deploy keys

  4. Enable the same key for each submodule

For Bitbucket: Bitbucket allows reusing deploy keys:

  1. Add the same deploy key to your main repository

  2. Add the same deploy key to each submodule repository

No special configuration needed—Bitbucket handles this automatically.

Option 2: Clone Submodules During Execution

If you can't use the same SSH key, clone submodules manually in your step commands:

- step:
    name: train-with-submodules
    image: python:3.11
    command:
      - apt-get update && apt-get install -y git
      - echo -e $SUBMODULE_KEY > ~/submodule_key
      - chmod 600 ~/submodule_key
      - export GIT_SSH_COMMAND="ssh -o StrictHostKeyChecking=no -i ~/submodule_key"
      - git submodule update --init --recursive
      - python train.py

Store SUBMODULE_KEY as a secret environment variable in project settings.

Working with Submodules

After adding a submodule, team members must initialize it:

# Clone the main repository
git clone [email protected]:username/main-repo.git
cd main-repo

# Initialize submodules
git submodule update --init --recursive

Update a submodule to a newer commit:

cd libs/shared-utils
git pull origin main
cd ../..
git add libs/shared-utils
git commit -m "Update shared-utils to latest"
git push

Valohai automatically fetches submodules if your SSH key has access to all repositories. Note that the repository size limits still apply:

  • Maximum compressed commit size: 1 GB

  • Warning threshold: 100 MB (may cause slow fetches or timeouts)

Submodule URL Formats

Submodules must use SSH format for private repositories:

# Correct - SSH format
[email protected]:username/repo.git

# Incorrect - HTTPS won't work with deploy keys
https://github.com/username/repo.git

Check your .gitmodules file:

[submodule "libs/shared-utils"]
    path = libs/shared-utils
    url = [email protected]:username/shared-utils.git

Troubleshooting

"Submodule initialization failed"

  • Valohai's SSH key doesn't have access to the submodule repository

  • Solution: Add the public key to the submodule repo (see provider-specific guides)

"Submodule directory is empty"

  • The submodule wasn't initialized

  • Solution: Make sure .gitmodules is committed and Valohai has fetched the latest commit

"Permission denied for submodule"

  • SSH key works for main repo but not submodules

  • Solution: Use a single SSH key with access to all repos (see Option 1 above)

"Submodule points to wrong commit"

  • You updated the submodule but didn't commit the reference

  • Solution: After updating a submodule, commit the change in the main repo

Alternative: Clone During Execution

If submodules are too complex, consider cloning additional repositories during execution:

- step:
    name: train-with-external-code
    image: python:3.11
    command:
      - apt-get update && apt-get install -y git
      - git clone https://github.com/username/shared-utils.git /shared-utils
      - python train.py

This is simpler but loses version tracking. See Clone Repositories During Execution for details.

Best Practices

Pin submodules to specific commits

  • Don't point to main or master branches

  • Use specific commit hashes for reproducibility

  • Update deliberately, not automatically

Keep submodules stable

  • Submodules should change infrequently

  • Frequent changes indicate the code should be merged

Document submodule setup

  • Add initialization instructions to your README

  • Explain why each submodule exists

Use shallow clones for large submodules

git submodule update --init --depth 1

Next Steps

Last updated

Was this helpful?