Git Integration

Why Git matters for machine learning reproducibility and how Valohai uses it

Valohai connects directly to your Git repository to version your ML code alongside your experiments. Every execution automatically tracks the exact commit used, making your work fully reproducible.

Why Git for Machine Learning?

Machine learning projects have a unique challenge: duality between code and data. Traditional software development is code-centric—Git works great there. But ML requires versioning both your training scripts and your datasets, models, and metrics.

Valohai solves this by:

  • Code versioning through Git – Your scripts, configurations, and dependencies

  • Data versioning through Valohai – Your datasets, model artifacts, and experiment outputs

  • Automatic lineage tracking – Every execution links code commits to data artifacts

This means you can reproduce any experiment from 6 months ago with confidence.

How Valohai Uses Git

When you connect a repository to a Valohai project:

  1. Valohai fetches your code at the start of each execution

  2. Commit hash is recorded automatically—no manual tracking needed

  3. You can run any commit from the UI or API—great for comparing branches

  4. Code is ephemeral on workers—no state persists between runs

Your repository should contain:

  • Training and inference scripts

  • valohai.yaml configuration

  • Requirements files (e.g., requirements.txt, environment.yml)

  • Documentation and READMEs

Your repository should not contain:

  • Training datasets (use Valohai data stores instead)

  • Trained models (Valohai versions these as outputs)

  • Secrets or credentials (use environment variables)

Repository Requirements

Valohai works with any Git provider and supports both public and private repositories:

Supported providers:

  • GitHub, GitLab, Bitbucket, Azure DevOps

  • Self-hosted Git servers (GitLab, Gitea, etc.)

  • Any Git-compatible service with SSH or HTTPS access

Size considerations:

  • Maximum compressed commit size: 1 GB

  • Commits over 100 MB may cause slow fetch times

  • Use .vhignore to exclude large files from workers (similar to .gitignore)

SSH Keys vs HTTPS

Valohai recommends SSH keys for private repositories:

SSH (Recommended)

  • More secure—no passwords in URLs

  • Works with deploy keys (read-only access)

  • Standard for production environments

HTTPS

  • Simpler for public repositories

  • Requires embedding credentials in URL for private repositories

  • Less secure for private repos

For private repositories, see our provider-specific guides for setting up SSH keys.

Best Practices

Keep commits focused and small

  • One logical change per commit

  • Clear commit messages help when debugging old experiments

  • Avoid "WIP" or "fixes" as messages—be specific

Don't commit large files

  • No datasets in Git—use Valohai's data versioning

  • No model checkpoints—these are execution outputs

  • No notebooks with outputs—clear cells before committing

Use .vhignore strategically

  • Exclude files Git tracks but workers don't need

  • Useful for documentation, tests, or local configs

  • Reduces fetch time and worker disk usage

Separate data from code

  • Store datasets in cloud storage (S3, GCS, Azure Blob)

  • Reference them as inputs in valohai.yaml

  • Let Valohai handle data versioning and lineage

Fetch regularly

  • Click "Fetch Repository" after pushing new commits

  • Valohai caches repository state—fetching updates it

  • Automatic fetching is available via webhooks

Common Pitfalls

"My execution uses old code"

  • You forgot to fetch after pushing

  • Solution: Always fetch after new commits

"Fetch is slow or times out"

  • Your commit is too large (>100 MB)

  • Solution: Use .vhignore or move large files to data storage

"Can't connect my private repo"

  • SSH key not added correctly

  • Solution: Follow provider-specific guides—especially the \n formatting for keys

"My secrets are exposed in logs"

  • Credentials hardcoded in scripts

  • Solution: Use Valohai environment variables (marked as secrets)

Next Steps

New to Valohai? Start with connecting a public repository to see how it works:

Using private repositories? Follow your Git provider's guide:

Advanced Git workflows:

Last updated

Was this helpful?