Git Integration

Valohai connects directly to your Git repository to version your ML code alongside your experiments. Every execution automatically tracks the exact commit used, making your work fully reproducible.

Why Git for Machine Learning?

Machine learning projects have a unique challenge: duality between code and data. Traditional software development is code-centric—Git works great there. But ML requires versioning both your training scripts and your datasets, models, and metrics.

Valohai solves this by:

Code versioning through Git – Your scripts, configurations, and dependencies
Data versioning through Valohai – Your datasets, model artifacts, and experiment outputs
Automatic lineage tracking – Every execution links code commits to data artifacts

This means you can reproduce any experiment from 6 months ago with confidence.

How Valohai Uses Git

When you connect a repository to a Valohai project:

Valohai fetches your code at the start of each execution
Commit hash is recorded automatically—no manual tracking needed
You can run any commit from the UI or API—great for comparing branches
Code is ephemeral on workers—no state persists between runs

Your repository should contain:

Training and inference scripts
valohai.yaml configuration
Requirements files (e.g., requirements.txt, environment.yml)
Documentation and READMEs

Your repository should not contain:

Training datasets (use Valohai data stores instead)
Trained models (Valohai versions these as outputs)
Secrets or credentials (use environment variables)

Repository Requirements

Valohai works with any Git provider and supports both public and private repositories:

Supported providers:

GitHub, GitLab, Bitbucket, Azure DevOps
Self-hosted Git servers (GitLab, Gitea, etc.)
Any Git-compatible service with SSH or HTTPS access

Size considerations:

Maximum compressed commit size: 1 GB
Commits over 100 MB may cause slow fetch times
Use .vhignore to exclude large files from workers (similar to .gitignore)

SSH Keys vs HTTPS

Valohai recommends SSH keys for private repositories:

SSH (Recommended)

More secure—no passwords in URLs
Works with deploy keys (read-only access)
Standard for production environments

HTTPS

Simpler for public repositories
Requires embedding credentials in URL for private repositories
Less secure for private repos

For private repositories, see our provider-specific guides for setting up SSH keys.

Best Practices

Keep commits focused and small

One logical change per commit
Clear commit messages help when debugging old experiments
Avoid "WIP" or "fixes" as messages—be specific

Don't commit large files

No datasets in Git—use Valohai's data versioning
No model checkpoints—these are execution outputs
No notebooks with outputs—clear cells before committing

Use .vhignore strategically

Exclude files Git tracks but workers don't need
Useful for documentation, tests, or local configs
Reduces fetch time and worker disk usage

Separate data from code

Store datasets in cloud storage (S3, GCS, Azure Blob)
Reference them as inputs in valohai.yaml
Let Valohai handle data versioning and lineage

Fetch regularly

Click "Fetch Repository" after pushing new commits
Valohai caches repository state—fetching updates it
Automatic fetching is available via webhooks

Common Pitfalls

"My execution uses old code"

You forgot to fetch after pushing
Solution: Always fetch after new commits

"Fetch is slow or times out"

Your commit is too large (>100 MB)
Solution: Use .vhignore or move large files to data storage

"Can't connect my private repo"

SSH key not added correctly
Solution: Follow provider-specific guides—especially the \n formatting for keys

"My secrets are exposed in logs"

Credentials hardcoded in scripts
Solution: Use Valohai environment variables (marked as secrets)

Next Steps

New to Valohai? Start with connecting a public repository to see how it works:

Connect Your Repository

Using private repositories? Follow your Git provider's guide:

Advanced Git workflows:

Git Submodules – Access multiple repos
Clone During Execution – Fetch additional repos at runtime

PreviousTrack Underutilization NextConnect Your Repository

Last updated 1 month ago

Was this helpful?