Git Integration
Why Git matters for machine learning reproducibility and how Valohai uses it
Valohai connects directly to your Git repository to version your ML code alongside your experiments. Every execution automatically tracks the exact commit used, making your work fully reproducible.
Why Git for Machine Learning?
Machine learning projects have a unique challenge: duality between code and data. Traditional software development is code-centric—Git works great there. But ML requires versioning both your training scripts and your datasets, models, and metrics.
Valohai solves this by:
Code versioning through Git – Your scripts, configurations, and dependencies
Data versioning through Valohai – Your datasets, model artifacts, and experiment outputs
Automatic lineage tracking – Every execution links code commits to data artifacts
This means you can reproduce any experiment from 6 months ago with confidence.
How Valohai Uses Git
When you connect a repository to a Valohai project:
Valohai fetches your code at the start of each execution
Commit hash is recorded automatically—no manual tracking needed
You can run any commit from the UI or API—great for comparing branches
Code is ephemeral on workers—no state persists between runs
Your repository should contain:
Training and inference scripts
valohai.yamlconfigurationRequirements files (e.g.,
requirements.txt,environment.yml)Documentation and READMEs
Your repository should not contain:
Training datasets (use Valohai data stores instead)
Trained models (Valohai versions these as outputs)
Secrets or credentials (use environment variables)
Repository Requirements
Valohai works with any Git provider and supports both public and private repositories:
Supported providers:
GitHub, GitLab, Bitbucket, Azure DevOps
Self-hosted Git servers (GitLab, Gitea, etc.)
Any Git-compatible service with SSH or HTTPS access
Size considerations:
Maximum compressed commit size: 1 GB
Commits over 100 MB may cause slow fetch times
Use
.vhignoreto exclude large files from workers (similar to.gitignore)
SSH Keys vs HTTPS
Valohai recommends SSH keys for private repositories:
SSH (Recommended)
More secure—no passwords in URLs
Works with deploy keys (read-only access)
Standard for production environments
HTTPS
Simpler for public repositories
Requires embedding credentials in URL for private repositories
Less secure for private repos
For private repositories, see our provider-specific guides for setting up SSH keys.
Best Practices
Keep commits focused and small
One logical change per commit
Clear commit messages help when debugging old experiments
Avoid "WIP" or "fixes" as messages—be specific
Don't commit large files
No datasets in Git—use Valohai's data versioning
No model checkpoints—these are execution outputs
No notebooks with outputs—clear cells before committing
Use .vhignore strategically
Exclude files Git tracks but workers don't need
Useful for documentation, tests, or local configs
Reduces fetch time and worker disk usage
Separate data from code
Store datasets in cloud storage (S3, GCS, Azure Blob)
Reference them as inputs in
valohai.yamlLet Valohai handle data versioning and lineage
Fetch regularly
Click "Fetch Repository" after pushing new commits
Valohai caches repository state—fetching updates it
Automatic fetching is available via webhooks
Common Pitfalls
"My execution uses old code"
You forgot to fetch after pushing
Solution: Always fetch after new commits
"Fetch is slow or times out"
Your commit is too large (>100 MB)
Solution: Use
.vhignoreor move large files to data storage
"Can't connect my private repo"
SSH key not added correctly
Solution: Follow provider-specific guides—especially the
\nformatting for keys
"My secrets are exposed in logs"
Credentials hardcoded in scripts
Solution: Use Valohai environment variables (marked as secrets)
Next Steps
New to Valohai? Start with connecting a public repository to see how it works:
Using private repositories? Follow your Git provider's guide:
Advanced Git workflows:
Git Submodules – Access multiple repos
Clone During Execution – Fetch additional repos at runtime
Last updated
Was this helpful?
