# Git Integration

Valohai connects directly to your Git repository to version your ML code alongside your experiments. Every execution automatically tracks the exact commit used, making your work fully reproducible.

## Why Git for Machine Learning?

Machine learning projects have a unique challenge: **duality between code and data**. Traditional software development is code-centric—Git works great there. But ML requires versioning both your training scripts *and* your datasets, models, and metrics.

Valohai solves this by:

* **Code versioning through Git** – Your scripts, configurations, and dependencies
* **Data versioning through Valohai** – Your datasets, model artifacts, and experiment outputs
* **Automatic lineage tracking** – Every execution links code commits to data artifacts

This means you can reproduce any experiment from 6 months ago with confidence.

## How Valohai Uses Git

When you connect a repository to a Valohai project:

1. **Valohai fetches your code** at the start of each execution
2. **Commit hash is recorded** automatically—no manual tracking needed
3. **You can run any commit** from the UI or API—great for comparing branches
4. **Code is ephemeral** on workers—no state persists between runs

Your repository should contain:

* Training and inference scripts
* `valohai.yaml` configuration
* Requirements files (e.g., `requirements.txt`, `environment.yml`)
* Documentation and READMEs

Your repository should **not** contain:

* Training datasets (use Valohai data stores instead)
* Trained models (Valohai versions these as outputs)
* Secrets or credentials (use environment variables)

## Repository Requirements

Valohai works with any Git provider and supports both public and private repositories:

**Supported providers:**

* GitHub, GitLab, Bitbucket, Azure DevOps
* Self-hosted Git servers (GitLab, Gitea, etc.)
* Any Git-compatible service with SSH or HTTPS access

**Size considerations:**

* Maximum compressed commit size: **1 GB**
* Commits over 100 MB may cause slow fetch times
* Use `.vhignore` to exclude large files from workers (similar to `.gitignore`)

## SSH Keys vs HTTPS

Valohai recommends **SSH keys** for private repositories:

**SSH (Recommended)**

* More secure—no passwords in URLs
* Works with deploy keys (read-only access)
* Standard for production environments

**HTTPS**

* Simpler for public repositories
* Requires embedding credentials in URL for private repositories
* Less secure for private repos

For private repositories, see our provider-specific guides for setting up SSH keys.

## Best Practices

**Keep commits focused and small**

* One logical change per commit
* Clear commit messages help when debugging old experiments
* Avoid "WIP" or "fixes" as messages—be specific

**Don't commit large files**

* No datasets in Git—use Valohai's data versioning
* No model checkpoints—these are execution outputs
* No notebooks with outputs—clear cells before committing

**Use .vhignore strategically**

* Exclude files Git tracks but workers don't need
* Useful for documentation, tests, or local configs
* Reduces fetch time and worker disk usage

**Separate data from code**

* Store datasets in cloud storage (S3, GCS, Azure Blob)
* Reference them as inputs in `valohai.yaml`
* Let Valohai handle data versioning and lineage

**Fetch regularly**

* Click "Fetch Repository" after pushing new commits
* Valohai caches repository state—fetching updates it
* Automatic fetching is available via webhooks

## Common Pitfalls

**"My execution uses old code"**

* You forgot to fetch after pushing
* Solution: Always fetch after new commits

**"Fetch is slow or times out"**

* Your commit is too large (>100 MB)
* Solution: Use `.vhignore` or move large files to data storage

**"Can't connect my private repo"**

* SSH key not added correctly
* Solution: Follow provider-specific guides—especially the `\n` formatting for keys

**"My secrets are exposed in logs"**

* Credentials hardcoded in scripts
* Solution: Use Valohai environment variables (marked as secrets)

## Next Steps

**New to Valohai?** Start with connecting a public repository to see how it works:

* [Connect Your Repository](https://docs.valohai.com/git-integration/connect-repo)

**Using private repositories?** Follow your Git provider's guide:

* [GitHub Private Repos](https://docs.valohai.com/git-integration/private-repositories/github)
* [GitLab Private Repos](https://docs.valohai.com/git-integration/private-repositories/gitlab)
* [Azure DevOps Private Repos](https://docs.valohai.com/git-integration/private-repositories/azure-devops)
* [Bitbucket Private Repos](https://docs.valohai.com/git-integration/private-repositories/bitbucket)

**Advanced Git workflows:**

* [Git Submodules](https://docs.valohai.com/git-integration/advanced-topics/submodules) – Access multiple repos
* [Clone During Execution](https://docs.valohai.com/git-integration/advanced-topics/clone-during-execution) – Fetch additional repos at runtime
