# Git Integration

Valohai connects directly to your Git repository to version your ML code alongside your experiments. Every execution automatically tracks the exact commit used, making your work fully reproducible.

## Why Git for Machine Learning?

Machine learning projects have a unique challenge: **duality between code and data**. Traditional software development is code-centric—Git works great there. But ML requires versioning both your training scripts *and* your datasets, models, and metrics.

Valohai solves this by:

* **Code versioning through Git** – Your scripts, configurations, and dependencies
* **Data versioning through Valohai** – Your datasets, model artifacts, and experiment outputs
* **Automatic lineage tracking** – Every execution links code commits to data artifacts

This means you can reproduce any experiment from 6 months ago with confidence.

## How Valohai Uses Git

When you connect a repository to a Valohai project:

1. **Valohai fetches your code** at the start of each execution
2. **Commit hash is recorded** automatically—no manual tracking needed
3. **You can run any commit** from the UI or API—great for comparing branches
4. **Code is ephemeral** on workers—no state persists between runs

Your repository should contain:

* Training and inference scripts
* `valohai.yaml` configuration
* Requirements files (e.g., `requirements.txt`, `environment.yml`)
* Documentation and READMEs

Your repository should **not** contain:

* Training datasets (use Valohai data stores instead)
* Trained models (Valohai versions these as outputs)
* Secrets or credentials (use environment variables)

## Repository Requirements

Valohai works with any Git provider and supports both public and private repositories:

**Supported providers:**

* GitHub, GitLab, Bitbucket, Azure DevOps
* Self-hosted Git servers (GitLab, Gitea, etc.)
* Any Git-compatible service with SSH or HTTPS access

**Size considerations:**

* Maximum compressed commit size: **1 GB**
* Commits over 100 MB may cause slow fetch times
* Use `.vhignore` to exclude large files from workers (similar to `.gitignore`)

## SSH Keys vs HTTPS

Valohai recommends **SSH keys** for private repositories:

**SSH (Recommended)**

* More secure—no passwords in URLs
* Works with deploy keys (read-only access)
* Standard for production environments

**HTTPS**

* Simpler for public repositories
* Requires embedding credentials in URL for private repositories
* Less secure for private repos

For private repositories, see our provider-specific guides for setting up SSH keys.

## Best Practices

**Keep commits focused and small**

* One logical change per commit
* Clear commit messages help when debugging old experiments
* Avoid "WIP" or "fixes" as messages—be specific

**Don't commit large files**

* No datasets in Git—use Valohai's data versioning
* No model checkpoints—these are execution outputs
* No notebooks with outputs—clear cells before committing

**Use .vhignore strategically**

* Exclude files Git tracks but workers don't need
* Useful for documentation, tests, or local configs
* Reduces fetch time and worker disk usage

**Separate data from code**

* Store datasets in cloud storage (S3, GCS, Azure Blob)
* Reference them as inputs in `valohai.yaml`
* Let Valohai handle data versioning and lineage

**Fetch regularly**

* Click "Fetch Repository" after pushing new commits
* Valohai caches repository state—fetching updates it
* Automatic fetching is available via webhooks

## Common Pitfalls

**"My execution uses old code"**

* You forgot to fetch after pushing
* Solution: Always fetch after new commits

**"Fetch is slow or times out"**

* Your commit is too large (>100 MB)
* Solution: Use `.vhignore` or move large files to data storage

**"Can't connect my private repo"**

* SSH key not added correctly
* Solution: Follow provider-specific guides—especially the `\n` formatting for keys

**"My secrets are exposed in logs"**

* Credentials hardcoded in scripts
* Solution: Use Valohai environment variables (marked as secrets)

## Next Steps

**New to Valohai?** Start with connecting a public repository to see how it works:

* [Connect Your Repository](/git-integration/connect-repo.md)

**Using private repositories?** Follow your Git provider's guide:

* [GitHub Private Repos](/git-integration/private-repositories/github.md)
* [GitLab Private Repos](/git-integration/private-repositories/gitlab.md)
* [Azure DevOps Private Repos](/git-integration/private-repositories/azure-devops.md)
* [Bitbucket Private Repos](/git-integration/private-repositories/bitbucket.md)

**Advanced Git workflows:**

* [Git Submodules](/git-integration/advanced-topics/submodules.md) – Access multiple repos
* [Clone During Execution](/git-integration/advanced-topics/clone-during-execution.md) – Fetch additional repos at runtime


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.valohai.com/git-integration.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
