# Image Best Practices

Follow these guidelines to build efficient, reproducible Docker images for Valohai.

## Use specific version tags

Always pin versions for reproducibility.

**Good:**

```dockerfile
FROM python:3.11.4
RUN pip install tensorflow==2.13.0
```

**Avoid:**

```dockerfile
FROM python:latest
RUN pip install tensorflow
```

Why? `latest` tags change over time. Six months from now, `latest` might be Python 3.13 with breaking changes. Pinned versions ensure your executions stay reproducible.

## Start with minimal base images

Smaller images download faster and use less disk space.

**Good choices:**

* `python:3.11-slim` (smaller than `python:3.11`)
* `nvidia/cuda:12.1.0-base-ubuntu22.04` (only CUDA runtime, not full SDK)
* `alpine` variants when compatible

**Compare sizes:**

* `python:3.11` → 1.0 GB
* `python:3.11-slim` → 130 MB

For GPU workloads, use NVIDIA's official base images to ensure CUDA compatibility.

## Leverage Docker layer caching

Docker builds images in layers. Each instruction in your Dockerfile creates a layer that can be cached.

**Order matters:**

```dockerfile
FROM python:3.11-slim

# 1. Install system dependencies (changes rarely)
RUN apt-get update && apt-get install -y \
    git \
    libgl1-mesa-glx \
    && rm -rf /var/lib/apt/lists/*

# 2. Copy requirements first (changes occasionally)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# 3. Copy code last (changes frequently)
COPY . /workspace
WORKDIR /workspace
```

When you change your code, only the last layer rebuilds. Requirements and system packages stay cached.

## Don't include code or data in the image

Your Docker image should only contain the environment, not your code or data.

**Your image:**

* Python runtime
* System libraries
* Python packages

**Not your image:**

* Training scripts (comes from Git)
* Datasets (comes from data stores)
* Model files (generated during execution)

Why? Separating code from environment makes images reusable and keeps them small.

## Pin Python package versions

Use a `requirements.txt` with exact versions:

```
tensorflow==2.13.0
transformers==4.30.2
numpy==1.24.3
```

Avoid version ranges like `tensorflow>=2.0` in production images. Ranges are fine for experimentation, but pinned versions ensure reproducibility.

## Cache control in Valohai

Valohai caches Docker images on worker machines by default. This means the first execution downloads the image, and subsequent executions reuse the cached version.

### Force a fresh image pull

If you've updated an image in your registry (using the same tag), force Valohai to pull the latest version:

Set the environment variable **`VH_NO_IMAGE_CACHE=1`** on your execution.

This ignores the cached image and pulls fresh from the registry.

### Clear all caches

To clear both image and data caches from a worker machine:

Set **`VH_CLEAN=1`** on your execution.

This forcibly removes all Docker images and cached data before and after execution. Use sparingly—it adds significant time.

### When to use cache controls

* `VH_NO_IMAGE_CACHE=1` → You pushed a new version with the same tag (not recommended, but sometimes necessary)
* `VH_CLEAN=1` → Debugging disk space issues or testing fresh environments

For normal workflows, let Valohai's default caching work. It's fast and efficient.

## Speed up image downloads with a pull-through cache

If you frequently build or pull large Docker images, a pull-through cache can significantly reduce download times.

### When to use this

Consider a pull-through cache if:

* You build or update Docker images frequently
* Download speeds are slow or you hit timeouts
* You want to reduce bandwidth costs

### How it works

Valohai sets up a caching server in your VPC. When workers pull images, they first check the cache. If the image exists, it's served locally (fast). If not, it's fetched once and cached for future use.

### Setup

This requires a dedicated machine in your VPC and network configuration to route traffic through the cache.

Contact `support@valohai.com` to set up a pull-through cache for your organization.

## Custom container runtime options

Valohai controls the `docker run` command and its arguments. This ensures executions work consistently across environments.

**You cannot:**

* Pass custom `docker run` flags
* Override the entrypoint Valohai sets
* Modify container networking or volume mounts

**You can:**

* Use any Docker image
* Pass parameters to your code
* Set environment variables
  * See [system environment variables](https://docs.valohai.com/executions/system-environment-variables)
* Mount data from your data stores

This design keeps infrastructure management outside your containers, so you focus on code, not configuration.

If you have a use case requiring custom `docker run` arguments, contact our support team to discuss alternatives.

## Building images without Docker installed

You don't need Docker installed locally to build images. Use the [Docker Image Builder](https://docs.valohai.com/reusable-step-libraries/build-your-own-library/docker-image-builder) from Valohai's [Reusable Step Libraries.](https://docs.valohai.com/reusable-step-libraries)

This library step:

* Takes your Dockerfile as input
* Builds the image on Valohai infrastructure
* Pushes to your registry

Perfect for teams without Docker experience or for CI/CD pipelines.

## Summary

**Fastest path:**

* Start with `python:3.11-slim` or similar
* Pin all versions
* Install packages in your code while iterating
* Build a custom image once dependencies stabilize

**Production-ready:**

* Use multi-stage builds
* Clean up in the same layer
* Leverage layer caching
* Never include code or data in the image

**Troubleshooting:**

* When launching an execution
  * Use `VH_NO_IMAGE_CACHE=1` to pull fresh images
  * Use `VH_CLEAN=1` to clear all caches (rarely needed)
* Contact support for pull-through cache setup
