Image Best Practices

Follow these guidelines to build efficient, reproducible Docker images for Valohai.

Use specific version tags

Always pin versions for reproducibility.

Good:

FROM python:3.11.4
RUN pip install tensorflow==2.13.0

Avoid:

FROM python:latest
RUN pip install tensorflow

Why? latest tags change over time. Six months from now, latest might be Python 3.13 with breaking changes. Pinned versions ensure your executions stay reproducible.

Start with minimal base images

Smaller images download faster and use less disk space.

Good choices:

  • python:3.11-slim (smaller than python:3.11)

  • nvidia/cuda:12.1.0-base-ubuntu22.04 (only CUDA runtime, not full SDK)

  • alpine variants when compatible

Compare sizes:

  • python:3.11 → 1.0 GB

  • python:3.11-slim → 130 MB

For GPU workloads, use NVIDIA's official base images to ensure CUDA compatibility.

Leverage Docker layer caching

Docker builds images in layers. Each instruction in your Dockerfile creates a layer that can be cached.

Order matters:

FROM python:3.11-slim

# 1. Install system dependencies (changes rarely)
RUN apt-get update && apt-get install -y \
    git \
    libgl1-mesa-glx \
    && rm -rf /var/lib/apt/lists/*

# 2. Copy requirements first (changes occasionally)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# 3. Copy code last (changes frequently)
COPY . /workspace
WORKDIR /workspace

When you change your code, only the last layer rebuilds. Requirements and system packages stay cached.

Don't include code or data in the image

Your Docker image should only contain the environment, not your code or data.

Your image:

  • Python runtime

  • System libraries

  • Python packages

Not your image:

  • Training scripts (comes from Git)

  • Datasets (comes from data stores)

  • Model files (generated during execution)

Why? Separating code from environment makes images reusable and keeps them small.

Pin Python package versions

Use a requirements.txt with exact versions:

tensorflow==2.13.0
transformers==4.30.2
numpy==1.24.3

Avoid version ranges like tensorflow>=2.0 in production images. Ranges are fine for experimentation, but pinned versions ensure reproducibility.

Cache control in Valohai

Valohai caches Docker images on worker machines by default. This means the first execution downloads the image, and subsequent executions reuse the cached version.

Force a fresh image pull

If you've updated an image in your registry (using the same tag), force Valohai to pull the latest version:

Set the environment variable VH_NO_IMAGE_CACHE=1 on your execution.

This ignores the cached image and pulls fresh from the registry.

Clear all caches

To clear both image and data caches from a worker machine:

Set VH_CLEAN=1 on your execution.

This forcibly removes all Docker images and cached data before and after execution. Use sparingly—it adds significant time.

When to use cache controls

  • VH_NO_IMAGE_CACHE=1 → You pushed a new version with the same tag (not recommended, but sometimes necessary)

  • VH_CLEAN=1 → Debugging disk space issues or testing fresh environments

For normal workflows, let Valohai's default caching work. It's fast and efficient.

Speed up image downloads with a pull-through cache

If you frequently build or pull large Docker images, a pull-through cache can significantly reduce download times.

When to use this

Consider a pull-through cache if:

  • You build or update Docker images frequently

  • Download speeds are slow or you hit timeouts

  • You want to reduce bandwidth costs

How it works

Valohai sets up a caching server in your VPC. When workers pull images, they first check the cache. If the image exists, it's served locally (fast). If not, it's fetched once and cached for future use.

Setup

This requires a dedicated machine in your VPC and network configuration to route traffic through the cache.

Contact [email protected] to set up a pull-through cache for your organization.

Custom container runtime options

Valohai controls the docker run command and its arguments. This ensures executions work consistently across environments.

You cannot:

  • Pass custom docker run flags

  • Override the entrypoint Valohai sets

  • Modify container networking or volume mounts

You can:

This design keeps infrastructure management outside your containers, so you focus on code, not configuration.

If you have a use case requiring custom docker run arguments, contact our support team to discuss alternatives.

Building images without Docker installed

You don't need Docker installed locally to build images. Use the Docker Image Builder from Valohai's Reusable Step Libraries.

This library step:

  • Takes your Dockerfile as input

  • Builds the image on Valohai infrastructure

  • Pushes to your registry

Perfect for teams without Docker experience or for CI/CD pipelines.

Summary

Fastest path:

  • Start with python:3.11-slim or similar

  • Pin all versions

  • Install packages in your code while iterating

  • Build a custom image once dependencies stabilize

Production-ready:

  • Use multi-stage builds

  • Clean up in the same layer

  • Leverage layer caching

  • Never include code or data in the image

Troubleshooting:

  • When launching an execution

    • Use VH_NO_IMAGE_CACHE=1 to pull fresh images

    • Use VH_CLEAN=1 to clear all caches (rarely needed)

  • Contact support for pull-through cache setup

Last updated

Was this helpful?