Linux Workers

Deploy Valohai workers on your on-premises Linux servers to run machine learning jobs on your own hardware.

Overview

The Compute and Data Layer of Valohai can be deployed to your on-premise environment. This enables you to:

  • Use your own on-premises machines to run machine learning jobs

  • Use your own cloud storage for storing training artifacts (trained models, preprocessed datasets, visualizations)

  • Mount local data to your on-premises workers

  • Access databases and data warehouses directly from workers inside your network

Valohai doesn't have direct access to on-premises machines that execute ML jobs. Instead, it communicates with a separate static virtual machine in your on-premise environment that's responsible for storing the job queue, job states, and short-term logs.

Prerequisites

Hardware requirements:

  • Linux server (Ubuntu 24.04 recommended)

  • Python 3.10+ installed

  • For GPU workloads: NVIDIA drivers and NVIDIA Container Toolkit installed

From Valohai:

Contact [email protected] to receive:

  • queue-name - Name for this worker or group of workers

  • queue-address - Address of your job queue

  • redis-password - Password for the queue

  • url - Download URL for the Valohai worker installer

Network requirements:

  • Worker can connect to the queue machine on port 63790

  • Worker can access your object storage (S3, Azure Blob, GCS)

  • Optional: Outbound internet access for pulling Docker images

Understanding Queue Names

The queue name identifies this worker or group of workers in Valohai.

Examples:

  • myorg-onprem-1

  • myorg-onprem-machine-name

  • myorg-onprem-gpus

  • myorg-onprem-gpus-prod

Each machine can have its own queue, but we recommend using the same queue name on all machines that have the same configuration and are used for the same purpose.

Installation Methods

Choose your installation method based on your operating system and preferences.

Automated installer for Ubuntu systems.

What it installs:

  • Valohai agent (Peon)

  • Docker (if not already installed)

  • NVIDIA Container Toolkit (if needed for GPU workloads)

  • System service configuration

Installation:

Replace the placeholder values with the information from Valohai.

After installation, the Valohai agent will start automatically and begin pulling jobs from the queue.

💡Setting the "PEON_WARDEN_ENABLED=true" enables monitoring the status of the agent in the Valohai UI. If you want this disabled, you can set the value to false.

Manual Installation

For non-Ubuntu systems or custom configurations.

Manual Installation Steps

Step 1: Install Dependencies

Python 3.10+

Verify Python is installed:

Docker

Install Docker for your Linux distribution. Visit the Docker installation guide and select your distribution.

NVIDIA Drivers (GPU only)

If using GPUs, install NVIDIA drivers appropriate for your GPU model.

Verify installation:

NVIDIA Container Toolkit (GPU only)

Install NVIDIA Container Toolkit to enable GPU access in containers.

Follow the NVIDIA documentation for your distribution.

Verify it works:

Step 2: Download and Install Peon

Download the Peon agent using the URL provided by Valohai. We recommend installing the agent inside a virtual environment.

Replace <URL> with the download URL from Valohai.

⚠️ The distutils module has been deprecated on Ubuntu 24.04 and Python 3.12. As some of the Valohai components depend on this, you will need to install for example setuptools inside the virtual environment to ensure everything works as expected.

Step 3: Configure Peon

Create the configuration file /etc/peon.config:

Configuration values:

Replace these placeholders:

  • <queue-name> - Your queue name from Valohai

  • <redis-password> - Redis password from Valohai (stored in your cloud Secret Manager)

  • <queue-address> - Queue address from Valohai

Step 4: Create Systemd Service

Create the service file /etc/systemd/system/peon.service:

Important: Update these values:

  • ExecStart - Path to valohai-peon binary (use sudo find / -iname "valohai-peon" -print to find it)

    • If you installed Peon inside a virtual environment, the binary should be there

    • Common locations for global installation: /home/valohai/.local/bin/valohai-peon or /usr/local/bin/valohai-peon

  • User - Linux user that will run the service

  • Group - Linux group for the user

Step 5: Create Cleanup Service

Create /etc/systemd/system/peon-clean.service:

Update ExecStart, User, and Group as needed.

Step 6: Create Cleanup Timer

Create /etc/systemd/system/peon-clean.timer:

This runs the cleanup service every 10 minutes to remove stale caches and Docker images.

Step 7: Create Warden Service

Warden is a component that allows monitoring the Peon status in the Valohai UI.

Create /etc/systemd/system/peon-warden.service:

Step 8: Grant Docker Permissions

The user running Peon needs permissions to control Docker:

Replace <User> with the user from your service files (e.g., valohai).

Step 9: Start Services

Reload systemd to recognize the new service files:

Start the Peon service:

Check that services are running:

Step 10: Enable Auto-Start

Enable services to start automatically on boot:

Troubleshooting Service Start

For a global installation, if services fail to start, try using the full Python module path in ExecStart:

Use this in both peon.service and peon-clean.service files if needed.

Multi-GPU Configuration

If your server has multiple GPUs, you can configure Valohai to either:

  1. Use all GPUs for a single job

  2. Run multiple jobs in parallel, each with access to one GPU

Note: You can only choose one option at a time.

Setup Multiple Peon Instances

Follow these steps to run multiple Peon instances, one per GPU.

Prerequisites:

  • Valohai worker already installed (Ubuntu installer or manual installation)

  • Multiple GPUs available on the server

Steps:

1. Stop the original Peon service

2. Rename the service file

3. Edit the service file

Open /etc/systemd/system/[email protected] and add these lines in the [Service] section:

Add these after the EnvironmentFile=/etc/peon.config line.

Replace UUID with a generated UUID. You can generate one at uuidgenerator.net.

Complete example:

4. Reload systemd

5. Enable and start Peon instances

Start one instance per GPU. For a server with 4 GPUs:

The number (@0, @1, etc.) corresponds to the GPU index.

6. Verify instances are running

Each instance should show as active (running).

Important: Make sure you disabled the original peon service. Otherwise, you'll have too many Peon instances competing for resources - one trying to use all GPUs and others using one GPU each.

Troubleshooting

Worker Not Connecting

Check Peon service status:

View Peon logs:

Common issues:

  • Incorrect Redis password

  • Queue address unreachable

  • Network firewall blocking port 63790

  • Missing environment variables in configuration

Docker Permission Errors

If you see "permission denied" errors when running Docker:

NVIDIA GPU Issues

Verify NVIDIA drivers:

Test GPU access in Docker:

Check NVIDIA Container Toolkit installation:

If GPU access doesn't work, verify that the NVIDIA Container Toolkit is properly installed and configured (see Manual Installation Step 1).

Jobs Not Starting

Check logs:

Look for errors related to:

  • Docker image pull failures

  • Network connectivity issues

  • Storage access problems

Verify Redis connection:

No Jobs Running or Service Stuck

Restart the Peon service:

Check for recent logs:

High Disk Usage

The Peon cleanup service should automatically remove old caches and Docker images.

Verify cleanup timer is running:

Manually trigger cleanup:

Check Docker disk usage:

Manual cleanup:

Collecting Logs for Support

If you need to contact Valohai support, collect logs:

Send peon-logs.txt to [email protected] with:

  • Description of the issue

  • Queue name

  • Server specifications (CPU, RAM, GPU)

  • When the issue started

Getting Help

Valohai Support: [email protected]

Include in support requests:

  • Operating system and version

  • Python version

  • Docker version

  • GPU model (if applicable)

  • Peon logs (see "Collecting Logs for Support" above)

  • Description of the issue and when it started

Last updated

Was this helpful?