Linux Workers

Install Valohai workers on your on-premises Linux servers for ML workloads

Deploy Valohai workers on your on-premises Linux servers to run machine learning jobs on your own hardware.

Overview

The Compute and Data Layer of Valohai can be deployed to your on-premise environment. This enables you to:

Use your own on-premises machines to run machine learning jobs
Use your own cloud storage for storing training artifacts (trained models, preprocessed datasets, visualizations)
Mount local data to your on-premises workers
Access databases and data warehouses directly from workers inside your network

Valohai doesn't have direct access to on-premises machines that execute ML jobs. Instead, it communicates with a separate static virtual machine in your on-premise environment that's responsible for storing the job queue, job states, and short-term logs.

Prerequisites

Hardware requirements:

Linux server (Ubuntu 24.04 recommended)
Python 3.10+ installed
For GPU workloads: NVIDIA drivers and NVIDIA Docker installed

From Valohai:

Contact [email protected] to receive:

queue-name - Name for this worker or group of workers
queue-address - Address of your job queue
redis-password - Password for the queue
url - Download URL for the Valohai worker installer

Network requirements:

Worker can connect to the queue machine on port 63790
Worker can access your object storage (S3, Azure Blob, GCS)
Optional: Outbound internet access for pulling Docker images

Understanding Queue Names

The queue name identifies this worker or group of workers in Valohai.

Examples:

myorg-onprem-1
myorg-onprem-machine-name
myorg-onprem-gpus
myorg-onprem-gpus-prod

Each machine can have its own queue, but we recommend using the same queue name on all machines that have the same configuration and are used for the same purpose.

Installation Methods

Choose your installation method based on your operating system and preferences.

Ubuntu Installer (Recommended)

Automated installer for Ubuntu systems.

What it installs:

Valohai agent (Peon)
Docker (if not already installed)
NVIDIA Docker (if needed for GPU workloads)
System service configuration

Warning: Only use on fresh, dedicated machines. This will reinstall Docker and nvidia-docker, breaking any existing container workloads. Follow the manual installation steps if you want more control.

Installation:

sudo su
apt-get update -y && apt-get install -y python3 python3-distutils

TEMPDIR=$(mktemp -d)
pushd $TEMPDIR

export NAME=<queue-name>
export QUEUE_ADDRESS=<queue-address>
export PASSWORD=<redis-password>
export URL=<bup-url>

curl $URL --output bup.pex
chmod u+x bup.pex
env "CLOUD=none" "ALLOW_MOUNTS=true" "INSTALLATION_TYPE=private-worker" "REDIS_URL=rediss://:$PASSWORD@$QUEUE_ADDRESS:63790" 'PEON_EXTRA_CONFIG={"ALLOW_MOUNTS":"true"}' "QUEUES=$NAME" ./bup.pex

popd

Replace the placeholder values with the information from Valohai.

After installation, the Valohai agent will start automatically and begin pulling jobs from the queue.

Manual Installation

For non-Ubuntu systems or custom configurations.

Manual Installation Steps

Step 1: Install Dependencies

Python 3.10+

Verify Python is installed:

python3 --version

Docker

Install Docker for your Linux distribution. Visit the Docker installation guide and select your distribution.

NVIDIA Drivers (GPU only)

If using GPUs, install NVIDIA drivers appropriate for your GPU model.

Verify installation:

nvidia-smi

NVIDIA Docker (GPU only)

Install NVIDIA Container Toolkit to enable GPU access in containers.

Follow the NVIDIA documentation for your distribution.

nvidia-docker wrapper script:

Peon expects to call either docker or nvidia-docker without arguments. It doesn't natively support docker --runtime=nvidia.

Create a wrapper script:

cd /usr/local/bin
curl -fsSL https://raw.githubusercontent.com/NVIDIA/nvidia-docker/master/nvidia-docker > nvidia-docker
chmod u+x nvidia-docker

Verify it works:

nvidia-docker run --rm nvidia/cuda:11.0-base nvidia-smi

Step 2: Download and Install Peon

Download the Peon agent using the URL provided by Valohai:

wget <URL>
mkdir peon
tar -C peon/ -xvf peon.tar
pip install peon/*.whl

Replace <URL> with the download URL from Valohai.

Step 3: Configure Peon

Create the configuration file /etc/peon.config:

CLOUD=none
DOCKER_COMMAND=nvidia-docker
INSTALLATION_TYPE=private-worker
QUEUES=<queue-name>
REDIS_URL=rediss://:<redis-password>@<queue-address>:63790
ALLOW_MOUNTS=true

Configuration values:

Replace these placeholders:

<queue-name> - Your queue name from Valohai
<redis-password> - Redis password from Valohai (stored in your cloud Secret Manager)
<queue-address> - Queue address from Valohai

DOCKER_COMMAND:

Use nvidia-docker for GPU machines
Use docker for CPU-only machines

Step 4: Create Systemd Service

Create the service file /etc/systemd/system/peon.service:

[Unit]
Description=Valohai Peon Service
After=network.target

[Service]
Environment=LC_ALL=C.UTF-8 LANG=C.UTF-8
EnvironmentFile=/etc/peon.config
ExecStart=/home/valohai/.local/bin/valohai-peon
User=valohai
Group=valohai
Restart=on-failure

[Install]
WantedBy=multi-user.target

Important: Update these values:

ExecStart - Path to valohai-peon binary (use which valohai-peon to find it)
- Common locations: /home/valohai/.local/bin/valohai-peon or /usr/local/bin/valohai-peon
User - Linux user that will run the service
Group - Linux group for the user

Step 5: Create Cleanup Service

Create /etc/systemd/system/peon-clean.service:

[Unit]
Description=Valohai Peon Cleanup
After=network.target

[Service]
Type=oneshot
Environment=LC_ALL=C.UTF-8 LANG=C.UTF-8
EnvironmentFile=/etc/peon.config
ExecStart=/home/valohai/.local/bin/valohai-peon clean
User=valohai
Group=valohai

[Install]
WantedBy=multi-user.target

Update ExecStart, User, and Group as needed.

Step 6: Create Cleanup Timer

Create /etc/systemd/system/peon-clean.timer:

[Unit]
Description=Valohai Peon Cleanup Timer
Requires=peon-clean.service

[Timer]
# Every 10 minutes
OnCalendar=*:0/10
Persistent=true

[Install]
WantedBy=timers.target

This runs the cleanup service every 10 minutes to remove stale caches and Docker images.

Step 7: Grant Docker Permissions

The user running Peon needs permissions to control Docker:

sudo usermod -aG docker <User>

Replace <User> with the user from your service files (e.g., valohai).

Step 8: Start Services

Reload systemd to recognize the new service files:

systemctl daemon-reload

Start the Peon service:

systemctl start peon
systemctl start peon-clean
systemctl start peon-clean.timer

Check that services are running:

systemctl status peon
systemctl status peon-clean.timer

Step 9: Enable Auto-Start

Enable services to start automatically on boot:

systemctl enable peon
systemctl enable peon-clean

Troubleshooting Service Start

If services fail to start, try using the full Python module path in ExecStart:

ExecStart=/usr/bin/env python3 -m peon.cli

Use this in both peon.service and peon-clean.service files if needed.

Multi-GPU Configuration

If your server has multiple GPUs, you can configure Valohai to either:

Use all GPUs for a single job
Run multiple jobs in parallel, each with access to one GPU

Note: You can only choose one option at a time.

Setup Multiple Peon Instances

Follow these steps to run multiple Peon instances, one per GPU.

Prerequisites:

Valohai worker already installed (Ubuntu installer or manual installation)
Multiple GPUs available on the server

Steps:

1. Stop the original Peon service

sudo systemctl stop peon
sudo systemctl disable peon

2. Rename the service file

sudo mv /etc/systemd/system/peon.service /etc/systemd/system/[email protected]

3. Edit the service file

Open /etc/systemd/system/[email protected] and add these lines in the [Service] section:

[Service]
Environment='EXTRA_ENVIRONMENT_VARIABLES={"NVIDIA_VISIBLE_DEVICES": "%I"}'
Environment="IDENTITY=UUID.%i"

Add these after the EnvironmentFile=/etc/peon.config line.

Replace UUID with a generated UUID. You can generate one at uuidgenerator.net.

Complete example:

[Unit]
Description=Valohai Peon Service
After=network.target

[Service]
Environment=LC_ALL=C.UTF-8 LANG=C.UTF-8
EnvironmentFile=/etc/peon.config
Environment='EXTRA_ENVIRONMENT_VARIABLES={"NVIDIA_VISIBLE_DEVICES": "%I"}'
Environment="IDENTITY=12345678-1234-1234-1234-123456789abc.%i"
ExecStart=/home/valohai/.local/bin/valohai-peon
User=valohai
Group=valohai
Restart=on-failure

[Install]
WantedBy=multi-user.target

4. Reload systemd

sudo systemctl daemon-reload

5. Enable and start Peon instances

Start one instance per GPU. For a server with 4 GPUs:

sudo systemctl enable --now peon@0
sudo systemctl enable --now peon@1
sudo systemctl enable --now peon@2
sudo systemctl enable --now peon@3

The number (@0, @1, etc.) corresponds to the GPU index.

6. Verify instances are running

systemctl status peon@0
systemctl status peon@1
systemctl status peon@2
systemctl status peon@3

Each instance should show as active (running).

Important: Make sure you disabled the original peon service. Otherwise, you'll have too many Peon instances competing for resources - one trying to use all GPUs and others using one GPU each.

Troubleshooting

Worker Not Connecting

Check Peon service status:

sudo systemctl status peon

View Peon logs:

sudo journalctl -u peon -f

Common issues:

Incorrect Redis password
Queue address unreachable
Network firewall blocking port 63790
Missing environment variables in configuration

Docker Permission Errors

If you see "permission denied" errors when running Docker:

# Add user to docker group
sudo usermod -aG docker <username>

# Log out and back in, or run:
newgrp docker

# Verify
docker ps

NVIDIA Docker Issues

Verify NVIDIA drivers:

nvidia-smi

Test NVIDIA Docker:

nvidia-docker run --rm nvidia/cuda:11.0-base nvidia-smi

Check NVIDIA Docker installation:

which nvidia-docker

If nvidia-docker command doesn't exist, ensure the wrapper script is installed (see Manual Installation Step 1).

Jobs Not Starting

Check logs:

sudo journalctl -u peon -r

Look for errors related to:

Docker image pull failures
Network connectivity issues
Storage access problems

Verify Redis connection:

# Install redis-tools if needed
apt-get install redis-tools

# Test connection (replace with your values)
redis-cli -h <queue-address> -p 63790 --tls PING

No Jobs Running or Service Stuck

Restart the Peon service:

sudo systemctl restart peon

Check for recent logs:

sudo journalctl --all --since "1 hour ago" -u peon

High Disk Usage

The Peon cleanup service should automatically remove old caches and Docker images.

Verify cleanup timer is running:

systemctl status peon-clean.timer

Manually trigger cleanup:

sudo systemctl start peon-clean

Check Docker disk usage:

docker system df

Manual cleanup:

# Remove unused Docker images
docker image prune -a

# Remove unused volumes
docker volume prune

# Full cleanup (careful - removes all unused resources)
docker system prune -a

Collecting Logs for Support

If you need to contact Valohai support, collect logs:

# Restart service to get fresh logs
sudo systemctl restart peon

# Collect last hour of logs
sudo journalctl --all --since "1 hour ago" -u peon > peon-logs.txt

Send peon-logs.txt to [email protected] with:

Description of the issue
Queue name
Server specifications (CPU, RAM, GPU)
When the issue started

Getting Help

Valohai Support: [email protected]

Include in support requests:

Operating system and version
Python version
Docker version
GPU model (if applicable)
Peon logs (see "Collecting Logs for Support" above)
Description of the issue and when it started

PreviousOn-Premises NextOpenShift

Last updated 14 days ago

Was this helpful?