# Linux Workers

Deploy Valohai workers on your on-premises Linux servers to run machine learning jobs on your own hardware.

## Overview

The Compute and Data Layer of Valohai can be deployed to your on-premise environment. This enables you to:

* Use your own on-premises machines to run machine learning jobs
* Use your own cloud storage for storing training artifacts (trained models, preprocessed datasets, visualizations)
* Mount local data to your on-premises workers
* Access databases and data warehouses directly from workers inside your network

Valohai doesn't have direct access to on-premises machines that execute ML jobs. Instead, it communicates with a separate static virtual machine in your on-premise environment that's responsible for storing the job queue, job states, and short-term logs.

<figure><img src="https://4109720758-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Ff3mjTRQNkASbnMbJqzJ2%2Fuploads%2Fgit-blob-d36dbf8beb67553b503cf0aca37654fca66f5df3%2Fimage%20(49).png?alt=media" alt=""><figcaption></figcaption></figure>

## Prerequisites

**Hardware requirements:**

* Linux server (Ubuntu 24.04 recommended)
* Python 3.10+ installed
* For GPU workloads: NVIDIA drivers and NVIDIA Container Toolkit installed

**From Valohai:**

Contact **<support@valohai.com>** to receive:

* `queue-name` - Name for this worker or group of workers
* `queue-address` - Address of your job queue
* `redis-password` - Password for the queue
* `url` - Download URL for the Valohai worker installer

> :exclamation:Let us know if you wish to use this queue/worker in a [Dispatch mode](#multi-gpu-configuration), as it requires special configuration.&#x20;

**Network requirements:**

* Worker can connect to the queue machine on port 63790
* Worker can access your object storage (S3, Azure Blob, GCS)
* Optional: Outbound internet access for pulling Docker images

## Understanding Queue Names

The queue name identifies this worker or group of workers in Valohai.

**Examples:**

* `myorg-onprem-1`
* `myorg-onprem-machine-name`
* `myorg-onprem-gpus`
* `myorg-onprem-gpus-prod`

Each machine can have its own queue, but we recommend using the same queue name on all machines that have the same configuration and are used for the same purpose.

## Installation Methods

Choose your installation method based on your operating system and preferences.

### Ubuntu Installer (Recommended)

Automated installer for Ubuntu systems.

**What it installs:**

* Valohai agent (Peon)
* Docker (if not already installed)
* NVIDIA Container Toolkit (if needed for GPU workloads)
* System service configuration

{% hint style="warning" %}
**Warning:** Only use on fresh, dedicated machines. This will reinstall Docker and NVIDIA Container Toolkit, breaking any existing container workloads. Follow the [manual installation steps](#manual-installation) if you want more control.
{% endhint %}

**Installation:**

```bash
sudo su
apt-get update -y && apt-get install -y python3 python3-distutils

TEMPDIR=$(mktemp -d)
pushd $TEMPDIR

export NAME=<queue-name>
export QUEUE_ADDRESS=<queue-address>
export PASSWORD=<redis-password>
export URL=<bup-url>

curl $URL --output bup.pex
chmod u+x bup.pex
env "CLOUD=none" "ALLOW_MOUNTS=true" "INSTALLATION_TYPE=private-worker" "REDIS_URL=rediss://:$PASSWORD@$QUEUE_ADDRESS:63790" "PEON_WARDEN_ENABLED=true" 'PEON_EXTRA_CONFIG={"ALLOW_MOUNTS":"true"}' "QUEUES=$NAME" ./bup.pex

popd
```

Replace the placeholder values with the information from Valohai.

After installation, the Valohai agent will start automatically and begin pulling jobs from the queue.

> :bulb:Setting the `"PEON_WARDEN_ENABLED=true"` enables monitoring the status of the agent in the Valohai UI. If you want this disabled, you can set the value to `false`.

### Manual Installation

For non-Ubuntu systems or custom configurations.

## Manual Installation Steps

### User Prerequisites

In the previous section, the agent was installed to operate as the **root** user. \
In this section, however, it runs as the `valohai` user from the `valohai` group. These are example user and group names; you can create them on your machine or use any other already available user and group.

When running as `valohai` (**non-root**), you must adjust permissions for output files generated by each execution to allow caching and cleaning. This requires modifying the permissions of the output directory to provide "write" access to "others" (i.e., the `valohai` user).&#x20;

Use the command `chmod o+w -R /valohai/outputs` as the last instruction in the `command` section. \
This ensures that the Valohai agent (Peon) can move and/or remove files.

Alternatively, you can run the agent as **root.**

### Step 1: Install Dependencies

**Python 3.10+**

Verify Python is installed:

```shell
python3 --version
```

**Docker**

Install Docker for your Linux distribution. Visit the [Docker installation guide](https://docs.docker.com/engine/install/) and select your distribution.

**NVIDIA Drivers (GPU only)**

If using GPUs, install NVIDIA drivers appropriate for your GPU model.

Verify installation:

```shell
nvidia-smi
```

**NVIDIA Container Toolkit (GPU only)**

Install NVIDIA Container Toolkit to enable GPU access in containers.

Follow the [NVIDIA documentation](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker) for your distribution.

Verify it works:

```shell
docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
```

### Step 2: Download and Install Peon

Download the Peon agent using the URL provided by Valohai. We recommend installing the agent inside a virtual environment.

```shell
sudo mkdir -p /opt/valohai
sudo chown -R valohai /opt/valohai
## If Peon will be run as root
## sudo chown -R root /opt/valohai
python -m venv /opt/valohai/peon-venv
pushd $(mktemp -d)
wget <URL>
tar xvf peon.tar
/opt/valohai/peon-venv/bin/python -m pip install *.whl
popd
```

Replace `<URL>` with the download URL from Valohai.

> :warning: The `distutils` module has been deprecated on Ubuntu 24.04 and Python 3.12. As some of the Valohai components depend on this, you will need to install for example `setuptools` inside the virtual environment to ensure everything works as expected.
>
> ```bash
> /opt/valohai/peon-venv/bin/python -m pip install setuptools
> ```

### Step 3: Configure Peon

Create the configuration file `/etc/peon.config`:

```shell
CLOUD=none
INSTALLATION_TYPE=private-worker
QUEUES=<queue-name>
REDIS_URL=rediss://:<redis-password>@<queue-address>:63790
ALLOW_MOUNTS=true
```

**Configuration values:**

Replace these placeholders:

* `<queue-name>` - Your queue name from Valohai
* `<redis-password>` - Redis password from Valohai (stored in your cloud Secret Manager)
* `<queue-address>` - Queue address from Valohai

### Step 4: Create Systemd Service

Create the service file `/etc/systemd/system/peon.service`:

```ini
[Unit]
Description=Valohai Peon Service
After=network.target

[Service]
Environment=LC_ALL=C.UTF-8 LANG=C.UTF-8
EnvironmentFile=/etc/peon.config
ExecStart=/opt/valohai/peon-venv/bin/valohai-peon
User=valohai
Group=valohai
## If Peon will be run as root
## User=root
## Group=root
Restart=on-failure

[Install]
WantedBy=multi-user.target
```

**Important:** Update these values:

* `ExecStart` - Path to valohai-peon binary (use `sudo find / -iname "valohai-peon" -print` to find it)
  * If you installed Peon inside a virtual environment, the binary should be there
  * Common locations for global installation: `/home/valohai/.local/bin/valohai-peon` or `/usr/local/bin/valohai-peon`
* `User` - Linux user that will run the service
* `Group` - Linux group for the user

### Step 5: Create Cleanup Service

Create `/etc/systemd/system/peon-clean.service`:

```ini
[Unit]
Description=Valohai Peon Cleanup
After=network.target

[Service]
Type=oneshot
Environment=LC_ALL=C.UTF-8 LANG=C.UTF-8
EnvironmentFile=/etc/peon.config
ExecStart=/opt/valohai/peon-venv/bin/valohai-peon clean
User=valohai
Group=valohai
## If Peon cleaner will be run as root
## User=root
## Group=root

[Install]
WantedBy=multi-user.target
```

Update `ExecStart`, `User`, and `Group` as needed.

### Step 6: Create Cleanup Timer

Create `/etc/systemd/system/peon-clean.timer`:

```ini
[Unit]
Description=Valohai Peon Cleanup Timer
Requires=peon-clean.service

[Timer]
# Every 10 minutes
OnCalendar=*:0/10
Persistent=true

[Install]
WantedBy=timers.target
```

This runs the cleanup service every 10 minutes to remove stale caches and Docker images.

### Step 7: Create Warden Service

Warden is a component that allows monitoring the Peon status in the Valohai UI.

Create `/etc/systemd/system/peon-warden.service`:

```ini
[Unit]
Description=Valohai Peon Warden Service
After=network.target

[Service]
Environment=LC_ALL=C.UTF-8 LANG=C.UTF-8
EnvironmentFile=-/etc/peon.config
ExecStart=/opt/valohai/peon-venv/bin/valohai-peon-warden
User=valohai
Group=valohai
Restart=on-failure
LimitNOFILE=65535

[Install]
WantedBy=multi-user.target
```

### Step 8: Grant Docker Permissions

The user running Peon needs permissions to control Docker:

```shell
sudo usermod -aG docker <User>
```

Replace `<User>` with the user from your service files (e.g., `valohai`).

### Step 9: Start Services

Reload systemd to recognize the new service files:

```shell
systemctl daemon-reload
```

Start the Peon service:

```shell
systemctl start peon
systemctl start peon-clean
systemctl start peon-clean.timer
systemctl start peon-warden
```

Check that services are running:

```shell
systemctl status peon
systemctl status peon-clean.timer
systemctl status peon-warden
```

### Step 10: Enable Auto-Start

Enable services to start automatically on boot:

```shell
systemctl enable peon
systemctl enable peon-clean
systemctl enable peon-warden
```

### Troubleshooting Service Start

For a global installation, if services fail to start, try using the full Python module path in `ExecStart`:

```ini
ExecStart=/usr/bin/env python3 -m peon.cli
```

Use this in both `peon.service` and `peon-clean.service` files if needed.

## Multi-GPU Configuration

If your server has multiple GPUs, you can configure Peon in such way so that it allows you, for each execution, to specify how many GPUs it should use.&#x20;

If this value is less than the number of available GPUs it will allow you running multiple executions at the same time on the same machine and therefore having much better resource utilization. <br>

This kind of Peon configuration is called **Dispatch mode.**\
\
For example, considering a single machine with 4 GPUs, you could achieve next utilization:

* 4 executions, each using 1 GPU
* 1 execution using 1 GPU, 1 execution using 3 GPUs
* 1 execution using 4 GPUs

### Configure Dispatch mode

**Prerequisites:**

* Valohai worker already installed (Ubuntu installer or manual installation)
* Multiple GPUs available on the server

**Steps:**

1. **Stop the running Peon service**

> :bulb: Take a look at [inhibition mode, ](https://docs.valohai.com/installation-and-setup/update-the-valohai-agent#step-1-enable-inhibit-mode)that will allow you to safely stop the running Peon service without the loss of any information or data.&#x20;

```shell
sudo systemctl stop peon
```

2. **Add `DISPATCH_MODE=true` to the `peon.config`, created** [**above**](#step-3-configure-peon)**, making it look like this**

```shellscript
CLOUD=none
INSTALLATION_TYPE=private-worker
QUEUES=<queue-name>
REDIS_URL=rediss://:<redis-password>@<queue-address>:63790
ALLOW_MOUNTS=true
DISPATCH_MODE=true
```

3. **You can start the Peon service again**

```shellscript
sudo systemctl start peon
```

> :exclamation:**When contacting the Valohai support (per** [**Prerequisites**](#prerequisites)**) don't forget to mention that the queue you wish to create will be used in Dispatch mode**

### Create an execution that uses Dispatch mode

When creating an execution, using **`VH_GPUS`** environment variable, you can specify how many GPUs does that execution requires.&#x20;

> :bulb:Take a look at how you can [add an environment variable ](https://docs.valohai.com/user-and-organization-management/getting-started/environment-variables)to your execution.

**`VH_GPUS`** takes an integer value greater than 0 and if not specified, it's value will be 1 - execution will occupy 1 GPU. \
Be careful to not assign a value that's greater than the number of available GPUs on any of the machines, in that case, execution will stay in the queue indefinitely.&#x20;

## Troubleshooting

### Worker Not Connecting

**Check Peon service status:**

```shell
sudo systemctl status peon
```

**View Peon logs:**

```shell
sudo journalctl -u peon -f
```

**Common issues:**

* Incorrect Redis password
* Queue address unreachable
* Network firewall blocking port 63790
* Missing environment variables in configuration

### Docker Permission Errors

If you see "permission denied" errors when running Docker:

```shell
# Add user to docker group
sudo usermod -aG docker <username>

# Log out and back in, or run:
newgrp docker

# Verify
docker ps
```

### NVIDIA GPU Issues

**Verify NVIDIA drivers:**

```bash
nvidia-smi
```

**Test GPU access in Docker:**

```shell
docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
```

**Check NVIDIA Container Toolkit installation:**

```shell
docker info | grep -i nvidia
```

If GPU access doesn't work, verify that the NVIDIA Container Toolkit is properly installed and configured (see Manual Installation Step 1).

### Jobs Not Starting

**Check logs:**

```shell
sudo journalctl -u peon -r
```

Look for errors related to:

* Docker image pull failures
* Network connectivity issues
* Storage access problems

**Verify Redis connection:**

```shell
# Install redis-tools if needed
apt-get install redis-tools

# Test connection (replace with your values)
redis-cli -h <queue-address> -p 63790 --tls PING
```

### No Jobs Running or Service Stuck

**Restart the Peon service:**

```shell
sudo systemctl restart peon
```

**Check for recent logs:**

```shell
sudo journalctl --all --since "1 hour ago" -u peon
```

### High Disk Usage

The Peon cleanup service should automatically remove old caches and Docker images.

**Verify cleanup timer is running:**

```shell
systemctl status peon-clean.timer
```

**Manually trigger cleanup:**

```shell
sudo systemctl start peon-clean
```

**Check Docker disk usage:**

```shell
docker system df
```

**Manual cleanup:**

```shell
# Remove unused Docker images
docker image prune -a

# Remove unused volumes
docker volume prune

# Full cleanup (careful - removes all unused resources)
docker system prune -a
```

## Collecting Logs for Support

If you need to contact Valohai support, collect logs:

```shell
# Restart service to get fresh logs
sudo systemctl restart peon

# Collect last hour of logs
sudo journalctl --all --since "1 hour ago" -u peon > peon-logs.txt
```

Send `peon-logs.txt` to <support@valohai.com> with:

* Description of the issue
* Queue name
* Server specifications (CPU, RAM, GPU)
* When the issue started

## Getting Help

**Valohai Support:** <support@valohai.com>

**Include in support requests:**

* Operating system and version
* Python version
* Docker version
* GPU model (if applicable)
* Peon logs (see "Collecting Logs for Support" above)
* Description of the issue and when it started
