Deploying the Valohai Compute and Data Layer
The Compute and Data Layer of Valohai can be deployed to your on-premise environment. This enables you to:
- Use your own on-premises machines to run machine learning jobs.
- Use your own cloud storage for storing training artifacts, like trained models, preprocessed datasets, visualizations, etc.
- Mount local data to your on-premises workers.
- Access databases and data warehouses directly from the workers, which are inside your network.
Valohai doesn’t have direct access to the on-premises machine that executes the machine learning jobs. Instead, it communicates with a separate static virtual machine in your on-premise environment that’s responsible for storing the job queue, job states, and short-term logs.
Installing the Valohai Worker (Peon)
The Valohai agent (Peon) is responsible for fetching new jobs, writing logs, and updating the job states for Valohai. If your server is running on Ubuntu, you can simply use the peon-bringup, also known as bup, to install all the required dependencies. For other Linux distributions, you’ll need to install the dependencies manually.
If you want to see a full list of the dependencies that bup will install, you can obtain that list from your Valohai contact. Moreover, if you already have some of these dependencies installed, you can use the --only
flag to install only the missing ones. For example, --only=*peon*
will install only the Valohai agent and no other dependencies.
Requirements
Before running the template, you’ll need the following information from Valohai:
- name: the queue name that this on-premises machine will use.
- queue-address: will be assigned to the queue in your subscription.
- redis-password: that your queue uses. This is usually stored in your cloud provider’s Secret Manager.
- url: download URL for the Valohai worker agent.
What’s a queue name?
The queue name is a name that you define to add that instance to a queue group. For example:
- myorg-onprem-1
- myorg-onprem-machine-name
- myorg-onprem-gpus
- myorg-onprem-gpus-prod
Each machine can have its own queue, but we recommended using the same queue name on all machines that have the same configuration and are used for the same purpose.
Environment Requirements
- Python 3.8+
- Nvidia drivers
- Docker
- Remember to choose the correct distribution by visiting the Docker installation guide.
- Nvidia-docker
- Check the NVIDIA documentation for installation instructions for nvidia-docker.
NVIDIA drivers
Nvidia drivers and nvidia-docker are only needed if you plan to use the GPU on the machine. Verify that they work by launching a GPU-enabled container locally and running nvidia-smi
in the container.
nvidia-docker
Peon (the Valohai agent) expects to call either docker
or nvidia-docker
, both without arguments. It doesn’t support docker --runtime=nvidia
natively yet. To fix that, you should install a script for nvidia-docker:
cd /usr/local/bin
curl -fsSL https://raw.githubusercontent.com/NVIDIA/nvidia-docker/master/nvidia-docker > nvidia-docker
chmod u+x nvidia-docker
Download and Install Peon Manually
Make sure you have wget
and tar
installed on the machine. You can get the <URL>
from your Valohai contact.
wget <URL>
mkdir peon
tar -C peon/ -xvf peon.tar
pip install peon/*.whl
Next, create a Peon configuration in /etc/peon.config
. Make sure you replace fields in the QUEUES
with the queue-name
and in the REDIS_URL
with your redis-password
and the queue-address
. The password should be stored in the Secret Manager/Key Vault in your cloud account.
Note: The DOCKER_COMMAND
is either docker
or nvidia-docker
depending on your installation.
CLOUD=none
DOCKER_COMMAND=nvidia-docker
INSTALLATION_TYPE=private-worker
QUEUES=<queue-name>
REDIS_URL=rediss://:<redis-password>@queue-address:63790
ALLOW_MOUNTS=true
You will also need to create the service file etc/systemd/system/peon.service
for the Valohai agent. The ExecStart
should point to the local installation, such as /home/valohai/.local/bin/valohai-peon
or /usr/local/bin/valohai-peon
.
In addition, the User
and Group
should be replaced by those that are relevant in your case.
[Unit]
Description=Valohai Peon Service
After=network.target
[Service]
Environment=LC_ALL=C.UTF-8 LANG=C.UTF-8
EnvironmentFile=/etc/peon.config
ExecStart=/home/valohai/.local/bin/valohai-peon
User=valohai
Group=valohai
Restart=on-failure
[Install]
WantedBy=multi-user.target
Lastly, you should also create the Peon cleanup service and a timer for it. This service will take care of cleaning the cached inputs and Docker images to avoid running out of disk space on the machine.
Create the file /etc/systemd/system/peon-clean.service
. Remember that ExecStart
should point to the local installation. User
and Group
should be replaced by the relevant ones here as well.
[Unit]
Description=Valohai Peon Cleanup
After=network.target
[Service]
Type=oneshot
Environment=LC_ALL=C.UTF-8 LANG=C.UTF-8
EnvironmentFile=/etc/peon.config
ExecStart=/home/valohai/.local/bin/valohai-peon clean
User=valohai
Group=valohai
[Install]
WantedBy=multi-user.target
The cleaning service will also need a timer. Copy-paste the following into /etc/systemd/system/peon-clean.timer
.
[Unit]
Description=Valohai Peon Cleanup Timer
Requires=peon-clean.service
[Timer]
# Every 10 minutes.
OnCalendar=*:0/10
Persistent=true
[Install]
WantedBy=timers.target
Make sure that the User
defined in the files (here valohai
) has Docker control rights. You can add them by running the following command:
sudo usermod -aG docker <User>
Now you can reload the unit files and start the service.
systemctl daemon-reload
systemctl start peon
systemctl start peon-clean
systemctl start peon-clean.timer
If the services are failing to start, try using /usr/bin/env python3 -m peon.cli
in the ExecStart
field in both the peon.service
and the peon-clean.service
files.
Once everything works as expected, add the services to start automatically with boot.
systemctl enable peon
systemctl enable peon-clean