Deploying the Valohai Compute and Data Layer
The Compute and Data Layer of Valohai can be deployed to your on-premise environment. This enables you to:
- Use your own on-premises machines to run machine learning jobs.
- Use your own cloud storage for storing training artifacts, like trained models, preprocessed datasets, visualizations, etc.
- Mount local data to your on-premises workers.
- Access databases and data warehouses directly from the workers, which are inside your network.
Valohai doesn’t have direct access to the on-premises machine that executes the machine learning jobs. Instead, it communicates with a separate static virtual machine in your on-premise environment that’s responsible for storing the job queue, job states, and short-term logs.
Installing the Valohai Worker (Peon)
The Valohai agent (Peon) is responsible for fetching new jobs, writing logs, and updating the job states for Valohai. If your server is running on Ubuntu, you can simply use the peon-bringup, also known as bup, to install all the required dependencies. For other Linux distributions, you’ll need to install the dependencies manually.
If you want to see a full list of the dependencies that bup will install, you can obtain that list from your Valohai contact. Moreover, if you already have some of these dependencies installed, you can use the
--only flag to install only the missing ones. For example,
--only=*peon* will install only the Valohai agent and no other dependencies.
Before running the template, you’ll need the following information from Valohai:
- name: the queue name that this on-premises machine will use.
- queue-address: will be assigned to the queue in your subscription.
- redis-password: that your queue uses. This is usually stored in your cloud provider’s Secret Manager.
- url: download URL for the Valohai worker agent.
What’s a queue name?
The queue name is a name that you define to add that instance to a queue group. For example:
Each machine can have its own queue, but we recommended using the same queue name on all machines that have the same configuration and are used for the same purpose.
- Python 3.8+
- Nvidia drivers
- Remember to choose the correct distribution by visiting the Docker installation guide.
- Check the NVIDIA documentation for installation instructions for nvidia-docker.
Nvidia drivers and nvidia-docker are only needed if you plan to use the GPU on the machine. Verify that they work by launching a GPU-enabled container locally and running
nvidia-smi in the container.
Peon (the Valohai agent) expects to call either
nvidia-docker, both without arguments. It doesn’t support
docker --runtime=nvidia natively yet. To fix that, you should install a script for nvidia-docker:
cd /usr/local/bin curl -fsSL https://raw.githubusercontent.com/NVIDIA/nvidia-docker/master/nvidia-docker > nvidia-docker chmod u+x nvidia-docker
Download and Install Peon Manually
Make sure you have
tar installed on the machine. You can get the
<URL> from your Valohai contact.
wget <URL> mkdir peon tar -C peon/ -xvf peon.tar pip install peon/*.whl
Next, create a Peon configuration in
/etc/peon.config. Make sure you replace fields in the
QUEUES with the
queue-name and in the
REDIS_URL with your
redis-password and the
queue-address. The password should be stored in the Secret Manager/Key Vault in your cloud account.
DOCKER_COMMAND is either
nvidia-docker depending on your installation.
CLOUD=none DOCKER_COMMAND=nvidia-docker INSTALLATION_TYPE=private-worker QUEUES=<queue-name> REDIS_URL=https://:<redis-password>@queue-address:63790 ALLOW_MOUNTS=true
You will also need to create the service file
etc/systemd/system/peon.service for the Valohai agent. The
ExecStart should point to the local installation, such as
In addition, the
Group should be replaced by those that are relevant in your case.
[Unit] Description=Valohai Peon Service After=network.target [Service] Environment=LC_ALL=C.UTF-8 LANG=C.UTF-8 EnvironmentFile=/etc/peon.config ExecStart=/home/valohai/.local/bin/valohai-peon User=valohai Group=valohai Restart=on-failure [Install] WantedBy=multi-user.target
Lastly, you should also create the Peon cleanup service and a timer for it. This service will take care of cleaning the cached inputs and Docker images to avoid running out of disk space on the machine.
Create the file
/etc/systemd/system/peon-clean.service. Remember that
ExecStart should point to the local installation.
Group should be replaced by the relevant ones here as well.
[Unit] Description=Valohai Peon Cleanup After=network.target [Service] Type=oneshot Environment=LC_ALL=C.UTF-8 LANG=C.UTF-8 EnvironmentFile=/etc/peon.config ExecStart=/home/valohai/.local/bin/valohai-peon clean User=valohai Group=valohai [Install] WantedBy=multi-user.target
The cleaning service will also need a timer. Copy-paste the following into
[Unit] Description=Valohai Peon Cleanup Timer Requires=peon-clean.service [Timer] # Every 10 minutes. OnCalendar=*:0/10 Persistent=true [Install] WantedBy=timers.target
Make sure that the
User defined in the files (here
valohai) has Docker control rights. You can add them by running the following command:
sudo usermod -aG docker <User>
Now you can reload the unit files and start the service.
systemctl daemon-reload systemctl start peon systemctl start peon-clean systemctl start peon-clean.timer
If the services are failing to start, try using
/usr/bin/env python3 -m peon.cli in the
ExecStart field in both the
peon.service and the
Once everything works as expected, add the services to start automatically with boot.
systemctl enable peon systemctl enable peon-clean