SLURM
Connect your SLURM cluster to Valohai for high-performance computing workloads
Connect your SLURM cluster to Valohai to leverage high-performance computing (HPC) environments for machine learning workloads.
What is SLURM?
SLURM is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters.
It is possible to connect SLURM clusters to Valohai to enable leveraging HPC environments in your machine learning workloads.
Note: Cluster configurations vary between organizations. If you have specific requirements or need help, contact your Valohai representative.
Requirements
From your SLURM cluster:
Valohai agent available on the cluster
Either:
SLURM REST API enabled (not necessarily enabled by default)
OR SSH access to a node that can run SLURM control tools (e.g.,
sbatch)
Important considerations:
Compared to jobs on virtual machines or Kubernetes where Valohai runs workloads in Docker or Podman containers, there may be less control in a SLURM environment.
Container runtime:
If your SLURM environment is configured for Docker or Podman, things will remain familiarly containerized
The current implementation has been built and tested with Singularity containers
If you want to use another solution (e.g., Podman HPC), discuss this with your Valohai contact first
Install Valohai Agent on the Cluster
Download the Agent
Download the Valohai agent (Peon) TAR file on the cluster. Ask for the link from your Valohai contact at [email protected].
Extract the Agent
tar -xf peon.tarThe TAR file contains a single standard Python .whl file.
Set Up Python Environment
Set up a Python 3.9+ virtual environment in a directory accessible from the SLURM compute nodes.
You can use venv, virtualenv, or uv:
Using venv:
python3 -m venv /path/to/valohai-envUsing virtualenv:
virtualenv /path/to/valohai-envUsing uv:
uv venv /path/to/valohai-envChoose a path that is accessible from all SLURM compute nodes (e.g., shared network storage).
Install the Agent
Install the wheel within the virtual environment:
/path/to/valohai-env/bin/pip install /path/to/peon-*.whlReplace /path/to/ with your actual paths.
Verify Installation
Confirm the agent works:
/path/to/valohai-env/bin/valohai-peon --helpThis should display the help information for the Valohai agent.
Set Up the Valohai Environment
Prepare Configuration Information
Gather the following information to provide to Valohai:
If using SSH access:
SSH private key to the node that can run SLURM control tools
SSH connection details (hostname, port, username)
If using SLURM REST API:
REST API endpoint URL
Authentication credentials (username, JWT token)
SLURM configuration:
Create a JSON configuration file with the following structure:
{
"job": {
"account": "<account_name>",
"partition": "<partition_name>",
"time_limit_minutes": <time_limit_in_minutes>
},
"worker": {
"working_directory": "<working_dir_in_vm>",
"peon_command": "<peon_command>"
},
"ssh": {
"hostname": "<host_name>",
"username": "<username>",
"port": <ssh_port>,
"password": "<password>",
"allowed_host_keys": [
"<key_name>"
]
},
"slurmrestd": {
"username": "<username>",
"jwt": "<jwt_token>",
"endpoint": "<url_endpoint>",
"api_version": "v0.0.39"
}
}Field descriptions:
job section:
account- SLURM account name (ornullif not needed)partition- SLURM partition name (ornullif not needed)time_limit_minutes- Time limit for jobs in minutes (ornullfor no limit)
worker section:
working_directory- Working directory path on compute nodespeon_command- Full path to the valohai-peon executable (e.g.,/path/to/valohai-env/bin/valohai-peon)
ssh section (provide if sending jobs via SSH):
hostname- SSH hostname or IP addressusername- SSH usernameport- SSH port (typically 22)password- SSH password (ornullif using key-based auth)allowed_host_keys- List of allowed SSH host key names
slurmrestd section (provide if sending jobs via REST API):
username- REST API usernamejwt- JWT token for authenticationendpoint- REST API endpoint URLapi_version- API version (e.g.,v0.0.39)
Example Configuration
{
"job": {
"account": "ml_research",
"partition": "gpu",
"time_limit_minutes": 720
},
"worker": {
"working_directory": "/scratch/valohai",
"peon_command": "/shared/valohai-env/bin/valohai-peon"
},
"ssh": {
"hostname": "slurm-head.example.com",
"username": "valohai",
"port": 22,
"password": null,
"allowed_host_keys": [
"ssh-rsa AAAAB3NzaC1..."
]
}
}Queue DNS Configuration
If you need a different DNS name for the queue from the perspective of the cluster, let your Valohai contact know.
Environment Setup
Self-Hosted Valohai
If you have a self-hosted Valohai installation:
1. Create environment
Navigate to the Valohai app admin site and create a new environment with type "SLURM".
2. Add SSH private key
Upload the SSH private key that Valohai will use to connect to your SLURM cluster.
3. Add configuration
Fill in the configuration JSON and add it under "Slurm config".
4. Configure queue (optional)
If you need a different DNS name for the queue from the cluster's perspective, set it up in "Worker queue host" under "Queue Configuration".
Managed Valohai
If you use Valohai's managed service (app.valohai.com):
A Valohai engineer will create the environment for you. Provide them with:
SSH private key (securely)
Configuration JSON
Any custom queue DNS requirements
Verify the Setup
After the environment is configured by Valohai:
1. Log in to Valohai
Navigate to app.valohai.com (or your self-hosted instance)
Check that the SLURM environment appears in your organization
2. Run a test execution
Create a test project
Select the SLURM environment
Run a simple execution
3. Monitor the job
Check that the job appears in your SLURM queue
Monitor execution logs in Valohai
Verify the job completes successfully
4. Verify outputs
Check that outputs are saved correctly
Verify data is accessible from Valohai
Troubleshooting
Jobs not appearing in SLURM queue
Check SSH connection:
If using SSH, verify Valohai can connect:
ssh -i /path/to/key username@hostnameCheck SLURM commands:
Verify you can run SLURM commands from the SSH node:
sinfo
squeueCheck REST API:
If using REST API, test the endpoint:
curl -H "X-SLURM-USER-NAME: username" \
-H "X-SLURM-USER-TOKEN: jwt_token" \
https://your-endpoint/slurm/v0.0.39/jobsPeon command not found
Verify installation path:
Check that the peon command path in your configuration is correct:
/path/to/valohai-env/bin/valohai-peon --versionCheck accessibility from compute nodes:
SSH to a compute node and verify:
/path/to/valohai-env/bin/valohai-peon --helpJobs failing immediately
Check working directory:
Ensure the working directory exists and is writable:
ls -ld /path/to/working/directoryCheck permissions:
Verify the SLURM user has access to:
Working directory
Python virtual environment
Any shared storage
Container runtime issues
Verify Singularity:
If using Singularity, check it's available:
singularity --versionTest container:
singularity run docker://hello-worldAlternative runtimes:
If using Docker or Podman, discuss configuration with your Valohai contact.
Authentication failures
SSH key issues:
Verify the private key matches the public key on the cluster
Check key permissions (should be 600)
Ensure key is not password-protected (or provide password)
REST API issues:
Verify JWT token is valid and not expired
Check username is correct
Ensure API endpoint is accessible
Getting Help
Valohai Support: [email protected]
Include in support requests:
SLURM version (
sinfo --version)Configuration JSON (sanitized)
Error messages from Valohai or SLURM logs
Output of
squeueandsinfocommandsWhether using SSH or REST API
Container runtime in use (Singularity, Docker, Podman)
SLURM logs:
Collect relevant logs from your SLURM cluster:
# Check SLURM controller logs
sudo journalctl -u slurmctld -n 100
# Check compute node logs
sudo journalctl -u slurmd -n 100
# View specific job logs
sacct -j <job_id> --format=JobID,JobName,Partition,State,ExitCodeLast updated
Was this helpful?
