SLURM

Connect your SLURM cluster to Valohai for high-performance computing workloads

Connect your SLURM cluster to Valohai to leverage high-performance computing (HPC) environments for machine learning workloads.

What is SLURM?

SLURM is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters.

It is possible to connect SLURM clusters to Valohai to enable leveraging HPC environments in your machine learning workloads.

Note: Cluster configurations vary between organizations. If you have specific requirements or need help, contact your Valohai representative.

Requirements

From your SLURM cluster:

Valohai agent available on the cluster
Either:
- SLURM REST API enabled (not necessarily enabled by default)
- OR SSH access to a node that can run SLURM control tools (e.g., sbatch)

Important considerations:

Compared to jobs on virtual machines or Kubernetes where Valohai runs workloads in Docker or Podman containers, there may be less control in a SLURM environment.

Container runtime:

If your SLURM environment is configured for Docker or Podman, things will remain familiarly containerized
The current implementation has been built and tested with Singularity containers
If you want to use another solution (e.g., Podman HPC), discuss this with your Valohai contact first

Install Valohai Agent on the Cluster

Download the Agent

Download the Valohai agent (Peon) TAR file on the cluster. Ask for the link from your Valohai contact at [email protected].

Extract the Agent

tar -xf peon.tar

The TAR file contains a single standard Python .whl file.

Set Up Python Environment

Set up a Python 3.9+ virtual environment in a directory accessible from the SLURM compute nodes.

You can use venv, virtualenv, or uv:

Using venv:

python3 -m venv /path/to/valohai-env

Using virtualenv:

virtualenv /path/to/valohai-env

Using uv:

uv venv /path/to/valohai-env

Choose a path that is accessible from all SLURM compute nodes (e.g., shared network storage).

Install the Agent

Install the wheel within the virtual environment:

/path/to/valohai-env/bin/pip install /path/to/peon-*.whl

Replace /path/to/ with your actual paths.

Verify Installation

Confirm the agent works:

/path/to/valohai-env/bin/valohai-peon --help

This should display the help information for the Valohai agent.

Set Up the Valohai Environment

Prepare Configuration Information

Gather the following information to provide to Valohai:

If using SSH access:

SSH private key to the node that can run SLURM control tools
SSH connection details (hostname, port, username)

If using SLURM REST API:

REST API endpoint URL
Authentication credentials (username, JWT token)

SLURM configuration:

Create a JSON configuration file with the following structure:

{
  "job": {
    "account": "<account_name>",
    "partition": "<partition_name>",
    "time_limit_minutes": <time_limit_in_minutes>
  },
  "worker": {
    "working_directory": "<working_dir_in_vm>",
    "peon_command": "<peon_command>"
  },
  "ssh": {
    "hostname": "<host_name>",
    "username": "<username>",
    "port": <ssh_port>,
    "password": "<password>",
    "allowed_host_keys": [
      "<key_name>"
    ]
  },
  "slurmrestd": {
    "username": "<username>",
    "jwt": "<jwt_token>",
    "endpoint": "<url_endpoint>",
    "api_version": "v0.0.39"
  }
}

Field descriptions:

job section:

account - SLURM account name (or null if not needed)
partition - SLURM partition name (or null if not needed)
time_limit_minutes - Time limit for jobs in minutes (or null for no limit)

worker section:

working_directory - Working directory path on compute nodes
peon_command - Full path to the valohai-peon executable (e.g., /path/to/valohai-env/bin/valohai-peon)

ssh section (provide if sending jobs via SSH):

hostname - SSH hostname or IP address
username - SSH username
port - SSH port (typically 22)
password - SSH password (or null if using key-based auth)
allowed_host_keys - List of allowed SSH host key names

slurmrestd section (provide if sending jobs via REST API):

username - REST API username
jwt - JWT token for authentication
endpoint - REST API endpoint URL
api_version - API version (e.g., v0.0.39)

Example Configuration

{
  "job": {
    "account": "ml_research",
    "partition": "gpu",
    "time_limit_minutes": 720
  },
  "worker": {
    "working_directory": "/scratch/valohai",
    "peon_command": "/shared/valohai-env/bin/valohai-peon"
  },
  "ssh": {
    "hostname": "slurm-head.example.com",
    "username": "valohai",
    "port": 22,
    "password": null,
    "allowed_host_keys": [
      "ssh-rsa AAAAB3NzaC1..."
    ]
  }
}

Queue DNS Configuration

If you need a different DNS name for the queue from the perspective of the cluster, let your Valohai contact know.

Environment Setup

Self-Hosted Valohai

If you have a self-hosted Valohai installation:

1. Create environment

Navigate to the Valohai app admin site and create a new environment with type "SLURM".

2. Add SSH private key

Upload the SSH private key that Valohai will use to connect to your SLURM cluster.

3. Add configuration

Fill in the configuration JSON and add it under "Slurm config".

4. Configure queue (optional)

If you need a different DNS name for the queue from the cluster's perspective, set it up in "Worker queue host" under "Queue Configuration".

Managed Valohai

If you use Valohai's managed service (app.valohai.com):

A Valohai engineer will create the environment for you. Provide them with:

SSH private key (securely)
Configuration JSON
Any custom queue DNS requirements

Verify the Setup

After the environment is configured by Valohai:

1. Log in to Valohai

Navigate to app.valohai.com (or your self-hosted instance)
Check that the SLURM environment appears in your organization

2. Run a test execution

Create a test project
Select the SLURM environment
Run a simple execution

3. Monitor the job

Check that the job appears in your SLURM queue
Monitor execution logs in Valohai
Verify the job completes successfully

4. Verify outputs

Check that outputs are saved correctly
Verify data is accessible from Valohai

Troubleshooting

Jobs not appearing in SLURM queue

Check SSH connection:

If using SSH, verify Valohai can connect:

ssh -i /path/to/key username@hostname

Check SLURM commands:

Verify you can run SLURM commands from the SSH node:

sinfo
squeue

Check REST API:

If using REST API, test the endpoint:

curl -H "X-SLURM-USER-NAME: username" \
     -H "X-SLURM-USER-TOKEN: jwt_token" \
     https://your-endpoint/slurm/v0.0.39/jobs

Peon command not found

Verify installation path:

Check that the peon command path in your configuration is correct:

/path/to/valohai-env/bin/valohai-peon --version

Check accessibility from compute nodes:

SSH to a compute node and verify:

/path/to/valohai-env/bin/valohai-peon --help

Jobs failing immediately

Check working directory:

Ensure the working directory exists and is writable:

ls -ld /path/to/working/directory

Check permissions:

Verify the SLURM user has access to:

Working directory
Python virtual environment
Any shared storage

Container runtime issues

Verify Singularity:

If using Singularity, check it's available:

singularity --version

Test container:

singularity run docker://hello-world

Alternative runtimes:

If using Docker or Podman, discuss configuration with your Valohai contact.

Authentication failures

SSH key issues:

Verify the private key matches the public key on the cluster
Check key permissions (should be 600)
Ensure key is not password-protected (or provide password)

REST API issues:

Verify JWT token is valid and not expired
Check username is correct
Ensure API endpoint is accessible

Getting Help

Valohai Support: [email protected]

Include in support requests:

SLURM version (sinfo --version)
Configuration JSON (sanitized)
Error messages from Valohai or SLURM logs
Output of squeue and sinfo commands
Whether using SSH or REST API
Container runtime in use (Singularity, Docker, Podman)

SLURM logs:

Collect relevant logs from your SLURM cluster:

# Check SLURM controller logs
sudo journalctl -u slurmctld -n 100

# Check compute node logs
sudo journalctl -u slurmd -n 100

# View specific job logs
sacct -j <job_id> --format=JobID,JobName,Partition,State,ExitCode

PreviousSelf-Hosted Deployment NextAdvanced Topics

Last updated 5 hours ago

Was this helpful?