SLURM

Connect your SLURM cluster to Valohai to leverage high-performance computing (HPC) environments for machine learning workloads.

What is SLURM?

SLURM is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters.

It is possible to connect SLURM clusters to Valohai to enable leveraging HPC environments in your machine learning workloads.

Note: Cluster configurations vary between organizations. If you have specific requirements or need help, contact your Valohai representative.

Requirements

From your SLURM cluster:

  • Valohai agent available on the cluster

  • Either:

    • SLURM REST API enabled (not necessarily enabled by default)

    • OR SSH access to a node that can run SLURM control tools (e.g., sbatch)

Important considerations:

Compared to jobs on virtual machines or Kubernetes where Valohai runs workloads in Docker or Podman containers, there may be less control in a SLURM environment.

Container runtime:

  • If your SLURM environment is configured for Docker or Podman, things will remain familiarly containerized

  • The current implementation has been built and tested with Singularity containers

  • If you want to use another solution (e.g., Podman HPC), discuss this with your Valohai contact first

Install Valohai Agent on the Cluster

Download the Agent

Download the Valohai agent (Peon) TAR file on the cluster. Ask for the link from your Valohai contact at [email protected].

Extract the Agent

The TAR file contains a single standard Python .whl file.

Set Up Python Environment

Set up a Python 3.9+ virtual environment in a directory accessible from the SLURM compute nodes.

You can use venv, virtualenv, or uv:

Using venv:

Using virtualenv:

Using uv:

Choose a path that is accessible from all SLURM compute nodes (e.g., shared network storage).

Install the Agent

Install the wheel within the virtual environment:

Replace /path/to/ with your actual paths.

Verify Installation

Confirm the agent works:

This should display the help information for the Valohai agent.

Set Up the Valohai Environment

Prepare Configuration Information

Gather the following information to provide to Valohai:

If using SSH access:

  • SSH private key to the node that can run SLURM control tools

  • SSH connection details (hostname, port, username)

If using SLURM REST API:

  • REST API endpoint URL

  • Authentication credentials (username, JWT token)

SLURM configuration:

Create a JSON configuration file with the following structure:

Field descriptions:

job section:

  • account - SLURM account name (or null if not needed)

  • partition - SLURM partition name (or null if not needed)

  • time_limit_minutes - Time limit for jobs in minutes (or null for no limit)

worker section:

  • working_directory - Working directory path on compute nodes

  • peon_command - Full path to the valohai-peon executable (e.g., /path/to/valohai-env/bin/valohai-peon)

ssh section (provide if sending jobs via SSH):

  • hostname - SSH hostname or IP address

  • username - SSH username

  • port - SSH port (typically 22)

  • password - SSH password (or null if using key-based auth)

  • allowed_host_keys - List of allowed SSH host key names

slurmrestd section (provide if sending jobs via REST API):

  • username - REST API username

  • jwt - JWT token for authentication

  • endpoint - REST API endpoint URL

  • api_version - API version (e.g., v0.0.39)

Example Configuration

Queue DNS Configuration

If you need a different DNS name for the queue from the perspective of the cluster, let your Valohai contact know.

Environment Setup

Self-Hosted Valohai

If you have a self-hosted Valohai installation:

1. Create environment

Navigate to the Valohai app admin site and create a new environment with type "SLURM".

2. Add SSH private key

Upload the SSH private key that Valohai will use to connect to your SLURM cluster.

3. Add configuration

Fill in the configuration JSON and add it under "Slurm config".

4. Configure queue (optional)

If you need a different DNS name for the queue from the cluster's perspective, set it up in "Worker queue host" under "Queue Configuration".

Managed Valohai

If you use Valohai's managed service (app.valohai.com):

A Valohai engineer will create the environment for you. Provide them with:

  • SSH private key (securely)

  • Configuration JSON

  • Any custom queue DNS requirements

Verify the Setup

After the environment is configured by Valohai:

1. Log in to Valohai

  • Navigate to app.valohai.com (or your self-hosted instance)

  • Check that the SLURM environment appears in your organization

2. Run a test execution

  • Create a test project

  • Select the SLURM environment

  • Run a simple execution

3. Monitor the job

  • Check that the job appears in your SLURM queue

  • Monitor execution logs in Valohai

  • Verify the job completes successfully

4. Verify outputs

  • Check that outputs are saved correctly

  • Verify data is accessible from Valohai

Troubleshooting

Jobs not appearing in SLURM queue

Check SSH connection:

If using SSH, verify Valohai can connect:

Check SLURM commands:

Verify you can run SLURM commands from the SSH node:

Check REST API:

If using REST API, test the endpoint:

Peon command not found

Verify installation path:

Check that the peon command path in your configuration is correct:

Check accessibility from compute nodes:

SSH to a compute node and verify:

Jobs failing immediately

Check working directory:

Ensure the working directory exists and is writable:

Check permissions:

Verify the SLURM user has access to:

  • Working directory

  • Python virtual environment

  • Any shared storage

Container runtime issues

Verify Singularity:

If using Singularity, check it's available:

Test container:

Alternative runtimes:

If using Docker or Podman, discuss configuration with your Valohai contact.

Authentication failures

SSH key issues:

  • Verify the private key matches the public key on the cluster

  • Check key permissions (should be 600)

  • Ensure key is not password-protected (or provide password)

REST API issues:

  • Verify JWT token is valid and not expired

  • Check username is correct

  • Ensure API endpoint is accessible

Getting Help

Valohai Support: [email protected]

Include in support requests:

  • SLURM version (sinfo --version)

  • Configuration JSON (sanitized)

  • Error messages from Valohai or SLURM logs

  • Output of squeue and sinfo commands

  • Whether using SSH or REST API

  • Container runtime in use (Singularity, Docker, Podman)

SLURM logs:

Collect relevant logs from your SLURM cluster:

Last updated

Was this helpful?