SLURM

Connect your SLURM cluster to Valohai for high-performance computing workloads

Connect your SLURM cluster to Valohai to leverage high-performance computing (HPC) environments for machine learning workloads.

What is SLURM?

SLURM is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters.

It is possible to connect SLURM clusters to Valohai to enable leveraging HPC environments in your machine learning workloads.

Note: Cluster configurations vary between organizations. If you have specific requirements or need help, contact your Valohai representative.

Requirements

From your SLURM cluster:

  • Valohai agent available on the cluster

  • Either:

    • SLURM REST API enabled (not necessarily enabled by default)

    • OR SSH access to a node that can run SLURM control tools (e.g., sbatch)

Important considerations:

Compared to jobs on virtual machines or Kubernetes where Valohai runs workloads in Docker or Podman containers, there may be less control in a SLURM environment.

Container runtime:

  • If your SLURM environment is configured for Docker or Podman, things will remain familiarly containerized

  • The current implementation has been built and tested with Singularity containers

  • If you want to use another solution (e.g., Podman HPC), discuss this with your Valohai contact first

Install Valohai Agent on the Cluster

Download the Agent

Download the Valohai agent (Peon) TAR file on the cluster. Ask for the link from your Valohai contact at [email protected].

Extract the Agent

tar -xf peon.tar

The TAR file contains a single standard Python .whl file.

Set Up Python Environment

Set up a Python 3.9+ virtual environment in a directory accessible from the SLURM compute nodes.

You can use venv, virtualenv, or uv:

Using venv:

python3 -m venv /path/to/valohai-env

Using virtualenv:

virtualenv /path/to/valohai-env

Using uv:

uv venv /path/to/valohai-env

Choose a path that is accessible from all SLURM compute nodes (e.g., shared network storage).

Install the Agent

Install the wheel within the virtual environment:

/path/to/valohai-env/bin/pip install /path/to/peon-*.whl

Replace /path/to/ with your actual paths.

Verify Installation

Confirm the agent works:

/path/to/valohai-env/bin/valohai-peon --help

This should display the help information for the Valohai agent.

Set Up the Valohai Environment

Prepare Configuration Information

Gather the following information to provide to Valohai:

If using SSH access:

  • SSH private key to the node that can run SLURM control tools

  • SSH connection details (hostname, port, username)

If using SLURM REST API:

  • REST API endpoint URL

  • Authentication credentials (username, JWT token)

SLURM configuration:

Create a JSON configuration file with the following structure:

{
  "job": {
    "account": "<account_name>",
    "partition": "<partition_name>",
    "time_limit_minutes": <time_limit_in_minutes>
  },
  "worker": {
    "working_directory": "<working_dir_in_vm>",
    "peon_command": "<peon_command>"
  },
  "ssh": {
    "hostname": "<host_name>",
    "username": "<username>",
    "port": <ssh_port>,
    "password": "<password>",
    "allowed_host_keys": [
      "<key_name>"
    ]
  },
  "slurmrestd": {
    "username": "<username>",
    "jwt": "<jwt_token>",
    "endpoint": "<url_endpoint>",
    "api_version": "v0.0.39"
  }
}

Field descriptions:

job section:

  • account - SLURM account name (or null if not needed)

  • partition - SLURM partition name (or null if not needed)

  • time_limit_minutes - Time limit for jobs in minutes (or null for no limit)

worker section:

  • working_directory - Working directory path on compute nodes

  • peon_command - Full path to the valohai-peon executable (e.g., /path/to/valohai-env/bin/valohai-peon)

ssh section (provide if sending jobs via SSH):

  • hostname - SSH hostname or IP address

  • username - SSH username

  • port - SSH port (typically 22)

  • password - SSH password (or null if using key-based auth)

  • allowed_host_keys - List of allowed SSH host key names

slurmrestd section (provide if sending jobs via REST API):

  • username - REST API username

  • jwt - JWT token for authentication

  • endpoint - REST API endpoint URL

  • api_version - API version (e.g., v0.0.39)

Example Configuration

{
  "job": {
    "account": "ml_research",
    "partition": "gpu",
    "time_limit_minutes": 720
  },
  "worker": {
    "working_directory": "/scratch/valohai",
    "peon_command": "/shared/valohai-env/bin/valohai-peon"
  },
  "ssh": {
    "hostname": "slurm-head.example.com",
    "username": "valohai",
    "port": 22,
    "password": null,
    "allowed_host_keys": [
      "ssh-rsa AAAAB3NzaC1..."
    ]
  }
}

Queue DNS Configuration

If you need a different DNS name for the queue from the perspective of the cluster, let your Valohai contact know.

Environment Setup

Self-Hosted Valohai

If you have a self-hosted Valohai installation:

1. Create environment

Navigate to the Valohai app admin site and create a new environment with type "SLURM".

2. Add SSH private key

Upload the SSH private key that Valohai will use to connect to your SLURM cluster.

3. Add configuration

Fill in the configuration JSON and add it under "Slurm config".

4. Configure queue (optional)

If you need a different DNS name for the queue from the cluster's perspective, set it up in "Worker queue host" under "Queue Configuration".

Managed Valohai

If you use Valohai's managed service (app.valohai.com):

A Valohai engineer will create the environment for you. Provide them with:

  • SSH private key (securely)

  • Configuration JSON

  • Any custom queue DNS requirements

Verify the Setup

After the environment is configured by Valohai:

1. Log in to Valohai

  • Navigate to app.valohai.com (or your self-hosted instance)

  • Check that the SLURM environment appears in your organization

2. Run a test execution

  • Create a test project

  • Select the SLURM environment

  • Run a simple execution

3. Monitor the job

  • Check that the job appears in your SLURM queue

  • Monitor execution logs in Valohai

  • Verify the job completes successfully

4. Verify outputs

  • Check that outputs are saved correctly

  • Verify data is accessible from Valohai

Troubleshooting

Jobs not appearing in SLURM queue

Check SSH connection:

If using SSH, verify Valohai can connect:

ssh -i /path/to/key username@hostname

Check SLURM commands:

Verify you can run SLURM commands from the SSH node:

sinfo
squeue

Check REST API:

If using REST API, test the endpoint:

curl -H "X-SLURM-USER-NAME: username" \
     -H "X-SLURM-USER-TOKEN: jwt_token" \
     https://your-endpoint/slurm/v0.0.39/jobs

Peon command not found

Verify installation path:

Check that the peon command path in your configuration is correct:

/path/to/valohai-env/bin/valohai-peon --version

Check accessibility from compute nodes:

SSH to a compute node and verify:

/path/to/valohai-env/bin/valohai-peon --help

Jobs failing immediately

Check working directory:

Ensure the working directory exists and is writable:

ls -ld /path/to/working/directory

Check permissions:

Verify the SLURM user has access to:

  • Working directory

  • Python virtual environment

  • Any shared storage

Container runtime issues

Verify Singularity:

If using Singularity, check it's available:

singularity --version

Test container:

singularity run docker://hello-world

Alternative runtimes:

If using Docker or Podman, discuss configuration with your Valohai contact.

Authentication failures

SSH key issues:

  • Verify the private key matches the public key on the cluster

  • Check key permissions (should be 600)

  • Ensure key is not password-protected (or provide password)

REST API issues:

  • Verify JWT token is valid and not expired

  • Check username is correct

  • Ensure API endpoint is accessible

Getting Help

Valohai Support: [email protected]

Include in support requests:

  • SLURM version (sinfo --version)

  • Configuration JSON (sanitized)

  • Error messages from Valohai or SLURM logs

  • Output of squeue and sinfo commands

  • Whether using SSH or REST API

  • Container runtime in use (Singularity, Docker, Podman)

SLURM logs:

Collect relevant logs from your SLURM cluster:

# Check SLURM controller logs
sudo journalctl -u slurmctld -n 100

# Check compute node logs
sudo journalctl -u slurmd -n 100

# View specific job logs
sacct -j <job_id> --format=JobID,JobName,Partition,State,ExitCode

Last updated

Was this helpful?