# SLURM

Connect your SLURM cluster to Valohai to leverage high-performance computing (HPC) environments for machine learning workloads.

## What is SLURM?

[SLURM](https://slurm.schedmd.com/overview.html) is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters.

It is possible to connect SLURM clusters to Valohai to enable leveraging HPC environments in your machine learning workloads.

> **Note:** Cluster configurations vary between organizations. If you have specific requirements or need help, contact your Valohai representative.

## Requirements

**From your SLURM cluster:**

* Valohai agent available on the cluster
* Either:
  * SLURM REST API enabled (not necessarily enabled by default)
  * OR SSH access to a node that can run SLURM control tools (e.g., `sbatch`)

**Important considerations:**

Compared to jobs on virtual machines or Kubernetes where Valohai runs workloads in Docker or Podman containers, there may be less control in a SLURM environment.

**Container runtime:**

* If your SLURM environment is configured for Docker or Podman, things will remain familiarly containerized
* The current implementation has been built and tested with **Singularity containers**
* If you want to use another solution (e.g., Podman HPC), discuss this with your Valohai contact first

## Install Valohai Agent on the Cluster

### Download the Agent

Download the Valohai agent (Peon) TAR file on the cluster. Ask for the link from your Valohai contact at **<support@valohai.com>**.

### Extract the Agent

```shell
tar -xf peon.tar
```

The TAR file contains a single standard Python `.whl` file.

### Set Up Python Environment

Set up a Python 3.9+ virtual environment in a directory accessible from the SLURM compute nodes.

You can use `venv`, `virtualenv`, or `uv`:

**Using venv:**

```shell
python3 -m venv /path/to/valohai-env
```

**Using virtualenv:**

```shell
virtualenv /path/to/valohai-env
```

**Using uv:**

```shell
uv venv /path/to/valohai-env
```

Choose a path that is accessible from all SLURM compute nodes (e.g., shared network storage).

### Install the Agent

Install the wheel within the virtual environment:

```shell
/path/to/valohai-env/bin/pip install /path/to/peon-*.whl
```

Replace `/path/to/` with your actual paths.

### Verify Installation

Confirm the agent works:

```shell
/path/to/valohai-env/bin/valohai-peon --help
```

This should display the help information for the Valohai agent.

## Set Up the Valohai Environment

### Prepare Configuration Information

Gather the following information to provide to Valohai:

**If using SSH access:**

* SSH private key to the node that can run SLURM control tools
* SSH connection details (hostname, port, username)

**If using SLURM REST API:**

* REST API endpoint URL
* Authentication credentials (username, JWT token)

**SLURM configuration:**

Create a JSON configuration file with the following structure:

```json
{
  "job": {
    "account": "<account_name>",
    "partition": "<partition_name>",
    "time_limit_minutes": <time_limit_in_minutes>
  },
  "worker": {
    "working_directory": "<working_dir_in_vm>",
    "peon_command": "<peon_command>"
  },
  "ssh": {
    "hostname": "<host_name>",
    "username": "<username>",
    "port": <ssh_port>,
    "password": "<password>",
    "allowed_host_keys": [
      "<key_name>"
    ]
  },
  "slurmrestd": {
    "username": "<username>",
    "jwt": "<jwt_token>",
    "endpoint": "<url_endpoint>",
    "api_version": "v0.0.39"
  }
}
```

**Field descriptions:**

**job section:**

* `account` - SLURM account name (or `null` if not needed)
* `partition` - SLURM partition name (or `null` if not needed)
* `time_limit_minutes` - Time limit for jobs in minutes (or `null` for no limit)

**worker section:**

* `working_directory` - Working directory path on compute nodes
* `peon_command` - Full path to the valohai-peon executable (e.g., `/path/to/valohai-env/bin/valohai-peon`)

**ssh section** (provide if sending jobs via SSH):

* `hostname` - SSH hostname or IP address
* `username` - SSH username
* `port` - SSH port (typically 22)
* `password` - SSH password (or `null` if using key-based auth)
* `allowed_host_keys` - List of allowed SSH host key names

**slurmrestd section** (provide if sending jobs via REST API):

* `username` - REST API username
* `jwt` - JWT token for authentication
* `endpoint` - REST API endpoint URL
* `api_version` - API version (e.g., `v0.0.39`)

### Example Configuration

```json
{
  "job": {
    "account": "ml_research",
    "partition": "gpu",
    "time_limit_minutes": 720
  },
  "worker": {
    "working_directory": "/scratch/valohai",
    "peon_command": "/shared/valohai-env/bin/valohai-peon"
  },
  "ssh": {
    "hostname": "slurm-head.example.com",
    "username": "valohai",
    "port": 22,
    "password": null,
    "allowed_host_keys": [
      "ssh-rsa AAAAB3NzaC1..."
    ]
  }
}
```

### Queue DNS Configuration

If you need a different DNS name for the queue from the perspective of the cluster, let your Valohai contact know.

## Environment Setup

### Self-Hosted Valohai

If you have a self-hosted Valohai installation:

**1. Create environment**

Navigate to the Valohai app admin site and create a new environment with type "SLURM".

**2. Add SSH private key**

Upload the SSH private key that Valohai will use to connect to your SLURM cluster.

**3. Add configuration**

Fill in the configuration JSON and add it under "Slurm config".

**4. Configure queue (optional)**

If you need a different DNS name for the queue from the cluster's perspective, set it up in "Worker queue host" under "Queue Configuration".

### Managed Valohai

If you use Valohai's managed service (app.valohai.com):

A Valohai engineer will create the environment for you. Provide them with:

* SSH private key (securely)
* Configuration JSON
* Any custom queue DNS requirements

## Verify the Setup

After the environment is configured by Valohai:

**1. Log in to Valohai**

* Navigate to app.valohai.com (or your self-hosted instance)
* Check that the SLURM environment appears in your organization

**2. Run a test execution**

* Create a test project
* Select the SLURM environment
* Run a simple execution

**3. Monitor the job**

* Check that the job appears in your SLURM queue
* Monitor execution logs in Valohai
* Verify the job completes successfully

**4. Verify outputs**

* Check that outputs are saved correctly
* Verify data is accessible from Valohai

## Troubleshooting

### Jobs not appearing in SLURM queue

**Check SSH connection:**

If using SSH, verify Valohai can connect:

```shell
ssh -i /path/to/key username@hostname
```

**Check SLURM commands:**

Verify you can run SLURM commands from the SSH node:

```shell
sinfo
squeue
```

**Check REST API:**

If using REST API, test the endpoint:

```shell
curl -H "X-SLURM-USER-NAME: username" -H "X-SLURM-USER-TOKEN: jwt_token" https://your-endpoint/slurm/v0.0.39/jobs
```

### Peon command not found

**Verify installation path:**

Check that the peon command path in your configuration is correct:

```shell
/path/to/valohai-env/bin/valohai-peon --version
```

**Check accessibility from compute nodes:**

SSH to a compute node and verify:

```shell
/path/to/valohai-env/bin/valohai-peon --help
```

### Jobs failing immediately

**Check working directory:**

Ensure the working directory exists and is writable:

```shell
ls -ld /path/to/working/directory
```

**Check permissions:**

Verify the SLURM user has access to:

* Working directory
* Python virtual environment
* Any shared storage

### Container runtime issues

**Verify Singularity:**

If using Singularity, check it's available:

```shell
singularity --version
```

**Test container:**

```shell
singularity run docker://hello-world
```

**Alternative runtimes:**

If using Docker or Podman, discuss configuration with your Valohai contact.

### Authentication failures

**SSH key issues:**

* Verify the private key matches the public key on the cluster
* Check key permissions (should be 600)
* Ensure key is not password-protected (or provide password)

**REST API issues:**

* Verify JWT token is valid and not expired
* Check username is correct
* Ensure API endpoint is accessible

## Getting Help

**Valohai Support:** <support@valohai.com>

**Include in support requests:**

* SLURM version (`sinfo --version`)
* Configuration JSON (sanitized)
* Error messages from Valohai or SLURM logs
* Output of `squeue` and `sinfo` commands
* Whether using SSH or REST API
* Container runtime in use (Singularity, Docker, Podman)

**SLURM logs:**

Collect relevant logs from your SLURM cluster:

```shell
# Check SLURM controller logs
sudo journalctl -u slurmctld -n 100

# Check compute node logs
sudo journalctl -u slurmd -n 100

# View specific job logs
sacct -j <job_id> --format=JobID,JobName,Partition,State,ExitCode
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.valohai.com/installation-and-setup/index.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
