Dynamic GPU Allocation
Split GPU resources on multi-GPU machines to run more jobs concurrently. Instead of dedicating entire machines to single executions, allocate only the GPUs each job needs.
This feature is particularly effective on on-premises servers with multiple GPUs. It's not enabled by default, your organization administrator must configure it first.
When to Use Dynamic Allocation
On-premises multi-GPU servers: Run multiple 1-GPU jobs simultaneously on an 8-GPU machine instead of queuing them sequentially.
When NOT needed:
Cloud auto-scaling: Select instance types that match your needs exactly (e.g.,
p3.2xlargefor 1 GPU,p3.8xlargefor 4 GPUs)Kubernetes environments: Resource allocation is handled through runtime configuration
Dynamic allocation is only available for Virtual Machine (Dispatch) environments running dispatch mode workers.
Configure GPU Allocation
Set the VH_GPUS environment variable to specify how many GPUs your execution needs:
VH_GPUS=2Your execution will wait in the queue until 2 GPUs become available on any machine in the environment.
Set via Web UI
Add the environment variable in the execution configuration:

Set via valohai.yaml
- step:
name: distributed-training
image: pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime
command:
- python train_distributed.py
environment-variables:
- name: VH_GPUS
default: 4Be careful with GPU requests. If you request more GPUs than any single machine has, your execution will remain queued indefinitely.
How GPU Scheduling Works
Valohai uses a first-come, first-served queue with intelligent prioritization:
Priority Rules
Small jobs run first: If two executions are queued, the one requesting fewer GPUs gets priority
Escalation after 1 hour: Executions waiting longer than 1 hour get elevated priority, preventing indefinite starvation of large multi-GPU jobs
GPU Assignment
GPUs are allocated in device index order — the same order tools like nvidia-smi display them:
# Your 2-GPU execution gets devices 0 and 1
nvidia-smi
# +-----------------------------------------------------------------------------+
# | NVIDIA-SMI 525.60.13 Driver Version: 525.60.13 CUDA Version: 12.0 |
# |-------------------------------+----------------------+----------------------+
# | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
# | 0 Tesla V100-SXM2 On | 00000000:00:1E.0 Off | 0 | ← First
# | 1 Tesla V100-SXM2 On | 00000000:00:1F.0 Off | 0 | ← Second
# | 2 Tesla V100-SXM2 On | 00000000:00:20.0 Off | 0 |
# | 3 Tesla V100-SXM2 On | 00000000:00:21.0 Off | 0 |Example Use Cases
Single-GPU Training on Multi-GPU Server
Run 8 experiments simultaneously on an 8-GPU machine:
- step:
name: single-gpu-experiment
image: tensorflow/tensorflow:latest-gpu
command:
- python train.py
environment-variables:
- name: VH_GPUS
default: 1Launch 8 executions — they'll all run in parallel instead of queuing.
Multi-GPU Distributed Training
Reserve 4 GPUs for a single distributed training job:
- step:
name: distributed-training
image: pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime
command:
- torchrun --nproc_per_node=4 train_distributed.py
environment-variables:
- name: VH_GPUS
default: 4Mixed Workload Scheduling
Queue both small and large jobs efficiently:
# Small job (1 GPU) - runs immediately
vh exec run quick-test --adhoc VH_GPUS=1
# Large job (4 GPUs) - waits for 4 GPUs to free up
vh exec run full-training --adhoc VH_GPUS=4
# Another small job (1 GPU) - runs before large job if submitted within 1 hour
vh exec run another-test --adhoc VH_GPUS=1After 1 hour, the 4-GPU job escalates in priority and will run next, even if more 1-GPU jobs are queued.
Monitoring GPU Utilization
Track how effectively you're using GPU resources:
Hardware Statistics — Real-time GPU utilization during execution
Visualize Utilization — Historical GPU usage patterns
Track Underutilization — Identify over-allocated GPUs
Related Topics
Tasks & Parallel Execution — Run hyperparameter sweeps with dynamic GPU allocation
Distributed Training — Coordinate multi-GPU training across executions
Team Quotas — Limit concurrent GPU usage per team
Last updated
Was this helpful?
