Distributed Training

Distributed Tasks split a single job across multiple machines that communicate with each other during execution.

Unlike standard Tasks (where executions run independently in parallel), Distributed Tasks enable inter-worker communication, making them suitable for workloads like distributed deep learning, reinforcement learning, and model parallelism.

What makes Distributed Tasks different?

Standard Tasks run many independent executions in parallel. Each execution trains a model, processes data, or runs an experiment without talking to other executions.

Distributed Tasks run executions (called "workers") that communicate during runtime. They share gradients, synchronize weights, or coordinate actions—treating multiple machines as one unified system.

Core use cases

Data parallel distributed training

Split your dataset across multiple machines. Each machine trains on its subset and shares gradients with others to update the model.

Why: Speed up training by parallelizing computation. What takes hours on one GPU becomes minutes on multiple GPUs or machines.

Frameworks: PyTorch Distributed, Horovod, TensorFlow Distributed

Model parallel distributed training

Split a large model across multiple machines when it doesn't fit on a single GPU or machine.

Why: Train models that are too large for a single device (e.g., large language models with billions of parameters).

Frameworks: DeepSpeed, Megatron-LM, PyTorch FSDP

Reinforcement learning

Use one worker as the environment and others as agents. Workers coordinate actions and share experiences during training.

Why: Parallelize agent training and environment interaction for faster convergence.

Frameworks: Ray RLlib, Stable Baselines3

How Distributed Tasks work

When you create a Distributed Task, Valohai:

Creates identical executions (workers) based on your specified count
Queues all workers and waits for them to come online
Shares connection information once all workers are ready (IP addresses, ports, ranks)
Writes configuration files to /valohai/config/distributed.json and /valohai/config/distributed.yaml on each worker
Runs your code, which reads the configuration and establishes connections between workers

Each worker downloads its own inputs, saves its own outputs, and logs its own metadata—just like standard executions.

Framework flexibility

Valohai Distributed Tasks don't lock you into a specific framework. You can use:

PyTorch Distributed with torch.distributed
Horovod with OpenMPI
TensorFlow Distributed with tf.distribute
DeepSpeed for large model training
Accelerate from Hugging Face
Custom frameworks that need inter-worker communication

The platform provides connection information; your code handles the framework-specific setup.

Requirements

To use Distributed Tasks, you need:

Private workers or a full private Valohai installation (not available on shared cloud infrastructure)
Network configuration that allows free communication between workers in the same subnet
Framework-specific code to read Valohai's configuration and establish connections

Contact Valohai support to enable distributed features on your account.

💡 Tip: Check the distributed examples repository for code samples with PyTorch, TensorFlow, Horovod, and more.

When to use Distributed Tasks

Use Distributed Tasks when:

Your training job benefits from parallelizing across multiple machines
Your model is too large for a single GPU or machine
You're running reinforcement learning with multiple agents
Your workload requires inter-worker communication

Use standard Tasks when:

You're running independent experiments (hyperparameter tuning, benchmarking)
Each execution can run without talking to others
You want simplicity and don't need distributed frameworks

Next steps

PreviousFinetuning LLMs in Valohai NextRunning Distributed Tasks

Last updated 13 days ago

Was this helpful?