Distributed Training
Distributed Tasks split a single job across multiple machines that communicate with each other during execution.
Unlike standard Tasks (where executions run independently in parallel), Distributed Tasks enable inter-worker communication, making them suitable for workloads like distributed deep learning, reinforcement learning, and model parallelism.
What makes Distributed Tasks different?
Standard Tasks run many independent executions in parallel. Each execution trains a model, processes data, or runs an experiment without talking to other executions.
Distributed Tasks run executions (called "workers") that communicate during runtime. They share gradients, synchronize weights, or coordinate actions—treating multiple machines as one unified system.
Core use cases
Data parallel distributed training
Split your dataset across multiple machines. Each machine trains on its subset and shares gradients with others to update the model.
Why: Speed up training by parallelizing computation. What takes hours on one GPU becomes minutes on multiple GPUs or machines.
Frameworks: PyTorch Distributed, Horovod, TensorFlow Distributed
Model parallel distributed training
Split a large model across multiple machines when it doesn't fit on a single GPU or machine.
Why: Train models that are too large for a single device (e.g., large language models with billions of parameters).
Frameworks: DeepSpeed, Megatron-LM, PyTorch FSDP
Reinforcement learning
Use one worker as the environment and others as agents. Workers coordinate actions and share experiences during training.
Why: Parallelize agent training and environment interaction for faster convergence.
Frameworks: Ray RLlib, Stable Baselines3
How Distributed Tasks work
When you create a Distributed Task, Valohai:
Creates identical executions (workers) based on your specified count
Queues all workers and waits for them to come online
Shares connection information once all workers are ready (IP addresses, ports, ranks)
Writes configuration files to
/valohai/config/distributed.jsonand/valohai/config/distributed.yamlon each workerRuns your code, which reads the configuration and establishes connections between workers
Each worker downloads its own inputs, saves its own outputs, and logs its own metadata—just like standard executions.
Framework flexibility
Valohai Distributed Tasks don't lock you into a specific framework. You can use:
PyTorch Distributed with
torch.distributedHorovod with OpenMPI
TensorFlow Distributed with
tf.distributeDeepSpeed for large model training
Accelerate from Hugging Face
Custom frameworks that need inter-worker communication
The platform provides connection information; your code handles the framework-specific setup.
Requirements
To use Distributed Tasks, you need:
Private workers or a full private Valohai installation (not available on shared cloud infrastructure)
Network configuration that allows free communication between workers in the same subnet
Framework-specific code to read Valohai's configuration and establish connections
Contact Valohai support to enable distributed features on your account.
💡 Tip: Check the distributed examples repository for code samples with PyTorch, TensorFlow, Horovod, and more.
When to use Distributed Tasks
Use Distributed Tasks when:
Your training job benefits from parallelizing across multiple machines
Your model is too large for a single GPU or machine
You're running reinforcement learning with multiple agents
Your workload requires inter-worker communication
Use standard Tasks when:
You're running independent experiments (hyperparameter tuning, benchmarking)
Each execution can run without talking to others
You want simplicity and don't need distributed frameworks
Next steps
Last updated
Was this helpful?
