Distributed Training

Distributed Tasks split a single job across multiple machines that communicate with each other during execution.

Unlike standard Tasks (where executions run independently in parallel), Distributed Tasks enable inter-worker communication, making them suitable for workloads like distributed deep learning, reinforcement learning, and model parallelism.

What makes Distributed Tasks different?

Standard Tasks run many independent executions in parallel. Each execution trains a model, processes data, or runs an experiment without talking to other executions.

Distributed Tasks run executions (called "workers") that communicate during runtime. They share gradients, synchronize weights, or coordinate actions—treating multiple machines as one unified system.

Core use cases

Data parallel distributed training

Split your dataset across multiple machines. Each machine trains on its subset and shares gradients with others to update the model.

Why: Speed up training by parallelizing computation. What takes hours on one GPU becomes minutes on multiple GPUs or machines.

Frameworks: PyTorch Distributed, Horovod, TensorFlow Distributed

Model parallel distributed training

Split a large model across multiple machines when it doesn't fit on a single GPU or machine.

Why: Train models that are too large for a single device (e.g., large language models with billions of parameters).

Frameworks: DeepSpeed, Megatron-LM, PyTorch FSDP

Reinforcement learning

Use one worker as the environment and others as agents. Workers coordinate actions and share experiences during training.

Why: Parallelize agent training and environment interaction for faster convergence.

Frameworks: Ray RLlib, Stable Baselines3

How Distributed Tasks work

When you create a Distributed Task, Valohai:

  1. Creates identical executions (workers) based on your specified count

  2. Queues all workers and waits for them to come online

  3. Shares connection information once all workers are ready (IP addresses, ports, ranks)

  4. Writes configuration files to /valohai/config/distributed.json and /valohai/config/distributed.yaml on each worker

  5. Runs your code, which reads the configuration and establishes connections between workers

Each worker downloads its own inputs, saves its own outputs, and logs its own metadata—just like standard executions.

Framework flexibility

Valohai Distributed Tasks don't lock you into a specific framework. You can use:

  • PyTorch Distributed with torch.distributed

  • Horovod with OpenMPI

  • TensorFlow Distributed with tf.distribute

  • DeepSpeed for large model training

  • Accelerate from Hugging Face

  • Custom frameworks that need inter-worker communication

The platform provides connection information; your code handles the framework-specific setup.

Requirements

To use Distributed Tasks, you need:

  • Private workers or a full private Valohai installation (not available on shared cloud infrastructure)

  • Network configuration that allows free communication between workers in the same subnet

  • Framework-specific code to read Valohai's configuration and establish connections

Contact Valohai support to enable distributed features on your account.

💡 Tip: Check the distributed examples repository for code samples with PyTorch, TensorFlow, Horovod, and more.

When to use Distributed Tasks

Use Distributed Tasks when:

  • Your training job benefits from parallelizing across multiple machines

  • Your model is too large for a single GPU or machine

  • You're running reinforcement learning with multiple agents

  • Your workload requires inter-worker communication

Use standard Tasks when:

  • You're running independent experiments (hyperparameter tuning, benchmarking)

  • Each execution can run without talking to others

  • You want simplicity and don't need distributed frameworks

Next steps

Last updated

Was this helpful?