# Distributed Training

Distributed Tasks split a single job across multiple machines that communicate with each other during execution.

Unlike standard Tasks (where executions run independently in parallel), Distributed Tasks enable **inter-worker communication**, making them suitable for workloads like distributed deep learning, reinforcement learning, and model parallelism.

### What makes Distributed Tasks different?

**Standard Tasks** run many independent executions in parallel. Each execution trains a model, processes data, or runs an experiment without talking to other executions.

**Distributed Tasks** run executions (called "workers") that communicate during runtime. They share gradients, synchronize weights, or coordinate actions—treating multiple machines as one unified system.

### Core use cases

#### Data parallel distributed training

Split your dataset across multiple machines. Each machine trains on its subset and shares gradients with others to update the model.

**Why:** Speed up training by parallelizing computation. What takes hours on one GPU becomes minutes on multiple GPUs or machines.

**Frameworks:** PyTorch Distributed, Horovod, TensorFlow Distributed

#### Model parallel distributed training

Split a large model across multiple machines when it doesn't fit on a single GPU or machine.

**Why:** Train models that are too large for a single device (e.g., large language models with billions of parameters).

**Frameworks:** DeepSpeed, Megatron-LM, PyTorch FSDP

#### Reinforcement learning

Use one worker as the environment and others as agents. Workers coordinate actions and share experiences during training.

**Why:** Parallelize agent training and environment interaction for faster convergence.

**Frameworks:** Ray RLlib, Stable Baselines3

### How Distributed Tasks work

When you create a Distributed Task, Valohai:

1. **Creates identical executions** (workers) based on your specified count
2. **Queues all workers** and waits for them to come online
3. **Shares connection information** once all workers are ready (IP addresses, ports, ranks)
4. **Writes configuration files** to `/valohai/config/distributed.json` and `/valohai/config/distributed.yaml` on each worker
5. **Runs your code**, which reads the configuration and establishes connections between workers

Each worker downloads its own inputs, saves its own outputs, and logs its own metadata—just like standard executions.

### Framework flexibility

Valohai Distributed Tasks don't lock you into a specific framework. You can use:

* **PyTorch Distributed** with `torch.distributed`
* **Horovod** with OpenMPI
* **TensorFlow Distributed** with `tf.distribute`
* **DeepSpeed** for large model training
* **Accelerate** from Hugging Face
* **Custom frameworks** that need inter-worker communication

The platform provides connection information; your code handles the framework-specific setup.

### Requirements

To use Distributed Tasks, you need:

* **Private workers** or a full private Valohai installation (not available on shared cloud infrastructure)
* **Network configuration** that allows free communication between workers in the same subnet
* **Framework-specific code** to read Valohai's configuration and establish connections

Contact Valohai support to enable distributed features on your account.

> 💡 **Tip:** Check the [distributed examples repository](https://github.com/valohai/distributed-examples) for code samples with PyTorch, TensorFlow, Horovod, and more.

### When to use Distributed Tasks

**Use Distributed Tasks when:**

* Your training job benefits from parallelizing across multiple machines
* Your model is too large for a single GPU or machine
* You're running reinforcement learning with multiple agents
* Your workload requires inter-worker communication

**Use standard Tasks when:**

* You're running independent experiments (hyperparameter tuning, benchmarking)
* Each execution can run without talking to others
* You want simplicity and don't need distributed frameworks

### Next steps

* [Running Distributed Tasks](/distributed-training/distributed-tasks.md)
* [Distributed Configuration Files](/distributed-training/distributed-config.md)
* [Distributed Examples Repository](https://github.com/valohai/distributed-examples)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.valohai.com/distributed-training.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
