Use distributed tasks in Valohai to distribute a single job to multiple machines.
Distributed tasks are collections of executions that are enabled and expected to communicate with each other to accomplish their designated work. Distributed tasks are frequently used to accomplish:
- Data parallel distributed training by using different hyperparameter sets on separate machines using different subsets of your dataset, and sharing and combining the results with a specific logic.
- Model parallel distributed training by splitting the model to separate machines to allow training models that cannot be trained, for a reason or another, on a individual machine.
- Reinforcement learning by using one of the workers as the action environment and the rest as intelligent agents, with the added limitation that the total amount of workers stays constant.
Valohai distributed tasks don’t limit you only to these use-cases or to a certain framework. They can be used to accomplish various jobs where inter-worker communication is a key requirement. But the overwhelmingly most common use-case is data parallel distributed training to speed up hyperparameter optimization.
See examples in your repository
This section of the documentation will cover the basics of how to use distributed tasks, and you can find concrete code examples over at https://github.com/valohai/distributed-examples