When a Valohai Task is created as a Distributed one, the following happens:
- Valohai creates and queues as many identical executions as described by the task.
- Each of these executions will wait for the rest the group to become online.
- After the whole group is ready, the members will share connection information between each other. This describes such things as local network addresses or how to connect to each other.
- This collective communication information is written to
/valohai/config/distributed.json
and/valohai/config/distributed.yaml
on each worker (instance). - Project code is expected to read this information and establish connections as their tooling needs.
- Then each execution will keep on running until:
- somebody manually stops the execution or the task
- the execution code quits
- the task automatically stops the execution e.g. because of a task “On Error” rule like “if one worker gets errors, stop all workers”
Each execution in a distributed task has their separate input downloading, output uploading and metadata recording like normal. It’s common to gather and upload final results from one of the workers (like the master discussed earlier) but this depends on the workload being ran.
Using the valohai-utils helper library
The publicly available valohai-utils
Python package contains helpers under valohai.distributed
to use the shared information and to assign unique ranks to workers, but you are also free to parse and use the configuration files how you see fit.
One of the workers will be marked as the master for convenience if valohai-utils
distributed helpers are used but that abstraction is optional. It’s simply the first worker that announced itself.