Two configuration files are generated during distributed task executions: /valohai/config/distributed.json
and /valohai/config/distributed.yaml
. Both files contain identical information, provided in two different formats for convenience. These configuration files are only present in executions that are part of a distributed task.
The publicly available valohai-utils
Python package includes helpers under valohai.distributed
to facilitate the use of distributed task configurations. However, you also have the freedom to parse and utilize the configuration files on your own.
If you opt to interpret the configuration manually, here are brief descriptions of the key values:
config.group_name
: The distributed group name, usually derived from the task identifier.config.member_id
: An identifier for the member running on the machine where this configuration is read.config.required_count
: The number of workers expected to be in this group.members
: A list of all the members, which are the individual executions.member.announce_time
: The timestamp when the member joined the group.member.identity
: The machine identifier of the member, depending on the infrastructure used.member.job_id
: The execution identifier of the member, used for queuing.member.member_id
: A member identifier, typically an arbitrary unique string, often represented as a simple number as a string.member.network.exposed_ports
: A mapping of host port to container port that are exposed. If all ports are exposed, e.g., byVH_DOCKER_NETWORK=host
, this could be empty.member.network.local_ips
: A list of known local IP addresses to access this member.member.network.public_ips
: A list of known public IP addresses to access this member, if available.self
: A duplicate helper object that represents the member object of the currently running machine.
Example config file
{
"config": {
"group_name": "task-0180f5a9-9ffe-4e09-d5a7-9a0a507019d4",
"member_id": "0",
"required_count": 3
},
"members": [
{
"announce_time": "2022-05-24T10:42:57",
"identity": "happy-yjaqaqlx",
"job_id": "exec-0180f5a9-a002-45a0-f0e6-8e98720eeaad",
"member_id": "0",
"network": {
"exposed_ports": {
"1234": "1234"
},
"local_ips": [
"10.0.16.61"
],
"public_ips": [
"34.121.32.110"
]
}
},
{
"announce_time": "2022-05-24T10:42:58",
"identity": "happy-kwfncqxe",
"job_id": "exec-0180f5a9-a007-633b-8af3-e11593482653",
"member_id": "2",
"network": {
"exposed_ports": {
"1234": "1234"
},
"local_ips": [
"10.0.16.60"
],
"public_ips": [
"34.134.18.149"
]
}
},
{
"announce_time": "2022-05-24T10:42:57",
"identity": "happy-tcuaezxm",
"job_id": "exec-0180f5a9-a005-f2ef-693a-3b4c4c115ed8",
"member_id": "1",
"network": {
"exposed_ports": {
"1234": "1234"
},
"local_ips": [
"10.0.16.59"
],
"public_ips": [
"35.194.55.255"
]
}
}
],
"self": {
"announce_time": "2022-05-24T10:42:57",
"identity": "happy-yjaqaqlx",
"job_id": "exec-0180f5a9-a002-45a0-f0e6-8e98720eeaad",
"member_id": "0",
"network": {
"exposed_ports": {
"1234": "1234"
},
"local_ips": [
"10.0.16.61"
],
"public_ips": [
"34.121.32.110"
]
}
}
}