Valohai automatically tracks the utilization of your hardware resources:
- Real-Time Monitoring: Track your CPU, memory and GPU in real-time.
- Automatic Visualizations: Visualize your resource utilization in the Valohai UI.
- Historical Data Access: Access and further analyze historical data to identify trends and optimize resource allocation.
- Alerts and Notifications: Receive alerts and notifications for resource usage.
- Alerts are displayed on the execution page and are less intrusive.
- Notifications are more configurable in-app, email, and Slack messages.
Utilization summary as displayed on the execution page, click the ▼ to expand.
Technical Overview
Valohai records the following hardware metrics out-of-the-box:
- CPU: Utilization of the CPU.
- Memory: Utilization of the memory.
- GPU Processor: Utilization of the GPU computational resources
- GPU Memory: Utilization of the GPU memory resources.
The usage metrics are collected from the host machine where the execution is running.
These statistics are collected at 2-minute intervals by default; each aggregate statistics entry includes the average, maximum, and minimum values for each metric over that span.
Real-time statistics are of finer granularity.
The runtime environment determines the availability of these metrics. For example, if the execution is running on a CPU-only machine, the GPU metrics will not be available, etc.
The statistics entries look something like this:
{
"version": 2
"start_time": 1715682773.4884605,
"end_time": 1715682894.42783,
"n_entries": 61,
"min": {
"cpu_usage": 0.00001407035175879397,
"num_cpus": 20,
"memory_usage_kb": 9632,
"memory_total_kb": 65272740,
"network_rx_kb": 1330,
"network_tx_kb": 27,
"num_gpus": 1,
"gpu_usage": 0.10007844033914715,
"gpu_memory_usage_kb": 808730,
"gpu_memory_total_kb": 8072192
},
"max": {
"cpu_usage": 0.00006872157655381506,
"num_cpus": 20,
"memory_usage_kb": 9632,
"memory_total_kb": 65272740,
"network_rx_kb": 1332,
"network_tx_kb": 27,
"num_gpus": 1,
"gpu_usage": 0.8991820020962256,
"gpu_memory_usage_kb": 7260324,
"gpu_memory_total_kb": 8072192
},
"avg": {
"cpu_usage": 0.00004246493545678705,
"num_cpus": 20,
"memory_usage_kb": 9632,
"memory_total_kb": 65272740,
"network_rx_kb": 1331,
"network_tx_kb": 27,
"num_gpus": 1,
"gpu_usage": 0.4994439844256729,
"gpu_memory_usage_kb": 3971050.295081967,
"gpu_memory_total_kb": 8072192
}
}