Hardware Statistics

Automatically track CPU, memory, and GPU utilization to optimize resource allocation and reduce costs.

Why hardware statistics matter

GPU hours are expensive. If you're requesting 8 GPUs but only using 40% of their capacity, you're burning money on idle hardware.

Hardware statistics give you real-time visibility into whether you're actually using the resources you're requesting. This lets data scientists optimize their compute configurations and helps platform engineers identify systemic inefficiencies.

What Valohai tracks

Valohai records the following metrics out-of-the-box for every execution:

  • CPU: Utilization of the CPU processor.

  • Memory: Utilization of system memory.

  • GPU Processor: Utilization of GPU computational resources.

  • GPU Memory: Utilization of GPU memory resources.

The availability of these metrics depends on the runtime environment. For example, executions on CPU-only machines won't have GPU metrics.

Workload resource utilization as shown on the execution page. Click the ▼ to expand.

How statistics are collected

Usage metrics are collected from the host machine where the execution runs.

Default collection interval: 2 minutes.

Each statistics entry includes average, maximum, and minimum values for each metric over that interval.

Real-time statistics are collected at finer granularity for live monitoring during execution.

Statistics data structure

{
  "version": 2,
  "start_time": 1715682773.4884605,
  "end_time": 1715682894.42783,
  "n_entries": 61,
  "min": {
    "cpu_usage": 0.00001407035175879397,
    "num_cpus": 20,
    "memory_usage_kb": 9632,
    "memory_total_kb": 65272740,
    "network_rx_kb": 1330,
    "network_tx_kb": 27,
    "num_gpus": 1,
    "gpu_usage": 0.10007844033914715,
    "gpu_memory_usage_kb": 808730,
    "gpu_memory_total_kb": 8072192
  },
  "max": {
    "cpu_usage": 0.00006872157655381506,
    "num_cpus": 20,
    "memory_usage_kb": 9632,
    "memory_total_kb": 65272740,
    "network_rx_kb": 1332,
    "network_tx_kb": 27,
    "num_gpus": 1,
    "gpu_usage": 0.8991820020962256,
    "gpu_memory_usage_kb": 7260324,
    "gpu_memory_total_kb": 8072192
  },
  "avg": {
    "cpu_usage": 0.00004246493545678705,
    "num_cpus": 20,
    "memory_usage_kb": 9632,
    "memory_total_kb": 65272740,
    "network_rx_kb": 1331,
    "network_tx_kb": 27,
    "num_gpus": 1,
    "gpu_usage": 0.4994439844256729,
    "gpu_memory_usage_kb": 3971050.295081967,
    "gpu_memory_total_kb": 8072192
  }
}

Features

Real-time monitoring

Track your CPU, memory, and GPU utilization in real-time while executions are running.

Automatic visualizations

Visualize your resource utilization directly in the Valohai UI without additional configuration.

Historical data access

Access and analyze historical data to identify trends and optimize resource allocation over time.

Alerts and notifications

Receive notifications when resources are underutilized:

  • Alerts are displayed on the execution page and are less intrusive.

  • Notifications are more configurable in-app, email, and Slack messages.

What's next

Now that you understand what Valohai tracks, learn how to:

Visualize utilization to interpret resource usage charts.

Track underutilization to identify and fix inefficient resource allocation.

Last updated

Was this helpful?