Load Average versus CPU Utilization explained

We have customers that we provide support for both unix and windows based systems. We like put metrics for these systems into our cacti monitoring system, especially performance based values. Here is an explanation I provided to a customer as we recently deployed a Linux based system for their MySQL database alongside their ASP.NET based web app:

—

On unix based systems, the metric of Load Average is based on the number of processes that have asked the kernel for cpu time and are currently waiting for that to be made available to them.

Ideally, you want your hardware to be powerful enough so that your Load Average is always below 1.0, meaning that when a process asks for cpu time it gets it without having to wait. It’s called an average because the cpu scheduler in the kernel reports these as an average over the past 1 minute, past 5 minutes and past 15 minutes and that is what you are seeing in that graph.

This is a completely different perspective than the graph that shows specific percent CPU utilization that we have setup for the Windows system.

It is possible to have 100% cpu utilization of a system, but a load average of <1.0. In this scenario, there is only One process that is asking for time and it is using all the cpu it can. The Load Avg is 1.0 or less because there aren’t other threads/processes that need the cpu as well during that period.

The more cpus, the more processes can concurrently request/have access to cpu time before the Load Average starts to reach 1.0, even if those processes are maxing out their individual cpus.

With more powerful cpus, the quicker a process will complete a task before the next task gets its time, so the “run queue” is emptied quicker, thereby keeping the Load Avg lower