BigV load averages


#1

I’ve set up a 16 core BigV VM to stress test a web server. Uptime gives load averages of over 18. I understood load average to be a maximum of 1.0 for each processor? If this is not the case, can anyone tell me the upper limit of load averages?

$ uptime
12:03:53 up 4:16, 1 user, load average: 18.16, 18.27, 17.99

Cheers
John


#2

Hi John,

There is no upper limit on your load average. Specifically I understand the load average is the average of the load number - and the load number is the number of runnable processes at a specific point in time.

Your kernel has to share the CPU cores between all of the processes running on that machine and processes spend a lot of their time blocked waiting for some event to happen, the kernel knows to only give CPU time to the processes that need it. So if I understand correctly the number of runnable processes is the number of processes that actually need time on the CPU. It makes sense to keep this number below 1.0 if you want to make sure nothing has to ever wait long for CPU time, but if you want to squeeze as many clock cycles out of a machine as possible I would personally be aiming for (2*number of CPUs)+1.

If the end result is stress testing I would concentrate on the metrics that users see rather than some internal metric. Internal metrics are much more useful when trying to understand what’s going on under the hood. I think the last time I wanted to stress test a web server I used siege - it’s not perfect but it’s easy to use.

Cheers


#3

Ah, thanks for the explanation. So would a load average of at least 1.0/core (assuming processes are distributed evenly across the cores) mean 100% CPU usage? I’m hoping so because that’s how I’ve explained it to my customer :slight_smile:

Cheers


#4

Ignoring all the details like hyper-threading and caching then yes, 1.0/core means there is on average one runnable process per core in the time-frame of that average. You might get more performance with a load average of 2.0/core+1 which I know the Gentoo wiki used to recommend to minimise compile time because it would make the most of the hyper-threading features and realistic I/O considerations, but at that load you’ll also start to see a loss of responsiveness and that’s how I’ve always chosen how many CPU-intensive processes to run at the same time.

The idea is to make sure every processor always has something to do without creating unnecessary contention for the cache space inside the processor. Modern processors+kernels have ways of mitigating the cache contention problem though and those strategies are a bit beyond my knowledge, so my advice might be outdated regarding exactly what’s going on with the load average.

It would be very interesting to graph load average against actual throughput in something like siege (run locally to avoid measuring network speed).