Solved: Re: Getting high CPU utilisation 90% alerts on Lin...

jeesshnasree · ‎09-11-2019

Hi All,

i configured one monitor tool for Java Process CPU utilisation alerts.

condition = above 80% alerts.

i am getting alerts everyday 90% cpu usage alerts on RHEL 4 (32-bit OS).

If I start 5 JVM’s that running with around 500 threads .

i checked manually on OS level in putty it’s showing memory is 9.4 and cpu usage is 3.6 one of the process ID.

load average is 0.37,0.40,0.43 and 4 CPU cores.

while checking on putty level with “top” command showing like this CPU usage is 3.6% and memory 9.4%, but my monitor tool showing every time 90% CPU usage alerts.

actually , I configured my monitor tool as per information from one blog , but currently I didn’t get my configuration.

finally, could you please help me for avoid that alerts and kindly provide solution for sort this issue. I think, I explained my issue here clearly .

could you please help me for avoid CPU usage alerts on Linux box (RHEL 4).

i am waiting for your valuable reply.

Scott · ‎09-11-2019

@jeesshnasree

One cause for your frustration is that Load Average != CPU consumption.

Here is the load average for my primary workstation:

load average: 2.17, 1.31, 1.18

One might look at this and freak out because, possibly, they've been told that load average is a measurement of CPU utilization and according to the data above, that person would INCORRECTLY read this as 217%, 131% and 118% utilization on this box.

What this data is actually depicting is that on average for the last 1 minute the CPU run-queue has had 2.17 runnable processes in it, for the last 5 minutes, the CPU run-queue has had 1.31 runnable processes in it, and for the last 15 minutes the CPU run-queue has had 1.18 runnable processes in it. **

Five minutes later on the same system, the load average now reads:

0.64, 0.91, 1.05

To get to the actual cause, one must first understand what a "Runnable process" is. That is, any process that is unscheduled, waiting in the CPU run queue, which is flagged as a 'runnable' [R] state. When you look at top, these would be processes that have an R in the "S" or state column. Most of the time, these are what you would expect. A process that requires CPU time and is waiting for it's turn on the CPU. The traditional thinking is that if there are lots of these waiting jobs, that the system is CPU-blocked because it is unable to run the requested jobs quickly enough to keep the queue small.

However, sometimes processes that are in a runnable state are actually not waiting for CPU time. It could be that this runnable job is waiting for a disk I/O return or a network connection to be established, and it is running periodically to check the status of this activity before being tasked back to the queue.

In my specific case on my workstation, I have a lot of open browser tabs, tabs which run dynamic content, so sometimes a batch of them all happen at the same time, but it is not always on the same frequency, which is why 5 minutes later, the load average is so much lower.

For your RHEL4 application server, I would guess that it is often the case that those java threads are awaiting data from a database or some other I/O activity, so setting your alerting at a .9 load average may be much smaller than where you should actually be concerned. Maybe, for this machine 3 or 5 is more appropriate of a value to alert on? I'm not advocating for 3 or 5 specifically, I would spend some time watching the machine while it's operating without issues to figure out what value would be appropriate for the system.

Additionally, I would guess that your monitoring software is looking at the 1 minute load average? That is highly prone to short spikes in size. If it's possible, you should either graduate your monitoring to be a large number for the 1 minute load average, or possibly look at the 5 or 15 minute load average only.

** In order to make the reading of this measurement easier to comprehend I simplified it a bit. load avg is actually log (average # of processes during sample time).

-STM

View solution in original post

flozano · ‎09-12-2019

If I remember correctly, the traditional Unix load average considers only runnable processes, as @jeesshnasree described, but Linux also considers processes that are waiting for I/O, so a system that is starved on disk or network capacity would also show a high load average, despite having its CPUs mostly iddle.

http://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html

View solution in original post

Scott · ‎09-11-2019

@jeesshnasree

One cause for your frustration is that Load Average != CPU consumption.

Here is the load average for my primary workstation:

load average: 2.17, 1.31, 1.18

One might look at this and freak out because, possibly, they've been told that load average is a measurement of CPU utilization and according to the data above, that person would INCORRECTLY read this as 217%, 131% and 118% utilization on this box.

What this data is actually depicting is that on average for the last 1 minute the CPU run-queue has had 2.17 runnable processes in it, for the last 5 minutes, the CPU run-queue has had 1.31 runnable processes in it, and for the last 15 minutes the CPU run-queue has had 1.18 runnable processes in it. **

Five minutes later on the same system, the load average now reads:

0.64, 0.91, 1.05

To get to the actual cause, one must first understand what a "Runnable process" is. That is, any process that is unscheduled, waiting in the CPU run queue, which is flagged as a 'runnable' [R] state. When you look at top, these would be processes that have an R in the "S" or state column. Most of the time, these are what you would expect. A process that requires CPU time and is waiting for it's turn on the CPU. The traditional thinking is that if there are lots of these waiting jobs, that the system is CPU-blocked because it is unable to run the requested jobs quickly enough to keep the queue small.

However, sometimes processes that are in a runnable state are actually not waiting for CPU time. It could be that this runnable job is waiting for a disk I/O return or a network connection to be established, and it is running periodically to check the status of this activity before being tasked back to the queue.

In my specific case on my workstation, I have a lot of open browser tabs, tabs which run dynamic content, so sometimes a batch of them all happen at the same time, but it is not always on the same frequency, which is why 5 minutes later, the load average is so much lower.

For your RHEL4 application server, I would guess that it is often the case that those java threads are awaiting data from a database or some other I/O activity, so setting your alerting at a .9 load average may be much smaller than where you should actually be concerned. Maybe, for this machine 3 or 5 is more appropriate of a value to alert on? I'm not advocating for 3 or 5 specifically, I would spend some time watching the machine while it's operating without issues to figure out what value would be appropriate for the system.

Additionally, I would guess that your monitoring software is looking at the 1 minute load average? That is highly prone to short spikes in size. If it's possible, you should either graduate your monitoring to be a large number for the 1 minute load average, or possibly look at the 5 or 15 minute load average only.

** In order to make the reading of this measurement easier to comprehend I simplified it a bit. load avg is actually log (average # of processes during sample time).

-STM

jeesshnasree · ‎09-12-2019

Hi @Scott ,

thanks for share valuable information.

flozano · ‎09-12-2019

If I remember correctly, the traditional Unix load average considers only runnable processes, as @jeesshnasree described, but Linux also considers processes that are waiting for I/O, so a system that is starved on disk or network capacity would also show a high load average, despite having its CPUs mostly iddle.

http://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html

jeesshnasree · ‎09-12-2019

Hi @flozano ,

thanks for share valuable information

Getting high CPU utilisation 90% alerts on Linux OS 4

CPU usage alerts above 80%