CPU usage is incorrect on server-status page

Bug #710319 reported by tinodj on 2011-01-30
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
apache2 (Ubuntu)
Medium
Unassigned
linux (Ubuntu)
Medium
Unassigned

Bug Description

Binary package hint: apache2

1. lsb_release -rd
Description: Ubuntu 10.04.1 LTS
Release: 10.04

2. Version of package: 2.2.14

3. I do expect CPU Usage and CPU Load to be correct in server-status page in apache2.

4. What happened instead? - See below:

CPU usage in apache2 server-status page is incorrect, and it's value is also decreasing sometimes.

Following statuses were obtained in 5seconds interval:

Server Version: Apache/2.2.14 (Ubuntu) DAV/2 SVN/1.6.6 PHP/5.3.2-1ubuntu4.7 with Suhosin-Patch mod_ssl/2.2.14 OpenSSL/0.9.8k
Server Built: Nov 18 2010 21:19:09

Current Time: Sunday, 30-Jan-2011 21:06:49 CET
Restart Time: Sunday, 30-Jan-2011 19:27:27 CET
Parent Server Generation: 0
Server uptime: 1 hour 39 minutes 22 seconds
Total accesses: 349255 - Total Traffic: 5.9 GB
CPU Usage: u388.82 s69.29 cu5.11 cs0 - 7.77% CPU load
58.6 requests/sec - 1.0 MB/second - 17.6 kB/request
141 requests currently being processed, 12 idle workers

Server Version: Apache/2.2.14 (Ubuntu) DAV/2 SVN/1.6.6 PHP/5.3.2-1ubuntu4.7 with Suhosin-Patch mod_ssl/2.2.14 OpenSSL/0.9.8k
Server Built: Nov 18 2010 21:19:09

Current Time: Sunday, 30-Jan-2011 21:06:54 CET
Restart Time: Sunday, 30-Jan-2011 19:27:27 CET
Parent Server Generation: 0
Server uptime: 1 hour 39 minutes 27 seconds
Total accesses: 349480 - Total Traffic: 5.9 GB
CPU Usage: u384.43 s67.96 cu5.1 cs0 - 7.67% CPU load
58.6 requests/sec - 1.0 MB/second - 17.6 kB/request
125 requests currently being processed, 32 idle workers

CPU Usage in second status is smaller then in the first (obtained 5seconds earlier) (384.43 vs 388.2).

This leads to incorrect CPU load and so on.

Scott Moser (smoser) wrote :

I'm fairly sure that the bug that was reported here is due to a kernel bug where utime can decrease.
http://kerneltrap.org/mailarchive/linux-kernel/2009/12/2/4514350

I've verified the test program shown in that thread reproduces on my lucid system:
  $ dpkg -S /boot/vmlinuz-$(uname -r)
  linux-image-2.6.32-24-server: /boot/vmlinuz-2.6.32-24-server
  $ dpkg-query --show linux-image-2.6.32-24-server
  linux-image-2.6.32-24-server 2.6.32-24.43

A single test on my natty system with 2.6.38-1-generic did not reproduce.

Source code review indicates both natty and lucid kernels have [1] applied.

One way or another this is almost certainly a kernel bug, as mod_status just reads kernel tms.
--
[1] http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=0cf55e1ec08bb5a22e068309e2d8ba1180ab4239

Changed in linux (Ubuntu):
status: New → Confirmed
Changed in apache2 (Ubuntu):
importance: Undecided → Medium
status: New → Triaged
Scott Moser (smoser) wrote :

Could you please verify that the linked test case shows the bug?
Download it to a file "mytest.c", then 'gcc -o mytest mytest.c -lpthread' and ./mytest

Changed in linux (Ubuntu):
importance: Undecided → Medium
tinodj (gjorgjioski) wrote :

Hi Scott,

Thank you for your time. I did what you suggest, and I've noticed two strange things.

1. Compiling showed me the following warnings:

gcc -o mytest mytest.c -lpthread
mytest.c: In function ‘child’:
mytest.c:83: warning: format ‘%3d’ expects type ‘int’, but argument 4 has type ‘ clock_t’
mytest.c:83: warning: format ‘%d’ expects type ‘int’, but argument 5 has type ‘c lock_t’
mytest.c:83: warning: format ‘%d’ expects type ‘int’, but argument 6 has type ‘c lock_t’
and so on ...

2. I was not able to test it with 500threads - it is production server, so ... I did it with 16.

Here is the result:

 ./mytest 16
### Prep:
loops_per_tick: 289017
threads : 16

### Test:
looptype : cont
samplesleep : no
 ## start ##
 ## done. ##
 loop total:
  user : 8800 ms
  system : 80 ms
  elapse : 4000 ms
  error : 0

result: GOOD

Segmentation fault
^^^

Result is good - but there is a segmentation fault.

Furthermore, testing with apache, I found that the source of the problem are the threads that are dying - because if I leave MaxSpareServers=MaxClients and MaxRequestsPerChild=0 ( so no thread is killed) then CPU usage seems normal.

tinodj (gjorgjioski) wrote :

Update:

Tried this script:

http://lkml.org/lkml/2009/11/3/522

Got this:
./mytest
utime decreased, was 1730, now 1577!
utime decreased, was 1754, now 1656!

uname -r
2.6.32-27-server

Stefan Bader (smb) wrote :

@tinodj, were you using the test from the reply (the original test case had a flaw)?

Anyway, the patch that Scot mentioned could be fixing things. But for Lucid this was added in Ubuntu-2.6.32-25.44 and at least in the comment a -24.43 kernel was used. So first question: is this still seen with a 2.6.32-28.55 kernel?

When browsing over patches I had the feeling there could be other things falling into that area as well, for example

commit c28739375bf0d6e239b4fa939ec8372aa2c707d2
Author: Stanislaw Gruszka <email address hidden>
Date: Thu Mar 11 14:04:42 2010 -0800

    cpu-timers: Avoid iterating over all threads in fastpath_timer_check()

But I would first want to double check that it is the right test case failing with the right kernel.

tinodj (gjorgjioski) wrote :

@Stefan

Ooopsss ... I missed that point. I just tried the test from the reply, and there is no output. So probably - there is no problem there.
My current kernel, once again is: 2.6.32-27

But the problem in apache persists.

Stefan Bader (smb) wrote :

@tinodj, ok thanks for the confirmation here. Hm, would you be able to install a test kernel on the host that exhibits the apache problem? That was a production system, right? I would try to create a test kernel based on 2.6.32-27 with just the patch added which I mentioned before. That at least was quite short and the description said something about exiting and getting the utime accounting wrong.

tinodj (gjorgjioski) wrote :

Probably I can do that if it is not complicated a lot :) Please provide me a test kernel and instructions what is the best/safest way to do the test. Do you know maybe in which kernel that patch will be incorporated to wait for it?

Stefan Bader (smb) wrote :

I took the kernel version that currently is in updates (2.6.32-28.55) and added the backport of the patch I was mentioning above. All the kernel packages are at: http://people.canonical.com/~smb/lp710319/.
There are packages for all kinds of installations, so don't worry you will only need to download 3. What you need is basically the linux-headers-*all* package and the a linux-headers and linux-image for your installed architecture and flavor (the latter you can find out with uname -r). The you do a "sudo dpkg -i <files>" with the three files and you have installed the test kernel (I test booted the amd64, generic one which does at least not explode in a heap for me).
After rebooting you should see the lp710319 in "uname -a" and then it would be interesting to hear whether that still gives the wrong apache stats.

tinodj (gjorgjioski) wrote :

I have just installed 2.6.32-28.

(still not rebooted)

I will do what you suggest later today. I hope I shouldn't go to my data center after trying this :)

tinodj (gjorgjioski) wrote :

Hi,

Thanks for taking care of this and preparing files, as well as manual how to do it.

uname -a
Linux mail 2.6.32-28-server #55+lp710319v1 SMP Tue Feb 15 09:01:19 UTC 2011 x86_64 GNU/Linux

Apache CPU Usage after some threads die (e.g. after gracefull reload) still is NOT correct.

Probably the bug is somewhere else ?!

Regarding the kernel I have following questions:
Is it for me safe to keep this kernel version?
Will I get regular update when one is available on ubuntu repository?

Thanks...

Stefan Bader (smb) wrote :

Thanks for trying. In that case its likely somewhere else (plenty of places but hard to find). :(

Regarding your question about safety: it should be relatively safe as the change is also upstream and seems to address a corner case anyway. If you want to be on the complete official side I think "sudo apt-get install --reinstall linux-image-server" should pull you back to the official version. Or any official update which will be replacing this kernel. There is already one under preparation as far as I know. So it is a matter of a few weeks. The versioning of the package is in a way that it is newer than the official package that it is based on, but will always be replaced by anything that comes through updates.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers