load average too low

Bug #838811 reported by bart bobrowski on 2011-09-01
22
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Medium
Brad Figg

Bug Description

Description: Ubuntu 10.04.3 LTS, 10.10. 11.04. 11.10, 12.04 pre-release
Release: 10.04, 10.10, 11.04, 11.10, 12.04 pre-release
2.6.37-02063706-generic, and all kernels since, including the current RC at kernel.org

load average doesn't seem to be registering for sleeping processes, as demoed here: http://www.youtube.com/watch?v=KjSZ-D0XBBQ

Actually, the load is included at first, but can get clobbered by the idle state transition.

first found this problem also on squeeze .32 and .38: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=639539

bart bobrowski (mrbart) wrote :

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 838811

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: lucid
bart bobrowski (mrbart) on 2011-09-01
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Doug Smythies (dsmythies) wrote :

This issue is identical to bug # 513848. There was a patch listed there by Chase Douglas, but there has been some code changes since, and it would need to be changed to work (at least I couldn't get it to work). The root issue remains the same, for a tickless system any idle transition during the 10 tick grace period will clobber the real load information. If the process or idle frequency is high enough, then the load average will be reported as 0.00 whereas it might really be the number of CPUs - a small amount (I.E. for an 8 CPU system a real load of 7.99 will be reported as 0.00). I am proposing that what needs to be done is to add a handshake flag so that the fold of idle information during the 10 tick grace period is only executed once. This seems to work reasonably well in all of the tests I have done.

All of my work has been done on Ubuntu server 11.10 kernel 3.0.0-15. , however I have verified the same issue on 10.10. I have also looked at the 3.3-rc2 code from kernel.org, and while it is structured differenty (the code will be in kernel/sched/core.c instead of /kernel/sched.c), the code is basically the same, so it will have this problem. I have not looked at the code for 10.04 to see if it is same.

I have made a test program for this issue, and I will post it here.

I have a proposed solution to this issue, and I will post the code fragment for the calc_load area of kernel/sched.c here. I will also post diff output between the original code and my code (I do not know how to list the patch the same way Chase Doulglas did in bug # 513848)

On my web site, I have web notes on my investigation and test results: http://www.smythies.com/~doug/network/load_average/index.html
(no java, no ads, just hand written HTML 4.01 strict code, that passes the w3c validator. Oh, one css, that also passes the w3c css validator) I stuggle with launchpad, but I might try to post some graphical test results here.

Oh, by the way Bart Bobrowski has agreed to try my proposed patch.

Doug Smythies (dsmythies) wrote :
Doug Smythies (dsmythies) wrote :

I can not seem to edit the tags. I want to expand them to other release names, but they keep disappearing after my edit.

tags: added: maverick natty oneric precise

Thank you for taking the time to file a bug report on this issue.

However, given the number of bugs that the Kernel Team receives during any development cycle it is impossible for us to review them all. Therefore, we occasionally resort to using automated bots to request further testing. This is such a request.

We have noted that there is a newer version of the development kernel than the one you last tested when this issue was found. Please test again with the newer kernel and indicate in the bug if this issue still exists or not.

You can update to the latest development kernel by simply running the following commands in a terminal window:

    sudo apt-get update
    sudo apt-get upgrade

If the bug still exists, change the bug status from Incomplete to Confirmed. If the bug no longer exists, change the bug status from Incomplete to Fix Released.

If you want this bot to quit automatically requesting kernel tests, add a tag named: bot-stop-nagging.

 Thank you for your help, we really do appreciate it.

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
tags: added: kernel-request-3.2.0-16.25
Doug Smythies (dsmythies) wrote :

It seems that kernel 3.0.0-15 is still the latest for ubuntu server version 11.10, which is what runs on my test computers.
As mentioned in my original posting, I downloaded the most recent kernel source directly from kernel.org (3.3 RC2 (at the time)) and looked at code area in question. It is from looking at the code that I suspect the problem will be there in the next release.

I had troubles to add to the tags. I wanted to add probably-precise instead of precise, but then it asked if I wanted to create a new tag. I did not know if I should do that or not, so I didn't. If I made an error, then sorry.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Doug Smythies (dsmythies) wrote :

My related comments were lost when I changed the status. I will re-type them:

I upgraded my very pathetic, very old test computer.

doug@test-smy:~/linux-3.2.0/kernel$ uname -a
Linux test-smy 3.2.0-16-generic-pae #25-Ubuntu SMP Tue Feb 14 04:00:45 UTC 2012 i686 i686 i386 GNU/Linux

It has the Low Load Averages issue.
I am unable to test my proposed fix because I can not compile the kernel on this computer (It takes 33 hours 4 minutes and 56.7 seconds before it gives up with an out of memory error (did I mention it is a patethic old thing)).

I prefer to leave my newer test computer at 11.10, at least for now.

I did look at the source code for 3.2.0-16.25 and the related area of kernel/sched.c looks the same (I did not do a diff).

bart bobrowski (mrbart) wrote :

I tested Doug's patch on 3.0.0-16 in tickless mode, and it seems to report the same or similar results as the same kernel without the patch and ticks enabled.

Attached is a before and after load average graph, ignore the spike that was the kernel compiling Doug's patch.

Thank you for taking the time to file a bug report on this issue.

However, given the number of bugs that the Kernel Team receives during any development cycle it is impossible for us to review them all. Therefore, we occasionally resort to using automated bots to request further testing. This is such a request.

We have noted that there is a newer version of the development kernel than the one you last tested when this issue was found. Please test again with the newer kernel and indicate in the bug if this issue still exists or not.

You can update to the latest development kernel by simply running the following commands in a terminal window:

    sudo apt-get update
    sudo apt-get upgrade

If the bug still exists, change the bug status from Incomplete to Confirmed. If the bug no longer exists, change the bug status from Incomplete to Fix Released.

If you want this bot to quit automatically requesting kernel tests, add a tag named: bot-stop-nagging.

 Thank you for your help, we really do appreciate it.

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
tags: added: kernel-request-3.2.0-17.26
bart bobrowski (mrbart) on 2012-02-19
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Doug Smythies (dsmythies) wrote :

Issue remains (actually it is not possible for it to be solved without edits to the calc_load area of sched.c, either my proposed edits or some other solution). I see bart already set this back to confirmed.

doug@test-smy:~$ uname -a
Linux test-smy 3.2.0-17-generic-pae #26-Ubuntu SMP Fri Feb 17 23:47:19 UTC 2012 i686 i686 i386 GNU/Linux

description: updated
tags: added: oneiric
removed: oneric
Doug Smythies (dsmythies) wrote :
tags: added: patch

Thank you for taking the time to file a bug report on this issue.

However, given the number of bugs that the Kernel Team receives during any development cycle it is impossible for us to review them all. Therefore, we occasionally resort to using automated bots to request further testing. This is such a request.

We have noted that there is a newer version of the development kernel than the one you last tested when this issue was found. Please test again with the newer kernel and indicate in the bug if this issue still exists or not.

You can update to the latest development kernel by simply running the following commands in a terminal window:

    sudo apt-get update
    sudo apt-get upgrade

If the bug still exists, change the bug status from Incomplete to Confirmed. If the bug no longer exists, change the bug status from Incomplete to Fix Released.

If you want this bot to quit automatically requesting kernel tests, add a tag named: bot-stop-nagging.

 Thank you for your help, we really do appreciate it.

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
tags: added: kernel-request-3.2.0-17.27
Doug Smythies (dsmythies) wrote :

Issue re-confirmed.

doug@test-smy:~$ uname -a
Linux test-smy 3.2.0-17-generic-pae #27-Ubuntu SMP Fri Feb 24 15:59:25 UTC 2012 i686 i686 i386 GNU/Linux

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? It will allow additional upstream developers to examine the issue. Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v3.3 kernel[1] (Not a kernel in the daily directory). Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag(Only that one tag, please leave the other tags). This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text.

If this bug is fixed by the mainline kernel, please add the following tag 'kernel-fixed-upstream-KERNEL-VERSION'. For example, if kernel version 3.3-rc5 fixed the issue, the tag would be: 'kernel-fixed-upstream-v3.3-rc5'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

If you are unable to test the mainline kernel, for example it will not boot, please add the tag: 'kernel-unable-to-test-upstream'.
Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[1] http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.3-rc5-precise/

Changed in linux (Ubuntu):
importance: Undecided → Medium
tags: added: needs-upstream-testing
Doug Smythies (dsmythies) wrote :
Download full text (3.6 KiB)

1.) Doing what was asked:
First, note that the image I needed was not available in the RC5 directory, so I used the appropriate image from the RC4 directory. See "2" below for the explanation as to why I know the results would be the same for RC5.

While the local terminal did not work with this version, I was able to connect via SSH and perform the test.

The issue is the same. See also "2" below.

doug@test-smy:~$ uname -a
Linux test-smy 3.3.0-030300rc4-generic-pae #201202181935 SMP Sun Feb 19 00:53:06 UTC 2012 i686 i686 i386 GNU/Linux
doug@test-smy:~$ cat /proc/version
Linux version 3.3.0-030300rc4-generic-pae (root@gomeisa) (gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5) ) #201202181935 SMP Sun Feb 19 00:53:06 UTC 2012

2.) Why these various kernel tests are not actually needed. (and please challenge me on this if you disagree):
Everything related to the load average calculations is contained in a fairly small piece of code in kernel/sched.c (or in the 3.3 kernel: /kernel/sched/core.c).
About 380 lines starting with "/* Variables and functions for calc_load */".
It is not possible for this issue to be fixed without code changes in this area, either my proposed patch or some other.
Therefore it is sufficient to compare this code area for changes. There are no changes. I got the 3.3RC5 code from kernel.org and compared it with the code from 3.3RC2 that I got a few weeks ago (see post #3 above) and compared it with the code I have been using on my 11.10 test machine.

3.) Some other notes:

I have done further readings on CodingStyle I made some violations on my proposed patch. I have fixed them and will make a new proposed patch posting.

I didn't know how to post a patch showing the differences the way others I have seen were done. I read how to do it properly (I think) and the new posting will be better (I hope).

If the process idle frequency is high enough, I thought my proposed patch was showing lower load averages than the control tests done with CONFIG_NO_HZ=n. However, more detailed showed no. Another proposed patch (variant 2) came out of this work.
Of course, it is understood that at some frequency everything will break down, and to also prevent incorrect high load averages the nohz disabled/enabled results may need to differ under some conditions. It is now realized that their is no difference between the two patches, as the one call to idle_fold almost always returned zero.

I have struggled to understand the timing relationship between calc_load_account_active and calc_global_nohz and the number of times that calc_load_fold_idle will be called in a LOAD_FREQ interval. On my (i7 8 cpu) system I have seen that typically calc_load_account_active is executed 8 times, as expected, but not always. I have tried various countings and handshakes between the two, but never had results as good as the mindless just call calc_load_fold_idle once (or not at all, see proposed patch variant 2) from calc_global_nohz during the 10 tick window.
So then one wonders what might the proposed patch have compromised? I.E. is there a senario where the patch would cause incorrect high load averages? (I have not been able to find such a senario. (Not saying th...

Read more...

Doug Smythies (dsmythies) wrote :
Doug Smythies (dsmythies) wrote :
Doug Smythies (dsmythies) wrote :
tags: added: kernel-bug-exists-upstream
removed: needs-upstream-testing
Joseph Salisbury (jsalisbury) wrote :

Thank you for providing a patch, and making Ubuntu better.

Can you provide some information on the status of the patch with regards to getting it merged upstream? Has it been sent upstream, what sort of feedback has it received, is it getting applied to a subsystem maintainer's tree, etc?

People affected by this bug are probably wondering why the kernel team doesn't just apply the patch and fix it. The reason is that the kernel team is reluctant (not opposed) to apply any patch to a stable kernel that is not from upstream. Applying patches that don't come from upstream add greatly to the support of the kernel as other upstream patches may touch the same area as the non-upstream patch and may prevent them from applying cleanly.

Doug Smythies (dsmythies) wrote :

This issue has not been sent upstream. I have studied how to do so, and will do so, and will report back to this bug report as new information comes.

Joseph Salisbury (jsalisbury) wrote :

Thanks, Doug.

Doug Smythies (dsmythies) wrote :

I heard back from Peter Zijlstra who is one of the sched.c mainatiners at kernel.org.
Attached is his signed off patch. There was a one line change since that e-mail, that I will not botther to post here.
I tested what I refer to as the "peter02" patch, and it gives the same results as my proposed patch.
My proposed patch did not deal the very very long idle case for some computers, Peter's does. My computers never take this path, and that this path also needed to be tested is included as a comment in my version 1 proposed patch, see post number 23.

I will post a couple of comparative test result graphs.

Doug Smythies (dsmythies) wrote :
Joseph Salisbury (jsalisbury) wrote :

Thanks for the update and getting your patch upstream, Doug.

What mailing list was the mail from Peter posted to? I didn't see it on LKML.

I can build a test kernel once this patch lands in Linus' mainline tree. We can then cherry pick the path until it lands in the stable tree.

Thank you for taking the time to file a bug report on this issue.

However, given the number of bugs that the Kernel Team receives during any development cycle it is impossible for us to review them all. Therefore, we occasionally resort to using automated bots to request further testing. This is such a request.

We have noted that there is a newer version of the development kernel than the one you last tested when this issue was found. Please test again with the newer kernel and indicate in the bug if this issue still exists or not.

You can update to the latest development kernel by simply running the following commands in a terminal window:

    sudo apt-get update
    sudo apt-get upgrade

If the bug still exists, change the bug status from Incomplete to Confirmed. If the bug no longer exists, change the bug status from Incomplete to Fix Released.

If you want this bot to quit automatically requesting kernel tests, add a tag named: bot-stop-nagging.

 Thank you for your help, we really do appreciate it.

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
tags: added: kernel-request-3.2.0-18.28
Doug Smythies (dsmythies) wrote :

I have two test computers: One has my kernel build environment for this issue, and I don't want to mess that up just yet; The other is too pathetic to compile the kernel, and is the one on which I do these kernel request tests.

I do not knnow if it is relevant, but yesterday I did a complete new install of the daily ISO of 12.04 Beta 1 for 2012.03.03 on the pathetic computer (and gave the related "PASS" feedback to the daily ISO testing web page)

I did the "sudo apt-get update" and "sudo apt-get upgrade" just now, but I am still at kernel 3.2.0-17-generic-pae #27.

I'm not sure, but I don't think the fix is in the development kernel yet anyhow.
I'm setting this bug back to confirmed.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Doug Smythies (dsmythies) wrote :

O.K., I see that the request to test with kernel 3.2.0-18.28 just preceeded, by many hours, that kernel actually being available via apt-get dist-upgrade. This morning it is available...
doug@test-smy:~$ cat /proc/version
Linux version 3.2.0-18-generic-pae (buildd@rothera) (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu2) ) #28-Ubuntu SMP Fri Mar 2 22:11:12 UTC 2012
As expected the reported load averages being way too low issue remains.
I think we are on top of this issue with respect to getting it included upstream and how it might propagate downstream, so I am going to set the "bot-stop-nagging" tag. I am also going to play with the "precise" tag, in an attempt to get this bug to be listed in the precise list, it might need to change to "precise beta1" or similar.

tags: added: bot-stop-nagging
Changed in linux (Ubuntu):
status: Confirmed → Triaged
Doug Smythies (dsmythies) wrote :

kernel 3.3-rc7 (kernel.org 2012.03.10) does not include the patch.

Doug Smythies (dsmythies) wrote :

Kernel 3.3 mainline (just released) does not contain the patch.
Based on an e-mail from 2012.03.12 from some tip-bot, I thought it might be included.

doug@test-smy:~$ uname -a
Linux test-smy 3.3.0-030300-generic-pae #201203182135 SMP Mon Mar 19 01:50:11 UTC 2012 i686 i686 i386 GNU/Linux

doug@test-smy:~$ cat /proc/version
Linux version 3.3.0-030300-generic-pae (apw@gomeisa) (gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5.1) ) #201203182135 SMP Mon Mar 19 01:50:11 UTC 2012

By the way, my terminal does not work with this kernel (see also posting #22 above, it didn't work with 3.3rc4 either), but I could SSH in to do the test.

Brad Figg (brad-figg) wrote :

This now appears to be upstream commit: c308b56b5398779cd3da0f62ab26b0453494c3d4

Changed in linux (Ubuntu):
assignee: nobody → Brad Figg (brad-figg)
status: Triaged → In Progress
Brad Figg (brad-figg) wrote :

I've backported the upstream commit (trivial backport). Test kernels are available at: http://people.canonical.com/~bradf/lp838811 . Please test the appropriate kernel for you and reply back to this bug whether it fixed the issue for you or not.

Changed in linux (Ubuntu):
status: In Progress → Incomplete
Doug Smythies (dsmythies) wrote :

Note that I have already tested to death the exact patch of the above referenced commit. However, I mainly did so with an 11.10 computer. That computer has since been migrated to 12.04 beta for iso and other testing.

One test computer:

doug@s15:~$ uname -a
Linux s15 3.2.0-20-generic #33~lp838811 SMP Thu Mar 29 15:04:04 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
doug@s15:~$ cat /proc/version
Linux version 3.2.0-20-generic (bradf@tangerine) (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu3) ) #33~lp838811 SMP Thu Mar 29 15:04:04 UTC 2012

Yes, the backported kernel seems to work as expected with respect to this issue. I'm seeing some load average swings that appear to be more than observed with the 11.10 testing. I'll do a all other things equal type test over about 6 or 8 hours to compare with a previously posted graph.

Another test computer (this is the one used for any previous 12.04 and 3.3 RC tests) (it was just bebuilt from scratch a couple of days ago for ISO testing):

doug@test-smy:~$ uname -a
Linux test-smy 3.2.0-20-generic-pae #33~lp838811 SMP Thu Mar 29 15:05:10 UTC 2012 i686 i686 i386 GNU/Linux
doug@test-smy:~$ cat /proc/version
Linux version 3.2.0-20-generic-pae (bradf@tangerine) (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu3) ) #33~lp838811 SMP Thu Mar 29 15:05:10 UTC 2012

I had not installed the compiler on this computer, which I need to compile my test program. I had some dependency troubles for awhile...
Anyway, now I got it all working now and the issue is fixed on this computer with this test kernel.

Doug Smythies (dsmythies) wrote :

I did the 6 hour test so that I could make a graph to compare with a peviously posted graph of the fix on an 11.10 computer. However, I seem to be a little dense because I picked a run case that I had not previously posted here (it is in my web notes).
So, in a moment I will add two attachments, the only variable being Brad's kernel Vs. the same patch on a 11.10 computer. (test program execution command "c/waiter 2 555" ( the "555" is very computer dependent).

Summary: The results are very similar.

Brad Figg (brad-figg) wrote :

@Doug,

So that indicates 1) I didn't mess up the backport and 2) it's doing what is expected.

Doug Smythies (dsmythies) wrote :

Yes, it all looks fine.

Note that my test case is pretty extreme, and even an old style tickless kernel will start to have deviations between real and resported load averages under such conditions, as demonstrated in some previous postings.
I could do some less extreme tests, such as 25 hertz sleep frequency, but I don't see that there would be value added. That being said, I am willing to whatever anybody wants.

Doug Smythies (dsmythies) wrote :

Opps, typo...
this:
"... and even an old style tickless kernel ..."
should be this:
"... and even an old style tick based kernel ..."

Brad Figg (brad-figg) on 2012-03-30
Changed in linux (Ubuntu):
status: Incomplete → Fix Committed
Doug Smythies (dsmythies) wrote :

By the way, the fix is finally included in Kernel 3.4RC1. (it never made any 3.3 kernel)
doug@test-smy:~$ uname -a
Linux test-smy 3.4.0-030400rc1-generic-pae #201203312035 SMP Sun Apr 1 00:50:08 UTC 2012 i686 i686 i386 GNU/Linux
doug@test-smy:~$ cat /proc/version
Linux version 3.4.0-030400rc1-generic-pae (apw@gomeisa) (gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5.1) ) #201203312035 SMP Sun Apr 1 00:50:08 UTC 2012

Brad's backport fix missed the last 12.04 kernel update, I'll check the next one.

Launchpad Janitor (janitor) wrote :
Download full text (5.2 KiB)

This bug was fixed in the package linux - 3.2.0-22.35

---------------
linux (3.2.0-22.35) precise; urgency=low

  [ Andy Whitcroft ]

  * Revert "SAUCE: hotkey quirks for various Zeptro Znote and Fujitsu Amilo
    laptops"
  * SAUCE: (no-up) elide some ioctl warnings which are known benign
    - LP: #972355

  [ Brad Figg ]

  * SAUCE (no-up) Provide a param for allowing the BIOS to handle changing
    the brightness on AC/battery status changes.
    - LP: #949311
  * SAUCE (drop after 3.4) Quirk for enabling backlight hotkeys on Samsung
    N150P
    - LP: #875893

  [ Colin Ian King ]

  * SAUCE: PCI: Allow pcie_aspm=force to work even when FADT indicates it
    is unsupported
    - LP: #962038

  [ Daniel Vetter ]

  * SAUCE: (drop after 3.5) drm/i915: reinstate GM45 TV detection fix
    - LP: #638939

  [ Kees Cook ]

  * SAUCE: SECCOMP: audit: always report seccomp violations
  * SAUCE: SECCOMP: adjust prctl constant

  [ Leann Ogasawara ]

  * [Config] Enable CONFIG_USBIP_CORE=m
    - LP: #900384
  * Rebase to v3.2.14
  * [Config] Updateconfigs after rebase to v3.2.14

  [ Stefan Bader ]

  * d-i: Fix module name for dm-raid45
    - LP: #969248

  [ Tim Gardner ]

  * SAUCE: remove __initdata from vesafb_fix
    - LP: #969309

  [ Upstream Kernel Changes ]

  * Revert "sched: tg->se->load should be initialised to tg->shares"
  * toshiba_acpi: make one-bit bitfields unsigned
    - LP: #810015
  * ACPI: EC: Add ec_get_handle()
    - LP: #810015
  * toshiba_acpi: Support alternate hotkey interfaces
    - LP: #810015
  * toshiba_acpi: Support additional hotkey scancodes
    - LP: #810015
  * toshiba_acpi: Refuse to load on machines with buggy INFO
    implementations
    - LP: #810015
  * ata_piix: Add Toshiba Satellite Pro A120 to the quirks list due to
    broken suspend functionality.
    - LP: #886850
  * sweep the floors and convert some .get_drvinfo routines to strlcpy
    - LP: #921793
  * be2net: init (vf)_if_handle/vf_pmac_id to handle failure scenarios
    - LP: #921793
  * be2net: stop checking the UE registers after an EEH error
    - LP: #921793
  * be2net: don't log more than one error on detecting EEH/UE errors
    - LP: #921793
  * be2net: stop issuing FW cmds if any cmd times out
    - LP: #921793
  * be2net: Fix TX queue create for Lancer
    - LP: #921793
  * be2net: add register dump feature for Lancer
    - LP: #921793
  * be2net: Add EEPROM dump feature for Lancer
    - LP: #921793
  * be2net: Fix VLAN promiscous mode for Lancer
    - LP: #921793
  * be2net: Use V1 query link status command for lancer
    - LP: #921793
  * be2net: Move to new SR-IOV implementation in Lancer
    - LP: #921793
  * be2net: Fix error recovery paths
    - LP: #921793
  * be2net: Add error handling for Lancer
    - LP: #921793
  * be2net: Use new hash key
    - LP: #921793
  * be2net: Fix non utilization of RX queues
    - LP: #921793
  * be2net: netpoll support
    - LP: #921793
  * be2net: update some counters to display via ethtool
    - LP: #921793
  * be2net: workaround to fix a bug in BE
    - LP: #921793
  * be2net: fix ethtool ringparam reporting
    - LP: #921793
  * be2net: refactor/cleanup vf configuration code
    - LP: #921793
...

Read more...

Changed in linux (Ubuntu):
status: Fix Committed → Fix Released
Doug Smythies (dsmythies) wrote :

Tested O.K., and as expected, on:
doug@s15:~$ uname -a
Linux s15 3.2.0-22-generic #35-Ubuntu SMP Tue Apr 3 18:33:15 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

Hello,

I'm afraid this fix has the annoying side effect of reporting a higher than real load on idle systems. At least that's my conclusion after digging why my precise pangolin desktop was hardly getting below 0.5 even when idling.

Attached is the output of (ps waux; for i in 1 2 3; do uptime; vmstat -n 10 30; done; uptime; ps waux) | tee /tmp/idle.txt
during which I just watched time pass by and pressed <ctrl> a few times to prevent the blanker to kick in. The chrome process had 3 tabs opened (gmail, google reader and this bug's thread) and a fetchmail was triggered by cron every 5 minutes.

In summary, the vmstat never showed more than 3% cpu usage and uptime gave the following loads:
 15:35:53 up 22 min, 4 users, load average: 0.45, 0.46, 0.49
 15:40:43 up 26 min, 4 users, load average: 1.60, 0.87, 0.63
 15:45:33 up 31 min, 4 users, load average: 0.91, 1.00, 0.77
 15:50:23 up 36 min, 4 users, load average: 0.48, 0.68, 0.68

IMHO, 0.68 is a really high load for an idle system. :-)

The desktop was a fresh unity on an HP Elite 8100 quad core i5@3.2GHz with 4GB of RAM and running
Linux 3.2.0-24-generic #37-Ubuntu SMP Wed Apr 25 08:43:22 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux.

I'll try to look at the kernel/sched.c later to understand how this fix induced this behaviour.

Cheers !

Doug Smythies (dsmythies) wrote :

While most of the entries in this bug report are about creating and testing with high load averages, as part of due diligence tests and sanity checks were also done under low load conditions. Myself and with the final patch version, I have never observed incorrect high reported load averages under idle or low load average conditions. For example from right now:

(precise_i386)doug@s15:~/udev1/2770/~ubuntu-core-dev/ubuntu/precise/udev/ubuntu$ uptime
 16:12:06 up 1 day, 12:43, 0 users, load average: 0.00, 0.01, 0.05

However, please note that I only use Ubuntu server editions and have never used desktop versions.

My suggestion for further investigation on your system is to use the "top" command and observe what is using systems resources. For example, and from your attachment, look at resource use of "compiz", "/usr/lib/unity/unity-panel-service", "/usr/lib/indicator-appmenu/hud-service".

As a side note: Since the release of 12.04 my web site, with my web notes on this subject, has had a noteable increase in search query traffic with search parameters such as "high load average on Ubuntu 12.04". I think this might be because reported load averages are now fixed and perhaps people got used to them being broken and reported too low.

Doug Smythies (dsmythies) wrote :

Addendum (to post#53): I see the errors and omitions in my previous reply, in that more time had passed between "ps waux" commands than I thought and I was glossing over the "vmstat" results. I'll see if I can create a similar senario as you, for context switch rates and interrupt rates with very high idle percentage, and report back (might take hours, might take days).

Doug Smythies (dsmythies) wrote :

O.K., so much for my due diligence testing speech above.

I have created a senario similar to what Olivier reported above, and yes indeed I also am seeing reported load averages that are very much too high (Oh, darn).

At the moment I am not setup for compiling the kernel, but I can be in a short time. I would like to verify proper load reported averages, under these conditions, with the final kernel.org patch taken out. (for my own benifit, and because I understood it better, I will then try my version of the patch, which is different than what came back from Peter Z. at kernel.org. (My version did not account for a certain code path in sched.c because my computer never ever took that path))

Currently, I have hacked up my test program to obtain these results. If I get around to fixing it up, I'll post it here.
In a minute I will post my test results as a text file.

Doug Smythies (dsmythies) wrote :

see also bug 985661 ( https://bugs.launchpad.net/ubuntu/+source/linux/+bug/985661 )

I made a kernel with the patch taken out and verified, basically 0 load average (which we already know load averages are always at or near 0 no matter what the real load is undet these conditions.)
I also verified my patch version2 behaves the same (version 2 is the one I escalated to kernel.org, but it got changed to cover a code path condition that my computers never take)

I went back even further to my patch version 1 (see posting way 23 above), and while it does better under these low load but high frequency enter/exit idle conditons, it still isn't very good (I'll try to post a graph tomorrow).

The root issue/challenge is still one of signal aliaising with the 10 ticks grace period stuff that allows for catch up due to long idle periods. Under the current structure of this part of the sched.c code, I am not sure that a solution to cover all conditions is possible in a tickless kernel. The code is also extremely difficult to follow. I do not understand why the code treats going into idle differently than exiting idle. To my way of thinking the code should be completely symetric in that regard. Anyway, I will continue to work on it.

Doug Smythies (dsmythies) wrote :

See this file added to the bug 985661 mentioned above:

https://launchpadlibrarian.net/104693152/commit_low_load.png

Myself, I still prefer the patch for overall better load average reporting, but then I don't use desktop versions of ubuntu.
Note: I also sent the same information upstream.

Doug Smythies (dsmythies) wrote :

See this updated file added to bug 985661

https://launchpadlibrarian.net/105809696/commit_low_load_rev2.png

The same update was also sent upstream to one of the mail lists about propagating the patch to other kernels.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers