Ubuntu

High load averages on Lucid while idling

Reported by Rod on 2010-05-04
296
This bug affects 48 people
Affects Status Importance Assigned to Milestone
Linux
New
Undecided
Unassigned
Pantheon
Confirmed
Medium
Unassigned
Ubuntu on EC2
Undecided
Unassigned
linux-ec2 (Ubuntu)
Undecided
John Johansen
linux-meta (Ubuntu)
Undecided
Unassigned

Bug Description

SRU Justification:

    Impact:
    Fixes loadavg reporting on EC2.

    Fix:
    This reverts commit 0d843425672f4d2dc99b9004409aae503ef4d39f which fixes a bug in load
    accounting when a tickless (no idle HZ) kernel is used. However the Xen patchset used
    on EC2 is not tickless but the accounting modifications are still being done, resulting
    in phantom load.

    Testcase:
    Start any Ubuntu Lucid based instance on EC2, let it idle while logging the load average.
         while true ; do cat /proc/loadavg >>load.log ; sleep 5 ; done
    Alternately simply run top or htop and monitor the load average.

    Without the revert the reported load will vary from 0 up to about .5 for a clean image
    with no extra tasks launched.

    With the revert the load stays steady around 0 with only occasional small bump when
    a background task is run.

ami-2d4aa444
Description: Ubuntu 10.04 LTS
Linux domU-XX-XX-XX-XX-XX-XX 2.6.32-305-ec2 #9-Ubuntu SMP Thu Apr 15 04:14:01 UTC 2010 i686 GNU/Linux

Description copied (and edited) from post at http://groups.google.com/group/ec2ubuntu/browse_thread/thread/4be26e81b7c597bc
Posted as a bug here as I'm not the only one experiencing these issues, see very similar post at http://groups.google.com/group/ec2ubuntu/browse_thread/thread/a7e9bc45cf923f8c

----------------------------------

I've been running a customised version of an Intrepid image by Eric Hammond for a long while now and decided it was time to upgrade so I've configured a fresh image based on the official Lucid 32-bit in us-east (ami-2d4aa444). And I'm having some strange issues.

I run on a c1.medium instance and normally expect a load average of between 0.2 and 0.6, roughly averaged throughout the day, with spikes usually no more than about 2.0. So it's fairly relaxed. When all my services are shut down the load averages go down to ~0.0.

Now I'm on to Lucid I'm getting load averages that are roughly 10 times higher than I expect it to be, hovering between around 1.8 and 2.5 and I can see no reason why it should be reported this high. There are no processes hogging CPU, just occasionally coming in and out, watching 'top' doesn't reveal anything obvious and it just looks like a major disconnect between the activity and the load averages. I can't catch any processes running uninterruptable [ ps auxw | awk '{if ($8
== "D") print }' ].

If I run my custom image without any of my services running, load averages hover between approximately 0.1 and 0.6, nothing like the ~0.0 I used to get with nothing happening; I can't see any reason for it moving but it goes up and down, apparently at random. I've tried the same thing on a fresh instance of ami-2d4aa444 and it does roughly the same thing so it doesn't seem to be anything I've done ontop of the base image.

When I start my services it shoots up to the ~2.0 levels, even though they don't do much work, although they do take up a fair bit of memory. I've tried swapping to a new instance but it's the same.

The main applications run on this server are Apache, MySQL and a bunch of separate Tomcat (Java) instances. I have a number of EBS volumes mounted, a combination of ext3 and XFS.

Here's a [ top -bn1 | head -20 ] that's taken at random. 'java' and 'mysql' come in and out of the top of the list but never stay for very long.

top - 20:55:35 up 6:47, 3 users, load average: 2.33, 2.35, 2.31
Tasks: 137 total, 1 running, 134 sleeping, 2 stopped, 0 zombie
Cpu(s): 5.1%us, 0.5%sy, 0.2%ni, 93.5%id, 0.3%wa, 0.0%hi, 0.0%si, 0.5%st
Mem: 1781976k total, 1684628k used, 97348k free, 29108k buffers
Swap: 917496k total, 26628k used, 890868k free, 660448k cached
  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
    1 root 20 0 2804 1476 1204 S 0 0.1 0:00.13 init
    2 root 20 0 0 0 0 S 0 0.0 0:00.00 kthreadd
    3 root RT 0 0 0 0 S 0 0.0 0:00.01 migration/0
    4 root 20 0 0 0 0 S 0 0.0 0:00.00 ksoftirqd/0
    5 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/0
    6 root 20 0 0 0 0 S 0 0.0 0:00.01 events/0
    7 root 20 0 0 0 0 S 0 0.0 0:00.00 cpuset
    8 root 20 0 0 0 0 S 0 0.0 0:00.00 khelper
    9 root 20 0 0 0 0 S 0 0.0 0:00.00 netns
   10 root 20 0 0 0 0 S 0 0.0 0:00.00 async/mgr
   11 root 20 0 0 0 0 S 0 0.0 0:00.00 xenwatch
   12 root 20 0 0 0 0 S 0 0.0 0:00.00 xenbus
   14 root RT 0 0 0 0 S 0 0.0 0:00.03 migration/1

... looks like a system doing not much, except for those numbers at the top right.

Are these new kernels doing something different to calculate those averages now? The main thing I'd like to know is: are these numbers a true reflection of the load on my server or are they skewed or scaled somehow? I've got used to measuring the utilisation of my servers in the lower numbers, but now I have these large numbers I'm not sure what to make of it. The graphs of my load throughout the day look completely different to what they used to but the workload hasn't changed at all.

---------------------------

Having done a bit more playing my current suspicion is that this is related to the amount of memory being used by running applications. If I install mysql on a base system then the load averages go up and it's using ~140m, apparently the same thing happens if you install postgresql. I've tested on c1.medium and m1.small, the other reporting user is having the same issues on a 64-bit machine (ami-4b4ba522).

See posts at Google groups for more information and data.

Adam Nelson (adam-varud) wrote :

Here's my info for an EBS instance with Postgres on it. 2GB of memory are used out of 7.7GB available:

ubuntu@domU-12-31-36-00-39-C1:~$ uptime
 14:44:17 up 5 days, 16:26, 1 user, load average: 0.79, 1.09, 1.15
ubuntu@domU-12-31-36-00-39-C1:~$ iostat -k
Linux 2.6.32-305-ec2 (domU-12-31-36-00-39-C1) 05/04/2010 _x86_64_ (2 CPU)

avg-cpu: %user %nice %system %iowait %steal %idle
           0.01 0.00 0.01 0.03 0.01 99.94

Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda1 0.90 0.30 6.91 147285 3396124
sdb 0.00 0.00 0.00 385 72

ubuntu@domU-12-31-36-00-39-C1:~$ uptime
 14:45:15 up 5 days, 16:27, 1 user, load average: 0.77, 1.03, 1.13
ubuntu@domU-12-31-36-00-39-C1:~$

Adam Nelson (adam-varud) wrote :

Ticket #575193 has information and has since been marked as a duplicate of this one.

Scott Moser (smoser) wrote :

I closed the "Ubuntu on EC2" task here we are no longer using that project to track bugs. Instead, we're tracking issues on ec2 the same as other ubuntu bugs. The best way to open a bug is with 'ubuntu-bug' on the instance. That will set the appropriate flags and collect additional information about the system. Thank you.

Changed in ubuntu-on-ec2:
status: New → Invalid
tags: added: ec2-images lucid
John Johansen (jjohansen) wrote :

I have been able to replicate this, and actually watch the load averages fluctuate on an idle system. I have begun investigating and hope to have a solution soon.

Changed in linux-ec2 (Ubuntu):
status: New → Confirmed
status: Confirmed → In Progress
assignee: nobody → John Johansen (jjohansen)
Josh Koenig (joshkoenig) wrote :

Adding to PANTHEON since we were separately tracking here:

https://bugs.edge.launchpad.net/pantheon/+bug/588564

We're not kernel hackers, but we can provide smart feedback and testing. Let us know if we can help.

Changed in pantheon:
importance: Undecided → Medium
Greg Coit (gregcoit) on 2010-06-22
Changed in pantheon:
status: New → Triaged
Matthew Gregg (mcg) wrote :

I'm seeing something similar. We've migrated from Intrepid to Lucid instances, load averages for processes that were very low load under Intrepid are causing very high load(real load the system becomes unresponsive), on Lucid. The load spike are seemingly random, with no correlation to other activity on the system, IO, etc. We did/do not see variation like this with Intrepid instances. These are m2.2xlarge instances.

Matthew Gregg (mcg) wrote :

Related or not to this bug, I've had to revert to Karmic on EC2 due to "mystery"(looks disk io related, but i can't see it with vmstat/iostat/etc) load on Lucid. Having reverted, load is gone.

John Johansen (jjohansen) wrote :

Matthew are the "mystery" disk io load you mention in #7 the same as you describe in #6? In #6 what else are you running on the machine? Do you have additional instances running. Another possibility is what filesystems are you using on the host and instance.

In my initial round of testing this was looking more like a load average calculation bug where load from the host system was showing up reported in the guest.

Matthew Gregg (mcg) wrote :

Load issue from #6 and #7 look like the same. Running ext3 on all(intrepid/karmic/lucid). We would most often see hangs and load shooting up, during code builds, but not always. Sometimes builds would proceed normally. Also random load/hang during normal non-busy server activity. My first thought was bad instances or over-utilized instances, "steal time" was minimal however. We spawned many instances to rule that out. Our remaining Intrepid instances and new Karmic instances do not show this behavior.

Alex Howells (howells) wrote :

Adding a link to linux-image as this does not just affect EC2.

I have a decent number of HP ProLiant systems and about 90% are exhibiting the exact same problems. All of the systems affected have a load average between 0.7 and 1.1 just a short while after being rebooted, and are 100% idle.

Additionally I am seeing high memory usage when idle, on a box with 8GB RAM doing 'free -m' shows 700MB as used. Unfortunately this multiplies up (but not linearly) so a box with 64GB RAM seems to have 3-4GB used almost straight after reboot.

I should note that these are fresh installs, with no services added, just OpenSSH running essentially.

Attached file containing output from some diagnostic commands.

Alex Howells (howells) wrote :

Wakeups-from-idle per second : 184.6 interval: 10.0s
no ACPI power usage estimate available

Top causes for wakeups:
  28.1% (206.7) [kernel scheduler] Load balancing tick
  27.2% (200.2) [kernel core] add_timer (smi_timeout)

I just thought I'd say I don't think this is the root cause of the problem, contrary to what some folks on bugs and mailing lists have deduced thusfar -- I have two completely identical HP ProLiant BL460c G5 systems and one exhibits the problem, one does not. Running powertop on each shows an almost identical output (see above) however one has a load average of 0.92 doing pretty much nothing, whilst the other has a zero load average.

Alex Howells (howells) wrote :

agh@thunder:~$ sudo dpkg -i linux-image-2.6.31-22-server_2.6.31-22.60_amd64.deb

Manually installing the latest kernel from Karmic Koala also acts as a fix for the problems, which is pretty much a smoking gun pointing at the kernel shipped with Lucid Lynx, in my opinion!

Things which were resolved:

    1) Memory usage is now over 500MB lighter according to 'free -m'
    2) Load average no longer idles between 0.7 - 1.1 on the box

Number of wakeups per second is pretty much identical, as is the top cause for wakeups, namely the scheduler.

I've uploaded another archive containing more diagnostic output, sourced using the same commands except this time running a Karmic Koala kernel whilst keeping a Lucid Lynx userland. Perhaps this will be of use to someone?

If someone needs help diagnosing the problem I can replicate it easily, am happy to assist, and within reason have a fairly large pool of different hardware (albeit all HP ProLiant) which can be quickly spun up for testing.

Alex Howells (howells) wrote :
Download full text (3.9 KiB)

For reference the systems used for testing ('thunder', 'lightning', 'aurora') are all HP ProLiant BL460c G5.

agh@thunder:~$ cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Xeon(R) CPU X5450 @ 3.00GHz
stepping : 6
cpu MHz : 3000.366
cache size : 6144 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 lahf_lm tpr_shadow vnmi flexpriority
bogomips : 6000.73
clflush size : 64
cache_alignment : 64
address sizes : 38 bits physical, 48 bits virtual
power management:

Rest of the output snipped for brevity. I've spun up some additional hosts with Ubuntu 10.04 which are HP ProLiant BL495c G5 and am unable to reproduce the issue on the 3-4 of those readily available to me, the memory usage upon 'first boot' seems to be abnormally high but this is the case on both Karmic and Lucid kernels.

Linux ferret 2.6.32-21-server #32-Ubuntu SMP Fri Apr 16 09:17:34 UTC 2010 x86_64 GNU/Linux

agh@ferret:~$ free -m
             total used free shared buffers cached
Mem: 64560 1275 63285 0 0 36
-/+ buffers/cache: 1237 63322
Swap: 7629 0 7629

Linux ferret 2.6.31-22-server #60-Ubuntu SMP Thu May 27 03:42:09 UTC 2010 x86_64 GNU/Linux

agh@ferret:~$ free -m
             total used free shared buffers cached
Mem: 64561 1313 63247 0 0 37
-/+ buffers/cache: 1275 63285
Swap: 7629 0 7629

Output from /proc/cpuinfo on that beefier box is as follows, again snipped for brevity:

processor : 0
vendor_id : AuthenticAMD
cpu family : 16
model : 4
model name : Quad-Core AMD Opteron(tm) Processor 2384
stepping : 2
cpu MHz : 2699.504
cache size : 512 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt
bogomips : 5398.98
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

Also possibly of interest the AMD boxes have somewhat less wakeups per second although this is s...

Read more...

summary: - High load averages on Lucid EC2 while idling
+ High load averages on Lucid while idling
Russell Branca (chewbranca) wrote :
Download full text (3.5 KiB)

I'm running into the same issues on the Lucid 64bit EBS AMI, except with much higher load:

$ uptime
 18:06:22 up 18:31, 2 users, load average: 12.16, 13.54, 13.78

$ top -bn1 | head -20
top - 18:06:52 up 18:32, 2 users, load average: 11.53, 13.19, 13.65
Tasks: 108 total, 1 running, 107 sleeping, 0 stopped, 0 zombie
Cpu(s): 4.4%us, 0.9%sy, 0.0%ni, 92.2%id, 0.2%wa, 0.0%hi, 0.0%si, 2.4%st
Mem: 7864548k total, 3910252k used, 3954296k free, 194372k buffers
Swap: 0k total, 0k used, 0k free, 461708k cached

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
    1 root 20 0 23704 1896 1268 S 0 0.0 0:00.15 init
    2 root 20 0 0 0 0 S 0 0.0 0:00.00 kthreadd
    3 root RT 0 0 0 0 S 0 0.0 0:00.00 migration/0
    4 root 20 0 0 0 0 S 0 0.0 0:00.00 ksoftirqd/0
    5 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/0
    6 root 20 0 0 0 0 S 0 0.0 0:00.00 events/0
    7 root 20 0 0 0 0 S 0 0.0 0:00.00 cpuset
    8 root 20 0 0 0 0 S 0 0.0 0:00.00 khelper
    9 root 20 0 0 0 0 S 0 0.0 0:00.00 netns
   10 root 20 0 0 0 0 S 0 0.0 0:00.00 async/mgr
   11 root 20 0 0 0 0 S 0 0.0 0:00.00 xenwatch
   12 root 20 0 0 0 0 S 0 0.0 0:00.00 xenbus
   14 root RT 0 0 0 0 S 0 0.0 0:00.00 migration/1

$ iostat -k
Linux 2.6.32-305-ec2 (ip-10-196-38-176) 07/09/2010 _x86_64_ (2 CPU)

avg-cpu: %user %nice %system %iowait %steal %idle
           4.35 0.00 0.94 0.16 2.36 92.20

Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda1 3.10 3.79 15.36 252813 1025840
sdb 0.00 0.01 0.00 ...

Read more...

Alex Howells (howells) wrote :

Another bug over here is receiving more attention, and seems potentially linked --

    https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/524281

At the moment the only solution I've got is to revert to a Karmic Koala kernel, which is hardly ideal. Is there a reason this bug isn't receiving any attention from the kernel team?

If further information is required, as noted previously, I am more than happy to assist with diagnosis and reproduction.

Al Sutton (al-sutton) wrote :

This is getting silly. I currently have 2 Amazon EC2 instances running 10.04 LTS, both of which sit at around 1.0 load and can go to 4, and that's only serving a few users. They were "upgraded" from 8.04 LTS where the load rarely went above 0.5.

I'm guessing as this hasn't been resolved for nearly 3 months with an official patch the only solution we have is to switch to another distribution.

John Johansen (jjohansen) wrote :

The bug isn't being ignored, but you are correct there is now official patch or solution to the problem.

I am currently looking at building test kernels, which will include the scheduler patch referenced in
    https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/524281

this will let us determine what impact this has on this bug. I will make this kernels available for public
testing when they are ready.

Alex Howells (howells) wrote :

Please note, as mentioned, I see exactly the same scheduler behaviour on my hardware with Karmic Koala - downgrading from the Lucid Lynx kernel fixes all my other issues though. Thus I'm not sure this is the actual problem.

Al Sutton (al-sutton) wrote :

For reference; I've just updated to linux-image-2.6.32-308-ec2 and the issue still persists.

Chris (nakota07) wrote :

I too have problems with high load averages on "unloaded" systems. For a while I thought this was due to my running Ubuntu Lucid under Virtualbox. One of my hosts (violet) is a Virtuabox VM w/2.6.32-24-generic #38-Ubuntu SMP running. It has Firefox 3.6, Pigeon and a terminal window open .

top - 22:13:25 up 2 days, 8:29, 2 users, load average: 1.87, 1.24, 1.07
Tasks: 142 total, 1 running, 141 sleeping, 0 stopped, 0 zombie
Cpu(s): 1.0%us, 0.0%sy, 0.0%ni, 99.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

It has been like this, unused for over an hour. A 'sar -P' shows that the lowest load average was 0.45 and the highest was 2.0 during the last 22 hours. the bulk of the time it was at a load avg of > 1.04

Running iotop on violet I can see that the disk reads/writes are very small (iotop updates every other second and only every third update shows activity at the sub 100kb level). There is little to no network activity, and there is still room left to make disk cache of free ram. IOTOP would be of more use if the maintainers of the Ubuntu Kernel would re-enable CONFIG_TASK_DELAY_ACCT, but apparently that causes more lost cycles than enabling SMP in non-smp environments does.

I thought for a while it may have been some sort of AMD-64 bug, but I also get this issue in on my core2-64 duo system at work. (If I retrograde the 10.04 release to 2.6.31-XX-generic (where XX =< 19) kernel the problem is reduced or becomes less pronounced. ) My work host is running Ubuntu 10.04 running the 2.6.32-2X-generic kernel and it went ballistic when I plugged in a new 4GB thumb drive and did a dd if=/dev/zero of=/path/to/thumb/drive/foo bs=1020 count=1M . to test the speed of the device. The load average went above 10 though it was only a little sluggish when switching to the other VMs.
My home host on 9.10 w/ Linux 2.6.31-20-generic #58-Ubuntu SMP, and my primary Linux host is Ubuntu 9.10 w/ Linux 2.6.31-17-generic and both run with little or no issues related to load average. If I use the default kernel with the Lucid release in a VM on either AMD or Core2 the guest is nearly unusable due to the load averages going above 10 and causing issues (core dumps et all). I don't know what the cause of the high load avg is since the swap and disk IO were low to non-existent. There were no conflicting VMs running, etc. I will try this out on my work laptop (core2-32bit)

Any idea when the issue might be resolved? Is there a Linux "distribution" where this is not an issue? Is this a kernel issue or something else in the release? I see that a few have complained about this issue. I see it has been an issue for some time. Is this currently a very low priority? Are there plans to extend support for 8.04 for an additional 4 months since this problem has not been resolved yet? I can imagine that this would be of great interest to those planning moves from 8.04 LTS.

flaccid (chris-xhost) wrote :

LTS only really applies to people that pay Canonical for support i.e. they only support LTS.
Ubuntu also does not release bug fixes in package updates during a release.
This basically means that LTS is worthless to the community and quite a big misnomer.

John Johansen (jjohansen) wrote :

Alex, I didn't expect it would make much of a difference either but was willing to try if for no other reason to rule it out.

Al, the issue will persist until at least the next point release.

Al Sutton (al-sutton) wrote :

John; Just so I'm clear, are you saying it's not going to be fixed until 10.04.1 LTS (i.e. a few weeks away), or are we looking at several months until 10.04.2?

On Tue, 27 Jul 2010, Al Sutton wrote:

> John; Just so I'm clear, are you saying it's not going to be fixed until
> 10.04.1 LTS (i.e. a few weeks away), or are we looking at several months
> until 10.04.2?

10.04.1 is weeks away. It is not likely that a fix for this would make
the cutoff for 10.04.1. That said, our images on EC2 take a different
release policy that is described at
https://wiki.ubuntu.com/UEC/Images/RefreshPolicy . A tested resolution to
this issue This issue would definitely warrant a refresh.

John is working on trying to get this fixed. We believe, at this point,
that it is simply an accounting/reporting problem.

Al Sutton (al-sutton) wrote :

Thanks for the update. The performance of the EC2 10.04 images is noticeably slower than the 8.04 LTS ones we were running, so I'm pretty sure it's not just a case of the load being incorrectly reported, hence why I've been actively monitoring this issue.

Given what you've said it seems unlikely that it will be resolved with an official stable patch any time soon, so, although I am grateful for your honesty, I'm going to run up some other distributions which support EC2 and have the relevant versions of the packages we need so we can compare performance and possibly switch.

Alexandre Bourget (wackysalut) wrote :

I have the same problems. My load increases (to some really scary values, even as high as 45), and the number of processes too, and at some point, my MySQL server is being killed (but not always). I don't think it's just a reporting problem.

Alex Howells (howells) wrote :

That sounds unrelated to this bug and may be an issue with your software workload. I'm not seeing any fluctuation in memory usage beyond the fact it uses 15-20x more than a Karmic Koala kernel upon initial boot up.

Are you able to post diagnostic output showing this isn't your workload causing the issue?

Matthew Gregg (mcg) wrote :

I unfortunately do not have the cycles currently to properly help debug this, but can only say we had significant load and performance issues with Lucid, that we did not have with Intrepid, nor do we have with Karmic(that we had to fall back to).

Chris (nakota07) wrote :

> John is working on trying to get this fixed. We believe, at this point, that it is simply an accounting/reporting problem.

In a word: NO.

With 10.04 as the base OS on the host and in a VB-vm the load average issues are tantamount to making the distribution unusable as shipped and only marginally usable with the latest kernel updates.

Before the release of 10.04 my work environment was: Host running 8.04 and guests running 9.10, and 9.04 and Win2K3. My primary guest where I did my work had 384 meg of ram and it ran Firefox, Thunderbird, several bash sessions in screen, pigeon, and a custom java application that was/is memory hungry. Things ran fine.

I upgraded my host to 10.04 and made new guests running 10.04. I configured everything the same, and did fresh installs and configurations of the software. In order to do work I had to increase the memory to 768Meg on my guests and that helped but did not solve the issue. I can not open wireshark captures with out it killing some other X application. The load average will randomly increase to 10, 20, or even 50 and slow down so much it becomes unusable (yet top/iotop/iftop show little to no processes that are causing the issue) when doing telephone support. I keept retrograding kernel levels in order to obtain usable systems. Kernels at or below 2.6.31-18 perform best, but not at the 9.10 level.

My home system has 9.10 as the base OS with several guests at 9.04, 9.10, and 10.04. The 10.04 system (violet) has all of the patches and 2.6.32-24. Violet [think resistor colours] as of this writing (20:00), sar shows that from 00:10:01 to 20:00:01 the lowest lavg1 is 0.40 and he highest is 1.95. Only 50 of the 125 sar samples are below 1.00, and only 26 are below 0.70. This system has been unused since 23:20 the previous night. It is running only Firefox, two shells in two tabs of a gnome terminal and pigeon. No cron jobs running, no SQL servers running etc. With the 2.6.32-24 kernel, with a load average of 1.20 the system is tolerable but still has sluggish moments as I write this. This does not describe a system where "accounting errors" are happening.

So: NO! It is not 'just accounting errors'. My 9.xx systems are running fine as guest VMs in virtualbox.

Greg Coit (gregcoit) on 2010-07-28
Changed in pantheon:
importance: Medium → Critical
Greg Coit (gregcoit) on 2010-07-28
Changed in pantheon:
status: Triaged → Confirmed
Chris (nakota07) wrote :

Not that this will do much to track down the issue, but I have some more case history.

Today I was tidying up some vdi disks on my host system (work system running 10.04 w/ 2.6.31-17 on Core2 Duo E6750 2.66GHz 4GB ram). A disk to disk (two physical disks) copy of a 7.5GB vdi caused the load average to climb up to about 8 today. An rzip of the vdi image caused the load average to climb to about 6. It slowed down the primary system, and the virtual boxes. The system s that shared the spindle where the data was being compressed was worst of all (obviously). All systems were sluggish but not unusable.

So I bought the vdi image home. My home system (black) is an AMD Athlon(tm) II X2 250 Processor at 3GHZ with 4 GB of ram and two 250GB (hardware) mirrored disks where Ubuntu 9.10 w/Kernel 2.6.31-20 lives. During the entire time that rzip was running the load average never went about 1.7. When it was finished it sank very fast to 0.14 whereas on my work system after a peak it takes about 3 min to reach a "normal" level of load of at least . I know this is kind of like comparing apples to hammers. I don't have a "pure" 10.04 system at work or at home that has enough disk space free do uncompress a 7gb image with out doing some work to create a new disk and attach it to an existing system.

So I did the next best thing I could come up with. On Violet (guest on black) my 10.04 that has not had the kernel retrograded, I did a tar of a folder containing 1.9GB of PDFs, images, mp3s, text, etc. Almost instantly violet went from 1.03 to 3.83. Firefox had ha hard time keeping up with my typing and right clicking for spelling checks. It took from 21:34 to 21:38 (while I was doing nothing but watching top) for the loadavg1 to drop from 3.72 to 1.08. I then did an rzip on the tar file. at 21:40 with a loadavg1 of 1.29 I started the test. It is with great interest that I only saw the loadavg1 go to about 4.33 (max) but the system was far more sluggish (a right click was a 4-5 second delay) and top would freeze up for several updates (2s/update) and then rush to catch up. It sopped zipping at 22:01. Watching top I saw most of the time %us was between 40 and 70, %sys was about 5-15. The amount of time needed to rzip the smaller 2GB of data was much longer (many minuets) than it took to unrzip 7gb of data. I know that zip and unzip are not symmetrical, but 20 minuets to about 5 and only dealing with about 28% of the data? It just doesn't seem right.

Chris (nakota07) wrote :

I should add: Note the change in kernel for my work host from my post on 7/27 to the one on 7/29. There is an improvement in retrograding kernels but it is still not like the 9.10 release.

Al Sutton (al-sutton) wrote :

Can we have an update to confirm that;

a) It's been agreed it's not "simply an accounting/reporting problem.", and is something that impacts performance

and

b) Details of whats being done to fix it

So people, like myself, who are waiting for this to be resolved in order to switch back to 10.04 LTS on EC2, can get an idea of whether it's worth putting resources into training staff on a different distribution, or if its' worth holding off for a fix in the near future.

Scott Moser (smoser) wrote :

On Mon, 2 Aug 2010, Al Sutton wrote:

> Can we have an update to confirm that;
>
> a) It's been agreed it's not "simply an accounting/reporting problem.",
> and is something that impacts performance

The ec2 issue definitely appears to not be simply accounting. Many of the
comments on this bug are not related to the ec2 issue.

> b) Details of whats being done to fix it

John Johansen is working on this bug. He has not yet determined what the
problem exactly is. However, we do know that
 i.) running a 9.10 user space on a 10.04 (lucid) kernel exhibits the problem
 ii.) running 10.04 user space on a 9.10 kernel does not exhibit the problem.

> So people, like myself, who are waiting for this to be resolved in order
> to switch back to 10.04 LTS on EC2, can get an idea of whether it's
> worth putting resources into training staff on a different distribution,
> or if its' worth holding off for a fix in the near future.

I really wish that I could tell you what to do, and that we had a fix. We
are working on it and recognize it as a serious issue.

Personally, I'd like for you not to go looking for another distribution.

Al Sutton (al-sutton) wrote :

Scott, thanks for the update. As always I appreciate an open an honest answer.

I can understand you can't predict how long it will take to fix the bug, but similarly we can't tell how long customers will continue to use the degraded service that is the best we could offer under 10.04 LTS, hence the move away.

I hope this does get resolved soon, because, at least for us, this really makes us doubt the commitment of the Ubuntu team to the EC2 version, which is a shame as we're more than happy with Ubuntu in every other respect (i.e. package availability, support lifecycle, etc.).

(On the EC2 point; Alex Howells removed the "EC2" section of the bug between #13 & #14, maybe it should be put back in to ensure it stays focused on the EC2 issue and any other issue which may look related but isn't gets filed elsewhere. What do you think?)

SirFrankie (junkert-ferenc) wrote :

Same issue here, but maybe some intresting additionals.

My note running 10.04 installed from CD (07.01), updated every time daily.
Configuration: Dell Latitude E5400 4GB RAM, 2.5Ghz CPU, 250 GB HDD, 2 pcs ext4 fs, running sweet without any high load issue...

This note has few installed Virtualbox Guests (WinXPsp3, and one Lucid) I use lucid to have separeted internet environment.
The LUCID VB guest configuration (actually installed 28/07 and updated az 29/07):
- 1 CPU, 1 Gig RAM (IO apic, HW timer, absolute pointing dev.) PAE/NX, VT-X, Mem-Virtualization enbaled
- 64 MB Video ram, 1 Screen, without any 3d or 2d accel.RDP enabled
- SATA disk controlled .vdi (8 GB size, ext4 fs, on 2 partition, and 512 Mb SWAP) and IDE CD-ROM controller
- Pulseaudio ICH AC97 enabled
- NO network and Serial
- USB enabled with USB 2.0 EHCI enabled (1 USB device HUAWEI HSDPA modem)
- no shared folder
So summa summarum, i found the issue on this guest machine.... I am currently happy with this state, and if i can help, just say...
regards,
SirFrankie

Chris (nakota07) wrote :

IMHO, if the problem is exhibited without the EC2 code, then the EC2 code is just that much less code to interact with while debugging this issue. The act of 'adding it back in' is a distraction (more or less) at this point and effort should be placed on tracking down the high load average issue. Once that is found adding back in EC2 code can be tested in the same QA cycle.

James Jonas (jamesjonas) wrote :

Hack and / - Linux Troubleshooting, Part I: High Load
by Kyle Rankin
Linux Journal

http://www.linuxjournal.com/magazine/hack-and-linux-troubleshooting-part-i-high-load

Just another resource for how to diagnose a high load on linux.

James Jonas

Chris (nakota07) wrote :

Jonas thank you for that information.

I am aware of the resources to track down high load average. I have iftop/iotop/top/vmstat/iostat and sar installed. The article fails to mention vmstat and SAR at all. Iftop/top/iotop are useful only in that they all report that nothing is "hogging" the resources. At one point it was even suggested, in another thread, that there was some sort of 'scheduler' issue and that running iftop would reduce the load average. I have not found that to be true.

FYI: I have been tracking down rouge processes since AIX 3.2.3, Solaris 2.4 and Slackware running the 0.9x kernels. The issue is systemic and not point based. To reiterate: a VM running 10.04 (with all patches) had load averages of 0.70 or above for more than 80 percent of the day. The only things running on the host were two shells in two tabs of a gnome terminal and Firefox with no java/js running (noscript blocked) and pidgin. Sar did/does not show any processes being stuck. Since no disk I/O, or network I/O would have been operating there was no reason for the load average to be above 0.01. A systemic issue indicates a Kernel issue (particularly since SAR does not report any unusual disk or I/O issues. I am sorry if I gave the impression that I was new at this.

flaccid (chris-xhost) wrote :

Has anyone looked at Ubuntu's code where the actual load average is calculated?
If so where is this? Perhaps we should start from there, see http://en.wikipedia.org/wiki/Load_(computing)#Unix-style_load_calculation

Testing is great, but we have already verified the incorrect averages.

Chris (nakota07) wrote :

Load average is just a number. Aside from your "accounting errors" there are fundamental performance issues. How they relate I do not know. Since none of the "usual suspects" except for load average show up as issues (disk i/o, network i/o, memory swapping, etc) for me to troubleshoot (or kill -9).

When my AMD Athlon(tm) II X2 250 system with 4GB or ram and mirrored SATA drives running a gust vm of 10.04 (512MB) and the guest 10.04 system has less performance than my Sparc 10 (128MB) did in 1999 I would call that odd. My other guests running 9.04 and 9.10 are running fine. (Stop me if written this before)

If it is pure accounting errors, tell me what I should tune in my system to make the performance better and I will check it out. Once my system is running fine, I can ignore load average as I would a broken indicator on a instrument panel. As of right now I have no data to even go on to "tune" my system other than load averages going "wonky" for no apparent reason on idle systems.

Seriously? You think all of us are having a conniption over a number rather than performance? I am getting the impression that the people form Ubuntu think this is not an issue. Or that we are "dreaming" about our performance issues. I think I have documented my issues reasonably well considering I am using "live" systems rather than test VMs. I didn't do an FDA approved 'blue book' series of tests like I have in other jobs I have had but I did the best I could given the fact I have a day job to attend to.

Josh Koenig (joshkoenig) on 2010-08-05
Changed in pantheon:
importance: Critical → Medium
description: updated
21 comments hidden view all 101 comments
Matthew Gregg (mcg) wrote :

Lucid is still unusable on EC2 due to some real load issue. I still do not have time to properly help debug this, so for me to open other tickets won't be much help.

Al Sutton (al-sutton) wrote :

Just as a follow-on to my comment in #58.

It's unlikely I'll be able to test this as we've now migrated everything away to a different distribution, so unfortunately I won't get a chance to try any new kernels.

slenova (seancasey) wrote :

@John Johansen,

The new test kernel seems to have fixed the phantom load average bug for me. Here's two instances of my web server sitting idle:

#1) Ubuntu 10.04 32-bit AMI with default kernel:
up 1:22, 1 user, load average: 1.03, 1.02, 0.93

#2) Same AMI with new test kernel (aki-84b75ded):
up 18 min, 1 user, load average: 0.02, 0.03, 0.08

Does this mean that it's safe to ignore the high load averages I'm seeing on the default kernel for now until a new kernel is released?

John Johansen (jjohansen) wrote :

Matthew,
  What kind of work loads are you running, that may help us track down your issue.

Al,
  sorry to hear that, note the code that causes the load average bug is now upstream so you may run into on any newer kernel if it has been built with CONFIG_NOHZ (tickless). We are still looking into the fix for tickless kernels but it can be fixed by just not building the kernel with that option. Thanks for you input and best of luck.

slenova,
  If the test kernels fix this for you then yes it is safe to ignore the high load averages on the default kernel until a new kernel is released.

Matthew Gregg (mcg) wrote :

When we switched to Lucid we had load averages go from 0.01, to 20+ and periods of time where the system would become unresponsive. Having moved back to Karmic with the same code base rebuilt on Karmic, load averages are back down. Our work load is heavy IO bound, both network and disk.

Rod (rod-vagg) wrote :

I've been putting up with the high load averages for a few months now on our production system. I've also been experiencing what I thought was an unrelated problem but I've come to suspect is tied up with this bug: every now and again the system would appear to lock up and become unresponsive but because I often keep an SSH session open to the server I can see that it's still running and load averages have spiked to over 50 and nothing can be killed and only simple processes can be started (ps). I have been able to just wait it out in the past and it fixes itself but because this is an important production system my best option is to force a restart (it usually responds to a 'reboot').
This happened every couple of weeks but recently it seems to have been happening more often. As far as I can recall this new since Lucid so I'm suspecting that it's related to this load reporting problem.
It's happened 3 times now in the last week and is becoming increasingly frustrating so I've restarted this system with one of the test kernels posted here (aki-84b75ded). I can confirm that this has fixed the original load average bug and the system has been running for 24 hours with no appreciable problems. I can report back here if the same load spike problem happens again, if it does then I guess its a new bug but I wouldn't be confident pinning it on Lucid in particular. I guess if it doesn't show up again we can assume that (a) the originally reported problem caused wider problems and (b) the new kernels have fixed those problems.

Matthew (mdl-mlemieux) wrote :

I've been experiencing times of high load reported when the machine is otherwise idle. I've also been experiencing sudden spikes in load (40+) when the machine is doing minimal work (CPU 50% idle), few processes running.

Is anybody able to confirm whether or not the fix from 20100827 solves these problems? (http://uec-images.ubuntu.com/releases/lucid/release/).

My issues are also real problems (unresponsiveness, even in already open shells) and do not seem to be entirely due to just the way load is calculated...

Rod, in #67, your comment implies that the new kernel does indeed fix your real unresponsiveness issue.

Marius Seritan (mseritan) wrote :

I have some Chef recipes that install nginx, ejabberd, monit, postfix and some ruby applications on a 32 bit system. I ran them on the latest Lucid image and the load went back to 12. I moved to Karmic and the load is 0. Unfortunately I need to release so I will just stay with Karmic until Maverick.

Maybe this is just a 'number of bars' measuring fluke although it does feel like the performance is variable (system sluggish at the terminal. One way or another I cannot release with a load of 12 under no activity because I cannot evaluate and react to the load of the system once customer start to hit them.

I do not have problems with Lucid and I am very happily using it on my home computer or on some Linode servers - I do not know what virtualization they use.

Marius

Rod (rod-vagg) wrote :

I didn't mean to imply that the occasional unresponsiveness was fixed by using the new test kernel, it was more of a hope. That hope was dashed a few days ago, while I was not around we experienced another spike that lasted for 1h 10 minutes during which the system was effectively externally unresponsive. Serverdensity managed to capture the spike and I'm attaching a screenshot of that to this comment (times are GMT+10:00).
I think these spikes generally occur for me on the hour and last roughly the same amount of time if left alone. I have an EBS snapshot job that runs on the hour which I've since moved to 10 mins past the hour to see if that's related--I generally get an error message from that job when spikes happen either about failing to get a response from the snapshot call or failing to get MySQL to lock properly pre-snapshot.

Eric Hammond (esh) wrote :

Rod: I think you are probably experiencing something completely unrelated to the bug identified in this launchpad report. It might be worth moving the conversation to a different bug or a different venue. If you are using ec2-consistent-snapshot to initiate your EBS snapshots (freezing an XFS file system) please see if the info in the following thread helps: http://groups.google.com/group/ec2ubuntu/browse_thread/thread/3aea02f7842d73e4/d1bd59ade2dbd36f

Rod (rod-vagg) wrote :

Quite possible Eric, will try the explicit unfreeze and see if it makes a difference. Thanks for that!

The summary from me is that the underlying high load average bug has been properly fixed for me since using the new kernel. My only remaining problem is this load spike issue, if it turns out to be as simple as an XFS lock then I'll be a happy chappy.

Rod (rod-vagg) wrote :

No-go. I just had it happen again! This time it was at 45 mins past the hour so nowhere near my snapshot time. System was so unresponsive that it kicked me off my ssh session. I tried a forced instance restart and once I did that I could suddenly log in again even though it didn't restart, was still pretty unresponsive though and I couldn't do anything meaningful on it. Eventually the restart happened but it didn't come back up properly so I just had to keep on trying to reconnect. I got back on again but it locked up again as I was remounting some EBV volumes. So another attempt at a restart... The problems all disappeared again almost 1 hour after it had begun.
It's frustrating trying to run a web service like this and having to come up with explanations for our customers when we disappear. And I have no idea where the blame lies for this! AWS? EBS? Lucid? Something else? Do I spend a day reverting back to Karmic? Do I throw away my reserved instance and set myself up in another availability zone or region?

Alexandre Bourget (wackysalut) wrote :

My problems are similar to those of Rod, straight unresponsiveness including high loads in some intermitent moments.

If anyone is looking for a solution to such a problem, they'll probably be driven to this thread. So should we open another bug and restart the conversation ?

Chris (nakota07) wrote :

After installing all of the most recent updates. Then fixing the botched patch roll out so that I could boot the system again. I still have wretched load averages. My host, violet 2.6.32-24-generic #42-Ubuntu SMP, is still experiencing load averages of > 1 after a reboot and waiting 5 min for it to settle. I am writing this with a load average of 1.14. Firefox is using 8% cpu, Xorg is using 3% cpu and top is using 1% cpu. A bout a half dozen gnome applications using < 0.7 % cpu.

I can not fathom that moi is the only one to be complaining yet. What is the bug ID for the unreasonable load averages on unused systems so that we can add to that case now. OOPS! Wow, typing in to the box here on the web page just threw the load average to 2.0. FF still peaks at 18% cpu and that is about it. So if this is the kernel that has the fixes in it. It may be more "accurately" reporting a load average. A Load average that is unreasonable for the kind of load running on a box. If this were my old K5 box with 64M or ram I might be more inclined to give slack, but 3GHz 64 bit host cpu. No.

Has something catastrophic happened in Ubuntu? I see from the canonical page you lost a member of the team. I can understand that is a hard thing particularly if he was well liked across the company, but really guys pull it together. If that is the reason for these two kernel issues and the botched patch it really doesn't look on Ubuntu.

John Johansen (jjohansen) wrote :

Rod, Alexandre, Matthew: your problem maybe at least partially related to Bug #585092, which is actually a little more generic than just umount.

Chris, something certainly isn't right with your system, but the fix from this bug isn't going to help you as it only applied to the -ec2 kernel. Outside of EC2 the high load average for an idle machine hasn't been generally reported as a problem. Can you please open a new bug so that we can gather more information and start tracking this problem down. From a terminal do
  ubuntu-bug linux

and follow the prompts. This will gather information from your machine that will help in debugging this issue.

Matthew Gregg (mcg) wrote :

@john that does seem to look like my issue. I had that behavior, but never saw any iowait.

asasoft (asasoft) wrote :

@John Johansen
"Outside of EC2 the high load average for an idle machine hasn't been generally reported as a problem"

John I think this is the same as Bug #524281 , already mentioned in comment 15

James Turk (james-turk) wrote :

Just echoing what others are saying here. This bug is definitely not resolved, I reported one of the original bugs (#575193) and just got a chance to try out the new lucid AMIs.

After the machine was up for two days load spiraled out of control until we had to shut it down due to unresponsiveness.

Just out of curiosity we also tried a maverick test kernel from September 2nd, the box has been up (nothing running on it) and the load is now at 9.92 after about 24 hours.

We're stuck with karmic or switching away from Ubuntu, neither of which is a desirable option.

Alex Howells (howells) wrote :

There are a lot of users here echoing my sentiments.

I appreciate there is a way to file bugs and that Ubuntu addresses them per that policy. What you have is quite a significant number of users saying the Lucid Lynx kernel is flat out broken.

I specifically tagged this bug to 'linux' (more generically) and reported there were issues outside EC2. This was many weeks ago and is clearly obvious in the bug history. What happened in result to that? It sure seems like someone completely ignored it and applied a patchset to EC2 kernels only and this bug is now 'dead' as far as work intention goes.

I am also experiencing the problems under Maverick and for me at the moment, I cannot recommend Ubuntu to anyone with a serious workload outside mostly casual desktop use. Frankly even with support from Canonical issues like this *always* seem to take too long to resolve, and the depth of talent someone like Red Hat has within the organisation to address these exact affairs doesn't seem to exist over the water with Ubuntu - or at least they're not motivated in the same way.

My stance has not changed: I want to help you diagnose and fix the problem. Whilst it takes this long to resolve though more people become disenchanted by Ubuntu, and in the case of my employer, I suspect the seed has been sown for distrust of the distribution for at least the next 12-18 months to come and we're falling back to Red Hat as "the usual suspect".

If we haven't brought this bug to your attention in the correct way, please clearly state what you expect so we can progress.

John Johansen (jjohansen) wrote :

asasoft,

I'll give you that Bug #524281 can and does affect load, however it certainly not the only issue that affects load, and it doesn't result in quite the same random phantom loads I was seeing. Unfortunately one persons phantom loads are different than another. I address one particular bug, there are likely others causing issues as well, each of which needs to be tackled.

John Johansen (jjohansen) wrote :

James,

thankyou for the report. Is there anymore data you can provide from the box that spiraled out of control?

Also the Maverick report is interesting are you will to run a couple test kernels that I have if I make them available?

John Johansen (jjohansen) wrote :

Alex,

I appreciate both your frustration and your desire to help, the Lucid kernel has seen literally hundreds of patches since release (see https://edge.launchpad.net/ubuntu/+source/linux/2.6.32-25.43) for just the latest round of updates for the proposed kernel and more are already on the way.

The linux task on this bug has not been closed, and is not being ignored. In general the kernel team tries to get bugs filed separately and then merge them/mark them as duplicates after the fact. This is done, do to the nature of kernel bugs, where just because a bug has similar symptoms they may not be related. Eg. graphics issue on intel hardware are most likely different than bugs on ati hardware, even if they exhibit many of the same symptoms.

I asked that a new bug be opened, as the issues you are experiencing seem to be different than the bug I had fixed, which specifically dealt with a load calculation issue in the EC2 kernel, that is not present in the -generic or -server kernels. When this bug was opened it was focused on EC2 which runs a kernel that is quite different than the rest of Ubuntu, so generally bugs against EC2 are handled somewhat separately from the rest of the Lucid kernel (this will change in the Maverick kernel).

The fact is that there is more than one issue affecting load average on Lucid, and dealing with them in one meta bug just isn't effective. What I wanted to do was split specific issues out so that they can be tracked independently. This is of course difficult to do when the bugs symptoms are high load.

As per specific load problems, there are several issues that have affected Lucid, and the upstream kernels (its not just Ubuntu). There have been the high number of wake ups from the load balancing tick, there has been the write back issue, there is an update to how load is computed from upstream that fixes a bug where load was under reported in older kernels, all newer kernels have this patch and so it should be expected that there are higher loads reported on newer kernels.

What isn't expected is spiraling out of control loads, extremely high loads on an idle systems, or phantom loads. We want to address all issues.

Chris (nakota07) wrote :

I created bug id #635181

Chris (nakota07) wrote :

The just of it is this. Enterprise cloud stuff goes here. Everyone else in to the pool I just created unless there is a better one out there.

Alexandre Bourget (wackysalut) wrote :

Hello folks,

just want to clarify, this might shed some light on my problem so I'm sharing it here.

My Lucid system was in fact upgraded from a Karmic install on EC2, so the kernel that is running there is 2.6.31-302-ec2, the one distributed with Karmic. All the system was upgraded to Lucid, but the kernel is unchanged. I don't know if that could cause those types of problems, but to be of any help to the community, I guess I'd need to test an upgrade.

Also, I don't see a lot of threads elsewhere discussing this lucid kernel intermittent problem, so it probably is related to something specific (of that particular EC2 instance or host, or installation or software configuration).

I'll post updates when I do upgrade to a Lucid kernel with the Lucid software stack, when I have time.

thank you all

Scott Moser (smoser) on 2010-09-20
Changed in linux-meta (Ubuntu):
status: New → Invalid
James Turk (james-turk) wrote :

John,

I'm so frustrated with this that I'd be happy to try anything, including however many test kernels you throw at me.

I'm curious though -- If you start up an EC2 instance of your own are you not seeing this behavior? After noticing the latest AMI release I decided I should try spinning up a micro from ami-6006f309

Without even waiting I can start up an instance and see the load go over 1 before I've had a chance to start any services or install anything.

A micro instance started 9 mins ago:

$ date && uptime
Sat Sep 25 03:19:21 UTC 2010
 03:19:21 up 3 min, 1 user, load average: 0.58, 0.22, 0.08

$ sudo aptitude install postgresql # doesn't have to be postgres, but basically any service that does anything (tried in past with nginx, mongodb, hudson, etc.)

$ date && uptime
Sat Sep 25 03:25:14 UTC 2010
 03:25:14 up 9 min, 1 user, load average: 1.40, 1.04, 0.48
$ sudo aptitude purge postgresql

$ date && uptime
Sat Sep 25 03:29:27 UTC 2010
 03:29:27 up 14 min, 1 user, load average: 0.35, 0.81, 0.52

the same test with the new amazon linux AMI

$ date && uptime
Sat Sep 25 03:40:38 UTC 2010
 03:40:38 up 5 min, 1 user, load average: 0.02, 0.18, 0.09

# install & run a postgresql server

$ date && uptime
Sat Sep 25 03:52:30 UTC 2010
 03:52:30 up 17 min, 1 user, load average: 0.00, 0.03, 0.06

Rod (rod-vagg) wrote :

My latest on this is that I haven't really had a problem since my last post here. Eric's suggestion to add an explicit XFS unfreeze after my regular snapshots may have fixed some of my load spike / lockup problems, even though it happened once soon after I added the unfreeze (as per my post above) it hasn't happened since. In fact, my only problem with my main instance in us-east-1 was when AWS had network hardware problems a couple of weeks ago which I initially thought might be due to this load spike issue until I saw the status reports for AWS that corresponded with my instance problems.

I'm using John's test kernel still with my Lucid instance, so it's booted up with aki-84b75ded (32-bit, see above in Scott's post). But I'm wondering, did the changes in this kernel make it into the standard Canonical Lucid images for EC2? i.e. from your post 12 hours ago Scott re updating UCE/EC2 images: http://uec-images.ubuntu.com/server/releases/lucid/release-20100923/ if we use those images or the AKIs associated with them would we have John's kernel patch for the sustained high load averages bug?

Scott Moser (smoser) wrote :

@Rod
  The latest kernel attached to the latest set of images does not contain a fix for this bug.
  The kernel team is working on getting this fix [1] pulled in and uploaded to -proposed, hopefully early this week. At that point, this bug will have an automatic comment added to it indicating that it has been uploaded to -proposed. I will then upload the kernel to each ec2 region in a "sandbox" bucket and ask for people here to test it, and comment that they have done so, per the SRU process [3].
  Once in -updates, the daily builds of 10.04 will pull it in, and the kernel will appear in a "testing" bucket. From there, I would then test images using that kernel and refresh them which would get it into a "ubuntu-kernels" bucket, where it would sit for ever (and never be removed).
  I currently don't expect to refresh images explicitly for this fix. The fix only addresses "phantom load". No performance is actually addressed (see John's comment #61). Instead, I'd like to hold off and release newly refreshed images in a month or so, hopefully picking up some other fixes such as (bug 634487:
t1.micro instance hangs when installing sun java). That said, I would consider manually pushing kernels to a "ubuntu-kernels" bucket so they would never be deleted, and you could rely on them being there for rebundled AMIs.

--
[1] http://kernel.ubuntu.com/git?p=ubuntu/ubuntu-lucid.git;a=commit;h=3c91150d222bbef6efd7121f9ae1a9b3c103a5af
[2] https://wiki.ubuntu.com/UEC/Images/NamingConvention
[3] https://wiki.ubuntu.com/StableReleaseUpdates#Verification

Scott Moser (smoser) wrote :

I've uploaded kernels to the following akis. Please test these (from lucid-proposed) and report back.

x86_64 ubuntu-lucid-amd64-linux-image-2.6.32-309-ec2-v-2.6.32-309.17-kernel.img
us-west-1 aki-70065635
us-east-1 aki-acc236c5
eu-west-1 aki-8c4174f8
ap-southeast-1 aki-320d7360

i386 ubuntu-lucid-i386-linux-image-2.6.32-309-ec2-v-2.6.32-309.17-kernel.img
us-west-1 aki-72065637
us-east-1 aki-a2c236cb
eu-west-1 aki-8e4174fa
ap-southeast-1 aki-300d7362

Scott Moser (smoser) wrote :

I ran ami-6407f20d (ubuntu-lucid-10.04-i386-server-20100923) with kernel aki-a2c236cb, and in almost an hour, the highest load shown is a spike to .15.

I consider this bug fixed in the -proposed kernel.

flaccid (chris-xhost) wrote :

I also confirm the new kernels fix the bug. Thanks Scott!
Users keep in mind, that for existing images, you'll need to install the corresponding kernel modules (a reboot will usually kernel panic if they do not exist).

Scott Moser (smoser) wrote :

This was fix-released in linux-image-2.6.32-309-ec2 .
That kernel is the default in uec images with serial 20101020.
https://lists.ubuntu.com/archives/ubuntu-cloud/2010-October/000307.html

Changed in linux-ec2 (Ubuntu):
status: In Progress → Fix Released
Peter Júnoš (petoju) wrote :

When will be this fix released for x64 server kernel? I updated server ~week ago (kernel 2.6.32-25-server) and fix still isn't there.

Is this fix present in 2.6.32-312-ec2? I'm seeing high loads on machines doing almost nothing after upgrading to the grub kernel (thanks for that, BTW!), and then -312.

DLHDavidLH (dlhdavidlh-yahoo) wrote :

this bug affects others....
-------------------------------------------------------------------------

High load average on Ubuntu 10.04 ...( 64 bit )

- Ubuntu 10.04 (64bit)
- Linux kernel 2.6.32-30
- Gnome 2.30.2

---------------------------------------------------------------------------------

- - computer specs - -

* AMD Phenom II X6 1055T ( 2.8GHz )

* ECS A790GXM-AD3 ( AMD 790GX North bridge / SB750 south bridge )

* 6 GB of DD3 1333 Memory

* WD3000HLFS ( SATA and 10000 RPM ) 300GB Hard drive

Changed in ubuntu-on-ec2:
status: Invalid → New
Justin Riddiough (jriddiough) wrote :

Hi - another person reporting in.. spent a couple days trying to troubleshoot what could possibly be bottlenecking our system to report such a high load. Finally found that it is a known bug.

It looks like the fix is almost ready?

> uname -a
Linux ip-###.##-ec2 #9-Ubuntu SMP Thu Apr 15 08:05:38 UTC 2010 x86_64 GNU/Linux
> w
 23:51:51 up 1 day, 11:18, 2 users, load average: 1.13, 1.22, 1.16

Our organization is hoping to quickly roll out some new features to our customers but are holding back until we are sure that our system isn't already overloaded.. thanks.

Scott Moser (smoser) on 2011-08-24
Changed in ubuntu-on-ec2:
status: New → Invalid
Matthew Gregg (mcg) wrote :

Why marked invalid, was this fixed?

flaccid (chris-xhost) wrote :

Which actual AKI is this bug against and is it occurring with the latest images?
Maybe it doesn't exist anymore in the latest with pvgrub AKI and AMI?
Does LTS mean anything here?

Scott Moser (smoser) wrote :

This bug is "Invalid" on "Ubuntu on EC2" where invalid means "not the correct project". Bugs found when running Ubuntu on EC2 should be filed against the Ubuntu project.

The Ubuntu project bug, open against the 'linux' package is marked 'Fixed Released'. See comment 93.

From the bug summary, it was originally opened against:
ami-2d4aa444 099720109477/ubuntu-images/ubuntu-lucid-10.04-i386-server-20100427.1
which ran/runs aki:
us-east-1 aki-754aa41c ubuntu-lucid-i386-linux-image-2.6.32-305-ec2-v-2.6.32-305.9-kernel

The latest akis published to the ubuntu-images bucket should not have this issue, nor should any of the current released images (after 20101020). I make the distinction between akis and images because the latest ubuntu 10.04 images use pv-grub for loading, and, thus do not use the akis. We continue to publish akis corresponding to the linux-virtual kernels for backwards compatibility.

Scott Moser (smoser) wrote :

I meant to add in comment 100.

I am most certainly not saying that the linux-ec2 kernel running on EC2 is flawless, nor am I saying that -virtual is flawless, but the high load in /proc/loadavg was fixed.

If you have other issues, with current images, please open bugs.

Displaying first 40 and last 40 comments. View all 101 comments or add a comment.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Related questions