Server 8.04 LTS: soft lockup - CPU#1 stuck for 11s! [bond1:3795] - bond - bond0

Bug #245779 reported by Sim on 2008-07-05
104
This bug affects 13 people
Affects Status Importance Assigned to Milestone
linux (Debian)
Fix Released
Unknown
linux (Ubuntu)
High
Unassigned
Hardy
High
Unassigned

Bug Description

Hi!
Ubuntu Server 8.04 LTS with all patch and last kernel
Hardware: HP DL360 G4 Xeon
Bonding with :
- bond0 2x1Gb Intel (802.3ad / 4)
- bond1 8x1Gb Intel (802.3ad / 4)
Nagios (only nrpe and plugin)
Heartbeat2 (withour CRM)
Vlan

Today it crash (after two week uptime from kernel upgrade) with this output

6640927 firewall 11:46:54 kernel: [431168.944816] BUG: soft lockup - CPU#1 stuck for 11s! [bond1:3795]
6640928 firewall 11:46:54 kernel: [431168.944849]
6640929 firewall 11:46:54 kernel: [431168.944853] Pid: 3795, comm: bond1 Not tainted (2.6.24-19-server #1)
6640930 firewall 11:46:54 kernel: [431168.944856] EIP: 0060:[ipv6:_spin_lock+0xa/0x10] EFLAGS: 00000286 CPU: 1
6640931 firewall 11:46:54 kernel: [431168.944865] EIP is at _spin_lock+0xa/0x10
6640932 firewall 11:46:54 kernel: [431168.944867] EAX: f749f334 EBX: f749f25c ECX: 00000001 EDX: f749f25c
6640933 firewall 11:46:54 kernel: [431168.944870] ESI: 00000000 EDI: f7ca1000 EBP: f6c35c80 ESP: f6835cc0
6640934 firewall 11:46:54 kernel: [431168.944872] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
6640935 firewall 11:46:54 kernel: [431168.944875] CR0: 8005003b CR2: b7bfd0a0 CR3: 35908000 CR4: 000006b0
6640936 firewall 11:46:54 kernel: [431168.944878] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
6640937 firewall 11:46:54 kernel: [431168.944880] DR6: ffff0ff0 DR7: 00000400
6640938 firewall 11:46:54 kernel: [431168.944887] [<f8b67606>] ad_rx_machine+0x26/0x690 [bonding]
6640939 firewall 11:46:54 kernel: [431168.944899] [nf_nat:_read_lock_bh+0x8/0x50] _read_lock_bh+0x8/0x20
6640940 firewall 11:46:54 kernel: [431168.944920] [arp_process+0x8b/0x5f0] arp_process+0x8b/0x5f0
6640941 firewall 11:46:54 kernel: [431168.944930] [<f8b67e6a>] bond_3ad_lacpdu_recv+0x1fa/0x240 [bonding]
6640942 firewall 11:46:54 kernel: [431168.944946] [ip_local_deliver_finish+0xf9/0x210] ip_local_deliver_finish+0xf9/0x210
6640943 firewall 11:46:54 kernel: [431168.944955] [ip_rcv_finish+0xff/0x370] ip_rcv_finish+0xff/0x370
6640944 firewall 11:46:54 kernel: [431168.944960] [sock_def_write_space+0x12/0xa0] sock_def_write_space+0x12/0xa0
6640945 firewall 11:46:54 kernel: [431168.944968] [<f8967a4b>] e1000_alloc_rx_buffers+0xab/0x3a0 [e1000]
6640946 firewall 11:46:54 kernel: [431168.944982] [arp_rcv+0x0/0x140] arp_rcv+0x0/0x140
6640947 firewall 11:46:54 kernel: [431168.944994] [e1000:__netdev_alloc_skb+0x22/0x2a80] __netdev_alloc_skb+0x22/0x50
6640948 firewall 11:46:54 kernel: [431168.945000] [<f8b67c70>] bond_3ad_lacpdu_recv+0x0/0x240 [bonding]
6640949 firewall 11:46:54 kernel: [431168.945011] [tg3:netif_receive_skb+0x379/0x720] netif_receive_skb+0x379/0x440
6640950 firewall 11:46:54 kernel: [431168.945024] [<f8968474>] e1000_clean_rx_irq+0x174/0x500 [e1000]
6640951 firewall 11:46:54 kernel: [431168.945037] [<f8968378>] e1000_clean_rx_irq+0x78/0x500 [e1000]
6640952 firewall 11:46:54 kernel: [431168.945059] [<f8968300>] e1000_clean_rx_irq+0x0/0x500 [e1000]
6640953 firewall 11:46:54 kernel: [431168.945071] [<f896569e>] e1000_clean+0x5e/0x250 [e1000]
6640954 firewall 11:46:54 kernel: [431168.945085] [net_rx_action+0x12d/0x210] net_rx_action+0x12d/0x210
6640955 firewall 11:46:54 kernel: [431168.945099] [__do_softirq+0x82/0x110] __do_softirq+0x82/0x110
6640956 firewall 11:46:54 kernel: [431168.945109] [do_softirq+0x55/0x60] do_softirq+0x55/0x60
6640957 firewall 11:46:54 kernel: [431168.945113] [irq_exit+0x6d/0x80] irq_exit+0x6d/0x80
6640958 firewall 11:46:54 kernel: [431168.945117] [do_IRQ+0x40/0x70] do_IRQ+0x40/0x70
6640959 firewall 11:46:54 kernel: [431168.945121] [find_busiest_group+0x1bd/0x760] find_busiest_group+0x1bd/0x760
6640960 firewall 11:46:54 kernel: [431168.945130] [common_interrupt+0x23/0x28] common_interrupt+0x23/0x28
6640961 firewall 11:46:54 kernel: [431168.945142] [<f897007b>] e1000_init_hw+0x34b/0xb50 [e1000]
6640962 firewall 11:46:54 kernel: [431168.945156] [ipv6:_spin_lock+0x3/0x10] _spin_lock+0x3/0x10
6640963 firewall 11:46:54 kernel: [431168.945163] [<f8b67606>] ad_rx_machine+0x26/0x690 [bonding]
6640964 firewall 11:46:54 kernel: [431168.945179] [lock_timer_base+0x27/0x60] lock_timer_base+0x27/0x60
6640965 firewall 11:46:54 kernel: [431168.945183] [delayed_work_timer_fn+0x0/0x20] delayed_work_timer_fn+0x0/0x20
6640966 firewall 11:46:54 kernel: [431168.945194] [<f8b68290>] bond_3ad_state_machine_handler+0xf0/0x9b0 [bonding]
6640967 firewall 11:46:54 kernel: [431168.945206] [queue_delayed_work_on+0x7c/0xb0] queue_delayed_work_on+0x7c/0xb0
6640968 firewall 11:46:54 kernel: [431168.945214] [usbcore:queue_delayed_work+0x51/0x70] queue_delayed_work+0x51/0x70
6640969 firewall 11:46:54 kernel: [431168.945221] [<f8b681a0>] bond_3ad_state_machine_handler+0x0/0x9b0 [bonding]
6640970 firewall 11:46:54 kernel: [431168.945229] [run_workqueue+0xbf/0x160] run_workqueue+0xbf/0x160
6640971 firewall 11:46:54 kernel: [431168.945240] [worker_thread+0x0/0xe0] worker_thread+0x0/0xe0
6640972 firewall 11:46:54 kernel: [431168.945245] [worker_thread+0x84/0xe0] worker_thread+0x84/0xe0
6640973 firewall 11:46:54 kernel: [431168.945251] [<c0145fc0>] autoremove_wake_function+0x0/0x40
6640974 firewall 11:46:54 kernel: [431168.945260] [worker_thread+0x0/0xe0] worker_thread+0x0/0xe0
6640975 firewall 11:46:54 kernel: [431168.945265] [kthread+0x42/0x70] kthread+0x42/0x70
6640976 firewall 11:46:54 kernel: [431168.945269] [kthread+0x0/0x70] kthread+0x0/0x70
6640977 firewall 11:46:54 kernel: [431168.945274] [kernel_thread_helper+0x7/0x10] kernel_thread_helper+0x7/0x10
6640978 firewall 11:46:54 kernel: [431168.945284] =======================

Can you help me?

Very thanks

---
Sim

Sim (simvirus) on 2008-07-05
description: updated
Sim (simvirus) on 2008-07-05
description: updated

Thanks for taking the time to report this bug and helping to make Ubuntu better. We appreciate the difficulties you are facing, but this appears to be a "regular" (non-security) bug. I have unmarked it as a security issue since this bug does not show evidence of allowing attackers to cross privilege boundaries nor directly cause loss of data/privacy. Please feel free to report any other bugs you may find.

Brian Murray (brian-murray) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. This bug did not have a package associated with it, which is important for ensuring that it gets looked at by the proper developers. You can learn more about finding the right package at https://wiki.ubuntu.com/Bugs/FindRightPackage . I have classified this bug as a bug in linux the package for kernel bugs.

Sim (simvirus) wrote :

Hi!
After 5,5 days this problem has reappeared.
This is really serious and critical!
This has never happened with previous kernels.
Now I have restarted with the kernel previous...hoping temporary resolution.
Thank you for your attention.

BobBlack (bblack-ubuntu) wrote :

I have encountered the same issue with a similar setup:

Ubuntu: 8.0.4 + updates
Hardware: HP DL360 (Pentium 4 based Xeon, unsure of DLP "generation")

I am bonding with the onboard NIC, which is (a dual PHY?) Broadcom:
02:02.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 10)
02:02.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 10)

My bonding config (in /etc/modprobe.d/bonding):
options bond0 mode=4 lacp_rate=1 miimon=100

Machine is basically locked, I see this message repeated at the console:
BUG: soft lockup - CPU#1 stuck for 11s! [bond0:3980]

I don't see any OOPS type info, nor a stack trace. (though I can't get into machine to view syslog)

The machine had locked on me last weekend also, though I was unable to activate/view the console so I don't know if it was the same problem or not. I was running the -xen kernel at the time.

When I am able to reboot the machine, I will run memtest on it, but as it's a production machine, and it takes 5+ days to error, I'm not sure if I'll be able to do much tinkering to figure out the problem. (I'll probably just drop bonding for the time being.)

Bob Black

BobBlack (bblack-ubuntu) wrote :

Just throwing an idea out there, maybe another thing to try would be to disable hyperthreading in the BIOS?
(My DL360 has 2 P4 Xeons and hyperthreading is enabled.)

I'm not going to risk re-enabling bonding on my machine as I was only using it for failover (I don't need it for increased bandwidth).

Bob Black

Sim (simvirus) wrote :

Now I'm running previous kernel (2.6.24-18).
Other servers with the same configuration hw+sw and with previous kernel have no problems !!!

I can conferm this message repeated at the console:

BUG: soft lockup - CPU#1 stuck for 11s! [bond0:xxxx]
BUG: soft lockup - CPU#1 stuck for 11s! [bond0:xxxx]
BUG: soft lockup - CPU#1 stuck for 11s! [bond0:xxxx]
BUG: soft lockup - CPU#1 stuck for 11s! [bond0:xxxx]
BUG: soft lockup - CPU#1 stuck for 11s! [bond0:xxxx]
BUG: soft lockup - CPU#1 stuck for 11s! [bond0:xxxx]
BUG: soft lockup - CPU#1 stuck for 11s! [bond0:xxxx]
BUG: soft lockup - CPU#1 stuck for 11s! [bond0:xxxx]

etc..

brupje (b-meijer) wrote :

I seem to have fixed the described problem with a BIOS update.

Sim (simvirus) wrote :

Hi Brupje,
excuse me, can you explain me your "solution" and hardware?
I have HP DL360 G4 Xeon.
With previous kernel (2.6.24-18-server), now I have 10 days of uptime..
I'm waiting and checking it.

Sim (simvirus) on 2008-07-21
Changed in linux:
status: New → Confirmed
Changed in linux:
assignee: nobody → ubuntu-kernel-team
importance: Undecided → High
brupje (b-meijer) wrote :

I experienced the problem with my Asus p5b-deluxe with core 2 duo E6400. While browsing for a solution I found some hints that this problem requires a micro code update for your processor, provided by a BIOS update. I figured it couldn't hurt much, so I did.

I rebooted a few times and the problem did not occur, while previously one out of two times my pc wouldn't boot. Although I must admit it's weird that the problem started to occur after a kernel update.

Sim (simvirus) wrote :

Hi brupje,
I'm not sure that it is the same problem.

Have you bonding?
Have you read changelog of your bios upgrade?

I our case we have message repeated at the console with freeze of all (ethernet, keyboard, ping, bonding, etc).

This is the bessage:

BUG: soft lockup - CPU#1 stuck for xx s! [bond0:xxxx]

No problem with previous kernel (2.6.24-18).
I think this bug is introduced with the new release (2.6.24-19).

Thanks

brupje (b-meijer) wrote :

I have the following facts:
- I got the same error message, after a kernel upgrade to -19. It repeats itself after hald is loaded. By googling for a solution with that message I found this topic.
- I run Ubuntu 8.04
- I do have a different processor
- The pc wouldn't accept any input (except reset button)
- After a bios update the problem no longer occured, so it works for me

The bios changelog is not very detailed. It tells something about new processors , enhancements (marketing talk for bugs solved) and revisions.

Simon Boggis (s-a-boggis) wrote :

I've seen a very similar bug with a debian testing 2.6.24-1-686 kernel, which will be very closely related to the ubuntu 8.04 one (http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=490156#2).

I've had it twice on two different but identical chassis - Viglen branded Intel chassis with Intel Server Board SE7520JR2 with dual Intel Xeon 3.00GHz. I up-to-date with the BIOS.

Does anyone know how to provoke the bug (other than waiting)? I'm running firewall/routers, and so far it has taken quite a long time to show up (a month or so) but that might just be luck.

Is the bug actually in the bonding module, and if so, has anyone tried disabling bonding and determined whether it clears the problem?

The info about which kernel version introduces the problem is very useful, but I wonder whether anyone has tried a newer kernel - I've had a suggestion that 2.6.26 might well do so, although I have no hard evidence that this is true (yet).

Best wishes,

Simon

Sim (simvirus) wrote :

Hi!
20 days uptime with old kernel version!
I can confirm 2.6.24-19 affected and not 2.6.24-18 !!
Pls. fix it

Thanks!

Changed in linux:
status: Unknown → New
aty (attila-gombos) wrote :

Hi,

same errot at me with 2.6.24-18 too during boots...

aty

ScottMarlowe (scott-marlowe) wrote :

I too can confirm this problem with 8.04.01 with all updates / upgrades applied. I WAS running on a dual quad core opteron machine with a tyan server class mobo, 32 Gigs of 667MHz ram, and an Areca 1680 SAS card with 16 drives in a large RAID-10 configuration. Under even moderate db loads I would see the cpu#x stuck for 11s message in my logs, and eventually I'd have 3 or 4 cores stuck spinning at 100% in postgres process waiting for I/O. Note that these stuck processes would prevent a proper shutdown and require me to hard power cycle to machine to get it back up and running.

7.10 did not have this problem, but 7.10 kept hanging on install, and could only get one of the machines installed. Not wanting to go into production with an OS that might or might not install (7.10) or one that routinely locks up my CPUs and never gives them back, I have resorted to installing Centos 5.2. Which runs perfectly on this hardware with no errors, hangs or locks. It is running 2.6.18-92.el5 kernel. Hopefully some kind of comparison of the kernel in 8.04.01 and centos 5.2 can turn something up? I'm gonna have another couple days with it available for testing where I can install 8.04.01 on a spare partition and see if I can recreate the problem if someone needs more info.

ScottMarlowe (scott-marlowe) wrote :

This bug is discussed in the lkml from last december.

http://lkml.org/lkml/2007/12/7/299

It is not apparent that it was actually fixed

ScottMarlowe (scott-marlowe) wrote :

It would appear from reading that thread on the lkml that the temporary fix is to set a boot option of NO_HZ=y. I no longer have my 8.04.01 machine to test this on, but might be able to set it up dual boot some weekend to play.

Malex (mbaldov) wrote :

Hello everybody,

     I also have the same problem that Sim describes on the top of this thread.

I know that is out the new Kernel (Ubuntu Security Notice USN-637-1 --August 25, 2008) : linux-image-2.6.24-19 --->2.6.24-19.41
but I have many fear to upgrade because I should stop some crucial server and doesn't know if this serious bug it was solved I'm troubled.

So is there anyone that has upgrade at the new Kernel?
And (especially) has the bug disappear/resolved?

Waiting for your response.

Regards.

Malex (mbaldov) wrote :

...In addition to the my previous post, this problem is happened in a VM (ESX 3.5i) without bonding.

Regards.

The Ubuntu Kernel Team is planning to move to the 2.6.27 kernel for the upcoming Intrepid Ibex 8.10 release. As a result, the kernel team would appreciate it if you could please test this newer 2.6.27 Ubuntu kernel. There are one of two ways you should be able to test:

1) If you are comfortable installing packages on your own, the linux-image-2.6.27-* package is currently available for you to install and test.

--or--

2) The upcoming Alpha5 for Intrepid Ibex 8.10 will contain this newer 2.6.27 Ubuntu kernel. Alpha5 is set to be released Thursday Sept 4. Please watch http://www.ubuntu.com/testing for Alpha5 to be announced. You should then be able to test via a LiveCD.

Please let us know immediately if this newer 2.6.27 kernel resolves the bug reported here or if the issue remains. More importantly, please open a new bug report for each new bug/regression introduced by the 2.6.27 kernel and tag the bug report with 'linux-2.6.27'. Also, please specifically note if the issue does or does not appear in the 2.6.26 kernel. Thanks again, we really appreicate your help and feedback.

Malex (mbaldov) wrote :

Hello Leann,
                    I would ask you this question:

Has The Kernel Team identified the reason that causing the bug reported here with this release of new kernel or not?

I ask you this because it's difficoult for me to stop crucial server if also with this new update the bug continues to be present.

Thanks in advance.

Hi Malex,

The kernel team sent out this call for testing to a majority of the bug reports to get immediate feedback regarding the 2.6.27 kernel from those willing to test. Unfortunately It is unknown if the newer 2.6.27 kernel will specifically resolve the issue reported here. Would anyone else experiencing this bug be willing to test? Sim, since you are the original bug reporter, would you possibly be able to test? Thanks.

Sim (simvirus) wrote :

Hi Leann,
hi to all.

I will try this release next day.

My servers are critical and I have to do these tests with attention.
Unfortunately the problem does not appear immediately and takes 1 to 2 weeks of uptime.

I can confirm the same problem with Ubuntu 8.04 LTS run on VMware ESX 3.5 (and in this case without bonding).

Regards

Sim (simvirus) wrote :

Hi!
Now I'm running last kernel on my server.

I have a question. Can I know exactly version of kernel?

# cat /proc/sys/kernel/osrelease
  or
# uname -a

show only: 2.6.24-19-server
and not 2.6.24-19.xx

Thanks

Malex (mbaldov) wrote :

Hi Leann,
               excuse me but I don't understand a thing:

why must I upgrade the 8.04 "HH" to 8.10 "II" to alpha5 (at the link that you have reported there isn't that version!)?

From what I read, Sim has upgrade the 8.04 at the last(?) available Kernel that is 2.6.24.-19.41 server

Could you take off this doubt?

Thanks in advance.

Regards.

@Sim - 'cat /proc/version_signature' should give you the exact Ubuntu kernel version

@Malex - I'm not quite sure I understand your question. Testing the upcoming Intrepid Ibex 8.10 Alpha# is completely voluntary if you choose to do so. Also, only Intrepid will contain this newer 2.6.27 kernel, it is not available in the Hardy repository, only the Intrepid repository. Thanks.

Sim (simvirus) wrote :

Update:

@Leann Ogasawara - Hi! I'm running Kernel "Ubuntu 2.6.24-19.41-server" from 21 days, without any problem ( for now ;-) )

Sim (simvirus) wrote :

BAD NEWS!!!!!!BAD NEWS!!!!!!BAD NEWS!!!!!!BAD NEWS!!!!!!BAD NEWS!!!!!!BAD NEWS!!!!!!

My server was broken today, after 23 days, with kernel Ubuntu 2.6.24-19.41-server

Here output:

[2015093.126455] BUG: soft lockup - CPU#0 stuck for 11s! [bond0:3867]
[20150.................] BUG: soft lockup - CPU#0 stuck for 11s! [bond0:3867]
[..]

The only working kernel is 2.6.24-18 !!!

PLEASE FIX THIS BIG BUG!

Thanks

Hark (ubuntu-komkommerkom) wrote :

I have the same problem on totally different hardware. Today one of our Dell PowerEdge R300 servers crashed with the message:

BUG: soft lockup - CPU#2 stuck for 11s! [kvm:5246]

This machine was still using kernel 2.6.24-19.36. I updated it to 2.6.24-19.41 now, but after reading the comments here I guess this will not solve this specific problem.

IMHO this is a huge problem. The Dell R300 is one of the best sold servers in the world. We are just switching from Debian Stable to Ubuntu for better KVM support, and support for newer hardware. This bug is a real show stopper for our migration plans.

Dave (ubuntu-comm) wrote :

I've been fighting with this bug for 2 months now. Sometimes I get uptime of a couple of weeks. Sometimes only a couple of days. Very weird and seems resilient to kernel changes.

My setup is a Dual Quad Core Xeon 5405 (6Mbyte cache) mounted on a supermicro X7DCL-i motherboard with 8gigs of DDR2 ECC ram memory. The hardware did complete a 48 hours memtest successfully so I'm quite confident it's not MB/RAM/Hardware issue. The BIOS is the latest available (8/18/2008 from Supermicro).

I've seen the bug randomly with both 2.6.24-18-xen and 2.6.24-19-xen versions of the kernel. The process that dies can be anything in Dom0, DomU and seems unrelated to the actual process/module that is executed. It seems somewhat related to the order of processes loaded (in my case the order of domU startup). The 2.6.24-18 kernel seems a little more stable, but this could be a coincidence.

So far I've seen crashing a simple ext2 formatting, various processes in different domU, various processes in dom0. The offending process is somewhat sticky (leading me to believe a memory/hardware issue) but I ruled that out above.

The latest incarnation is a clamd process that won't live longher than few hours without crashing:

19240.984220] BUG: soft lockup - CPU#0 stuck for 11s! [clamd:2976]
[19240.984230]
[19240.984233] Pid: 2976, comm: clamd Not tainted (2.6.24-18-xen #1)
[19240.984237] EIP: 0061:[<c0327677>] EFLAGS: 00000286 CPU: 0
[19240.984245] EIP is at _spin_lock+0x7/0x10
[19240.984248] EAX: c1c2898c EBX: 00000000 ECX: c1c28980 EDX: 00000d88
[19240.984251] ESI: 50425067 EDI: 00000001 EBP: c0477158 ESP: e7f49ef4
[19240.984254] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0069
[19240.984260] CR0: 8005003b CR2: b70b1000 CR3: 28400000 CR4: 00002660
[19240.984264] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[19240.984267] DR6: ffff0ff0 DR7: 00000400
[19240.984271] [<c01759d5>] mprotect_fixup+0x395/0x800
[19240.984284] [<c013bb90>] autoremove_wake_function+0x0/0x40
[19240.984293] [<c0175fcb>] sys_mprotect+0x18b/0x230
[19240.984299] [<c0105832>] syscall_call+0x7/0xb
[19240.984305] [<c0320000>] vcc_def_wakeup+0x30/0x60
[19240.984310] =======================

I'm currently running this particular domU with 2.4.26-21-xen kernel, for testing. Will report if it crashes.

Here's CPUinfo. Might be usefull:

processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Xeon(R) CPU E5405 @ 2.00GHz
stepping : 6
cpu MHz : 1999.999
cache size : 6144 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu de tsc msr pae mce cx8 apic mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc up arch_perfmon pebs bts pni monitor ds_cpl vmx tm2 ssse3 cx16 xtpr dca sse4_1 lahf_lm
bogomips : 4004.96
clflush size : 64

ScottMarlowe (scott-marlowe) wrote :

Dave, what's the verdict on the -21 kernel? Did it fix your lockups? Also, did you try the NO_HZ=y boot option?

Ryan Sitzman (sitzmar) wrote :

Stuck in the same boat. Just wanted to add that I've been running 2.6.24-22-xen for a week now, with really no change in behavior. Random lockups still occur.

ScottMarlowe (scott-marlowe) wrote :

sitzmar, have you tried the NO_HZ=y boot option?

Ryan Sitzman (sitzmar) wrote :

No, I haven't tried that before, but I'm willing to give it a try. This parameter goes on dom0, correct?

Ryan Sitzman (sitzmar) wrote :

Well, no luck.

I first added it to the kernel params on dom0, and it didn't take long before I had a vm lockup. So then I added it into the domU kernel params. Things were going good for almost 2 days, but it did lock up again.

So I don't think NO_HZ=y has made any difference.

Changed in linux:
status: New → Fix Released
John Leach (johnleach) wrote :

I experienced this bond driver lockup problem in a few Xen dom0 instances (though on Centos with a 2.6.18 kernel). Reducing number of cpus in dom0 to 1 worked around it for now. No crash in weeks since. Probably not an option for everyone, but is fine for us.

Ryan Sitzman (sitzmar) wrote :

I noticed that debbugs #490156 was closed, and was marked as being fixed in 2.6.26.
Will this kernel (or the fix) ever be available in hardy? Or am i going to have to upgrade all of my servers to Intrepid?

Per a decision made by the Ubuntu Kernel Team, bugs will longer be assigned to the ubuntu-kernel-team in Launchpad as part of the bug triage process. The ubuntu-kernel-team is being unassigned from this bug report. Refer to https://wiki.ubuntu.com/KernelTeamBugPolicies for more information. Thanks.

Lily (starlily) wrote :

I have a Dell 6650 running whatever the latest xen server image is (and I run update/upgrade/dist-upgrade frequently). One DomU locks up pretty regularly with "BUG: soft lockup - CPU#1 stuck for 11s!", usually during large file transfers. DomU Kernel version is 2.6.24-19. The bug *requires* destroying the DomU and restarting it.

Its pretty clear after reading many bug reports about this that it is in the Kernel somewhere (and the kernel team has responded by changing their policy about bug reporting). It is clearly NOT hardware or application specific, as this is reported on many platforms and appears to not have a consistent trigger.

Potentially, this is related to SMP, or PAE, although I find that listing these as an area of issue is an easy scapegoat, even if it may be true.

Id really like someone who KNOWS what this bug is caused by to provide a definitive answer somewhere that can be seen by the public, and if possible provide a workaround or targeted date for release of fix.

Thanks!
Lily

Warren V (verbanista) wrote :

I am seeing this same issue across multiple Dell platforms running CentOS 5.2. I have a bug report there: http://bugs.centos.org/view.php?id=3318

I'm seeing this item with both the 2.6.17 and 2.6.18 kernels- the base kernels seem to be the most stable. Occasionally, I don't get the soft lockup message- on my faster multicore machines, a core will go to 100%+, but the machine will not completely fail.

I sure wish that the kernel team hadn't run away from this bug.

Hark (ubuntu-komkommerkom) wrote :

Lily, you use kernel 2.6.24-19. That's not the newest one for Hardy, the newest one is currently 2.6.24-22. Using the latest kernels this bug hasn't occurred on my servers anymore for some time now. But when, on a updated server, I start a KVM with an older kernel and multiple processors configured, I usually experience this bug quite fast!

So it looks like when you use the latest kernel on your server AND on all virtual servers this bug won't occur anymore. I'm wondering if other people have also noticed this.

Warren V, do your multiple Dell platforms have AMD or Intel processors? I use Ubuntu Hardy only on Intel hardware now, but I'm wondering if this bug also applies to AMD platforms (I'm planning to buy a bunch of AMD servers soon and this is quite an important issue for me).

John Leach (johnleach) wrote :

I think there might be two bugs here. Something regarding bonding, which I've seen on our Dell machines with Centos 5 too.

And then a general cpu softlock problem, which I'm also experiencing with Hardy as a Xen guest - that I think is Xen related (I see it come up with various processes - whatever is busy really). This bug here is probably the best place to report those types of problems: https://bugs.launchpad.net/ubuntu/+source/linux-meta/+bug/259487

Download full text (7.0 KiB)

Howdy-

All of my units are using dual or quad-core Intels. The quads don't seem to
throw the softlock error- top just shows that one of the cores is slammed,
and everything slows to a crawl. I can't seem to be able to kill the bonding
kmod- so I always end up having to reboot the units. I'm going to update my
dev environment to 2.6.24-22, and will advise on any oddness that I run
into.

-Warren V

On Tue, Jan 20, 2009 at 8:26 AM, John Leach <email address hidden> wrote:

> I think there might be two bugs here. Something regarding bonding,
> which I've seen on our Dell machines with Centos 5 too.
>
> And then a general cpu softlock problem, which I'm also experiencing
> with Hardy as a Xen guest - that I think is Xen related (I see it come
> up with various processes - whatever is busy really). This bug here is
> probably the best place to report those types of problems:
> https://bugs.launchpad.net/ubuntu/+source/linux-meta/+bug/259487
>
> --
> Server 8.04 LTS: soft lockup - CPU#1 stuck for 11s! [bond1:3795] - bond -
> bond0
> https://bugs.launchpad.net/bugs/245779
> You received this bug notification because you are a direct subscriber
> of the bug.
>
> Status in "linux" source package in Ubuntu: Confirmed
> Status in "linux" source package in Debian: Fix Released
>
> Bug description:
> Hi!
> Ubuntu Server 8.04 LTS with all patch and last kernel
> Hardware: HP DL360 G4 Xeon
> Bonding with :
> - bond0 2x1Gb Intel (802.3ad / 4)
> - bond1 8x1Gb Intel (802.3ad / 4)
> Nagios (only nrpe and plugin)
> Heartbeat2 (withour CRM)
> Vlan
>
> Today it crash (after two week uptime from kernel upgrade) with this output
>
> 6640927 firewall 11:46:54 kernel: [431168.944816] BUG: soft lockup - CPU#1
> stuck for 11s! [bond1:3795]
> 6640928 firewall 11:46:54 kernel: [431168.944849]
> 6640929 firewall 11:46:54 kernel: [431168.944853] Pid: 3795, comm: bond1
> Not tainted (2.6.24-19-server #1)
> 6640930 firewall 11:46:54 kernel: [431168.944856] EIP:
> 0060:[ipv6:_spin_lock+0xa/0x10] EFLAGS: 00000286 CPU: 1
> 6640931 firewall 11:46:54 kernel: [431168.944865] EIP is at
> _spin_lock+0xa/0x10
> 6640932 firewall 11:46:54 kernel: [431168.944867] EAX: f749f334 EBX:
> f749f25c ECX: 00000001 EDX: f749f25c
> 6640933 firewall 11:46:54 kernel: [431168.944870] ESI: 00000000 EDI:
> f7ca1000 EBP: f6c35c80 ESP: f6835cc0
> 6640934 firewall 11:46:54 kernel: [431168.944872] DS: 007b ES: 007b FS:
> 00d8 GS: 0000 SS: 0068
> 6640935 firewall 11:46:54 kernel: [431168.944875] CR0: 8005003b CR2:
> b7bfd0a0 CR3: 35908000 CR4: 000006b0
> 6640936 firewall 11:46:54 kernel: [431168.944878] DR0: 00000000 DR1:
> 00000000 DR2: 00000000 DR3: 00000000
> 6640937 firewall 11:46:54 kernel: [431168.944880] DR6: ffff0ff0 DR7:
> 00000400
> 6640938 firewall 11:46:54 kernel: [431168.944887] [<f8b67606>]
> ad_rx_machine+0x26/0x690 [bonding]
> 6640939 firewall 11:46:54 kernel: [431168.944899]
> [nf_nat:_read_lock_bh+0x8/0x50] _read_lock_bh+0x8/0x20
> 6640940 firewall 11:46:54 kernel: [431168.944920] [arp_process+0x8b/0x5f0]
> arp_process+0x8b/0x5f0
> 6640941 firewall 11:46:54 kernel: [431168.944930] [<f8b67e6a>]
> bond_3ad_lacpdu_recv+0x1fa/0x240 [bonding]
> 6640942 firewall 11:46:54 kernel: [431168.944...

Read more...

Ryan Sitzman (sitzmar) wrote :

This isn't a solution to the bug, but you may find that using the backports repository to install xen 3.3.0 and the 2.6.24-23 kernel yields some positive results. On one of my boxes, I could consistently trigger the 'CPU#1 stuck' problem, and after upgrading it hasn't locked up once. Of course, on a different box with slightly different hardware, it locks up just as frequently as before... so ymmv.

Warren V (verbanista) wrote :
Download full text (7.3 KiB)

Hi-

Actually, I think I have one better. The latest redhat kernel patch release
for 2.6.18-128 seems to have fixed the issue (two weeks now, no reboot or
lockup), even though there is no "official" fix listed. It looks like they
made some alterations to the bonding code to fix some bogus MAC address
tracking silliness, which may be preventing the larger issue.

The patch discussion is at:
https://rhn.redhat.com/errata/RHSA-2009-0225.html
I downloaded the patch from:
http://people.redhat.com/dzickus/el5/128.el5/i686/

For those of us running CentOS, this is a straight rpm -ivh install. I
thought about doing the roll-my-own 2.6.24 install, but it was just too much
a jump ahead in kernel versions for me to be comfortable.

Thanks for the message!

-Warren V

On Mon, Feb 2, 2009 at 9:30 AM, Ryan Sitzman <email address hidden> wrote:

> This isn't a solution to the bug, but you may find that using the
> backports repository to install xen 3.3.0 and the 2.6.24-23 kernel
> yields some positive results. On one of my boxes, I could consistently
> trigger the 'CPU#1 stuck' problem, and after upgrading it hasn't locked
> up once. Of course, on a different box with slightly different hardware,
> it locks up just as frequently as before... so ymmv.
>
> --
> Server 8.04 LTS: soft lockup - CPU#1 stuck for 11s! [bond1:3795] - bond -
> bond0
> https://bugs.launchpad.net/bugs/245779
> You received this bug notification because you are a direct subscriber
> of the bug.
>
> Status in "linux" source package in Ubuntu: Confirmed
> Status in "linux" source package in Debian: Fix Released
>
> Bug description:
> Hi!
> Ubuntu Server 8.04 LTS with all patch and last kernel
> Hardware: HP DL360 G4 Xeon
> Bonding with :
> - bond0 2x1Gb Intel (802.3ad / 4)
> - bond1 8x1Gb Intel (802.3ad / 4)
> Nagios (only nrpe and plugin)
> Heartbeat2 (withour CRM)
> Vlan
>
> Today it crash (after two week uptime from kernel upgrade) with this output
>
> 6640927 firewall 11:46:54 kernel: [431168.944816] BUG: soft lockup - CPU#1
> stuck for 11s! [bond1:3795]
> 6640928 firewall 11:46:54 kernel: [431168.944849]
> 6640929 firewall 11:46:54 kernel: [431168.944853] Pid: 3795, comm: bond1
> Not tainted (2.6.24-19-server #1)
> 6640930 firewall 11:46:54 kernel: [431168.944856] EIP:
> 0060:[ipv6:_spin_lock+0xa/0x10] EFLAGS: 00000286 CPU: 1
> 6640931 firewall 11:46:54 kernel: [431168.944865] EIP is at
> _spin_lock+0xa/0x10
> 6640932 firewall 11:46:54 kernel: [431168.944867] EAX: f749f334 EBX:
> f749f25c ECX: 00000001 EDX: f749f25c
> 6640933 firewall 11:46:54 kernel: [431168.944870] ESI: 00000000 EDI:
> f7ca1000 EBP: f6c35c80 ESP: f6835cc0
> 6640934 firewall 11:46:54 kernel: [431168.944872] DS: 007b ES: 007b FS:
> 00d8 GS: 0000 SS: 0068
> 6640935 firewall 11:46:54 kernel: [431168.944875] CR0: 8005003b CR2:
> b7bfd0a0 CR3: 35908000 CR4: 000006b0
> 6640936 firewall 11:46:54 kernel: [431168.944878] DR0: 00000000 DR1:
> 00000000 DR2: 00000000 DR3: 00000000
> 6640937 firewall 11:46:54 kernel: [431168.944880] DR6: ffff0ff0 DR7:
> 00000400
> 6640938 firewall 11:46:54 kernel: [431168.944887] [<f8b67606>]
> ad_rx_machine+0x26/0x690 [bonding]
> 6640939 firewall 11:46:54 kernel: [431168.944899]
> [nf_nat:_read_l...

Read more...

Hark (ubuntu-komkommerkom) wrote :

Yesterday I got this error again:
 Feb 12 19:20:52 xxxxxxx kernel: [1410045.600863] BUG: soft lockup - CPU#3 stuck for 11s! [kvm:10534]

I had to use the remote power switch to get the machine running again, and that's definitely not something I want on a production machine!

Warren V (verbanista) wrote :
Download full text (6.4 KiB)

using 2.6.24 or 2.6.18-128?

-W

On Fri, Feb 13, 2009 at 12:12 AM, Hark <email address hidden> wrote:

> Yesterday I got this error again:
> Feb 12 19:20:52 xxxxxxx kernel: [1410045.600863] BUG: soft lockup - CPU#3
> stuck for 11s! [kvm:10534]
>
> I had to use the remote power switch to get the machine running again,
> and that's definitely not something I want on a production machine!
>
> --
> Server 8.04 LTS: soft lockup - CPU#1 stuck for 11s! [bond1:3795] - bond -
> bond0
> https://bugs.launchpad.net/bugs/245779
> You received this bug notification because you are a direct subscriber
> of the bug.
>
> Status in "linux" source package in Ubuntu: Confirmed
> Status in "linux" source package in Debian: Fix Released
>
> Bug description:
> Hi!
> Ubuntu Server 8.04 LTS with all patch and last kernel
> Hardware: HP DL360 G4 Xeon
> Bonding with :
> - bond0 2x1Gb Intel (802.3ad / 4)
> - bond1 8x1Gb Intel (802.3ad / 4)
> Nagios (only nrpe and plugin)
> Heartbeat2 (withour CRM)
> Vlan
>
> Today it crash (after two week uptime from kernel upgrade) with this output
>
> 6640927 firewall 11:46:54 kernel: [431168.944816] BUG: soft lockup - CPU#1
> stuck for 11s! [bond1:3795]
> 6640928 firewall 11:46:54 kernel: [431168.944849]
> 6640929 firewall 11:46:54 kernel: [431168.944853] Pid: 3795, comm: bond1
> Not tainted (2.6.24-19-server #1)
> 6640930 firewall 11:46:54 kernel: [431168.944856] EIP:
> 0060:[ipv6:_spin_lock+0xa/0x10] EFLAGS: 00000286 CPU: 1
> 6640931 firewall 11:46:54 kernel: [431168.944865] EIP is at
> _spin_lock+0xa/0x10
> 6640932 firewall 11:46:54 kernel: [431168.944867] EAX: f749f334 EBX:
> f749f25c ECX: 00000001 EDX: f749f25c
> 6640933 firewall 11:46:54 kernel: [431168.944870] ESI: 00000000 EDI:
> f7ca1000 EBP: f6c35c80 ESP: f6835cc0
> 6640934 firewall 11:46:54 kernel: [431168.944872] DS: 007b ES: 007b FS:
> 00d8 GS: 0000 SS: 0068
> 6640935 firewall 11:46:54 kernel: [431168.944875] CR0: 8005003b CR2:
> b7bfd0a0 CR3: 35908000 CR4: 000006b0
> 6640936 firewall 11:46:54 kernel: [431168.944878] DR0: 00000000 DR1:
> 00000000 DR2: 00000000 DR3: 00000000
> 6640937 firewall 11:46:54 kernel: [431168.944880] DR6: ffff0ff0 DR7:
> 00000400
> 6640938 firewall 11:46:54 kernel: [431168.944887] [<f8b67606>]
> ad_rx_machine+0x26/0x690 [bonding]
> 6640939 firewall 11:46:54 kernel: [431168.944899]
> [nf_nat:_read_lock_bh+0x8/0x50] _read_lock_bh+0x8/0x20
> 6640940 firewall 11:46:54 kernel: [431168.944920] [arp_process+0x8b/0x5f0]
> arp_process+0x8b/0x5f0
> 6640941 firewall 11:46:54 kernel: [431168.944930] [<f8b67e6a>]
> bond_3ad_lacpdu_recv+0x1fa/0x240 [bonding]
> 6640942 firewall 11:46:54 kernel: [431168.944946]
> [ip_local_deliver_finish+0xf9/0x210] ip_local_deliver_finish+0xf9/0x210
> 6640943 firewall 11:46:54 kernel: [431168.944955]
> [ip_rcv_finish+0xff/0x370] ip_rcv_finish+0xff/0x370
> 6640944 firewall 11:46:54 kernel: [431168.944960]
> [sock_def_write_space+0x12/0xa0] sock_def_write_space+0x12/0xa0
> 6640945 firewall 11:46:54 kernel: [431168.944968] [<f8967a4b>]
> e1000_alloc_rx_buffers+0xab/0x3a0 [e1000]
> 6640946 firewall 11:46:54 kernel: [431168.944982] [arp_rcv+0x0/0x140]
> arp_rcv+0x0/0x140
> 6640947 firewall 11:46:54 kernel: [431168.944994]
> [e10...

Read more...

Hark (ubuntu-komkommerkom) wrote :

Sorry, I'm using using the latest Ubuntu 8.04 kernel: 2.6.24-23-server

Sim (simvirus) wrote :

1 Year and we have always the same problem!
For 4 month I haven't had any problem.
In this week two server (see my first post) have crashed with the same results, without touching anything!
This is very very very ridiculous!
This big problem is appared with kernel 2.6.24-19-server....

No comment!No comment!No comment!No comment!No comment!No comment!No comment!

Regards

---
Sim

Zoltán Vigh (zool) wrote :

Are there any changes with this error????

I had this problem with kernel release 2.6.24.19-server on 4 cloned Ubuntu 8.04.2 on Vmware ESX 3.5.
The problem happened randomly twice a day or even more on all cloned servers.

After doing dist-upgrade to 2.6.24.24-server the problem disappeared.

Kind regards
--
Ludovico

Sim, since you are the original bug reporter, care to comment if this is now resolved for you as well? Thanks.

Changed in linux (Ubuntu):
status: Confirmed → Incomplete

Leann Ogasawara wrote:
> Sim, since you are the original bug reporter, care to comment if this is
> now resolved for you as well? Thanks.
>
> ** Changed in: linux (Ubuntu)
> Status: Confirmed => Incomplete
>

I have not had a recurrence on any of my equipment (of the order of 25
servers), so I say "yes".

Cheers,

Simon

--
Dr Simon A. Boggis Senior Network Analyst
Computing Services, Tel. 020 7882 7078
Queen Mary, University of London, London E1 4NS UK.

We are seeing this problem with 2.6.24-24-server. We were running with -23 recently but had this problem and upgraded after reading this report. Additionally, the CPU load on the machine is around 190 though summing the individual processes in top doesn't approach that total. kswapd is near the top though the machine still has a lot of real RAM unallocated. /var/log/kern.log grows rapidly at a rate of about 200 KB/minute. It has a lot of Call Traces and "Pid: 15713, comm: <process> Not tainted 2.6.24-24-server #1".

Download full text (6.8 KiB)

Upgrade your kernel to 2.6.28. CentOS is now on 2.6.28-128, I noted the
problem went away around 2.6.28-92.

Ubuntu is stuck with whatever is currently out.

On Wed, Jul 22, 2009 at 12:55 PM, Ryan Lovett <email address hidden> wrote:

> We are seeing this problem with 2.6.24-24-server. We were running with
> -23 recently but had this problem and upgraded after reading this
> report. Additionally, the CPU load on the machine is around 190 though
> summing the individual processes in top doesn't approach that total.
> kswapd is near the top though the machine still has a lot of real RAM
> unallocated. /var/log/kern.log grows rapidly at a rate of about 200
> KB/minute. It has a lot of Call Traces and "Pid: 15713, comm: <process>
> Not tainted 2.6.24-24-server #1".
>
> --
> Server 8.04 LTS: soft lockup - CPU#1 stuck for 11s! [bond1:3795] - bond -
> bond0
> https://bugs.launchpad.net/bugs/245779
> You received this bug notification because you are a direct subscriber
> of the bug.
>
> Status in “linux” package in Ubuntu: Incomplete
> Status in “linux” package in Debian: Fix Released
>
> Bug description:
> Hi!
> Ubuntu Server 8.04 LTS with all patch and last kernel
> Hardware: HP DL360 G4 Xeon
> Bonding with :
> - bond0 2x1Gb Intel (802.3ad / 4)
> - bond1 8x1Gb Intel (802.3ad / 4)
> Nagios (only nrpe and plugin)
> Heartbeat2 (withour CRM)
> Vlan
>
> Today it crash (after two week uptime from kernel upgrade) with this output
>
> 6640927 firewall 11:46:54 kernel: [431168.944816] BUG: soft lockup - CPU#1
> stuck for 11s! [bond1:3795]
> 6640928 firewall 11:46:54 kernel: [431168.944849]
> 6640929 firewall 11:46:54 kernel: [431168.944853] Pid: 3795, comm: bond1
> Not tainted (2.6.24-19-server #1)
> 6640930 firewall 11:46:54 kernel: [431168.944856] EIP:
> 0060:[ipv6:_spin_lock+0xa/0x10] EFLAGS: 00000286 CPU: 1
> 6640931 firewall 11:46:54 kernel: [431168.944865] EIP is at
> _spin_lock+0xa/0x10
> 6640932 firewall 11:46:54 kernel: [431168.944867] EAX: f749f334 EBX:
> f749f25c ECX: 00000001 EDX: f749f25c
> 6640933 firewall 11:46:54 kernel: [431168.944870] ESI: 00000000 EDI:
> f7ca1000 EBP: f6c35c80 ESP: f6835cc0
> 6640934 firewall 11:46:54 kernel: [431168.944872] DS: 007b ES: 007b FS:
> 00d8 GS: 0000 SS: 0068
> 6640935 firewall 11:46:54 kernel: [431168.944875] CR0: 8005003b CR2:
> b7bfd0a0 CR3: 35908000 CR4: 000006b0
> 6640936 firewall 11:46:54 kernel: [431168.944878] DR0: 00000000 DR1:
> 00000000 DR2: 00000000 DR3: 00000000
> 6640937 firewall 11:46:54 kernel: [431168.944880] DR6: ffff0ff0 DR7:
> 00000400
> 6640938 firewall 11:46:54 kernel: [431168.944887] [<f8b67606>]
> ad_rx_machine+0x26/0x690 [bonding]
> 6640939 firewall 11:46:54 kernel: [431168.944899]
> [nf_nat:_read_lock_bh+0x8/0x50] _read_lock_bh+0x8/0x20
> 6640940 firewall 11:46:54 kernel: [431168.944920] [arp_process+0x8b/0x5f0]
> arp_process+0x8b/0x5f0
> 6640941 firewall 11:46:54 kernel: [431168.944930] [<f8b67e6a>]
> bond_3ad_lacpdu_recv+0x1fa/0x240 [bonding]
> 6640942 firewall 11:46:54 kernel: [431168.944946]
> [ip_local_deliver_finish+0xf9/0x210] ip_local_deliver_finish+0xf9/0x210
> 6640943 firewall 11:46:54 kernel: [431168.944955]
> [ip_rcv_finish+0xff/0x370] ip_rcv_finish+0xff/0x370
> 6640944...

Read more...

On Wed, Jul 22, 2009 at 06:23:01PM -0000, Warren V wrote:
> Upgrade your kernel to 2.6.28. CentOS is now on 2.6.28-128, I noted the
> problem went away around 2.6.28-92.
>
> Ubuntu is stuck with whatever is currently out.

Do you know which patch addressed the issue? If so, the Ubuntu kernel devs
might be able to backport it to the LTS release.

Ryan

Download full text (6.7 KiB)

At one point I was able to track down a kernel.org post about the root of
the problem, but I can't remember exactly what it was. But I recall it was
due to a mistake on the part of one of the kernel devs.

-W

On Wed, Jul 22, 2009 at 2:36 PM, Ryan Lovett <email address hidden> wrote:

> On Wed, Jul 22, 2009 at 06:23:01PM -0000, Warren V wrote:
> > Upgrade your kernel to 2.6.28. CentOS is now on 2.6.28-128, I noted the
> > problem went away around 2.6.28-92.
> >
> > Ubuntu is stuck with whatever is currently out.
>
> Do you know which patch addressed the issue? If so, the Ubuntu kernel devs
> might be able to backport it to the LTS release.
>
> Ryan
>
> --
> Server 8.04 LTS: soft lockup - CPU#1 stuck for 11s! [bond1:3795] - bond -
> bond0
> https://bugs.launchpad.net/bugs/245779
> You received this bug notification because you are a direct subscriber
> of the bug.
>
> Status in “linux” package in Ubuntu: Incomplete
> Status in “linux” package in Debian: Fix Released
>
> Bug description:
> Hi!
> Ubuntu Server 8.04 LTS with all patch and last kernel
> Hardware: HP DL360 G4 Xeon
> Bonding with :
> - bond0 2x1Gb Intel (802.3ad / 4)
> - bond1 8x1Gb Intel (802.3ad / 4)
> Nagios (only nrpe and plugin)
> Heartbeat2 (withour CRM)
> Vlan
>
> Today it crash (after two week uptime from kernel upgrade) with this output
>
> 6640927 firewall 11:46:54 kernel: [431168.944816] BUG: soft lockup - CPU#1
> stuck for 11s! [bond1:3795]
> 6640928 firewall 11:46:54 kernel: [431168.944849]
> 6640929 firewall 11:46:54 kernel: [431168.944853] Pid: 3795, comm: bond1
> Not tainted (2.6.24-19-server #1)
> 6640930 firewall 11:46:54 kernel: [431168.944856] EIP:
> 0060:[ipv6:_spin_lock+0xa/0x10] EFLAGS: 00000286 CPU: 1
> 6640931 firewall 11:46:54 kernel: [431168.944865] EIP is at
> _spin_lock+0xa/0x10
> 6640932 firewall 11:46:54 kernel: [431168.944867] EAX: f749f334 EBX:
> f749f25c ECX: 00000001 EDX: f749f25c
> 6640933 firewall 11:46:54 kernel: [431168.944870] ESI: 00000000 EDI:
> f7ca1000 EBP: f6c35c80 ESP: f6835cc0
> 6640934 firewall 11:46:54 kernel: [431168.944872] DS: 007b ES: 007b FS:
> 00d8 GS: 0000 SS: 0068
> 6640935 firewall 11:46:54 kernel: [431168.944875] CR0: 8005003b CR2:
> b7bfd0a0 CR3: 35908000 CR4: 000006b0
> 6640936 firewall 11:46:54 kernel: [431168.944878] DR0: 00000000 DR1:
> 00000000 DR2: 00000000 DR3: 00000000
> 6640937 firewall 11:46:54 kernel: [431168.944880] DR6: ffff0ff0 DR7:
> 00000400
> 6640938 firewall 11:46:54 kernel: [431168.944887] [<f8b67606>]
> ad_rx_machine+0x26/0x690 [bonding]
> 6640939 firewall 11:46:54 kernel: [431168.944899]
> [nf_nat:_read_lock_bh+0x8/0x50] _read_lock_bh+0x8/0x20
> 6640940 firewall 11:46:54 kernel: [431168.944920] [arp_process+0x8b/0x5f0]
> arp_process+0x8b/0x5f0
> 6640941 firewall 11:46:54 kernel: [431168.944930] [<f8b67e6a>]
> bond_3ad_lacpdu_recv+0x1fa/0x240 [bonding]
> 6640942 firewall 11:46:54 kernel: [431168.944946]
> [ip_local_deliver_finish+0xf9/0x210] ip_local_deliver_finish+0xf9/0x210
> 6640943 firewall 11:46:54 kernel: [431168.944955]
> [ip_rcv_finish+0xff/0x370] ip_rcv_finish+0xff/0x370
> 6640944 firewall 11:46:54 kernel: [431168.944960]
> [sock_def_write_space+0x12/0xa0] sock_def_write_space+0x12/0xa0
> 66...

Read more...

Perhaps related to this kernel bug with a patch against 2.6.27-rc5:
http://bugzilla.kernel.org/show_bug.cgi?id=10753

I'm marking this Fix Released against the actively developed kernel based on the feedback from Sim, the original bug reporter:

https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/245779/comments/53

Changed in linux (Ubuntu):
status: Incomplete → Fix Released

On Thu, Aug 13, 2009 at 11:32:20PM -0000, Leann Ogasawara wrote:
> I'm marking this Fix Released against the actively developed kernel
> based on the feedback from Sim, the original bug reporter:
>
> https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/245779/comments/53

But since I am seeing the same problem against the LTS kernel do I have to
open a new bug?

Ryan

Chris Braun (cbraun) wrote :

I am seeing the same symptom as Sim. Seems to have popped up after upgrading from Debian 4 (2.6.18-6-686-BIGMEM) to Debian 5 (2.6.26-2-686-BIGMEM). I have a bout 25 VMs running ESX 4 and 3.5 that are all running debian kernel 2.6.26-2. They are all 2 CPU VMs runnning on 2, 4, and 8 way systems. One core locks up @100% after about 10-20 days of runtime. In some cases the problem occurs within a day or so in others it takes a month. The guest itself does not show 100% utilization in top. The system does come out of this "lockup" situation after about 5 hours and the system resumes as usual.

I am currently in the process of downgrading the kernel on each of these guests back to 2.6.18 to see if I can resolve this. Unfortunately that means loosing many of the VIM paravirt features that were added since 2.6.18.

I don't understand why this issue was closed as the original poster, Sim, did not indicate it was addressed. Unless Simon Boggis is the same poster as Sim which I don't think he is.

Sim (simvirus) wrote :

Hello everybody,

such as Chris Braun says at his last sentence of his post, Sim and Simon Boggis isn't the same person.
Infact I'm Sim and I don't know who is the other person who pretends to be me!!!

However, the issue seems to be solved but unfortunatelly tonight it was resubmitted the same problem on the Hw of my first post, WITH ALL NEW KERNEL AFTER 2.6.18

Regards

Sim

Changed in linux (Ubuntu):
status: Fix Released → Confirmed
Malex (mbaldov) wrote :

Hi all,
I also confirm what Sim says: the issue isn't solved.

I also tell you to change the state in a Debian section, because that Simon Boggis says that this bug is solved but isn't true!!!
@Chris Brown:You have reason, this issue isn't closed/solved and I don't understand why Simon Boggis (and not Sim)
says that things!

Rather I would like to ask Simon Boggis to explains to the whole community how he has solved this issue (if it's ireally solved)

Regards.

Chris Braun (cbraun) wrote :

Following up on my previous post. I have downgraded the kernel back to 2.6.18-2 on 20 of our guests. These machines have been running for over a week with no incident. We have been able to reproduce this issue on 2.6.26-2 very consistently. We also attempted to upgrade to 2.6.30 but we experienced a very similar symptom, however not identical. In the 2.6.30 case both cpus locked up, not just one.

Our application runs the following: JBoss, MySQL. The application is very disk I/O intensive and generates a fair share of Network IO as well.

Came across a changelog entry for 2.6.31 that may be interesting:

http://www.kernel.org/pub/linux/kernel/v2.6/testing/ChangeLog-2.6.31-rc8

Chris Braun (cbraun) wrote :

Here is the change log entry from the link I posted above:

Date: Mon Aug 17 14:34:59 2009 -0700

    clockevent: Prevent dead lock on clockevents_lock

    Currently clockevents_notify() is called with interrupts enabled at
    some places and interrupts disabled at some other places.

    This results in a deadlock in this scenario.

    cpu A holds clockevents_lock in clockevents_notify() with irqs enabled
    cpu B waits for clockevents_lock in clockevents_notify() with irqs disabled
    cpu C doing set_mtrr() which will try to rendezvous of all the cpus.

    This will result in C and A come to the rendezvous point and waiting
    for B. B is stuck forever waiting for the spinlock and thus not
    reaching the rendezvous point.

    Fix the clockevents code so that clockevents_lock is taken with
    interrupts disabled and thus avoid the above deadlock.

    Also call lapic_timer_propagate_broadcast() on the destination cpu so
    that we avoid calling smp_call_function() in the clockevents notifier
    chain.

    This issue left us wondering if we need to change the MTRR rendezvous
    logic to use stop machine logic (instead of smp_call_function) or add
    a check in spinlock debug code to see if there are other spinlocks
    which gets taken under both interrupts enabled/disabled conditions.

    Signed-off-by: Suresh Siddha <email address hidden>
    Signed-off-by: Venkatesh Pallipadi <email address hidden>
    Cc: "Pallipadi Venkatesh" <email address hidden>
    Cc: "Brown Len" <email address hidden>
    LKML-Reference: <email address hidden>
    Signed-off-by: Thomas Gleixner <email address hidden>

Charles (taylorc) wrote :
Download full text (5.2 KiB)

I am encountering soft lockups related the bond module. I suspect they are related however this situation is a bit different.

This server is dedicated, no virtualization, host or guest, here.

cat /proc/version_signature
Ubuntu 2.6.24-24.60-server

This unit runs bonding on 2 x Intel e1000 interfaces and serves up ISO and OVF files via NFS off a RAID6 array of 4 drives for use by VMWare ESXi clients.

This server is dedicated, no virtualization, host or guest, here.

Oct 14 06:34:55 iSCSI-A kernel: [3158205.689661] BUG: soft lockup - CPU#0 stuck for 11s! [bond0:4387]
Oct 14 06:34:55 iSCSI-A kernel: [3158205.689686]
Oct 14 06:34:55 iSCSI-A kernel: [3158205.689699] Pid: 4387, comm: bond0 Not tainted (2.6.24-24-server #1)
Oct 14 06:34:55 iSCSI-A kernel: [3158205.689703] EIP: 0060:[sunrpc:_spin_lock+0x7/0x10] EFLAGS: 00000286 CPU: 0
Oct 14 06:34:55 iSCSI-A kernel: [3158205.689715] EIP is at _spin_lock+0x7/0x10
Oct 14 06:34:55 iSCSI-A kernel: [3158205.689720] EAX: f7ce4134 EBX: f7ce405c ECX: 00000001 EDX: f7ce405c
Oct 14 06:34:55 iSCSI-A kernel: [3158205.689726] ESI: 00000000 EDI: f7d1f000 EBP: f7167480 ESP: df99dcc0
Oct 14 06:34:55 iSCSI-A kernel: [3158205.689732] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Oct 14 06:34:55 iSCSI-A kernel: [3158205.689738] CR0: 8005003b CR2: b7fa9000 CR3: 0049e000 CR4: 000006b0
Oct 14 06:34:55 iSCSI-A kernel: [3158205.689745] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
Oct 14 06:34:55 iSCSI-A kernel: [3158205.689750] DR6: ffff0ff0 DR7: 00000400
Oct 14 06:34:55 iSCSI-A kernel: [3158205.689760] [<f8d80606>] ad_rx_machine+0x26/0x6b0 [bonding]
Oct 14 06:34:55 iSCSI-A kernel: [3158205.689822] [<f8d80ce4>] bond_3ad_lacpdu_recv+0x54/0x240 [bonding]
Oct 14 06:34:55 iSCSI-A kernel: [3158205.689855] [<f8d80e8a>] bond_3ad_lacpdu_recv+0x1fa/0x240 [bonding]
Oct 14 06:34:55 iSCSI-A kernel: [3158205.689920] [<f8d80ce4>] bond_3ad_lacpdu_recv+0x54/0x240 [bonding]
Oct 14 06:34:55 iSCSI-A kernel: [3158205.689934] [deadline_dispatch_requests+0x44/0xd0] deadline_dispatch_requests+0x44/0xd0
Oct 14 06:34:55 iSCSI-A kernel: [3158205.689954] [scsi_mod:elv_next_request+0xaf/0x760] elv_next_request+0xaf/0x1c0
Oct 14 06:34:55 iSCSI-A kernel: [3158205.689964] [<f8982b90>] ata_scsi_rw_xlat+0x0/0x220 [libata]
Oct 14 06:34:55 iSCSI-A kernel: [3158205.689997] [<f88e2a4b>] e1000_alloc_rx_buffers+0xab/0x3a0 [e1000]
Oct 14 06:34:55 iSCSI-A kernel: [3158205.690020] [e1000:__netdev_alloc_skb+0x22/0x2b00] __netdev_alloc_skb+0x22/0x50
Oct 14 06:34:55 iSCSI-A kernel: [3158205.690040] [e1000:__netdev_alloc_skb+0x22/0x2b00] __netdev_alloc_skb+0x22/0x50
Oct 14 06:34:55 iSCSI-A kernel: [3158205.690052] [<f8d80c90>] bond_3ad_lacpdu_recv+0x0/0x240 [bonding]
Oct 14 06:34:55 iSCSI-A kernel: [3158205.690075] [e1000:netif_receive_skb+0x381/0xcf0] netif_receive_skb+0x381/0x460
Oct 14 06:34:55 iSCSI-A kernel: [3158205.690109] [<f88e356b>] e1000_clean_rx_irq+0x26b/0x530 [e1000]
Oct 14 06:34:55 iSCSI-A kernel: [3158205.690137] [<f88e337c>] e1000_clean_rx_irq+0x7c/0x530 [e1000]
Oct 14 06:34:55 iSCSI-A kernel: [3158205.690201] [<f88e3300>] e1000_clean_rx_irq+0x0/0x530 [e1000]
Oct 14 06:34:55 iSCSI-A kernel: [3158205.690229] [<f88e069e>] e...

Read more...

Charles (taylorc) wrote :

Additionally, this is Ubuntu 8.04.3 LTS running on a hyperthreaded P4. Disabling hyperthreading has no effect. I am reving it to Ubuntu 2.6.24-24.61-server, but not very hopeful...

Charles (taylorc) wrote :

Just curious... This is a high priority bug that affects many servers... I thought this distribution had Long Term Support. This bug has been sitting here for OVER A YEAR!!! Doesn't anyone care???

Sis Temes (sistemes) wrote :

try clocksource=tsc at boot time...

Jeremy Foshee (jeremyfoshee) wrote :

Charles,
     The main difficulty is the total of 10,000 bugs that need attention. I'd really appreciate dmesg logs and any other log file that is relevant to the issue. Have you had the opportunity to test this versus the latest Hardy release that just came out a few weeks ago?

I'd be very interested in your testing.

Thanks,

-JFo

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Jeremy Foshee (jeremyfoshee) wrote :

I neglected to mention that I would be very interested in whether you encounter this issue in one of the more current releases of Karmic or Lucid. Do you think you could run a LiveCD to verify this issue in one of these releases?

Thanks in advance,

-JFo

Sim (simvirus) wrote :

After 5 months, 2 crash in 1 day!

LTS???... Ridiculous!

PS: At boot ubuntu show:

[..] intel_rng: FWH not detected
[..] iTCO_wdt: failed to reset NO_REBOOT flag, reboot disabled by hardware

Is this the cause?

However it is difficult to diagnose ...
Regards

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Sim (simvirus) wrote :

After 5 months, 2 crash in 1 day!

LTS???... Ridiculous!

PS: At boot ubuntu show:

[..] intel_rng: FWH not detected
[..] iTCO_wdt: failed to reset NO_REBOOT flag, reboot disabled by hardware

Is this the cause?

However it is difficult to diagnose ...
Regards

Sim (simvirus) wrote :

Another crash after 14h!

Here a screenshot.

http://img195.imageshack.us/img195/1930/screenshotjr.jpg

No comment!

JvA (jvanacht) wrote :

I just started seeing this bug on an NFS server (nfs-kernel-server 1:1.1.2-2ubuntu2.2) on 8.04LTS (2.6.24-27-server #1 SMP Fri Mar 12 01:23:09 UTC 2010 x86_64 GNU/Linux / Ubuntu 2.6.24-27.68-server) with four NFS exports from two iSCSI initiated volumes (open-iscsi 2.0.865-1ubuntu3.3). This NFS server is a virtual machine (VMware ESXi 3.5.0 build 169697) that was setup in October 2009. The NFS server is strictly serving an interim need of offloading data from an old (OpenSuSE 10.0) Samba server's overflowing hard drives.

Up until last week the machine ran without any trouble.

Last week we added the second of the two iSCSI volumes and added an NFS share to the space on that volumes. (All volumes, local disk and iSCSI, are ext3.) We mounted the new NFS volume from the Samba machine and moved about 100GB of data off the old Samba server's local drives via rsync. No problem doing that. We then deleted the data from the Samba server and created symlinks in place of each moved folder, pointing to the respective folder on the new NFS volume.

The 100GB of data we just moved was backup data which the Windows users were backing up to using robocopy.exe. This system had worked just fine for years. But now, nearly every time a robocopy runs, we see the NFS server's kernel hang with the softlockup on 11s error being discussed on this thread. When this happens, the virtual machine is totally unresponsive, and we have to do a hard reset. The other virtual machines on the VMware server do not seem to be impacted in any way.

When we manually drag-and-drop 4GB of data (a typical amount being robocopied by the users) we do not have the problem. This is the first NFS folder which has to handle data being copied (through the Samba server, remember) using robocopy.

I'm no linux kernel developer, but my two cents are that the kernel is seeing a slow response from the iSCSI initiator when a heavy write load is placed on the iSCSI driver and it doesn't respond for a few seconds. After doing some research into this, we are going to try increasing the /proc/sys/kernel/softlockup_thresh from 10 to 60 seconds (the maximum allowed value short of turning off the threshold check) for now and see if that changes anything. If my hypothesis is correct, it likely would.

Perhaps these observations will be of some value among the community and developers in piecing this puzzle together...

Andrew Cowie (afcowie) wrote :

We're hitting a similar problem; hard to say whether it's related to bug #556919 or bug #540378 but we get

   CPU#1 stuck for 66s

type stuff too. This is in a Lucid server host & guest, with lucid-propsed enabled.

AfC

David McGiven (davidmcgivenn) wrote :

Dear Ubuntu Users,

I'm hitting the same error/bug you are. My setup is the following :

- SunFire X4450 with 4 Intel Xeon 6-Core :
(Intel(R) Xeon(R) CPU E7450 @ 2.40GHz)

- Ubuntu 8.04 LTS :
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 8.04.4 LTS
Release: 8.04
Codename: hardy

- Kernel version :
Linux xxxxx 2.6.24-27-server #1 SMP Wed Mar 24 11:32:39 UTC 2010 x86_64 GNU/Linux

- I'm running a 24 processors NAMD job (http://www.ks.uiuc.edu/Research/namd/)

After less than 1 minute, the system becomes unresponsive for ~10 minutes and then it comes back to "normal" (no need to reboot if you are patient enough).

Checking the dmesg buffer shows the already discussed "[ 2618.201092] BUG: soft lockup - CPU#23 stuck for 11s! [events/23:98]"

I've also seen some messages regarding a RAID module :
[ 2625.160886] aacraid: Host adapter abort request (0,0,0,0)
[ 2625.161029] aacraid: Host adapter reset request. SCSI hang ?

But I don't know if they're very relevant because the software and the data are accessed through NFS so it's not really writing to a local disk.

Does this provide any help to solve the problem ? I can send more detailed logs to the ubuntu LTS team if needed, this bug has to be solved! it's been more than a year now.

David McGiven (davidmcgivenn) wrote :

Well, I've added the noapic boot option and now I am not able to reproduce the error ...

That's an step forward ...

Wuestenschiff (wuestenschiff) wrote :

got the same problem on my Ubuntu 8.04 64 bit (home-)server. Is there a solution yet? Should i try a update too 10.04 (use the server alone so i ciuld risk the downtime)?

David McGiven (davidmcgivenn) wrote :

It's very sad that this bug is not being fixed by Ubuntu developers. I'm very disappointed.

Sim (simvirus) wrote :

...me to!

Jeremy Foshee (jeremyfoshee) wrote :

Clearly several of you are under a misconception as to the role of the Ubuntu Kernel Team. I will attempt to clarify.

The Ubuntu Kernel Team is focused on providing the most stable kernel possible for a release schedule of 6 months. This means that they are committed to pulling in the patches from upstream that make the most sense and are stable enough to be included for a distro release. They are not the type of team that works through kernel bugs and resolve them in the upstream source. For that you would need to open bugs with https://bugzilla.kernel.org/ as they are the upstream maintainers of the Linux kernel. The Ubuntu Kernel Team accepts bugs so that we can pull the relevant fixes from the upstream source where possible. This is the only use case for our tracking of bugs and it is also what contributes in some cases to the length of time in response to open bugs.

Should your testing against upstream kernels such as those available in the kernel PPA by following https://wiki.ubuntu.com/KernelMainlineBuilds then my suggestion is that you should file an upstream bug report to have this looked at by the kernel maintainers. Information and suggestion on filing upstream bug reports is available at https://wiki.ubuntu.com/Kernel/Bugs in the section titled 'Reporting Bugs Upstream'

~JFo

tags: added: kernel-core kernel-needs-review
Changed in linux (Ubuntu):
status: Confirmed → Triaged

On Tue, Jun 08, 2010 at 04:07:40PM -0000, Jeremy Foshee wrote:
> The Ubuntu Kernel Team is focused on providing the most stable kernel
> possible for a release schedule of 6 months.

Is there a document online, beyond this bug, that discusses where the
kernel team's support goals differ from the distribution itself? IMO, a 6
month support schedule for the kernel adversely affects the reputation of
a release with the LTS monicker.

Note that I understand that managing kernel packages for Ubuntu is an
incredibly complex task and this bug isn't doing you any favors, but I
think that if you're going to hedge your bets so to speak on the support
duration, "LTS" should be qualified in some way.

Ryan

Sim (simvirus) wrote :

THIS NIGHT A NEW CRASH AFTER SEVERAL MONTHS WITH MANY HOURS DOWN.
ALSO HEARTBEAT ISN'T ABLE TO SWITH TO SECOND SERVER BECAUSE THIS ISSUE NOT RELEASE ALL RESOURCES!
I'VE ANOTHER DISTRO WITH SOME "YEARS" UP WITHOUT ANY PROBLEM.
THIS IS RIDICULOUS... UBUNTU SERVER IS RIDICULOUS!
IS IT UBUNTU SERVER A REAL SERVER/STABLE RELEASE OR IT'S A JOKE?
FROM 2008-07-05 WHAT WAS INTRODUCED IN NEW KERNEL RELEASE (2.6.24-18)->(2.6.24-19) ????
NO COMMENT

Sim (simvirus) wrote :

THIS NIGHT A NEW CRASH AFTER SEVERAL MONTHS WITH MANY HOURS DOWN.
ALSO HEARTBEAT ISN'T ABLE TO SWITH TO SECOND SERVER BECAUSE THIS ISSUE NOT RELEASE ALL RESOURCES!
I'VE ANOTHER DISTRO WITH SOME "YEARS" UP WITHOUT ANY PROBLEM.
THIS IS RIDICULOUS... UBUNTU SERVER IS RIDICULOUS!
IS IT UBUNTU SERVER A REAL SERVER/STABLE RELEASE OR IT'S A JOKE?
FROM 2008-07-05 WHAT WAS INTRODUCED IN NEW KERNEL RELEASE (2.6.24-18)->(2.6.24-19) ????
NO COMMENT

Sim (simvirus) wrote :

THIS NIGHT A NEW CRASH AFTER SEVERAL MONTHS WITH MANY HOURS DOWN.
ALSO HEARTBEAT ISN'T ABLE TO SWITH TO SECOND SERVER BECAUSE THIS ISSUE NOT RELEASE ALL RESOURCES!
I'VE ANOTHER DISTRO WITH SOME "YEARS" UP WITHOUT ANY PROBLEM.
THIS IS RIDICULOUS... UBUNTU SERVER IS RIDICULOUS!
IS IT UBUNTU SERVER A REAL SERVER/STABLE RELEASE OR IT'S A JOKE?
FROM 2008-07-05 WHAT WAS INTRODUCED IN NEW KERNEL RELEASE (2.6.24-18)->(2.6.24-19) ????
NO COMMENT

tags: added: kernel-candidate kernel-reviewed
removed: kernel-needs-review
tags: removed: kernel-candidate
Stefan Bader (smb) on 2010-07-14
Changed in linux (Ubuntu):
assignee: nobody → Stefan Bader (stefan-bader-canonical)
Stefan Bader (smb) wrote :

First it is not really clear why -18 works and -19 does not. Looking over the patches in between, none of them immediately would looks as like it could be the cause for that. And actually I wonder whether all comments really are about the same problem or just the same symptom.
Anyway, as the kernel in question is relatively old its hard to get any upstream help on this, especially as it seem to be no issue any more. I found one patch that looks like it could be related but this is a bit of a wild guess.

commit 2bf86b7aa8e74bf81a9872f7b610f49b610a4649 (2.6.25.y)
Author: Jay Vosburgh <email address hidden>
Date: Fri Mar 21 22:29:33 2008 -0700

    bonding: Fix locking in 802.3ad mode

But still might be worth trying, I put test kernel packages to http://people.canonical.com/~smb/lp245779/
If that does still fail, then we need the help of someone affected to do bisects between the working and non-working kernels.

Sim (simvirus) wrote :

Dear Stefan,
thanks for your support!

Can you tell me if there is thi this patch in new Ubuntu 10.04 LTS kernel?

In this way I can try to upgrade my servers to new version.

The crash is very occasional (about 3/6 months) and for this test with pached kernel we have to wait long time.

Tell me what have to do...

Thanks again!
Best regards
Sim

Stefan Bader (smb) wrote :

Sim, this patch was part of 2.6.26, so beside all other things that have gone into 10.04 (which is based on 2.6.32) this is part of it as well. As all the comments from various sources indicate this problem was not seen after 2.6.26, so 10.04 should be fine.

The patches test kernels would be for people that need to stay on Hardy and want to see whether this would fix the issue in Hardy (though there is no guarantee). As stated above, there is no clear reason what would have caused the problem between the two mentioned kernel versions, but on the other hand the patch found handled some lockup problem. But that is all things to consider for Hardy.

Changed in linux (Ubuntu Hardy):
assignee: nobody → Stefan Bader (stefan-bader-canonical)
importance: Undecided → High
status: New → Triaged
Changed in linux (Ubuntu):
assignee: Stefan Bader (stefan-bader-canonical) → nobody
status: Triaged → Fix Released
Stefan Bader (smb) wrote :

There was no update to this from the OR after suggesting 10.04 LTS should be ok. So I assume this is now in use and works. As I don't see anybody requiring a fix in Hardy (no report back on the test kernel either) I assume the immediate need is gone there. It would take some amount of debugging/verification work to find out whether we could actually solve it (especially as it happens only rarely), so I set it to won't fix for the time being. Should there be a real need for it, we can re-open the bug.

Changed in linux (Ubuntu Hardy):
assignee: Stefan Bader (stefan-bader-canonical) → nobody
status: Triaged → Won't Fix
semmelb (hajimemasudozo) wrote :

BUG reapers in Ubuntu Server 10.04.1 LTS
Using Kernel 2.6.32-24 server running on an Asus P5K-E mainboard

Man this is annoying!

Sim (simvirus) wrote :

I'm very desperated.

Ubuntu is very bad and the support haven't solved this problem since July 2008!!!

For three mounth nothing was happened and today the issue was presented for three time!!!
How Semmelb said, bug reappers in Ubuntu Server 10.04 LTS.

I also put an attachment to show this long-standing issue.

Please, please please solve it!!!!

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.