BUG: soft lockup - CPU#0 stuck for 61s!

Bug #214814 reported by TJ on 2008-04-09
44
This bug affects 6 people
Affects Status Importance Assigned to Milestone
Linux
Fix Released
Unknown
linux (Ubuntu)
High
TJ
Hardy
High
Unassigned

Bug Description

See also upstream bug:

http://bugzilla.kernel.org/show_bug.cgi?id=10396

Systems based on the Intel 450NX chipset may experience issues where devices aren't recognised that lead to drivers failing, unhandled IRQs, and other serious boot failures. The issue is caused because this chipset has 3 PCI root buses. When it was first released some operating systems (read: Windows NT) didn't always correctly discover the 2nd and 3rd PCI buses. As a result the PCI BIOS tables were 'hacked' to have a fake bridge device on PCI bus 0 that points to the same bus number as the 1st bus so they would be scanned correctly by the OS.

$ lspci
00:0a.0 PCI bridge: Intel Corporation 21154 PCI-to-PCI Bridge
00:10.0 Host bridge: Intel Corporation 450NX - 82451NX Memory & I/O Controller (rev 03)
00:12.0 Host bridge: Intel Corporation 450NX - 82454NX/84460GX PCI Expander Bridge (rev 04)
00:13.0 Host bridge: Intel Corporation 450NX - 82454NX/84460GX PCI Expander Bridge (rev 04)
00:14.0 Host bridge: Intel Corporation 450NX - 82454NX/84460GX PCI Expander Bridge (rev 04)

As a result, in a well-behaved OS the 2nd and 3rd PCI buses would be scanned twice. Once as secondaries of the 1st bus, and then as root buses in their own right. This caused problems with devices being discovered twice.

A fix-up for all i450N chipsets was introduced in arch/i386/pci/fixups.c::pci_fixup_i450nx(). Note: arch/i386 was refactored to arch/x86/ subsequently. The fix-up checks the PCI config for the subsidiary buses and if it finds them scans them. This adds them to the root_pci_bus list. Later in the boot process the ACPI/PCI code reads the ACPI DSDT table, finds the PCI bus entries (PNP0A03) and tries to scan them. It fails when scanning the 2nd and 3rd buses with:

[ 0.910906] ACPI: PCI Root Bridge [PX0B] (0000:02)
[ 0.912085] ACPI: Bus 0000:02 not present in PCI namespace
[ 0.917111] ACPI: PCI Root Bridge [PX1A] (0000:03)
[ 0.920085] ACPI: Bus 0000:03 not present in PCI namespace

Unfortunately, the report is misleading since the reason is that the bus is found to be already registered and therefore ignored. The situation can be worked around by booting with "pci=noacpi".

The solution is to make the pci_fixup_i450nx() code selective based on the DMI of the system. I've introduced a patch that does this. Initially the only DMI it will match is Dell PowerEdge 6300 but if other systems are found to be affected the output of "sudo dmidecode" should be captured and reported. Additional DMI_MATCH entries can then be added to the patch.

I found this reference to the issue in AKM's 2.6.0 mm tree and the linux-scsi mailing list archive:

"I can tell you what's going on here. This is a 450NX based motherboard. The 450NX chipset from Intel was the first chipset to have peer PCI busses. For backwards compatibility, some machine makers hacked their PCI BIOS to have a fake bridge device on PCI bus 0 that points to the same bus number as the peer bus. This way if the OS didn't know about the peer bus registers it would still find the devices by scanning behind the bridge. In this case we are scanning behind this fake bridge and then also scanning based upon the peer bus registers in the chipset, and as a result we are finding the device twice. In order to fix this problem you need to change the peer bus quirk code for the 450NX chipset to scan the list of bus 0 devices looking for a bridge that has the same config as the peer bus registers and if so delete the bridge from the list. That will avoid double scanning and will avoid having the PCI code try and configure sub busses via a fake bridge when it should do all configurations via the 450NX peer bus registers.

--
  Doug Ledford <email address hidden>"

http://marc.info/?l=linux-scsi&m=106839680416899&w=2

In this particular case a Dell PowerEdge 6300 with a PERC 2 RAID array controller (aacraid) fails to boot on any kernel after v2.6.20 (Feisty). Reports show:

[ 0.000000] Linux version 2.6.24-15-generic (root@PowerEdge6300) (gcc version 4.1.2 (Ubuntu 4.1.2-0ubuntu4)) #1 SMP Fri Apr 4 09:18:39 BST 2008 (Ubuntu 2.6.24-15.26-generic)

[ 436.079664] Adaptec aacraid driver 1.1-5[2449]-ms

[ 492.476969] BUG: soft lockup - CPU#2 stuck for 11s! [modprobe:1376]
[ 492.483317]
[ 492.484874] Pid: 1376, comm: modprobe Not tainted (2.6.24-15-generic #1)
[ 492.491642] EIP: 0060:[<c0216641>] EFLAGS: 00000287 CPU: 2
[ 492.497226] EIP is at delay_tsc+0x41/0x50
[ 492.501302] EAX: 0000059e EBX: 0000003f ECX: 00000000 EDX: 0000003f
[ 492.507640] ESI: 17c02b3e EDI: df84f278 EBP: 17c025a0 ESP: df9dfd4c
[ 492.513972] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[ 492.519443] CR0: 8005003b CR2: 0812574c CR3: 1f97b000 CR4: 00000690
[ 492.525781] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[ 492.532114] DR6: ffff0ff0 DR7: 00000400
[ 492.536029] [<c02165c6>] __delay+0x6/0x10
[ 492.540264] [<f89496aa>] aac_fib_send+0x21a/0x2d0 [aacraid]
[ 492.546108] [<c012363a>] enqueue_task_fair+0x1a/0x30
[ 492.551318] [<f8945a94>] aac_get_adapter_info+0x74/0x620 [aacraid]
[ 492.557753] [<f8942f54>] aac_probe_one+0x224/0x450 [aacraid]
[ 492.563642] [<f8949b80>] aac_command_thread+0x0/0x6d0 [aacraid]
[ 492.569801] [<c0223136>] pci_device_probe+0x56/0x80
[ 492.574903] [<c027e85e>] driver_probe_device+0x8e/0x190
[ 492.580373] [<c027eace>] __driver_attach+0x9e/0xa0
[ 492.585385] [<c027dc7b>] bus_for_each_dev+0x3b/0x60
[ 492.590491] [<c027e6d6>] driver_attach+0x16/0x20
[ 492.595330] [<c027ea30>] __driver_attach+0x0/0xa0
[ 492.600259] [<c027e00a>] bus_add_driver+0x8a/0x1e0
[ 492.605281] [<c02232e3>] __pci_register_driver+0x53/0xa0
[ 492.610815] [<f8850033>] aac_init+0x33/0x74 [aacraid]
[ 492.616098] [<c0151511>] sys_init_module+0x151/0x1990
[ 492.621377] [<c01778fa>] __do_fault+0x21a/0x410
[ 492.626170] [<c0166421>] handle_fasteoi_irq+0x91/0xf0
[ 492.631465] [<c01053b2>] syscall_call+0x7/0xb
[ 492.636066] =======================

[ 17.155571] irq 10: nobody cared (try booting with the "irqpoll" option)
[ 17.155571] Pid: 0, comm: swapper Not tainted 2.6.25-rc8-custom #1
[ 17.155571] [<c025ad74>] __report_bad_irq+0x24/0x80

This was first thought to be part of bug #149071 "-server kernel variant fails to boot on PowerEdge 2650 with AACRAID timeouts" but it now appears likely that has a different root cause.

Attached here are patches for Gutsy and Hardy. An upstream patch for v2.6.25-rc8 is attached to the bugzilla report.

TJ (tj) wrote :
TJ (tj) wrote :
Changed in linux:
assignee: nobody → intuitivenipple
importance: Undecided → High
milestone: none → ubuntu-8.04
status: New → In Progress
description: updated
TJ (tj) on 2008-04-09
description: updated
Changed in linux:
status: Unknown → Incomplete
Changed in linux:
status: Incomplete → In Progress
TJ (tj) wrote :

A simpler fix was provided by Matthew Wilcox on linux-pci mailing list. It is cleaner and simpler than my DMI-based patches. Matthew's suggested patch then came to the attention of Zhao Yakui on linux-acpi who reported a patch is already in the -mm tree that solves a similar report at buzilla:

"Intel SC450NX system stops working with kernels later than 2.6.22.x"

http://bugzilla.kernel.org/show_bug.cgi?id=10124

That patch has been tested and confirmed working. The patch is found at:

http://kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.25-rc8/2.6.25-rc8-mm2/broken-out/acpi-unneccessary-to-scan-the-pci-bus-already-scanned.patch

Changed in linux:
status: In Progress → Invalid
Steve Langasek (vorlon) on 2008-04-26
Changed in linux:
milestone: ubuntu-8.04 → ubuntu-8.04.1

I get this on an older HP Pavilion a200n running as a server.

It previously had 7.10 on it and worked - as far as I know - flawlessly.

_ERROR_: BUG: soft lockup - CPU#0 stuck for 11s! [kacpi_notify:45]
_PROBLEM_: Computer locks up for $x amount of seconds
_TRIED_: Disable acpi by adding 'acpi=off' to '/boot/grub/menu.lst' -- didn't work

I will now attach:

$ lspci -vv > lspci.txt
$ cat /proc/cpuinfo > cpuinfo.txt
$ cat /proc/meminfo > meminfo.txt

How can I apply this patch? Or how can I compile the latest kernel?

Noticed you used a program called acpidump. Attaching.

Steve Langasek (vorlon) wrote :

There doesn't seem to have been much progress on getting this bugfix into the archive. I'm nominating this bug for hardy SRU, but dropping the milestone since it doesn't appear to be on track for inclusion in 8.04.1.

Changed in linux:
milestone: ubuntu-8.04.1 → none
importance: Undecided → High
status: New → Triaged

Hi TJ,

The patch you referenced in an eariler comment ( https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/214814/comments/3 ) seems to already be in both Hardy Heron 8.04 and the upcoming Intepid Ibex 8.10 kernel already. Can you confirm this is now resolved for you for both Hardy and the upcoming Intrepid releases? Thanks in advance.

ogasawara@yoji:~/ubuntu-hardy$ git log 08f1c192c3c32797068bfe97738babb3295bbf42
commit 08f1c192c3c32797068bfe97738babb3295bbf42
Author: Muli Ben-Yehuda <email address hidden>
Date: Sun Jul 22 00:23:39 2007 +0300

    x86-64: introduce struct pci_sysdata to facilitate sharing of ->sysdata

ogasawara@yoji:~/ubuntu-intrepid$ git log 08f1c192c3c32797068bfe97738babb3295bbf42
commit 08f1c192c3c32797068bfe97738babb3295bbf42
Author: Muli Ben-Yehuda <email address hidden>
Date: Sun Jul 22 00:23:39 2007 +0300

    x86-64: introduce struct pci_sysdata to facilitate sharing of ->sysdata

TJ (tj) wrote :

Leann, the patch you reference is the one that introduced the problem.

The fix can be cherry-picked from:

commit b87e81e5c6e64ae0eae3b4f61bf07bfeec856184
Author: <email address hidden> <email address hidden>
Date: Tue Apr 15 14:34:49 2008 -0700

    acpi: unneccessary to scan the PCI bus already scanned

My HP Pavillion a200n with an Intel Celeron 2.4GHz processor has this problem.

It happened in a fresh install of 8.04 and even updated to 8.04.1. This never happened in 7.10

Hope that helps!

TJ (tj) on 2008-08-15
Changed in linux:
milestone: none → ubuntu-8.04.2
TJ (tj) on 2008-08-15
Changed in linux:
status: Invalid → Unknown

Hi TJ,

Thanks for the clarification. This latest patch you pointed me to also seems to already be in the Hardy and Intrepid trees:

ogasawara@yoji:~/ubuntu-hardy$ git log b87e81e5c6e64ae0eae3b4f61bf07bfeec856184
commit b87e81e5c6e64ae0eae3b4f61bf07bfeec856184

Author: <email address hidden> <email address hidden>

Date: Tue Apr 15 14:34:49 2008 -0700

    acpi: unneccessary to scan the PCI bus already scanned

    http://bugzilla.kernel.org/show_bug.cgi?id=10124

ogasawara@yoji:~/ubuntu-intrepid$ git log b87e81e5c6e64ae0eae3b4f61bf07bfeec856184

commit b87e81e5c6e64ae0eae3b4f61bf07bfeec856184

Author: <email address hidden> <email address hidden>

Date: Tue Apr 15 14:34:49 2008 -0700

    acpi: unneccessary to scan the PCI bus already scanned

    http://bugzilla.kernel.org/show_bug.cgi?id=10124

Brett, I know you confirmed you still have an issue, but glancing at your lspci output it seems you have different hardware than what TJ originally reported. Even though you experience the same symptom that's reported here you might need to open a different bug report. But I'd like to hear from TJ first if this is actually fixed or not. Thanks.

Whoa, there is something wacky going on with my git tree I think. Obviously from the git log output I pasted it looks like the patch is already applied to Hardy, but further examining the actual file(s) I'm not seeing the patch applied. I even re-cloned my git tree (git clone git://kernel.ubuntu.com/ubuntu/ubuntu-hardy.git) and see the same odditiy. So please disregard my comment that this is already in Hardy.

On Fri, 2008-08-15 at 07:54 +0000, Leann Ogasawara wrote:
> Hi TJ,
>
> Thanks for the clarification. This latest patch you pointed me to also
> seems to already be in the Hardy and Intrepid trees:
>
> ogasawara@yoji:~/ubuntu-hardy$ git log b87e81e5c6e64ae0eae3b4f61bf07bfeec856184
> commit b87e81e5c6e64ae0eae3b4f61bf07bfeec856184
>
> Author: <email address hidden> <email address hidden>
>
> Date: Tue Apr 15 14:34:49 2008 -0700

Leann, are you sure that is from ubuntu-hardy master branch? What I mean
is, I think it is from the <mainline>/master remote tracking branch if
you have one.

If I check the commits in master against one of the files touched the
commit b87e81e5 doesn't show up:

ubuntu-hardy$ git-status
# On branch master
nothing to commit (working directory clean)

ubuntu-hardy$ git-log -1 --pretty=oneline
b180a9b27d1875b970df1bcd74114300e0f7707a UBUNTU: if_arp: add a WiMax pseudo header

ubuntu-hardy$ git-log --pretty=format:"%h %ci %s" -- arch/x86/pci/pci.c | grep b87e81e5
ubuntu-hardy$

If I check the most recent tag prior to that commit I get:
ubuntu-hardy$ git-describe b87e81e5
v2.6.25-rc9-67-gb87e81e

ubuntu-hardy$ git-describe --contains b87e81e5
v2.6.25~6

which comes from my remote mainline tracking branch:

ubuntu-hardy$ git-remote show mainline
* remote mainline
  URL: /home/all/SourceCode/linux/linux-2.6/.git
  Tracked remote branches
    fix-bugzilla-10396 master pci-2.6 pci-iomem pci-resource-allocation-debug

Changed in linux:
status: Unknown → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.