[Acer AR320 F1 and Acer AR160 F1] Halting shortly after booting with 10.04.3

Bug #897773 reported by Brendan Donegan
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
HW-labs
Fix Released
High
Jeff Lane 

Bug Description

We have two Acer servers which, when they have the Lucid 10.04.3 halt shortly after booting, this didn't happen with 10.04.2

The message 'Uhhuh. NMI received for unknown reason 21 on CPU 0.' has been seen shortly before the halting occurred on one of the systems (the AR320).

We have not been able to reproduce the behaviour with the release Lucid kernel, no matter how long we wait.

Next steps are to run 'memtest' on both systems and do an apport-collect if possible.

Tags: lucid
summary: - [Acer AR320 F1 and Acer AR160 F1] Halting shortly after booting
+ [Acer AR320 F1 and Acer AR160 F1] Halting shortly after booting with
+ Lucid (2.6.32-36.79) -proposed kernel
Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 897773

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: lucid
Revision history for this message
Brad Figg (brad-figg) wrote : Re: [Acer AR320 F1 and Acer AR160 F1] Halting shortly after booting with Lucid (2.6.32-36.79) -proposed kernel

Ignore the comment from the bot about needing log files.

Changed in linux (Ubuntu):
assignee: nobody → Brad Figg (brad-figg)
status: Incomplete → Confirmed
Revision history for this message
Brad Figg (brad-figg) wrote :

It has been reported in IRC that the 10.04.03 release kernel is not exhibiting this issue. Please install the kernels that have come out since 10.04.03 to see where this possible regression was first introduced.

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
TienFu Chen (ctf) wrote :

Executed the memory test from grub menu, test passed.
Got more error message:
Uhhuh. NMI received for unknown reason 21 on CPU 0
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue.

Revision history for this message
TienFu Chen (ctf) wrote :

I removed the package: checkbox-certification-cli, then the halting problem disappeared, it has been about 1 hour. I'm keeping monitoring it.
The attachment is from AR320 after checkbox-certification-cli removed.

Revision history for this message
TienFu Chen (ctf) wrote :

checkbox-certification.log, while halting is happening.

Revision history for this message
TienFu Chen (ctf) wrote :
Revision history for this message
Brendan Donegan (brendan-donegan) wrote :

We see this problem in the release kernel as well, contrary to what it says in the bug description.

Brad Figg (brad-figg)
Changed in linux (Ubuntu):
importance: Undecided → High
status: Incomplete → New
Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 897773

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: bot-stop-nagging
Ara Pulido (ara)
summary: [Acer AR320 F1 and Acer AR160 F1] Halting shortly after booting with
- Lucid (2.6.32-36.79) -proposed kernel
+ 10.04.3
Revision history for this message
Victor Tuson Palau (vtuson) wrote :

Hi Brendan,

What is the release kernel - can you add the kernel number?

Based on Tim's comment it looks like it is an issue related to checkbox?

Thanks,

Victor

Revision history for this message
Ara Pulido (ara) wrote :

We are currently trying to find out when this regression was introduced

description: updated
Revision history for this message
Brendan Donegan (brendan-donegan) wrote :

The release kernel is 2.6.32-33. As Ara said we only just established today that it was present in the release kernel, so I'm now attempting to trace further back.

tags: removed: bot-stop-nagging
Revision history for this message
Brendan Donegan (brendan-donegan) wrote :

To sum up the problem as it seems to occur, we established that running the Checkbox script 'udev_resource' (lp:checkbox ; scripts/udev_resource) is what causes the halting. This script runs 'udevadm info --export-db' and parses the results. It *does* touch /proc when doing this.

We've also established that the problem is exhibited in 10.04.2 with the same kernel the system was certified with. We're now running the memtest tool again to see if we can find anything more.

Revision history for this message
TienFu Chen (ctf) wrote :

Some tests.
1) run memtest for 14 hours, it passes.
2) running udev_resource with sudo always halts on both 10.04.2,10.04.3. run udev_resource with normal user is ok.
3) save the result from "udevadm info --export-db" as file, then make udev_resource read data from the file(no /proc accessing), same problem as above.

Revision history for this message
TienFu Chen (ctf) wrote :

3) in comment #14, (no /proc accessing) is wrong. Looking into source, still has /proc accessing.

Revision history for this message
TienFu Chen (ctf) wrote :

The following data from "udevadm info --export-db" causes the halting.

P: /devices/pci0000:00/0000:00:03.0/0000:01:00.0
E: UDEV_LOG=3
E: DEVPATH=/devices/pci0000:00/0000:00:03.0/0000:01:00.0
E: DRIVER=megaraid_sas
E: PCI_CLASS=10400
E: PCI_ID=1000:0073
E: PCI_SUBSYS_ID=1000:9241
E: PCI_SLOT_NAME=0000:01:00.0
E: MODALIAS=pci:v00001000d00000073sv00001000sd00009241bc01sc04i00
E: SUBSYSTEM=pci

It's related to device below.

01:00.0 RAID bus controller [0104]: LSI Logic / Symbios Logic MegaRAID SAS 9240 [1000:0073] (rev 02)
 Subsystem: LSI Logic / Symbios Logic Device [1000:9241]
 Flags: bus master, fast devsel, latency 0, IRQ 16
 I/O ports at 2000 [size=256]
 Memory at df940000 (64-bit, non-prefetchable) [size=16K]
 Memory at df900000 (64-bit, non-prefetchable) [size=256K]
 [virtual] Expansion ROM at c0000000 [disabled] [size=256K]
 Capabilities: <access denied>
 Kernel driver in use: megaraid_sas
 Kernel modules: megaraid_sas

Revision history for this message
TienFu Chen (ctf) wrote :

The problem is caused by the following file.

/sys/devices/pci0000:00/0000:00:03.0/0000:01:00.0/vpd
(MegaRAID SAS 9240 DEVPATH=/devices/pci0000:00/0000:00:03.0/0000:01:00.0)
-rw------- 1 root root 32768 2011-12-02 14:51 vpd

executing the command will cause halting: sudo cat vpd
The vpd file can be only read/write by root, so it explains why running udev_resource with normal user can't reproduce this problem.

Revision history for this message
TienFu Chen (ctf) wrote :

Executed the command,sudo cat vpd, will cause halting and the server console will display the following message immediately:
-----
Uhhuh. NMI received for unknown reason 21 on CPU 0
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue.

Revision history for this message
TienFu Chen (ctf) wrote :

  97 What: /sys/bus/pci/devices/.../vpd
  98 Date: February 2008
  99 Contact: Ben Hutchings <email address hidden>
 100 Description:
 101 A file named vpd in a device directory will be a
 102 binary file containing the Vital Product Data for the
 103 device. It should follow the VPD format defined in
 104 PCI Specification 2.1 or 2.2, but users should consider
 105 that some devices may have malformatted data. If the
 106 underlying VPD has a writable section then the
 107 corresponding section of this file will be writable.

Revision history for this message
TienFu Chen (ctf) wrote :

To fix the problem for checkbox, we may add "vpd" to the excluded list in below statement.

line 443 in udev_resource:
        for name in names:
            name_path = posixpath.join(sys_path, name)
            if name[0] == "." \
               or name in ["dev", "uevent", "vpd"] \
               or posixpath.isdir(name_path) \
               or posixpath.islink(name_path):
                continue
The udev_resource with excluded "vpd" will not cause the halting problem.

Revision history for this message
TienFu Chen (ctf) wrote :

Disks on AR320:
[ 2.431927] scsi 4:0:6:0: Direct-Access ATA WDC WD2002FYPS-0 5G04 PQ: 0 ANSI: 5
[ 2.438752] scsi 4:0:7:0: Direct-Access ATA WDC WD2002FYPS-0 5G04 PQ: 0 ANSI: 5

Disks on AR160:
[ 3.935828] scsi 4:0:13:0: Direct-Access ATA WDC WD2002FYPS-0 5G04 PQ: 0 ANSI: 5
[ 3.941908] scsi 4:0:14:0: Direct-Access ATA WDC WD2002FYPS-0 5G04 PQ: 0 ANSI: 5

Revision history for this message
Brad Figg (brad-figg) wrote :

@tim

Please try to capture a dmesg log at the point of the halt.

Revision history for this message
Jeff Lane  (bladernr) wrote :

So I've done some digging.... First, an ancedotal story:

A man was experiencing extreme pain whenever he touched his arm in a particular place. So he goes to the doctor to have it checked out. The man says "Doctor, every time I push on this spot on my arm, I get horrible pain."

The doctor looks at the man and says, "Well then, stop pushing there."

After looking at this and recreating it a few times, I think I know what's going on here. Now keep in mind that this is just a guess... but I think that that vpd file in the sysfs directory for this card is actually a direct interface to the card itself, meant to be accessed only via an API of some sorts... if you run file on it and other items in the directory you'll note that it's a "regluar file" as opposed to "ASCII Text" or some other filetype.

For comparison, if you run file on /proc/kcore, it too is a "regular file".

I mention this because the behaviour seen here is alarmingly similar to older kernels (2.2x or 2.4.x, IIRC) where you could cat /proc/kcore to bring a system crashing to a halt.

So what I think is happening is that by accessing that vpd file (either using cat or in the udev_resource script which is doing a open() on the vpd file) we are inadvertantly causing the kernel to hang. I'd like to poke at this further by checking into the uevent file in that devices sysfs directory, but at the moment, I'm unabel to access the system, even after a reboot. It may need to be re-installed or at least poked manually. Power-cycling it remotely hasn't worked so far.

IN any case, I think Tim's already got this sorted out. My only concern with his suggested solution is whether or not blacklisting vpd will break data collection for other places. In other words, while /path/to/vpd here may be breaking the system, is there a case where /path/to/vpd actually contains parseable data that doesn't trigger a hung system when you try opening it?

Revision history for this message
Jeff Lane  (bladernr) wrote :

I was asked earlier for a dmesg dump from when this is triggered. Here it is, but you'll notice that there's nothing there. This is an instantaneous hang for the kernel, so it never has a chance to write anything to the ring buffer, let alone pass that on to log files.

To create this, I ran the following script, appending the output to the file dmesg.log:

#!/bin/bash
while true; do
        dmesg -c
done

sudo ./dmesg-monitor >> dmesg.log

Revision history for this message
Brendan Donegan (brendan-donegan) wrote :

This leads to the question, why don't any of our other systems hang when they touch the vpd file?

Revision history for this message
TienFu Chen (ctf) wrote :

I exchanged the raid cards on both systems.
Now the AR160 has the card originally adapted on AR320[card A], now this problem can be always reproduced on AR160.
Card A has a sticker on the heatsink and stated "Old stepping.", don't know what it means actually.

Revision history for this message
Brad Figg (brad-figg) wrote :

Can we get an "apport-collect 897773" run on this system? Thanks.

Revision history for this message
TienFu Chen (ctf) wrote :

Hi Brad,
Please check the apport report on comment #5

Revision history for this message
Brad Figg (brad-figg) wrote :

@tim,

Ok, it doesn't normally get added as a single attachment like that.

Revision history for this message
Jeff Lane  (bladernr) wrote :

So this problem seems to have been hardware related ultimately.

On further investigation, after Tim moved swapped identical cards between two serves and the failure followed the older card, I got to looking and found that the firmware on the failing card was several revs below the firmware on the newer card, which was also 1 rev down from the most recent firmware for the LSI 9240-4i.

So I flashed the firmware on both cards to the latest rev (20.10.1-0061) and this seems to have cured the problem.

I can manually cat the vpd file in sysfs repeatedly without failure and I have also been able to run checkbox (which initially triggered the issue) again without failure.

So I think we can call this a firmware issue. I'm closing it as such. If this re-occurrs for some reason, we can reopen or just open a new bug then.

affects: linux (Ubuntu) → hw-labs
Changed in hw-labs:
assignee: Brad Figg (brad-figg) → Jeff Lane (bladernr)
status: Incomplete → Fix Released
Revision history for this message
Jeff Lane  (bladernr) wrote :

Also, re-assigning this to hw-labs as this was a hardware issue rather than an OS or kernel issue.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.