HW-labs

[Acer AR320 F1 and Acer AR160 F1] Halting shortly after booting with 10.04.3

Bug #897773 reported by Brendan Donegan on 2011-11-29

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	HW-labs	Fix Released	High	Jeff Lane 

Bug Description

We have two Acer servers which, when they have the Lucid 10.04.3 halt shortly after booting, this didn't happen with 10.04.2

The message 'Uhhuh. NMI received for unknown reason 21 on CPU 0.' has been seen shortly before the halting occurred on one of the systems (the AR320).

We have not been able to reproduce the behaviour with the release Lucid kernel, no matter how long we wait.

Next steps are to run 'memtest' on both systems and do an apport-collect if possible.

See original description

Tags:

Brendan Donegan (brendan-donegan) on 2011-11-29

summary:

- [Acer AR320 F1 and Acer AR160 F1] Halting shortly after booting
+ [Acer AR320 F1 and Acer AR160 F1] Halting shortly after booting with
+ Lucid (2.6.32-36.79) -proposed kernel

Revision history for this message

Brad Figg (brad-figg) wrote on 2011-11-29: Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 897773

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status:	New → Incomplete
tags:	added: lucid

Revision history for this message

Brad Figg (brad-figg) wrote on 2011-11-29: Re: [Acer AR320 F1 and Acer AR160 F1] Halting shortly after booting with Lucid (2.6.32-36.79) -proposed kernel

Ignore the comment from the bot about needing log files.

Changed in linux (Ubuntu):
assignee:	nobody → Brad Figg (brad-figg)
status:	Incomplete → Confirmed

Revision history for this message

Brad Figg (brad-figg) wrote on 2011-11-29:

It has been reported in IRC that the 10.04.03 release kernel is not exhibiting this issue. Please install the kernels that have come out since 10.04.03 to see where this possible regression was first introduced.

Changed in linux (Ubuntu):
status:	Confirmed → Incomplete

Revision history for this message

TienFu Chen (ctf) wrote on 2011-11-30:

Executed the memory test from grub menu, test passed.
Got more error message:
Uhhuh. NMI received for unknown reason 21 on CPU 0
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue.

Revision history for this message

TienFu Chen (ctf) wrote on 2011-11-30:

apport log from AR320 Edit (420.9 KiB, text/plain)

I removed the package: checkbox-certification-cli, then the halting problem disappeared, it has been about 1 hour. I'm keeping monitoring it.
The attachment is from AR320 after checkbox-certification-cli removed.

Revision history for this message

TienFu Chen (ctf) wrote on 2011-11-30:

checkbox-certification.log Edit (145.8 KiB, text/plain)

checkbox-certification.log, while halting is happening.

Revision history for this message

TienFu Chen (ctf) wrote on 2011-11-30:

checkbox-certification.log with log-level=debug Edit (3.7 KiB, text/plain)

Revision history for this message

Brendan Donegan (brendan-donegan) wrote on 2011-11-30:

We see this problem in the release kernel as well, contrary to what it says in the bug description.

Brad Figg (brad-figg) on 2011-11-30

Changed in linux (Ubuntu):
importance:	Undecided → High
status:	Incomplete → New

Revision history for this message

Brad Figg (brad-figg) wrote on 2011-11-30: Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 897773

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status:	New → Incomplete

Joseph Salisbury (jsalisbury) on 2011-11-30

tags:

added: bot-stop-nagging

Ara Pulido (ara) on 2011-11-30

summary:

[Acer AR320 F1 and Acer AR160 F1] Halting shortly after booting with
- Lucid (2.6.32-36.79) -proposed kernel
+ 10.04.3

Revision history for this message

Victor Tuson Palau (vtuson) wrote on 2011-11-30:

#10

Hi Brendan,

What is the release kernel - can you add the kernel number?

Based on Tim's comment it looks like it is an issue related to checkbox?

Thanks,

Victor

Revision history for this message

Ara Pulido (ara) wrote on 2011-11-30:

#11

We are currently trying to find out when this regression was introduced

description:

updated

Revision history for this message

Brendan Donegan (brendan-donegan) wrote on 2011-11-30:

#12

The release kernel is 2.6.32-33. As Ara said we only just established today that it was present in the release kernel, so I'm now attempting to trace further back.

Joseph Salisbury (jsalisbury) on 2011-11-30

tags:

removed: bot-stop-nagging

Revision history for this message

Brendan Donegan (brendan-donegan) wrote on 2011-12-01:

#13

To sum up the problem as it seems to occur, we established that running the Checkbox script 'udev_resource' (lp:checkbox ; scripts/udev_resource) is what causes the halting. This script runs 'udevadm info --export-db' and parses the results. It *does* touch /proc when doing this.

We've also established that the problem is exhibited in 10.04.2 with the same kernel the system was certified with. We're now running the memtest tool again to see if we can find anything more.

Revision history for this message

TienFu Chen (ctf) wrote on 2011-12-02:

#14

memtest memory test for 14 hours. Edit (832.4 KiB, image/jpeg)

Some tests.
1) run memtest for 14 hours, it passes.
2) running udev_resource with sudo always halts on both 10.04.2,10.04.3. run udev_resource with normal user is ok.
3) save the result from "udevadm info --export-db" as file, then make udev_resource read data from the file(no /proc accessing), same problem as above.

Revision history for this message

TienFu Chen (ctf) wrote on 2011-12-02:

#15

3) in comment #14, (no /proc accessing) is wrong. Looking into source, still has /proc accessing.

Revision history for this message

TienFu Chen (ctf) wrote on 2011-12-02:

#16

The following data from "udevadm info --export-db" causes the halting.

P: /devices/pci0000:00/0000:00:03.0/0000:01:00.0
E: UDEV_LOG=3
E: DEVPATH=/devices/pci0000:00/0000:00:03.0/0000:01:00.0
E: DRIVER=megaraid_sas
E: PCI_CLASS=10400
E: PCI_ID=1000:0073
E: PCI_SUBSYS_ID=1000:9241
E: PCI_SLOT_NAME=0000:01:00.0
E: MODALIAS=pci:v00001000d00000073sv00001000sd00009241bc01sc04i00
E: SUBSYSTEM=pci

It's related to device below.

01:00.0 RAID bus controller [0104]: LSI Logic / Symbios Logic MegaRAID SAS 9240 [1000:0073] (rev 02)
Subsystem: LSI Logic / Symbios Logic Device [1000:9241]
Flags: bus master, fast devsel, latency 0, IRQ 16
I/O ports at 2000 [size=256]
Memory at df940000 (64-bit, non-prefetchable) [size=16K]
Memory at df900000 (64-bit, non-prefetchable) [size=256K]
[virtual] Expansion ROM at c0000000 [disabled] [size=256K]
Capabilities: <access denied>
Kernel driver in use: megaraid_sas
Kernel modules: megaraid_sas

Revision history for this message

TienFu Chen (ctf) wrote on 2011-12-02:

#17

The problem is caused by the following file.

/sys/devices/pci0000:00/0000:00:03.0/0000:01:00.0/vpd
(MegaRAID SAS 9240 DEVPATH=/devices/pci0000:00/0000:00:03.0/0000:01:00.0)
-rw------- 1 root root 32768 2011-12-02 14:51 vpd

executing the command will cause halting: sudo cat vpd
The vpd file can be only read/write by root, so it explains why running udev_resource with normal user can't reproduce this problem.

Revision history for this message

TienFu Chen (ctf) wrote on 2011-12-02:

#18

Executed the command,sudo cat vpd, will cause halting and the server console will display the following message immediately:
-----
Uhhuh. NMI received for unknown reason 21 on CPU 0
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue.

Revision history for this message

TienFu Chen (ctf) wrote on 2011-12-02:

#19

  97 What: /sys/bus/pci/devices/.../vpd
  98 Date: February 2008
  99 Contact: Ben Hutchings <email address hidden>
100 Description:
101 A file named vpd in a device directory will be a
102 binary file containing the Vital Product Data for the
103 device. It should follow the VPD format defined in
104 PCI Specification 2.1 or 2.2, but users should consider
105 that some devices may have malformatted data. If the
106 underlying VPD has a writable section then the
107 corresponding section of this file will be writable.

Revision history for this message

TienFu Chen (ctf) wrote on 2011-12-02:

#20

To fix the problem for checkbox, we may add "vpd" to the excluded list in below statement.

line 443 in udev_resource:
        for name in names:
            name_path = posixpath.join(sys_path, name)
            if name[0] == "." \
               or name in ["dev", "uevent", "vpd"] \
               or posixpath.isdir(name_path) \
               or posixpath.islink(name_path):
                continue
The udev_resource with excluded "vpd" will not cause the halting problem.

Revision history for this message

TienFu Chen (ctf) wrote on 2011-12-02:

#21

Disks on AR320:
[ 2.431927] scsi 4:0:6:0: Direct-Access ATA WDC WD2002FYPS-0 5G04 PQ: 0 ANSI: 5
[ 2.438752] scsi 4:0:7:0: Direct-Access ATA WDC WD2002FYPS-0 5G04 PQ: 0 ANSI: 5

Disks on AR160:
[ 3.935828] scsi 4:0:13:0: Direct-Access ATA WDC WD2002FYPS-0 5G04 PQ: 0 ANSI: 5
[ 3.941908] scsi 4:0:14:0: Direct-Access ATA WDC WD2002FYPS-0 5G04 PQ: 0 ANSI: 5

Revision history for this message

Brad Figg (brad-figg) wrote on 2011-12-02:

#22

@tim

Please try to capture a dmesg log at the point of the halt.

Revision history for this message

Jeff Lane  (bladernr) wrote on 2011-12-02:

#23

So I've done some digging.... First, an ancedotal story:

A man was experiencing extreme pain whenever he touched his arm in a particular place. So he goes to the doctor to have it checked out. The man says "Doctor, every time I push on this spot on my arm, I get horrible pain."

The doctor looks at the man and says, "Well then, stop pushing there."

After looking at this and recreating it a few times, I think I know what's going on here. Now keep in mind that this is just a guess... but I think that that vpd file in the sysfs directory for this card is actually a direct interface to the card itself, meant to be accessed only via an API of some sorts... if you run file on it and other items in the directory you'll note that it's a "regluar file" as opposed to "ASCII Text" or some other filetype.

For comparison, if you run file on /proc/kcore, it too is a "regular file".

I mention this because the behaviour seen here is alarmingly similar to older kernels (2.2x or 2.4.x, IIRC) where you could cat /proc/kcore to bring a system crashing to a halt.

So what I think is happening is that by accessing that vpd file (either using cat or in the udev_resource script which is doing a open() on the vpd file) we are inadvertantly causing the kernel to hang. I'd like to poke at this further by checking into the uevent file in that devices sysfs directory, but at the moment, I'm unabel to access the system, even after a reboot. It may need to be re-installed or at least poked manually. Power-cycling it remotely hasn't worked so far.

IN any case, I think Tim's already got this sorted out. My only concern with his suggested solution is whether or not blacklisting vpd will break data collection for other places. In other words, while /path/to/vpd here may be breaking the system, is there a case where /path/to/vpd actually contains parseable data that doesn't trigger a hung system when you try opening it?

Revision history for this message

Jeff Lane  (bladernr) wrote on 2011-12-02:

#24

dmesg.log Edit (53.2 KiB, text/plain)

I was asked earlier for a dmesg dump from when this is triggered. Here it is, but you'll notice that there's nothing there. This is an instantaneous hang for the kernel, so it never has a chance to write anything to the ring buffer, let alone pass that on to log files.

To create this, I ran the following script, appending the output to the file dmesg.log:

#!/bin/bash
while true; do
dmesg -c
done

sudo ./dmesg-monitor >> dmesg.log

Revision history for this message

Brendan Donegan (brendan-donegan) wrote on 2011-12-04:

#25

This leads to the question, why don't any of our other systems hang when they touch the vpd file?

Revision history for this message

TienFu Chen (ctf) wrote on 2011-12-05:

#26

I exchanged the raid cards on both systems.
Now the AR160 has the card originally adapted on AR320[card A], now this problem can be always reproduced on AR160.
Card A has a sticker on the heatsink and stated "Old stepping.", don't know what it means actually.

Revision history for this message

Brad Figg (brad-figg) wrote on 2011-12-07:

#27

Can we get an "apport-collect 897773" run on this system? Thanks.

Revision history for this message

TienFu Chen (ctf) wrote on 2011-12-09:

#28

Hi Brad,
Please check the apport report on comment #5

Revision history for this message

Brad Figg (brad-figg) wrote on 2011-12-09:

#29

@tim,

Ok, it doesn't normally get added as a single attachment like that.

Revision history for this message

Jeff Lane  (bladernr) wrote on 2011-12-12:

#30

So this problem seems to have been hardware related ultimately.

On further investigation, after Tim moved swapped identical cards between two serves and the failure followed the older card, I got to looking and found that the firmware on the failing card was several revs below the firmware on the newer card, which was also 1 rev down from the most recent firmware for the LSI 9240-4i.

So I flashed the firmware on both cards to the latest rev (20.10.1-0061) and this seems to have cured the problem.

I can manually cat the vpd file in sysfs repeatedly without failure and I have also been able to run checkbox (which initially triggered the issue) again without failure.

So I think we can call this a firmware issue. I'm closing it as such. If this re-occurrs for some reason, we can reopen or just open a new bug then.

affects:	linux (Ubuntu) → hw-labs
Changed in hw-labs:
assignee:	Brad Figg (brad-figg) → Jeff Lane (bladernr)
status:	Incomplete → Fix Released