Precise crashes hard when HP array rebuilds

Bug #1098262 reported by Nick Moffitt on 2013-01-10
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
High
Unassigned

Bug Description

We've long suspected that on certain hardware, Precise will crash entirely when certain events (such as "the array has finished rebuilding") come from the HP storage array.

On a ProLiant SL335s G7 running 3.2.0-33-generic amd64 on Ubuntu 12.04.1 LTS, we noticed this with more conclusive information. The system in question is an openstack compute node, and we pulled the following out of its logs:

 {'class': 'POST Message',
  'count': 1,
  'description': 'POST Error: 1716-Slot X Drive Array - Unregenerable Media Errors Detected on Drives during previous Rebuild or Auto-Reliability Monitoring (ARM) scan. Problem will be fixed automatically when the sector(s) are overwritten.',
  'initial_update': '01/10/2013 17:09',
  'last_update': '01/10/2013 17:09',
  'severity': 'Caution'}]

This corresponded with the system crash.

Nick Moffitt (nick-moffitt) wrote :

I'm not sure what level of reboot-on-panic or host watchdog/ILO may have done this, but at least one crash resulted in an automatic reboot. More may have followed.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1098262

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: precise
Joseph Salisbury (jsalisbury) wrote :

Did this issue start happening after an update/upgrade? Was there a kernel version where you were not having this particular problem? This will help determine if the problem you are seeing is the result of the introduction of a regression, and when this regression was introduced.

tags: added: kernel-da-key
Joseph Salisbury (jsalisbury) wrote :

Also, would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v3.8 kernel[0] (Not a kernel in the daily directory) and install both the linux-image and linux-image-extra .deb packages.

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

If you are unable to test the mainline kernel, for example it will not boot, please add the tag: 'kernel-unable-to-test-upstream'.
Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.8-rc2-raring/

Changed in linux (Ubuntu):
importance: Undecided → High
Nick Moffitt (nick-moffitt) wrote :

Brad: This system doesn't have outgoing HTTP access, so apport-collect seems to just hang. How do I generate the data you need for copy/paste?

Joseph Salisbury (jsalisbury) wrote :

You can do it this way:

apport-bug --save /tmp/report.1098262 linux

Then attach the file report.1098262 to the bug.

Nick Moffitt (nick-moffitt) wrote :

Okay, here's the apport data.

Joseph Salisbury (jsalisbury) wrote :

Did this issue just start happening with the 3.2.0-33-generic kernel, or did you see it in prior releases as well?

Nick Moffitt (nick-moffitt) wrote :

We never saw this problem before Precise, but we also didn't use this hardware much before Precise either.

At the moment we're reading http://h20565.www2.hp.com/portal/site/hpsc/template.PAGE/action.process/public/psi/topIssuesDisplay/?sp4ts.oid=4337682&sp4ts.sn=CZ3212BWKE&sp4ts.pn=614046-B21&javax.portlet.action=true&spf_p.tpst=psiContentDisplay&javax.portlet.begCacheTok=com.vignette.cachetoken&spf_p.prp_psiContentDisplay=wsrp-interactionState%3DdocId%253Demr_na-c03555882%257CdocLocale%253Den_US&javax.portlet.endCacheTok=com.vignette.cachetoken and assessing its relevance to this bug. It looks like the latest raring kernel would incorporate this driver update. The latest Precise kernel update seems to have been a couple weeks before this hpsa update.

Andrew Glen-Young (aglenyoung) wrote :

Another data point:

We have another machine with the same identical hardware, firmware and OS, but running an earlier kernel and it is not suffering from the same problem. Of course we haven't had a disk failure on the hardware so cannot say for sure that the bug does not exist in the earlier kernel version. We cannot simulate a failure on the sister hardware as the machine is in use with production work loads.

Joseph Salisbury (jsalisbury) wrote :

@Nick, it looks like the hpsa driver is version 2.0.2-1 in Raring(And also in Linus' tree). The following is from the kernel source for Raring:

~drivers/scsi/hpsa.c:

/* HPSA_DRIVER_VERSION must be 3 byte values (0-255) separated by '.' */
#define HPSA_DRIVER_VERSION "2.0.2-1"
#define DRIVER_NAME "HP HPSA Driver (v " HPSA_DRIVER_VERSION ")"
#define HPSA "hpsa"

It looks like the Customer advisory suggests that versison 3.1.0-7 is required to resolve this bug. Did you see somewhere that the hpsa driver would be a newer version in Raring?

Herton R. Krzesinski (herton) wrote :

Besides Joseph's note, I also see that the kernel being used on the logs is 3.2.0-33. There was a commit that went later on Precise 3.2.0-34.53 kernel (commit 21e89afd325849eb38adccf382df16cc895911f9, [SCSI] hpsa: Use LUN reset instead of target reset), that may relate to the problem listed on HP's page pointed at commit #9.

So it looks a good idea to upgrade the kernel to the latest version in -updates and see how it goes.

Herton R. Krzesinski (herton) wrote :

I meant to say *comment #9*, not commit :)

Nick Moffitt (nick-moffitt) wrote :

We upgraded to 3.2.0-36-generic at your recommendation and added a new good disk to the array. We were unable to reproduce the problem. Our kernel logs now include a few of the following:

    hpsa 0000:06:00.0: cp ffff880036d00000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[1a 00 3f 00 ff 00 00 00 00 00 00 00 00 00 00 00]

We don't think we can help isolate this problem, but if we learn anything new we'll let you know.

Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
Alexander List (alexlist) wrote :

We just saw this using kernel 3.2.0-60-generic.

System Information
        Manufacturer: HP
        Product Name: ProLiant DL380 G6

Smart Array P410i in Slot 0 (Embedded)
   Hardware Revision: C
   Firmware Version: 6.00-2

Alexander List (alexlist) wrote :
Changed in linux (Ubuntu):
status: Expired → Confirmed
Alexander List (alexlist) wrote :

# uname -a
Linux bxxxxxx 3.2.0-60-generic #91-Ubuntu SMP Wed Feb 19 03:54:44 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

# strings /lib/modules/3.2.0-60-generic/kernel/drivers/scsi/hpsa.ko |grep -i version
driver version string '%s' unrecognized.
version=2.0.2-1
description=Driver for HP Smart Array Controller version 2.0.2-1
srcversion=4B94039325552B7B9DA9D7A
vermagic=3.2.0-60-generic SMP mod_unload modversions

Alexander List (alexlist) wrote :

This is what we see in the controller's log:

Event: 29 Added: 03/28/2014 00:38
CAUTION: POST Messages - POST Error: 1786-Drive Array Recovery Needed.

The drive is now in predictive failure.

bigbrovar (bigbrovar) wrote :

I am seriously affected by this bug. what is the way forward for me? how can I install the HP patch on my version of ubuntu? a production server just got crashed and while searching the error was brought to this report.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers