Precise crashes hard when HP array rebuilds

Bug #1098262 reported by Nick Moffitt
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
High
Unassigned

Bug Description

We've long suspected that on certain hardware, Precise will crash entirely when certain events (such as "the array has finished rebuilding") come from the HP storage array.

On a ProLiant SL335s G7 running 3.2.0-33-generic amd64 on Ubuntu 12.04.1 LTS, we noticed this with more conclusive information. The system in question is an openstack compute node, and we pulled the following out of its logs:

 {'class': 'POST Message',
  'count': 1,
  'description': 'POST Error: 1716-Slot X Drive Array - Unregenerable Media Errors Detected on Drives during previous Rebuild or Auto-Reliability Monitoring (ARM) scan. Problem will be fixed automatically when the sector(s) are overwritten.',
  'initial_update': '01/10/2013 17:09',
  'last_update': '01/10/2013 17:09',
  'severity': 'Caution'}]

This corresponded with the system crash.

Revision history for this message
Nick Moffitt (nick-moffitt) wrote :

I'm not sure what level of reboot-on-panic or host watchdog/ILO may have done this, but at least one crash resulted in an automatic reboot. More may have followed.

Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1098262

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: precise
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Did this issue start happening after an update/upgrade? Was there a kernel version where you were not having this particular problem? This will help determine if the problem you are seeing is the result of the introduction of a regression, and when this regression was introduced.

tags: added: kernel-da-key
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Also, would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v3.8 kernel[0] (Not a kernel in the daily directory) and install both the linux-image and linux-image-extra .deb packages.

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

If you are unable to test the mainline kernel, for example it will not boot, please add the tag: 'kernel-unable-to-test-upstream'.
Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.8-rc2-raring/

Changed in linux (Ubuntu):
importance: Undecided → High
Revision history for this message
Nick Moffitt (nick-moffitt) wrote :

Brad: This system doesn't have outgoing HTTP access, so apport-collect seems to just hang. How do I generate the data you need for copy/paste?

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

You can do it this way:

apport-bug --save /tmp/report.1098262 linux

Then attach the file report.1098262 to the bug.

Revision history for this message
Nick Moffitt (nick-moffitt) wrote :

Okay, here's the apport data.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Did this issue just start happening with the 3.2.0-33-generic kernel, or did you see it in prior releases as well?

Revision history for this message
Nick Moffitt (nick-moffitt) wrote :

We never saw this problem before Precise, but we also didn't use this hardware much before Precise either.

At the moment we're reading http://h20565.www2.hp.com/portal/site/hpsc/template.PAGE/action.process/public/psi/topIssuesDisplay/?sp4ts.oid=4337682&sp4ts.sn=CZ3212BWKE&sp4ts.pn=614046-B21&javax.portlet.action=true&spf_p.tpst=psiContentDisplay&javax.portlet.begCacheTok=com.vignette.cachetoken&spf_p.prp_psiContentDisplay=wsrp-interactionState%3DdocId%253Demr_na-c03555882%257CdocLocale%253Den_US&javax.portlet.endCacheTok=com.vignette.cachetoken and assessing its relevance to this bug. It looks like the latest raring kernel would incorporate this driver update. The latest Precise kernel update seems to have been a couple weeks before this hpsa update.

Revision history for this message
Andrew Glen-Young (aglenyoung) wrote :

Another data point:

We have another machine with the same identical hardware, firmware and OS, but running an earlier kernel and it is not suffering from the same problem. Of course we haven't had a disk failure on the hardware so cannot say for sure that the bug does not exist in the earlier kernel version. We cannot simulate a failure on the sister hardware as the machine is in use with production work loads.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

@Nick, it looks like the hpsa driver is version 2.0.2-1 in Raring(And also in Linus' tree). The following is from the kernel source for Raring:

~drivers/scsi/hpsa.c:

/* HPSA_DRIVER_VERSION must be 3 byte values (0-255) separated by '.' */
#define HPSA_DRIVER_VERSION "2.0.2-1"
#define DRIVER_NAME "HP HPSA Driver (v " HPSA_DRIVER_VERSION ")"
#define HPSA "hpsa"

It looks like the Customer advisory suggests that versison 3.1.0-7 is required to resolve this bug. Did you see somewhere that the hpsa driver would be a newer version in Raring?

Revision history for this message
Herton R. Krzesinski (herton) wrote :

Besides Joseph's note, I also see that the kernel being used on the logs is 3.2.0-33. There was a commit that went later on Precise 3.2.0-34.53 kernel (commit 21e89afd325849eb38adccf382df16cc895911f9, [SCSI] hpsa: Use LUN reset instead of target reset), that may relate to the problem listed on HP's page pointed at commit #9.

So it looks a good idea to upgrade the kernel to the latest version in -updates and see how it goes.

Revision history for this message
Herton R. Krzesinski (herton) wrote :

I meant to say *comment #9*, not commit :)

Revision history for this message
Nick Moffitt (nick-moffitt) wrote :

We upgraded to 3.2.0-36-generic at your recommendation and added a new good disk to the array. We were unable to reproduce the problem. Our kernel logs now include a few of the following:

    hpsa 0000:06:00.0: cp ffff880036d00000 has check condition: unknown type: Sense: 0x5, ASC: 0x20, ASCQ: 0x0, Returning result: 0x2, cmd=[1a 00 3f 00 ff 00 00 00 00 00 00 00 00 00 00 00]

We don't think we can help isolate this problem, but if we learn anything new we'll let you know.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
Revision history for this message
Alexander List (alexlist) wrote :

We just saw this using kernel 3.2.0-60-generic.

System Information
        Manufacturer: HP
        Product Name: ProLiant DL380 G6

Smart Array P410i in Slot 0 (Embedded)
   Hardware Revision: C
   Firmware Version: 6.00-2

Revision history for this message
Alexander List (alexlist) wrote :
Changed in linux (Ubuntu):
status: Expired → Confirmed
Revision history for this message
Alexander List (alexlist) wrote :

# uname -a
Linux bxxxxxx 3.2.0-60-generic #91-Ubuntu SMP Wed Feb 19 03:54:44 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

# strings /lib/modules/3.2.0-60-generic/kernel/drivers/scsi/hpsa.ko |grep -i version
driver version string '%s' unrecognized.
version=2.0.2-1
description=Driver for HP Smart Array Controller version 2.0.2-1
srcversion=4B94039325552B7B9DA9D7A
vermagic=3.2.0-60-generic SMP mod_unload modversions

Revision history for this message
Alexander List (alexlist) wrote :

This is what we see in the controller's log:

Event: 29 Added: 03/28/2014 00:38
CAUTION: POST Messages - POST Error: 1786-Drive Array Recovery Needed.

The drive is now in predictive failure.

Revision history for this message
bigbrovar (bigbrovar) wrote :

I am seriously affected by this bug. what is the way forward for me? how can I install the HP patch on my version of ubuntu? a production server just got crashed and while searching the error was brought to this report.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.