Controller lockup detected on ProLiant DL380 Gen9 with P440 Controller

Bug #1720359 reported by Eric Desrochers on 2017-09-29
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Medium
Unassigned
Xenial
Medium
Eric Desrochers
Zesty
Medium
Eric Desrochers
Artful
Medium
Unassigned

Bug Description

Deploying ceph osd on Trusty/14.04 (LTS) with Ubuntu 4.4 series kernel on HP HW[1] system triggers "Controller lockup"[2]

[1] - HW
System Information
 Manufacturer: HP
 Product Name: ProLiant DL380 Gen9

BIOS
  Vendor: HP
  Version: P89
  Release Date: 02/17/2017

Smart Array Controller
 Smart Array P440 Controller

[2] - /var/log/kern.log
...
ceph-osd: 2017-09-26 15:34:42.259205 7f99a3d38700 1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7f99ad69d700' had timed out after 60
eph-osd: 2017-09-26 15:34:42.259215 7f99a553b700 1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7f99ace9c700' had timed out after 60
ceph-osd: 2017-09-26 15:34:42.259219 7f99a553b700 1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7f99ad69d700' had timed out after 60
hpsa 0000:08:00.0: Controller lockup detected: 0x00130001 after 30
hpsa 0000:08:00.0: hpsa_send_abort_ioaccel2: Tag:0x00000000:00000190: unknown abort service response 0x00
....

Eric Desrochers (slashd) on 2017-09-29
tags: added: sts
Eric Desrochers (slashd) wrote :

The above behavior is reproducible using upstream mainline kernel, so not Ubuntu specific.

A bisection against upstream kernel revealed :

# first bad commit: [a736e9b6a03283a2e0fc8190b748b3a672f289c1] hpsa: correct ioaccel2 sg chain len

Note that the behavior also seems to be fix in recent upstream kernel (tested with v4.14-rc2)

The upstream v4.14-rc2 kernel is using :
HPSA_DRIVER_VERSION : "3.4.20-0"

while

Ubuntu-4.4.0-97.120 is using :
HPSA_DRIVER_VERSION : "3.4.14-0"

Another bisection is in progress to find the first good commit between "a736e9b6a03283a2e0fc8190b748b3a672f289c1]" and "v4.14-rc2" HEAD.

- Eric

description: updated

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1720359

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: kernel-da-key
Changed in linux (Ubuntu):
importance: Undecided → Medium
Eric Desrochers (slashd) on 2017-10-04
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
status: Confirmed → In Progress
assignee: nobody → Eric Desrochers (slashd)
Eric Desrochers (slashd) wrote :

The GOOD commit (found via a git bisect) is the following :

git show e2c7b433f729cedb32514480af8cbdf2fe5cf264
commit e2c7b433f729cedb32514480af8cbdf2fe5cf264
Author: Yadan Fan <email address hidden>
Date: Fri Jun 23 17:40:05 2017 +0800

    scsi: hpsa: limit transfer length to 1MB

    The hpsa firmware will bypass the cache for any request larger than 1MB,
    so we should cap the request size to avoid any performance degradation
    in kernels later than v4.3

    This degradation is caused from d2be537c3ba3568acd79cd178327b842e60d035e,
    which changed max_sectors_kb to 1280k, but the hardware is able to work
    fine with it, so the true fix should be from hpsa driver.

    Signed-off-by: Yadan Fan <email address hidden>
    Reviewed-by: Johannes Thumshirn <email address hidden>
    Acked-by: Don Brace <email address hidden>
    Signed-off-by: Martin K. Petersen <email address hidden>

diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
index 8914eab..4f7cdb2 100644
--- a/drivers/scsi/hpsa.c
+++ b/drivers/scsi/hpsa.c
@@ -938,7 +938,7 @@ static struct scsi_host_template hpsa_driver_template = {
 #endif
        .sdev_attrs = hpsa_sdev_attrs,
        .shost_attrs = hpsa_shost_attrs,
- .max_sectors = 8192,
+ .max_sectors = 1024,
        .no_write_same = 1,
 };

Eric Desrochers (slashd) on 2017-10-06
Changed in linux (Ubuntu Artful):
status: In Progress → Fix Released
Changed in linux (Ubuntu Zesty):
assignee: nobody → Eric Desrochers (slashd)
Changed in linux (Ubuntu Xenial):
assignee: nobody → Eric Desrochers (slashd)
Changed in linux (Ubuntu Trusty):
assignee: nobody → Eric Desrochers (slashd)
importance: Undecided → Medium
Changed in linux (Ubuntu Xenial):
importance: Undecided → Medium
Changed in linux (Ubuntu Zesty):
importance: Undecided → Medium
Changed in linux (Ubuntu Trusty):
status: New → In Progress
Changed in linux (Ubuntu Xenial):
status: New → In Progress
Changed in linux (Ubuntu Zesty):
status: New → In Progress
Eric Desrochers (slashd) wrote :

Patch has been submitted to kernel team ML for review :

[PATCH][SRU][X] scsi: hpsa: limit transfer length to 1MB
[PATCH][SRU][Z] scsi: hpsa: limit transfer length to 1MB

no longer affects: linux (Ubuntu Trusty)
Changed in linux (Ubuntu Artful):
assignee: Eric Desrochers (slashd) → nobody
Changed in linux (Ubuntu Xenial):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Zesty):
status: In Progress → Fix Committed

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
tags: added: verification-needed-zesty

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-zesty' to 'verification-done-zesty'. If the problem still exists, change the tag 'verification-needed-zesty' to 'verification-failed-zesty'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

Eric Desrochers (slashd) on 2017-10-18
tags: added: verification-done-xenial verification-done-zesty
removed: verification-needed-xenial verification-needed-zesty
Launchpad Janitor (janitor) wrote :
Download full text (4.4 KiB)

This bug was fixed in the package linux - 4.10.0-38.42

---------------
linux (4.10.0-38.42) zesty; urgency=low

  * linux: 4.10.0-38.42 -proposed tracker (LP: #1722330)

  * Controller lockup detected on ProLiant DL380 Gen9 with P440 Controller
    (LP: #1720359)
    - scsi: hpsa: limit transfer length to 1MB

  * [Dell Docking IE][0bda:8153] Realtek USB Ethernet leads to system hang
    (LP: #1720977)
    - r8152: fix the list rx_done may be used without initialization

  * Touchpad not detected in Lenovo X1 Yoga / Yoga 720-15IKB (LP: #1700657)
    - mfd: intel-lpss: Add missing PCI ID for Intel Sunrise Point LPSS devices

  * Add installer support for Broadcom BCM573xx network drivers. (LP: #1720466)
    - d-i: Add bnxt_en to nic-modules.

  * CVE-2017-1000252
    - KVM: VMX: Do not BUG() on out-of-bounds guest IRQ

  * CVE-2017-10663
    - f2fs: sanity check checkpoint segno and blkoff

  * xfstest sanity checks on seek operations fails (LP: #1696049)
    - xfs: fix off-by-one on max nr_pages in xfs_find_get_desired_pgoff()

  * [P9, Power NV][ WSP][Ubuntu 16.04.03] : perf hw breakpoint command results
    in call traces and system goes for reboot. (LP: #1706033)
    - powerpc/64s: Handle data breakpoints in Radix mode

  * 5U84 - ses driver isn't binding right - cannot blink lights on 1 of the 2
    5u84 (LP: #1693369)
    - scsi: ses: do not add a device to an enclosure if enclosure_add_links()
      fails.

  * Vlun resize request could fail with cxlflash driver (LP: #1713575)
    - scsi: cxlflash: Fix vlun resize failure in the shrink path

  * More migrations with constant load (LP: #1713576)
    - sched/fair: Prefer sibiling only if local group is under-utilized

  * New PMU fixes for marked events. (LP: #1716491)
    - powerpc/perf: POWER9 PMU stops after idle workaround

  * CVE-2017-14340
    - xfs: XFS_IS_REALTIME_INODE() should be false if no rt device present

  * [Zesty][Yakkety] rtl8192e bug fixes (LP: #1698470)
    - staging: rtl8192e: rtl92e_fill_tx_desc fix write to mapped out memory.
    - staging: rtl8192e: fix 2 byte alignment of register BSSIDR.
    - staging: rtl8192e: rtl92e_get_eeprom_size Fix read size of EPROM_CMD.
    - staging: rtl8192e: GetTs Fix invalid TID 7 warning.

  * Stranded with ENODEV after mdadm --readonly (LP: #1706243)
    - md: MD_CLOSING needs to be cleared after called md_set_readonly or
      do_md_stop

  * multipath -ll is not showing the disks which are actually multipath
    (LP: #1718397)
    - fs: aio: fix the increment of aio-nr and counting against aio-max-nr

  * ETPS/2 Elantech Touchpad inconsistently detected (Gigabyte P57W laptop)
    (LP: #1594214)
    - Input: i8042 - add Gigabyte P57 to the keyboard reset table

  * CVE-2017-10911
    - xen-blkback: don't leak stack data via response ring

  * CVE-2017-11176
    - mqueue: fix a use-after-free in sys_mq_notify()

  * implement 'complain mode' in seccomp for developer mode with snaps
    (LP: #1567597)
    - Revert "UBUNTU: SAUCE: seccomp: log actions even when audit is disabled"
    - seccomp: Provide matching filter for introspection
    - seccomp: Sysctl to display available actions
    - seccomp: Operation for checking if an a...

Read more...

Changed in linux (Ubuntu Zesty):
status: Fix Committed → Fix Released
Launchpad Janitor (janitor) wrote :
Download full text (11.5 KiB)

This bug was fixed in the package linux - 4.4.0-98.121

---------------
linux (4.4.0-98.121) xenial; urgency=low

  * linux: 4.4.0-98.121 -proposed tracker (LP: #1722299)

  * Controller lockup detected on ProLiant DL380 Gen9 with P440 Controller
    (LP: #1720359)
    - scsi: hpsa: limit transfer length to 1MB

  * [Dell Docking IE][0bda:8153] Realtek USB Ethernet leads to system hang
    (LP: #1720977)
    - r8152: fix the list rx_done may be used without initialization

  * Add installer support for Broadcom BCM573xx network drivers. (LP: #1720466)
    - d-i: Add bnxt_en to nic-modules.

  * snapcraft.yaml: add dpkg-dev to the build deps (LP: #1718886)
    - snapcraft.yaml: add dpkg-dev to the build deps

  * Support setting I2C_TIMEOUT via ioctl for i2c-designware (LP: #1718578)
    - i2c: designware: Use transfer timeout from ioctl I2C_TIMEOUT

  * 5U84 - ses driver isn't binding right - cannot blink lights on 1 of the 2
    5u84 (LP: #1693369)
    - scsi_transport_sas: add function to get SAS endpoint address
    - ses: fix discovery of SATA devices in SAS enclosures
    - scsi: sas: provide stub implementation for scsi_is_sas_rphy
    - scsi: ses: Fix SAS device detection in enclosure

  * multipath -ll is not showing the disks which are actually multipath
    (LP: #1718397)
    - fs: aio: fix the increment of aio-nr and counting against aio-max-nr

  * Support Dell Wireless DW5819/5818 WWAN devices (LP: #1721455)
    - SAUCE: USB: serial: qcserial: add Dell DW5818, DW5819

  * CVE-2017-10911
    - xen-blkback: don't leak stack data via response ring

  * implement 'complain mode' in seccomp for developer mode with snaps
    (LP: #1567597)
    - seccomp: Provide matching filter for introspection
    - seccomp: Sysctl to display available actions
    - seccomp: Operation for checking if an action is available
    - seccomp: Sysctl to configure actions that are allowed to be logged
    - seccomp: Selftest for detection of filter flag support
    - seccomp: Action to log before allowing

  * implement errno action logging in seccomp for strict mode with snaps
    (LP: #1721676)
    - seccomp: Provide matching filter for introspection
    - seccomp: Sysctl to display available actions
    - seccomp: Operation for checking if an action is available
    - seccomp: Sysctl to configure actions that are allowed to be logged
    - seccomp: Selftest for detection of filter flag support
    - seccomp: Filter flag to log all actions except SECCOMP_RET_ALLOW

  * [Xenial] update OpenNSL kernel modules to 6.5.10 (LP: #1721511)
    - SAUCE: update OpenNSL kernel modules to 6.5.10

  * Xenial update to 4.4.90 stable release (LP: #1721550)
    - cifs: release auth_key.response for reconnect.
    - mac80211: flush hw_roc_start work before cancelling the ROC
    - KVM: PPC: Book3S: Fix race and leak in kvm_vm_ioctl_create_spapr_tce()
    - tracing: Fix trace_pipe behavior for instance traces
    - tracing: Erase irqsoff trace with empty write
    - md/raid5: fix a race condition in stripe batch
    - md/raid5: preserve STRIPE_ON_UNPLUG_LIST in break_stripe_batch_list
    - scsi: scsi_transport_iscsi: fix the issue that iscsi_if_rx doesn't parse
      nlms...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers