General protection fault panic on module hpsa with lockup_detected attribute

Bug #1581169 reported by Eric Desrochers on 2016-05-12
18
This bug affects 4 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Medium
Eric Desrochers
Wily
Medium
Eric Desrochers

Bug Description

it has been brought to my attention the following:

Kernel version: 4.2.0-30-generic #36~14.04.1-Ubuntu

When running an sosreport on HP DL380 gen8 machines running this kernel (Ubuntu 14.04.4 using linux-generic-lts-wily), which includes hpsa 3.4.10-0, hspa causes a kernel panic when sosreport is scanning block devices. These are machines with an onboard p420i and daughtercard p420 RAID controller, with each drive in a single raid0 configuration. (unideal, but the machines do not boot when the card is in HBA mode).

This panic does not happen on kernel 3.13 with hpsa 3.4.1-0 when using sosreport.

The funny thing is kernel 4.2 / 3.4.10-0 still is a more stable solution - I have yet to see a prior issue in which the p420 would lock up on this version. One issue wit h this is HP 99% of the time will require an sosreport when we raise any hardware issues. I can no longer produce that on kernel 4.2 machines because they kernel panic.

I can reproduce this consistently with several other machines in our environment. - please let me know if you would like more info.

Eric Desrochers (slashd) wrote :

The problem is reproducible on demand :

root@dob2-bfs-r5n09:~# udevadm --debug info -ap /sys/block/sdf
calling: info
device 0x154a300 has devpath '/devices/pci0000:00/0000:00:02.2/0000:02:00.0/host2/target2:0:0/2:0:0:5/block/sdf'

Udevadm info starts with the device specified by the devpath and then
walks up the chain of parent devices. It prints for every device
found, all possible attributes in the udev rules key format.
A rule to match, can be composed by the attributes of the device
and the attributes from one single parent device.

looking at device '/devices/pci0000:00/0000:00:02.2/0000:02:00.0/host2/target2:0:0/2:0:0:5/block/sdf':
KERNEL=="sdf"
SUBSYSTEM=="block"
DRIVER==""
ATTR{ro}=="0"
ATTR{size}=="1562758832"
ATTR{stat}==" 526 0 168573 232 11691 5205 2997588 3820 0 4016 4040"
ATTR{range}=="16"
ATTR{discard_alignment}=="0"
ATTR{events}==""
ATTR{ext_range}=="256"
ATTR{events_poll_msecs}=="-1"
ATTR{alignment_offset}=="0"
ATTR{inflight}==" 0 0"
ATTR{removable}=="0"
ATTR{capability}=="50"
ATTR{events_async}==""

device 0x154b220 has devpath '/devices/pci0000:00/0000:00:02.2/0000:02:00.0/host2/target2:0:0/2:0:0:5'
looking at parent device '/devices/pci0000:00/0000:00:02.2/0000:02:00.0/host2/target2:0:0/2:0:0:5':
KERNELS=="2:0:0:5"
SUBSYSTEMS=="scsi"
DRIVERS=="sd"
ATTRS{rev}=="7.02"
ATTRS{type}=="0"
ATTRS{scsi_level}=="6"
ATTRS{lunid}=="0x0500004000000000"
ATTRS{model}=="LOGICAL VOLUME "
ATTRS{state}=="running"
ATTRS{queue_type}=="simple"
ATTRS{iodone_cnt}=="0x328b"
ATTRS{iorequest_cnt}=="0x328b"
ATTRS{unique_id}=="600508B1001C6ACDCAA167F467871EAB"
ATTRS{queue_ramp_up_period}=="120000"
ATTRS{device_busy}=="0"
ATTRS{evt_capacity_change_reported}=="0"
ATTRS{timeout}=="30"
ATTRS{evt_media_change}=="0"
ATTRS{ioerr_cnt}=="0x2"
ssh: Write failed: Broken pipe

Eric Desrochers (slashd) wrote :

We can confirm this is the last line we see regardless of the disk:

ATTRS{ioerr_cnt}=="0x2"

On a working disk the next line is :

"ATTRS{lockup_detected}"

It seems to stuck at the lockup_detected attribute.

Eric Desrochers (slashd) wrote :

Possible commit to fix the issue, found upstream:

I'll create a testfix and will make it public shortly.

$ git show fb53c43
commit fb53c439d84387621c53808a3957ffd9876e5094
Author: Tomas Henzl <email address hidden>
Date: Fri Nov 6 16:24:09 2015 +0100

hpsa: move lockup_detected attribute to host attr

This patch fixes a 'general protection fault' issue by
moving the attribute to where it was likely meant.

Signed-off-by: Tomas Henzl <email address hidden>
Signed-off-by: Don Brace <email address hidden>
Signed-off-by: Martin K. Petersen <email address hidden>

diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
index 57166e6..6d44123 100644
--- a/drivers/scsi/hpsa.c
+++ b/drivers/scsi/hpsa.c
@@ -867,7 +867,6 @@ static struct device_attribute *hpsa_sdev_attrs[] = {
&dev_attr_unique_id,
&dev_attr_hp_ssd_smart_path_enabled,
&dev_attr_path_info,
- &dev_attr_lockup_detected,
NULL,
};

@@ -879,6 +878,7 @@ static struct device_attribute *hpsa_shost_attrs[] = {
&dev_attr_resettable,
&dev_attr_hp_ssd_smart_path_status,
&dev_attr_raid_offload_debug,
+ &dev_attr_lockup_detected,
NULL,
};

Changed in linux (Ubuntu):
importance: Undecided → Medium
Eric Desrochers (slashd) wrote :

Stack Trace:

[46586.194135] general protection fault: 0000 [#1] SMP
[46586.194933] Modules linked in: ipmi_ssif x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd serio_raw input_leds joydev sb_edac edac_core hpilo lpc_ich ioatdma dca shpchp wmi xfs ipmi_si 8250_fintek ipmi_msghandler tpm_infineon acpi_power_meter mac_hid libcrc32c bonding lp parport raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor hid_generic psmouse raid6_pq sfc raid1 usbhid raid0 mtd hid tg3 pata_acpi i2c_algo_bit multipath ptp mdio hpsa pps_core linear
[46586.206368] CPU: 24 PID: 53367 Comm: udevadm Not tainted 4.2.0-30-generic #36~14.04.1-Ubuntu
[46586.207925] Hardware name: HP ProLiant DL380p Gen8, BIOS P70 07/01/2015
[46586.208898] task: ffff880386240000 ti: ffff880130fac000 task.ti: ffff880130fac000
[46586.210030] RIP: 0010:[<ffffffffc0030ddc>] [<ffffffffc0030ddc>] host_show_lockup_detected+0x2c/0x50 [hpsa]
[46586.213065] RSP: 0018:ffff880130fafd88 EFLAGS: 00010282
[46586.213882] RAX: 2030203020302030 RBX: ffffffffc00422e0 RCX: ffff88274bb66000
[46586.215502] RDX: ffff881fbf980000 RSI: ffffffffc00422e0 RDI: ffff881fb1416968
[46586.216560] RBP: ffff880130fafd88 R08: ffff881fb1416978 R09: 0000000000000000
[46586.217616] R10: 0000000000001000 R11: 0000000000000246 R12: ffffffff818749b0
[46586.218674] R13: 0000000000000001 R14: ffff880130faff20 R15: ffff881fb1d46180
[46586.219969] FS: 00007f060a57d880(0000) GS:ffff881fbf980000(0000) knlGS:0000000000000000
[46586.221181] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[46586.222000] CR2: 0000000001748000 CR3: 0000000a6ad8f000 CR4: 00000000001406e0
[46586.223039] Stack:
[46586.223334] ffff880130fafdb8 ffffffff814f0eb0 ffff88274bb66000 ffffffff817ba686
[46586.224518] ffff881fa016aa00 ffff881fb1d46180 ffff880130fafdd8 ffffffff81263bc2
[46586.225716] 0000000000000000 ffff881fa016aa00 ffff880130fafde8 ffffffff81262413
[46586.227165] Call Trace:
[46586.227539] [<ffffffff814f0eb0>] dev_attr_show+0x20/0x50
[46586.228313] [<ffffffff817ba686>] ? mutex_lock+0x16/0x37
[46586.229113] [<ffffffff81263bc2>] sysfs_kf_seq_show+0xc2/0x1a0
[46586.229957] [<ffffffff81262413>] kernfs_seq_show+0x23/0x30
[46586.231564] [<ffffffff8120d8a5>] seq_read+0xe5/0x350
[46586.232303] [<ffffffff81262bcd>] kernfs_fop_read+0x10d/0x170
[46586.233180] [<ffffffff811ea708>] __vfs_read+0x18/0x40
[46586.233947] [<ffffffff811eace6>] vfs_read+0x86/0x130
[46586.235451] [<ffffffff811ebb06>] SyS_read+0x46/0xa0
[46586.236180] [<ffffffff817bc332>] entry_SYSCALL_64_fastpath+0x16/0x75
[46586.237160] Code: 1f 44 00 00 55 48 89 d1 48 8b 87 e0 02 00 00 48 89 e5 65 8b 15 a6 93 fd 3f 48 63 d2 48 8b 80 b8 4b 00 00 48 8b 14 d5 80 c1 d2 81 <8b> 14 02 48 c7 c6 5c fb 03 c0 48 89 cf 31 c0 e8 50 0b 38 c1 5d
[46586.242101] RIP [<ffffffffc0030ddc>] host_show_lockup_detected+0x2c/0x50 [hpsa]
[46586.243446] RSP <ffff880130fafd88>

Eric Desrochers (slashd) wrote :

According to the vmcore[1], the crash seems to occur while performing the "udevadm"[2] command which is part of the sosreport under the block plugin.

[1] - vmcore

KERNEL: /usr/lib/debug/boot/vmlinux-4.2.0-30-generic
DUMPFILE: dump.201605101421 [PARTIAL DUMP]
CPUS: 40
DATE: Wed Dec 31 19:00:00 1969
UPTIME: 12:56:48
LOAD AVERAGE: 4.66, 5.51, 4.64
TASKS: 37318
NODENAME: XXXXXXXXXXXXXX
RELEASE: 4.2.0-30-generic
VERSION: #36~14.04.1-Ubuntu SMP Fri Feb 26 18:49:23 UTC 2016
MACHINE: x86_64 (2992 Mhz)
MEMORY: 256 GB
PANIC: ""
PID: 53367
COMMAND: "udevadm"
TASK: ffff880386240000 [THREAD_INFO: ffff880130fac000]
CPU: 24
STATE: TASK_RUNNING (PANIC)

----
[2] - sosreport - block plugin

sosreport-3.1/sos/plugins/block.py: self.add_cmd_output("udevadm info -ap /sys/block/%s" % (disk))

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1581169

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete

Hi Brad,

I do have a vmcore + dmesg, but this is very large (total: ~16GB).

I will provide the test kernel based on Ubuntu-lts-4.2.0-36.41_14.04.1 including commit "fb53c43" and ask my contact to test it and confirm if it mitigate the issue on their affected systems where they can reproduce the panic on demand using the "udevadm" command.

Eric

Changed in linux (Ubuntu):
status: Incomplete → Triaged
tags: added: kernel-da-key wily
Changed in linux (Ubuntu Wily):
importance: Undecided → Medium
status: New → Triaged
Eric Desrochers (slashd) wrote :

Here's a test kernel based on "4.2.0-36.42~14.04.1" including the upstream commit "fb53c439 - hpsa: move lockup_detected attribute to host attr".

Please test and provide feedbacks.

Instructions:

# Pre-installation
$ sudo add-apt-repository ppa:slashd/bug1581169
$ sudo apt-get update

# Installation
$ sudo apt-get install linux-image-4.2.0-36-generic=4.2.0-36.42~14.04.1hf98078v20160516b1 -y
$ sudo apt-get install linux-image-extra-4.2.0-36-generic=4.2.0-36.42~14.04.1hf98078v20160516b1 -y
$ sudo apt-get install linux-headers-4.2.0-36-generic=4.2.0-36.42~14.04.1hf98078v20160516b1 -y

# Reboot
$ sudo reboot

Eric

Eric Desrochers (slashd) wrote :

Here's what has been brought to my attention :

I've now got everything installed and confirm the kernel build (42~14.04.1hf98078v20160516b1-Ubuntu) with sosreport and udevadm probe succeeds on 4 affected machines.

It looks like this solves the issue.

Changed in linux (Ubuntu):
assignee: nobody → Eric Desrochers (slashd)
Changed in linux (Ubuntu Wily):
assignee: nobody → Eric Desrochers (slashd)
Eric Desrochers (slashd) wrote :

An email has been sent to "<email address hidden>" for an SRU on Wily.

Eric Desrochers (slashd) on 2016-05-18
Changed in linux (Ubuntu):
status: Triaged → In Progress
Changed in linux (Ubuntu Wily):
status: Triaged → In Progress
Changed in linux (Ubuntu Wily):
status: In Progress → Fix Committed
Kamal Mostafa (kamalmostafa) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-wily' to 'verification-done-wily'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-wily
Eric Desrochers (slashd) on 2016-06-14
tags: added: verification-done-wily
removed: verification-needed-wily
Eric Desrochers (slashd) on 2016-06-26
Changed in linux (Ubuntu):
status: In Progress → Fix Released
summary: - kernel panic (General protection fault) on module hpsa (lockup_detected)
+ General protection fault panic on module hpsa with lockup_detected
+ attribute
Launchpad Janitor (janitor) wrote :
Download full text (7.4 KiB)

This bug was fixed in the package linux - 4.2.0-41.48

---------------
linux (4.2.0-41.48) wily; urgency=low

  [ Luis Henriques ]

  * Release Tracking Bug
    - LP: #1595914

  [ Upstream Kernel Changes ]

  * netfilter: x_tables: validate e->target_offset early
    - LP: #1555338
    - CVE-2016-3134
  * netfilter: x_tables: make sure e->next_offset covers remaining blob
    size
    - LP: #1555338
    - CVE-2016-3134
  * netfilter: x_tables: fix unconditional helper
    - LP: #1555338
    - CVE-2016-3134
  * netfilter: x_tables: don't move to non-existent next rule
    - LP: #1595350
  * netfilter: x_tables: validate targets of jumps
    - LP: #1595350
  * netfilter: x_tables: add and use xt_check_entry_offsets
    - LP: #1595350
  * netfilter: x_tables: kill check_entry helper
    - LP: #1595350
  * netfilter: x_tables: assert minimum target size
    - LP: #1595350
  * netfilter: x_tables: add compat version of xt_check_entry_offsets
    - LP: #1595350
  * netfilter: x_tables: check standard target size too
    - LP: #1595350
  * netfilter: x_tables: check for bogus target offset
    - LP: #1595350
  * netfilter: x_tables: validate all offsets and sizes in a rule
    - LP: #1595350
  * netfilter: x_tables: don't reject valid target size on some
    architectures
    - LP: #1595350
  * netfilter: arp_tables: simplify translate_compat_table args
    - LP: #1595350
  * netfilter: ip_tables: simplify translate_compat_table args
    - LP: #1595350
  * netfilter: ip6_tables: simplify translate_compat_table args
    - LP: #1595350
  * netfilter: x_tables: xt_compat_match_from_user doesn't need a retval
    - LP: #1595350
  * netfilter: x_tables: do compat validation via translate_table
    - LP: #1595350
  * netfilter: x_tables: introduce and use xt_copy_counters_from_user
    - LP: #1595350

linux (4.2.0-40.47) wily; urgency=low

  [ Kamal Mostafa ]

  * Release Tracking Bug
    - LP: #1595725

  [ Serge Hallyn ]

  * SAUCE: add a sysctl to disable unprivileged user namespace unsharing
    - LP: #1555338, #1595350

linux (4.2.0-39.46) wily; urgency=low

  [ Kamal Mostafa ]

  * Release Tracking Bug
    - LP: #1591301

  [ J. R. Okajima ]

  * SAUCE: AUFS: mm/mmap: fix oopsing on remap_file_pages aufs mmap:
    bugfix, mainly for linux-4.5-rc5, remap_file_pages(2) emulation
    - LP: #1558120

  [ Kamal Mostafa ]

  * [debian] getabis: Only git add $abidir if running in local repo
    - LP: #1584890
  * [debian] getabis: Fix inconsistent compiler versions check
    - LP: #1584890

  [ Tim Gardner ]

  * Revert "SAUCE: mm/mmap: fix oopsing on remap_file_pages"
    - LP: #1558120
  * [Config] Remove arc4 from nic-modules
    - LP: #1582991

  [ Upstream Kernel Changes ]

  * Revert "usb: hub: do not clear BOS field during reset device"
    - LP: #1582864
  * hpsa: move lockup_detected attribute to host attr
    - LP: #1581169
  * ALSA: timer: Fix leak in SNDRV_TIMER_IOCTL_PARAMS
    - LP: #1580379
    - CVE-2016-4569
  * ALSA: timer: Fix leak in events via snd_timer_user_ccallback
    - LP: #1581866
    - CVE-2016-4578
  * ALSA: timer: Fix leak in events via snd_timer_user_tinterrupt
    - LP: #1581866
    - CVE-2016-4578
  * net: fix a kernel inf...

Read more...

Changed in linux (Ubuntu Wily):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers