Ubuntu 16.10 netboot install fails with "Oops: Exception in kernel mode, sig: 5 [#1] " (lpfc)

Bug #1648873 reported by bugproxy on 2016-12-09
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
debian-installer (Ubuntu)
Undecided
Adam Conrad
Xenial
Undecided
Adam Conrad
Yakkety
Undecided
Adam Conrad
linux (Ubuntu)
High
Canonical Kernel Team
Xenial
Undecided
Unassigned
Yakkety
Undecided
Tim Gardner

Bug Description

== Comment: #33 - Mauricio Faria De Oliveira - 2016-12-09 06:49:57 ==

Hi Canonical,

Can you please apply this patch [1] to 16.10 and 16.04.x HWE (4.8) ?
It's fixes a regression introduced in 4.8.

As you can see, it's in the SCSI maintainer (James Bottomley)'s 'fixes' branch, but didn't make 4.9-rc8 (maybe he considered it late for this one).

We have installer, boot, and post-boot problems due to this one.
It'd be good if the netboot images for 16.04.x HWE kernel can get it too.

Thank you,

[1] scsi: lpfc: fix oops/BUG in lpfc_sli_ringtxcmpl_put()
    http://git.kernel.org/cgit/linux/kernel/git/jejb/scsi.git/commit/?h=fixes&id=2319f847a8910cff1d46c9b66aa1dd7cc3e836a9

Historical context:

== Comment: #0 - HARSHA THYAGARAJA - 2016-11-21 02:39:35 ==
---Problem Description---
Ubuntu 16.10 netboot install fails with "Oops: Exception in kernel mode. " (kernel: 4.8.0-27)

Machine Type = Power8 baremetal

---boot type---
QEMU direct boot kernel/initrd

---Kernel cmdline used to launch install---
On a Power8 server, Using kernel and initrd images,

netcfg/disable_dhcp=true netcfg/confirm_static=true netcfg/choose_interface=98:BE:94:00:4C:68 netcfg/get_ipaddress=9.47.67.159/20 netcfg/get_gateway=9.47.79.254 netcfg/get_nameservers=

---Install repository type---
Internet repository

---Install repository Location---
http://ports.ubuntu.com/ubuntu-ports/dists/yakkety/main/installer-ppc64el/current/images/netboot/ubuntu-installer/ppc64el/

---Point of failure---
Other failure during installation (stage 1)

== Comment: #1 - HARSHA THYAGARAJA - 2016-11-21 02:41:54 ==
The netboot install fails and Call traces are seen at the Disk detection step.

== Comment: #8 - Mauricio Faria De Oliveira - 2016-11-21 15:58:25 ==
Finally got it.

The assembly offset/code + the trap signal is due to this BUG_ON(), and the second condition triggered the trap.

Checking why piocb is not NULL but piocb->vport is NULL.
This might have happened in the lpfc_linkdown_port() path, in the stack trace.

Would need a more readable console log (ie, dmesg, as requested in comments 5, 3) to help understanding it.

--

static int
lpfc_sli_ringtxcmpl_put(struct lpfc_hba *phba, struct lpfc_sli_ring *pring,
                        struct lpfc_iocbq *piocb)
{
        lockdep_assert_held(&phba->hbalock);

        BUG_ON(!piocb || !piocb->vport);
<...>
}

[ 226.147886] NIP [d00000000b7324c0] lpfc_sli_ringtxcmpl_put+0x48/0x120 [lpfc]

0x2478 + 0x48 = 0x24c0 (tdnei; trap doubleword not equal immediate)

0000000000002478 <lpfc_sli_ringtxcmpl_put>:
<...>
    2498: 78 2b bf 7c mr r31,r5 // r31 is *piocb (r5 is the 3rd function parameter)
    249c: 78 1b 7d 7c mr r29,r3
    24a0: 78 23 9e 7c mr r30,r4
    24a4: 01 00 00 48 bl 24a4 <lpfc_sli_ringtxcmpl_put+0x2c> // probably converted at module load time to the call lockdep_assert_held()
    24a8: 00 00 00 60 nop
    24ac: 00 00 bf 2f cmpdi cr7,r31,0 // compare piocb with 0. checking for NULL.
    24b0: 70 00 9e 41 beq cr7,2520 <lpfc_sli_ringtxcmpl_put+0xa8> // if equal to zero, branch out. done w/ the former part of the OR check.
    24b4: e8 00 3f e9 ld r9,232(r31) // an offset of piocb. probably piocb->vport in the bug_on
    24b8: 74 00 29 7d cntlzd r9,r9 // count leading zeroes. if r59 is null (0), leading zeroes is 64 (binary: 0100_0000, bit 6 is 1, and 6 LSbs [bits 5-0] are 0)
    24bc: 82 d1 29 79 rldicl r9,r9,58,6 // rotate left 58 (ie, those 6 LSbs are now MSbs, and that bit 6 from 64 was rotated in the register and is now bit 0, the LSb), now AND the 6 MSbs w/ 0-bits, and the all lower bits with 1-bits (ie, save the LSb).
    24c0: 00 00 09 0b tdnei r9,0 // trap if not equal to zero. (ie, the whole r9 was zero, with 64 leading/consecutive zeroes, then bit 6 is 1, it becomes bit 0.. and since bit 0 is now 1, r9 is thus non-zero, and the trap triggers.) this checked the latter part of the OR.

CVE References

bugproxy (bugproxy) on 2016-12-09
tags: added: architecture-ppc64le bugnameltc-148978 severity-critical targetmilestone-inin1610
Changed in ubuntu:
assignee: nobody → Taco Screen team (taco-screen-team)
affects: ubuntu → linux (Ubuntu)
Changed in linux (Ubuntu):
assignee: Taco Screen team (taco-screen-team) → Canonical Kernel Team (canonical-kernel-team)
importance: Undecided → High
status: New → Triaged
Adam Conrad (adconrad) on 2016-12-10
Changed in debian-installer (Ubuntu Xenial):
assignee: nobody → Adam Conrad (adconrad)
Changed in debian-installer (Ubuntu Yakkety):
assignee: nobody → Adam Conrad (adconrad)
Changed in linux (Ubuntu Xenial):
status: New → Invalid
Changed in debian-installer (Ubuntu):
assignee: nobody → Adam Conrad (adconrad)
Changed in linux (Ubuntu Yakkety):
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)

------- Comment From <email address hidden> 2016-12-12 11:19 EDT-------
Canonical,

The patch has been accepted into mainline/4.9 [1].
Submitting to kernel-team mailing list a while ago, but not in the archives yet.

Updated netboot files (lpfc module) required.

Thanks!

[1] scsi: lpfc: fix oops/BUG in lpfc_sli_ringtxcmpl_put()
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/scsi/lpfc?id=2319f847a8910cff1d46c9b66aa1dd7cc3e836a9

[2] subject: "[SRU][Xenial HWE 4.8][Yakkety][PATCH] scsi: lpfc: fix oops/BUG in lpfc_sli_ringtxcmpl_put()"

Tim Gardner (timg-tpi) wrote :
Changed in linux (Ubuntu Yakkety):
assignee: Canonical Kernel Team (canonical-kernel-team) → Tim Gardner (timg-tpi)
status: New → In Progress
Changed in linux (Ubuntu):
status: Triaged → Fix Released
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-12-13 05:51 EDT-------
*** Bug 149899 has been marked as a duplicate of this bug. ***

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-12-13 20:52 EDT-------
*** Bug 149899 has been marked as a duplicate of this bug. ***

Luis Henriques (henrix) on 2016-12-14
Changed in linux (Ubuntu Yakkety):
status: In Progress → Fix Committed
Luis Henriques (henrix) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-yakkety' to 'verification-done-yakkety'. If the problem still exists, change the tag 'verification-needed-yakkety' to 'verification-failed-yakkety'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-yakkety
bugproxy (bugproxy) on 2016-12-30
tags: added: verification-done-yakkety
removed: verification-needed-yakkety
Launchpad Janitor (janitor) wrote :
Download full text (3.2 KiB)

This bug was fixed in the package linux - 4.8.0-34.36

---------------
linux (4.8.0-34.36) yakkety; urgency=low

  [ Luis Henriques ]

  * Release Tracking Bug
    - LP: #1651800

  * Miscellaneous Ubuntu changes
    - SAUCE: Do not build the xr-usb-serial driver for s390

linux (4.8.0-33.35) yakkety; urgency=low

  [ Thadeu Lima de Souza Cascardo ]

  * Release Tracking Bug
    - LP: #1651721

  [ Luis Henriques ]

  * crypto : tolerate new crypto hardware for z Systems (LP: #1644557)
    - s390/zcrypt: Introduce CEX6 toleration

  * Several new Asus laptops are missing touchpad support (LP: #1650895)
    - HID: asus: Add i2c touchpad support

  * Acer, Inc ID 5986:055a is useless after 14.04.2 installed. (LP: #1433906)
    - uvcvideo: uvc_scan_fallback() for webcams with broken chain

  * cdc_ether fills kernel log (LP: #1626371)
    - cdc_ether: Fix handling connection notification

  * Kernel Fixes to get TCMU File Backed Optical to work (LP: #1646204)
    - SAUCE: target/user: Fix use-after-free of tcmu_cmds if they are expired

  * CVE-2016-9756
    - KVM: x86: drop error recovery in em_jmp_far and em_ret_far

  * On boot excessive number of kworker threads are running (LP: #1649905)
    - slub: move synchronize_sched out of slab_mutex on shrink

  * Ethernet not work after upgrade from kernel 3.19 to 4.4 [10ec:8168]
    (LP: #1648279)
    - ACPI / blacklist: Make Dell Latitude 3350 ethernet work

  * Ubuntu 16.10 netboot install fails with "Oops: Exception in kernel mode,
    sig: 5 [#1] " (lpfc) (LP: #1648873)
    - scsi: lpfc: fix oops/BUG in lpfc_sli_ringtxcmpl_put()

  * CVE-2016-9793
    - net: avoid signed overflows for SO_{SND|RCV}BUFFORCE

  * [Hyper-V] Kernel panic not functional on 32bit Ubuntu 14.10, 15.04, and
    15.10 (LP: #1400319)
    - Drivers: hv: avoid vfree() on crash

  * d-i is missing usb support for platforms that use the xhci-platform driver
    (LP: #1625222)
    - d-i initrd needs additional usb modules to support the merlin platform

  * overlayfs no longer supports nested overlayfs mounts, but there is a fix
    upstream (LP: #1647007)
    - ovl: fix d_real() for stacked fs

  * Yakkety: arm64: CONFIG_ARM64_ERRATUM_845719 isn't enabled (LP: #1647793)
    - [Config] CONFIG_ARM64_ERRATUM_845719=y

  * Ubuntu16.10 - EEH on BELL3 adapter fails to recover (serial/tty)
    (LP: #1646857)
    - serial: 8250_pci: Detach low-level driver during PCI error recovery

  * Driver for Exar USB UART (LP: #1645591)
    - SAUCE: xr-usb-serial: Driver for Exar USB serial ports
    - SAUCE: xr-usb-serial: interface for switching modes
    - SAUCE: cdc-acm: Exclude Exar USB serial ports

  * [Bug] (Purley) x86/hpet: Reduce HPET counter read contention (LP: #1645928)
    - x86/hpet: Reduce HPET counter read contention

  * Need Alps upstream their new touchpad driver (LP: #1571530)
    - Input: ALPS - add touchstick support for SS5 hardware
    - Input: ALPS - handle 0-pressure 1F events
    - Input: ALPS - allow touchsticks to report pressure
    - Input: ALPS - set DualPoint flag for 74 03 28 devices

  * CONFIG_NR_CPUS=256 is too low (LP: #1579205)
    - [Config] Increase the NR_CPUS to 512 for amd64 to support systems with a...

Read more...

Changed in linux (Ubuntu Yakkety):
status: Fix Committed → Fix Released
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2017-01-11 17:07 EDT-------
*** Bug 150058 has been marked as a duplicate of this bug. ***

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers