Ubuntu 16.10 netboot install fails with "Oops: Exception in kernel mode, sig: 5 [#1] " (lpfc)

Bug #1648873 reported by bugproxy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
debian-installer (Ubuntu)
Fix Released
Undecided
Adam Conrad
Xenial
Fix Released
Undecided
Adam Conrad
Yakkety
Fix Released
Undecided
Adam Conrad
linux (Ubuntu)
Fix Released
High
Canonical Kernel Team
Xenial
Invalid
Undecided
Unassigned
Yakkety
Fix Released
Undecided
Tim Gardner

Bug Description

== Comment: #33 - Mauricio Faria De Oliveira - 2016-12-09 06:49:57 ==

Hi Canonical,

Can you please apply this patch [1] to 16.10 and 16.04.x HWE (4.8) ?
It's fixes a regression introduced in 4.8.

As you can see, it's in the SCSI maintainer (James Bottomley)'s 'fixes' branch, but didn't make 4.9-rc8 (maybe he considered it late for this one).

We have installer, boot, and post-boot problems due to this one.
It'd be good if the netboot images for 16.04.x HWE kernel can get it too.

Thank you,

[1] scsi: lpfc: fix oops/BUG in lpfc_sli_ringtxcmpl_put()
    http://git.kernel.org/cgit/linux/kernel/git/jejb/scsi.git/commit/?h=fixes&id=2319f847a8910cff1d46c9b66aa1dd7cc3e836a9

Historical context:

== Comment: #0 - HARSHA THYAGARAJA - 2016-11-21 02:39:35 ==
---Problem Description---
Ubuntu 16.10 netboot install fails with "Oops: Exception in kernel mode. " (kernel: 4.8.0-27)

Machine Type = Power8 baremetal

---boot type---
QEMU direct boot kernel/initrd

---Kernel cmdline used to launch install---
On a Power8 server, Using kernel and initrd images,

netcfg/disable_dhcp=true netcfg/confirm_static=true netcfg/choose_interface=98:BE:94:00:4C:68 netcfg/get_ipaddress=9.47.67.159/20 netcfg/get_gateway=9.47.79.254 netcfg/get_nameservers=

---Install repository type---
Internet repository

---Install repository Location---
http://ports.ubuntu.com/ubuntu-ports/dists/yakkety/main/installer-ppc64el/current/images/netboot/ubuntu-installer/ppc64el/

---Point of failure---
Other failure during installation (stage 1)

== Comment: #1 - HARSHA THYAGARAJA - 2016-11-21 02:41:54 ==
The netboot install fails and Call traces are seen at the Disk detection step.

== Comment: #8 - Mauricio Faria De Oliveira - 2016-11-21 15:58:25 ==
Finally got it.

The assembly offset/code + the trap signal is due to this BUG_ON(), and the second condition triggered the trap.

Checking why piocb is not NULL but piocb->vport is NULL.
This might have happened in the lpfc_linkdown_port() path, in the stack trace.

Would need a more readable console log (ie, dmesg, as requested in comments 5, 3) to help understanding it.

--

static int
lpfc_sli_ringtxcmpl_put(struct lpfc_hba *phba, struct lpfc_sli_ring *pring,
                        struct lpfc_iocbq *piocb)
{
        lockdep_assert_held(&phba->hbalock);

        BUG_ON(!piocb || !piocb->vport);
<...>
}

[ 226.147886] NIP [d00000000b7324c0] lpfc_sli_ringtxcmpl_put+0x48/0x120 [lpfc]

0x2478 + 0x48 = 0x24c0 (tdnei; trap doubleword not equal immediate)

0000000000002478 <lpfc_sli_ringtxcmpl_put>:
<...>
    2498: 78 2b bf 7c mr r31,r5 // r31 is *piocb (r5 is the 3rd function parameter)
    249c: 78 1b 7d 7c mr r29,r3
    24a0: 78 23 9e 7c mr r30,r4
    24a4: 01 00 00 48 bl 24a4 <lpfc_sli_ringtxcmpl_put+0x2c> // probably converted at module load time to the call lockdep_assert_held()
    24a8: 00 00 00 60 nop
    24ac: 00 00 bf 2f cmpdi cr7,r31,0 // compare piocb with 0. checking for NULL.
    24b0: 70 00 9e 41 beq cr7,2520 <lpfc_sli_ringtxcmpl_put+0xa8> // if equal to zero, branch out. done w/ the former part of the OR check.
    24b4: e8 00 3f e9 ld r9,232(r31) // an offset of piocb. probably piocb->vport in the bug_on
    24b8: 74 00 29 7d cntlzd r9,r9 // count leading zeroes. if r59 is null (0), leading zeroes is 64 (binary: 0100_0000, bit 6 is 1, and 6 LSbs [bits 5-0] are 0)
    24bc: 82 d1 29 79 rldicl r9,r9,58,6 // rotate left 58 (ie, those 6 LSbs are now MSbs, and that bit 6 from 64 was rotated in the register and is now bit 0, the LSb), now AND the 6 MSbs w/ 0-bits, and the all lower bits with 1-bits (ie, save the LSb).
    24c0: 00 00 09 0b tdnei r9,0 // trap if not equal to zero. (ie, the whole r9 was zero, with 64 leading/consecutive zeroes, then bit 6 is 1, it becomes bit 0.. and since bit 0 is now 1, r9 is thus non-zero, and the trap triggers.) this checked the latter part of the OR.

CVE References

bugproxy (bugproxy)
tags: added: architecture-ppc64le bugnameltc-148978 severity-critical targetmilestone-inin1610
Changed in ubuntu:
assignee: nobody → Taco Screen team (taco-screen-team)
affects: ubuntu → linux (Ubuntu)
Changed in linux (Ubuntu):
assignee: Taco Screen team (taco-screen-team) → Canonical Kernel Team (canonical-kernel-team)
importance: Undecided → High
status: New → Triaged
Adam Conrad (adconrad)
Changed in debian-installer (Ubuntu Xenial):
assignee: nobody → Adam Conrad (adconrad)
Changed in debian-installer (Ubuntu Yakkety):
assignee: nobody → Adam Conrad (adconrad)
Changed in linux (Ubuntu Xenial):
status: New → Invalid
Changed in debian-installer (Ubuntu):
assignee: nobody → Adam Conrad (adconrad)
Changed in linux (Ubuntu Yakkety):
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2016-12-12 11:19 EDT-------
Canonical,

The patch has been accepted into mainline/4.9 [1].
Submitting to kernel-team mailing list a while ago, but not in the archives yet.

Updated netboot files (lpfc module) required.

Thanks!

[1] scsi: lpfc: fix oops/BUG in lpfc_sli_ringtxcmpl_put()
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/scsi/lpfc?id=2319f847a8910cff1d46c9b66aa1dd7cc3e836a9

[2] subject: "[SRU][Xenial HWE 4.8][Yakkety][PATCH] scsi: lpfc: fix oops/BUG in lpfc_sli_ringtxcmpl_put()"

Revision history for this message
Tim Gardner (timg-tpi) wrote :
Changed in linux (Ubuntu Yakkety):
assignee: Canonical Kernel Team (canonical-kernel-team) → Tim Gardner (timg-tpi)
status: New → In Progress
Changed in linux (Ubuntu):
status: Triaged → Fix Released
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-12-13 05:51 EDT-------
*** Bug 149899 has been marked as a duplicate of this bug. ***

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-12-13 20:52 EDT-------
*** Bug 149899 has been marked as a duplicate of this bug. ***

Luis Henriques (henrix)
Changed in linux (Ubuntu Yakkety):
status: In Progress → Fix Committed
Revision history for this message
Luis Henriques (henrix) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-yakkety' to 'verification-done-yakkety'. If the problem still exists, change the tag 'verification-needed-yakkety' to 'verification-failed-yakkety'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-yakkety
bugproxy (bugproxy)
tags: added: verification-done-yakkety
removed: verification-needed-yakkety
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (3.2 KiB)

This bug was fixed in the package linux - 4.8.0-34.36

---------------
linux (4.8.0-34.36) yakkety; urgency=low

  [ Luis Henriques ]

  * Release Tracking Bug
    - LP: #1651800

  * Miscellaneous Ubuntu changes
    - SAUCE: Do not build the xr-usb-serial driver for s390

linux (4.8.0-33.35) yakkety; urgency=low

  [ Thadeu Lima de Souza Cascardo ]

  * Release Tracking Bug
    - LP: #1651721

  [ Luis Henriques ]

  * crypto : tolerate new crypto hardware for z Systems (LP: #1644557)
    - s390/zcrypt: Introduce CEX6 toleration

  * Several new Asus laptops are missing touchpad support (LP: #1650895)
    - HID: asus: Add i2c touchpad support

  * Acer, Inc ID 5986:055a is useless after 14.04.2 installed. (LP: #1433906)
    - uvcvideo: uvc_scan_fallback() for webcams with broken chain

  * cdc_ether fills kernel log (LP: #1626371)
    - cdc_ether: Fix handling connection notification

  * Kernel Fixes to get TCMU File Backed Optical to work (LP: #1646204)
    - SAUCE: target/user: Fix use-after-free of tcmu_cmds if they are expired

  * CVE-2016-9756
    - KVM: x86: drop error recovery in em_jmp_far and em_ret_far

  * On boot excessive number of kworker threads are running (LP: #1649905)
    - slub: move synchronize_sched out of slab_mutex on shrink

  * Ethernet not work after upgrade from kernel 3.19 to 4.4 [10ec:8168]
    (LP: #1648279)
    - ACPI / blacklist: Make Dell Latitude 3350 ethernet work

  * Ubuntu 16.10 netboot install fails with "Oops: Exception in kernel mode,
    sig: 5 [#1] " (lpfc) (LP: #1648873)
    - scsi: lpfc: fix oops/BUG in lpfc_sli_ringtxcmpl_put()

  * CVE-2016-9793
    - net: avoid signed overflows for SO_{SND|RCV}BUFFORCE

  * [Hyper-V] Kernel panic not functional on 32bit Ubuntu 14.10, 15.04, and
    15.10 (LP: #1400319)
    - Drivers: hv: avoid vfree() on crash

  * d-i is missing usb support for platforms that use the xhci-platform driver
    (LP: #1625222)
    - d-i initrd needs additional usb modules to support the merlin platform

  * overlayfs no longer supports nested overlayfs mounts, but there is a fix
    upstream (LP: #1647007)
    - ovl: fix d_real() for stacked fs

  * Yakkety: arm64: CONFIG_ARM64_ERRATUM_845719 isn't enabled (LP: #1647793)
    - [Config] CONFIG_ARM64_ERRATUM_845719=y

  * Ubuntu16.10 - EEH on BELL3 adapter fails to recover (serial/tty)
    (LP: #1646857)
    - serial: 8250_pci: Detach low-level driver during PCI error recovery

  * Driver for Exar USB UART (LP: #1645591)
    - SAUCE: xr-usb-serial: Driver for Exar USB serial ports
    - SAUCE: xr-usb-serial: interface for switching modes
    - SAUCE: cdc-acm: Exclude Exar USB serial ports

  * [Bug] (Purley) x86/hpet: Reduce HPET counter read contention (LP: #1645928)
    - x86/hpet: Reduce HPET counter read contention

  * Need Alps upstream their new touchpad driver (LP: #1571530)
    - Input: ALPS - add touchstick support for SS5 hardware
    - Input: ALPS - handle 0-pressure 1F events
    - Input: ALPS - allow touchsticks to report pressure
    - Input: ALPS - set DualPoint flag for 74 03 28 devices

  * CONFIG_NR_CPUS=256 is too low (LP: #1579205)
    - [Config] Increase the NR_CPUS to 512 for amd64 to support systems with a...

Read more...

Changed in linux (Ubuntu Yakkety):
status: Fix Committed → Fix Released
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2017-01-11 17:07 EDT-------
*** Bug 150058 has been marked as a duplicate of this bug. ***

Adam Conrad (adconrad)
Changed in debian-installer (Ubuntu Xenial):
status: New → Fix Released
Changed in debian-installer (Ubuntu Yakkety):
status: New → Fix Released
Changed in debian-installer (Ubuntu):
status: New → Fix Released
Brad Figg (brad-figg)
tags: added: cscc
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.