Kernel panic after the ubuntu_nbd_smoke_test on Xenial kernel

Bug #1793464 reported by Po-Hsu Lin on 2018-09-20
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
ubuntu-kernel-tests
Undecided
Unassigned
linux (Ubuntu)
High
Colin Ian King
Xenial
Undecided
Unassigned

Bug Description

== SRU Justification ==

When running the Ubuntu nbd autotest regression test we trip a hang
and then a little later a panic message. There are two upstream
fixes required as this is actually two issues in one. One fix is to
not to shutdown the sock when IRQs are disable and a second to fix is
to race in the nbd ioctl.

== Fix ==

Upstream commits:

23272a6754b81ff6503e09c743bb4ceeeab39997
  nbd: Remove signal usage

1f7b5cf1be4351e60cf8ae7aab976503dd73c5f8
  nbd: Timeouts are not user requested disconnects

0e4f0f6f63d3416a9e529d99febfe98545427b81
  nbd: Cleanup reset of nbd and bdev after a disconnect

c261189862c6f65117eb3b1748622a08ef49c262
  nbd: don't shutdown sock with irq's disabled

97240963eb308d8d21a89c0459822f7ea98463b4
  nbd: fix race in ioctl

The first 3 patches are prerequisites required for the latter two fixes to apply and work correctly. Most of these backports are minor patch wiggles
required because later patches have been applied to the driver in earlier fixes to this driver.

== Regression Potential ==

These fixes just touch nbd, so the regression potential is just limited to this. Secondly, we are pulling in upstream fixes that exist in Bionic and Cosmic kernels, so these are tried and tested fixes.

== Test Case ==

  1. Deploy a node with 4.4 Xenial
  2. Run the ubuntu_nbd_smoke_test

Without the fix, we get hang/crashes. With the fix one can run this test
multiple times without any issues at all.

----

This issue can be reproduced on AMD64 KVM //bare-metal node, s390x zKVM node

The test itself will pass, but the system will hang after a few second.

Steps:
  1. Deploy a node with 4.4 Xenial
  2. Run the ubuntu_nbd_smoke_test

If you have access to the console, you will see that this system actually bumped into a kernel panic:

 Unable to handle kernel pointer dereference in virtual kernel address space
 failing address: 000003ff802c1000 TEID: 000003ff802c1803
 Fault in home space mode while using kernel ASCE.
 Log here (s390x KVM): https://pastebin.ubuntu.com/p/dNmtvbGjmz/

ProblemType: Bug
DistroRelease: Ubuntu 16.04
Package: linux-image-4.4.0-136-generic 4.4.0-136.162
ProcVersionSignature: Ubuntu 4.4.0-136.162-generic 4.4.144
Uname: Linux 4.4.0-136-generic s390x
NonfreeKernelModules: zfs zunicode zcommon znvpair zavl
AlsaDevices: Error: command ['ls', '-l', '/dev/snd/'] failed with exit code 2: ls: cannot access '/dev/snd/': No such file or directory
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
ApportVersion: 2.20.1-0ubuntu2.18
Architecture: s390x
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
CRDA: Error: command ['iw', 'reg', 'get'] failed with exit code 1: nl80211 not found.
Date: Thu Sep 20 03:46:00 2018
HibernationDevice: RESUME=UUID=ca468a9c-9563-442c-85c6-6055e800a66e
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
Lspci:

Lsusb: Error: command ['lsusb'] failed with exit code 1:
PciMultimedia:

ProcFB: Error: [Errno 2] No such file or directory: '/proc/fb'
ProcKernelCmdLine: root=UUID=b65b756a-ba4e-4c53-aa32-0db2bdb50bb3 crashkernel=196M
RelatedPackageVersions:
 linux-restricted-modules-4.4.0-136-generic N/A
 linux-backports-modules-4.4.0-136-generic N/A
 linux-firmware 1.157.20
RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)

CVE References

Po-Hsu Lin (cypressyew) wrote :

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1793464

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Po-Hsu Lin (cypressyew) wrote :

Tested on node "rumford" with AMD64 X kernel in -released, this issue still exist.

Po-Hsu Lin (cypressyew) on 2018-10-08
description: updated
Colin Ian King (colin-king) wrote :

Can reproduce this on a -133 xenial kernel too on Xenial VM. After the test, I ran vmstat 1 and observed the machine just hang after ~20 or so seconds.

Changed in linux (Ubuntu):
assignee: nobody → Colin Ian King (colin-king)
importance: Undecided → High
status: Incomplete → In Progress
Colin Ian King (colin-king) wrote :

And occurs way back to 4.4.0-21 too.

Colin Ian King (colin-king) wrote :
Download full text (4.9 KiB)

Course sanity check with mainline kernels:

4.2 OK
4.3 hangs
4.4 hangs
4.5 hang
4.6 - 4.8 no hang, but dumps message (see below)
4.9 OK

Oct 8 11:04:03 ubuntu kernel: [ 31.788232] block nbd0: NBD_DISCONNECT
Oct 8 11:04:03 ubuntu kernel: [ 31.788286] block nbd0: shutting down socket
Oct 8 11:04:03 ubuntu kernel: [ 31.788290] ------------[ cut here ]------------
Oct 8 11:04:03 ubuntu kernel: [ 31.788299] WARNING: CPU: 0 PID: 1807 at /home/kernel/COD/linux/kernel/softirq.c:150 __local_bh_enable_ip+0x6b/0x80
Oct 8 11:04:03 ubuntu kernel: [ 31.788301] Modules linked in: squashfs loop fuse nbd nls_iso8859_1 vfat fat snd_hda_codec_generic snd_hda_intel snd_hda_codec ppdev snd_hda_core virtio_console snd_hwdep virtio_balloon snd_pcm joydev input_leds efi_pstore led_class snd_timer efivars serio_raw snd i2c_piix4 soundcore acpi_cpufreq parport_pc 8250_fintek processor parport qemu_fw_cfg mac_hid ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi efivarfs autofs4 ext4 crc16 jbd2 mbcache btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor hid_generic usbhid hid raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod ata_generic virtio_net virtio_blk pata_acpi qxl ttm drm_kms_helper syscopyarea crct10dif_pclmul crc32_pclmul sysfillrect sysimgblt crc32c_intel fb_sys_fops ghash_clmulni_intel drm ata_piix intel_agp libata aesni_intel intel_gtt aes_x86_64 lrw gf128mul glue_helper uhci_hcd ablk_helper ehci_pci cryptd ehci_hcd agpgart scsi_mod virtio_pci psmouse usbcore virtio_ring virtio usb_common floppy button
Oct 8 11:04:03 ubuntu kernel: [ 31.788400] CPU: 0 PID: 1807 Comm: nbd-client Not tainted 4.6.0-040600-generic #201606100558
Oct 8 11:04:03 ubuntu kernel: [ 31.788402] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
Oct 8 11:04:03 ubuntu kernel: [ 31.788404] 0000000000000086 0000000047e82912 ffff88039807faa8 ffffffff8135f483
Oct 8 11:04:03 ubuntu kernel: [ 31.788409] 0000000000000000 0000000000000000 ffff88039807fae8 ffffffff810814db
Oct 8 11:04:03 ubuntu kernel: [ 31.788412] 0000009647e82912 0000000000000200 ffff8803987d8840 ffff880393e74070
Oct 8 11:04:03 ubuntu kernel: [ 31.788416] Call Trace:
Oct 8 11:04:03 ubuntu kernel: [ 31.788445] [<ffffffff8135f483>] dump_stack+0x63/0x90
Oct 8 11:04:03 ubuntu kernel: [ 31.788448] [<ffffffff810814db>] __warn+0xcb/0xf0
Oct 8 11:04:03 ubuntu kernel: [ 31.788451] [<ffffffff8108160d>] warn_slowpath_null+0x1d/0x20
Oct 8 11:04:03 ubuntu kernel: [ 31.788455] [<ffffffff81086bfb>] __local_bh_enable_ip+0x6b/0x80
Oct 8 11:04:03 ubuntu kernel: [ 31.788461] [<ffffffff81579217>] lock_sock_nested+0x57/0x70
Oct 8 11:04:03 ubuntu kernel: [ 31.788471] [<ffffffff8160e50b>] inet_shutdown+0x3b/0x110
Oct 8 11:04:03 ubuntu kernel: [ 31.788474] [<ffffffff815738a0>] kernel_sock_shutdown+0x10/0x20
Oct 8 11:04:03 ubuntu kernel: [ 31.788481] [<ffffffffc064ae6a>] sock_shutdown+0x4a/0xa0 [nbd]
Oct 8 11:04:03 ubuntu kernel: [ 31.788486] [<ffffffffc064b4d5>] __nbd_ioctl+0x615/0xb70 [nbd]
Oct 8 11:04:03 ubuntu kernel: [ 31.788492...

Read more...

description: updated
Po-Hsu Lin (cypressyew) on 2018-10-19
tags: added: i386
Changed in linux (Ubuntu Xenial):
status: New → Fix Committed
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
Colin Ian King (colin-king) wrote :

Tested against -proposed kernel 4.4.0-139-generic, nbd tests no longer fail. Marking as verified.

tags: added: verification-done-xenial
removed: verification-needed-xenial
Launchpad Janitor (janitor) wrote :
Download full text (21.0 KiB)

This bug was fixed in the package linux - 4.4.0-139.165

---------------
linux (4.4.0-139.165) xenial; urgency=medium

  * linux: 4.4.0-139.165 -proposed tracker (LP: #1799401)

  * Kernel panic after the ubuntu_nbd_smoke_test on Xenial kernel (LP: #1793464)
    - nbd: Remove signal usage
    - nbd: Timeouts are not user requested disconnects
    - nbd: Cleanup reset of nbd and bdev after a disconnect
    - nbd: don't shutdown sock with irq's disabled
    - nbd: fix race in ioctl

  * fscache: bad refcounting in fscache_op_complete leads to OOPS (LP: #1797314)
    - SAUCE: fscache: Fix race in decrementing refcount of op->npages

  * xenial: virtio-scsi: CPU soft lockup due to loop in
    virtscsi_target_destroy() (LP: #1798110)
    - SAUCE: (no-up) virtio-scsi: Decrement reqs counter before SCSI command
      requeue

  * Error reported when creating ZFS pool with "-t" option, despite successful
    pool creation (LP: #1769937)
    - SAUCE: (noup) Update zfs to 0.6.5.6-0ubuntu26

  * Xenial update: 4.4.160 upstream stable release (LP: #1798770)
    - crypto: skcipher - Fix -Wstringop-truncation warnings
    - tsl2550: fix lux1_input error in low light
    - vmci: type promotion bug in qp_host_get_user_memory()
    - x86/numa_emulation: Fix emulated-to-physical node mapping
    - staging: rts5208: fix missing error check on call to rtsx_write_register
    - uwb: hwa-rc: fix memory leak at probe
    - power: vexpress: fix corruption in notifier registration
    - Bluetooth: Add a new Realtek 8723DE ID 0bda:b009
    - USB: serial: kobil_sct: fix modem-status error handling
    - 6lowpan: iphc: reset mac_header after decompress to fix panic
    - md-cluster: clear another node's suspend_area after the copy is finished
    - media: exynos4-is: Prevent NULL pointer dereference in __isp_video_try_fmt()
    - powerpc/kdump: Handle crashkernel memory reservation failure
    - media: fsl-viu: fix error handling in viu_of_probe()
    - x86/tsc: Add missing header to tsc_msr.c
    - x86/entry/64: Add two more instruction suffixes
    - scsi: target/iscsi: Make iscsit_ta_authentication() respect the output
      buffer size
    - scsi: klist: Make it safe to use klists in atomic context
    - scsi: ibmvscsi: Improve strings handling
    - usb: wusbcore: security: cast sizeof to int for comparison
    - powerpc/powernv/ioda2: Reduce upper limit for DMA window size
    - alarmtimer: Prevent overflow for relative nanosleep
    - s390/extmem: fix gcc 8 stringop-overflow warning
    - ALSA: snd-aoa: add of_node_put() in error path
    - media: s3c-camif: ignore -ENOIOCTLCMD from v4l2_subdev_call for s_power
    - media: soc_camera: ov772x: correct setting of banding filter
    - media: omap3isp: zero-initialize the isp cam_xclk{a,b} initial data
    - staging: android: ashmem: Fix mmap size validation
    - drivers/tty: add error handling for pcmcia_loop_config
    - media: tm6000: add error handling for dvb_register_adapter
    - ALSA: hda: Add AZX_DCAPS_PM_RUNTIME for AMD Raven Ridge
    - ath10k: protect ath10k_htt_rx_ring_free with rx_ring.lock
    - rndis_wlan: potential buffer overflow in rndis_wlan_auth_indication()
    - wlcore: Add missing PM call fo...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
Changed in linux (Ubuntu):
status: In Progress → Fix Released
Po-Hsu Lin (cypressyew) on 2018-11-20
Changed in ubuntu-kernel-tests:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers