KVM system crashes after starting guest

Bug #1596635 reported by bugproxy on 2016-06-27
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
High
Canonical Kernel Team
Xenial
Undecided
Tim Gardner

Bug Description

== Comment: #0 - Chanh H. Nguyen - 2016-06-25 00:24:28 ==
We have Ubuntun 16.04.1 version on our SuperMicro system and some of the virtual packages installed. Define a guest with a pci passthrough is fine but then system crashes at xhci_irq+0x1bc/0xf50 after we start the guest....

7c:mon> e
cpu 0x7c: Vector: 300 (Data Access) at [c000001e1b80f760]
    pc: c00000000088217c: xhci_irq+0x1bc/0xf50
    lr: c000000000882050: xhci_irq+0x90/0xf50
    sp: c000001e1b80f9e0
   msr: 9000000102009033
   dar: 28
 dsisr: 40000000
  current = 0xc000001e1bc82a20
  paca = 0xc000000007b89a00 softe: 0 irq_happened: 0x01
    pid = 4026, comm = libvirtd
7c:mon> t
[c000001e1b80fb00] c00000000080ebb0 usb_hcd_irq+0x50/0xa0
[c000001e1b80fb30] c00000000082af58 usb_hcd_pci_remove+0x68/0x1c0
[c000001e1b80fb70] c00000000088a118 xhci_pci_remove+0x78/0xb0
[c000001e1b80fba0] c0000000005e54b0 pci_device_remove+0x70/0x110
[c000001e1b80fbe0] c0000000006d1550 __device_release_driver+0xc0/0x190
[c000001e1b80fc10] c0000000006d1660 device_release_driver+0x40/0x70
[c000001e1b80fc40] c0000000006cf860 unbind_store+0x170/0x1b0
[c000001e1b80fc80] c0000000006ce1d4 drv_attr_store+0x64/0xa0
[c000001e1b80fcc0] c0000000003978d0 sysfs_kf_write+0x80/0xb0
[c000001e1b80fd00] c0000000003967e8 kernfs_fop_write+0x188/0x200
[c000001e1b80fd50] c0000000002e126c __vfs_write+0x6c/0xe0
[c000001e1b80fd90] c0000000002e1fa0 vfs_write+0xc0/0x230
[c000001e1b80fde0] c0000000002e2fdc SyS_write+0x6c/0x110
[c000001e1b80fe30] c000000000009204 system_call+0x38/0xb4
--- Exception: c01 (System Call) at 00003fff7f6e6708
SP (3fff7abfd520) is in userspace
7c:mon> r
R00 = c000000000882050 R16 = 00003fff7a400000
R01 = c000001e1b80f9e0 R17 = c000000000df4200
R02 = c0000000015b4200 R18 = c000000000b84200
R03 = d000080081560024 R19 = c000000000de4200
R04 = c000000004880000 R20 = 0000000000000001
R05 = c000000004884000 R21 = 00003fff5400565d
R06 = c000000004884000 R22 = 00003fff5875aa80
R07 = 000000000000003e R23 = 00003fff7fa914e0
R08 = 0000000000000000 R24 = 00003fff7fa90b90
R09 = 0000000000000006 R25 = c000000000df4200
R10 = 0000000000000000 R26 = c000001e1b80fe00
R11 = 0000000000000006 R27 = c000001e3a2d1698
R12 = c000000000881fc0 R28 = c000000001550f98
R13 = c000000007b89a00 R29 = c000000004880260
R14 = 0000000000000000 R30 = c0000000048802ac
R15 = 0000000000000000 R31 = c000000004880000
pc = c00000000088217c xhci_irq+0x1bc/0xf50
cfar= c000000000008468 slb_miss_realmode+0x50/0x78
lr = c000000000882050 xhci_irq+0x90/0xf50
msr = 9000000102009033 cr = 28028882
ctr = c000000000881fc0 xer = 0000000000000000 trap = 300
dar = 0000000000000028 dsisr = 40000000
7c:mon> d c000000000b000f0
c000000000b000f0 4c696e7578207665 7273696f6e20342e |Linux version 4.|
c000000000b00100 342e302d32342d67 656e657269632028 |4.0-24-generic (|
c000000000b00110 6275696c64644062 6f7330312d707063 |buildd@bos01-ppc|
c000000000b00120 3634656c2d303233 2920286763632076 |64el-023) (gcc v|

== Comment: #9 - Gabriel Krisman Bertazi - 2016-06-27 08:43:33 ==

(In reply to comment #0)
> We have Ubuntun 16.04.1 version on our SuperMicro system and some of the
> virtual packages installed. Define a guest with a pci passthrough is fine
> but then system crashes at xhci_irq+0x1bc/0xf50 after we start the guest....
>
> 7c:mon> e
> cpu 0x7c: Vector: 300 (Data Access) at [c000001e1b80f760]
> pc: c00000000088217c: xhci_irq+0x1bc/0xf50
> lr: c000000000882050: xhci_irq+0x90/0xf50
> sp: c000001e1b80f9e0
> msr: 9000000102009033
> dar: 28
> dsisr: 40000000
> current = 0xc000001e1bc82a20
> paca = 0xc000000007b89a00 softe: 0 irq_happened: 0x01
> pid = 4026, comm = libvirtd

Hi,

From a quick look, it seems you are missing this commit:

commit 27a41a83ec54d0edfcaf079310244e7f013a7701
Author: Gabriel Krisman Bertazi <email address hidden>
Date: Wed Jun 1 18:09:07 2016 +0300

    xhci: Cleanup only when releasing primary hcd

==

Canonical,

Please backport to 16.04.01

bugproxy (bugproxy) wrote : full log

Default Comment by Bridge

tags: added: architecture-ppc64le bugnameltc-143075 severity-critical targetmilestone-inin16041

Default Comment by Bridge

bugproxy (bugproxy) wrote : lspci -vv

Default Comment by Bridge

Changed in ubuntu:
assignee: nobody → Taco Screen team (taco-screen-team)
affects: ubuntu → linux (Ubuntu)
Changed in linux (Ubuntu):
assignee: Taco Screen team (taco-screen-team) → Canonical Kernel Team (canonical-kernel-team)
importance: Undecided → High
status: New → Triaged
Tim Gardner (timg-tpi) wrote :
Changed in linux (Ubuntu Xenial):
assignee: nobody → Tim Gardner (timg-tpi)
status: New → In Progress
Changed in linux (Ubuntu Xenial):
status: In Progress → Fix Committed
Kamal Mostafa (kamalmostafa) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial

------- Comment From <email address hidden> 2016-06-29 14:00 EDT-------
(In reply to comment #28)
> This bug is awaiting verification that the kernel in -proposed solves the
> problem. Please test the kernel and update this bug with the results. If the
> problem is solved, change the tag 'verification-needed-xenial' to
> 'verification-done-xenial'.
>
> If verification is not done by 5 working days from today, this fix will be
> dropped from the source code, and this bug will be closed.
>
> See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to
> enable and use -proposed. Thank you!

Hello Canonical,

I did apply the -proposed kernel and we still hit this issue....system is in xmon now.
0:mon> ls linux_banner
linux_banner: c000000000b000f0
0:mon> d c000000000b000f0
c000000000b000f0 4c696e7578207665 7273696f6e20342e |Linux version 4.|
c000000000b00100 342e302d32382d67 656e657269632028 |4.0-28-generic (|
c000000000b00110 6275696c64644062 6f7330312d707063 |buildd@bos01-ppc|
c000000000b00120 3634656c2d303138 2920286763632076 |64el-018) (gcc v|
0:mon> e
cpu 0x0: Vector: 300 (Data Access) at [c000000fe9eaf760]
pc: c000000000882bfc: xhci_irq+0x1bc/0xf50
lr: c000000000882ad0: xhci_irq+0x90/0xf50
sp: c000000fe9eaf9e0
msr: 9000000102009033
dar: 28
dsisr: 40000000
current = 0xc000000fe5c2b710
paca = 0xc000000007b40000 softe: 0 irq_happened: 0x01
pid = 3945, comm = libvirtd
0:mon>

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-06-29 15:39 EDT-------
Hello Canonical,
Sorry, I use apt-get dist-upgrade and it installed the -28 kernel.
But when I use "aptitude" command then I get my system upgrade to -29 kernel.
With the -29 kernel, I am able to start my guest that has the pci pass through.

root@micro:~# uname -r
4.4.0-29-generic

root@micro:~# virsh list
Id Name State
----------------------------------------------------
6 microg4 running

I also see this error "xhci_hcd". Should I be worried about that init fail.......
root@micro:~# dmesg |grep "xhci_hcd"
[ 1.884017] xhci_hcd 0001:09:00.0: xHCI Host Controller
[ 1.884079] xhci_hcd 0001:09:00.0: new USB bus registered, assigned bus number 1
[ 1.884166] xhci_hcd 0001:09:00.0: Using 64-bit DMA iommu bypass
[ 1.884229] xhci_hcd 0001:09:00.0: hcc params 0x0270f06d hci version 0x96 quirks 0x00000000
[ 1.884936] xhci_hcd 0001:09:00.0: xHCI Host Controller
[ 1.884941] xhci_hcd 0001:09:00.0: new USB bus registered, assigned bus number 2
[ 2.193049] usb 1-3: new high-speed USB device number 2 using xhci_hcd
[ 2.433162] usb 2-2: new SuperSpeed USB device number 2 using xhci_hcd
[ 2.561045] usb 1-4: new high-speed USB device number 3 using xhci_hcd
[ 2.801107] usb 2-4: new SuperSpeed USB device number 3 using xhci_hcd
[ 2.913045] usb 1-3.1: new low-speed USB device number 4 using xhci_hcd
[ 68.765623] xhci_hcd 0001:09:00.0: remove, state 1
[ 68.865172] xhci_hcd 0001:09:00.0: Host not halted after 16000 microseconds.
[ 68.865175] xhci_hcd 0001:09:00.0: Host controller not halted, aborting reset.
[ 68.865244] xhci_hcd 0001:09:00.0: USB bus 2 deregistered
[ 68.865299] xhci_hcd 0001:09:00.0: remove, state 1
[ 69.329779] xhci_hcd 0001:09:00.0: USB bus 1 deregistered
[ 70.233109] xhci_hcd 0001:09:00.0: xHCI Host Controller
[ 70.233116] xhci_hcd 0001:09:00.0: new USB bus registered, assigned bus number 1
[ 70.264505] xhci_hcd 0001:09:00.0: Host not halted after 16000 microseconds.
[ 70.264507] xhci_hcd 0001:09:00.0: can't setup: -110
[ 70.264586] xhci_hcd 0001:09:00.0: USB bus 1 deregistered
[ 70.264597] xhci_hcd 0001:09:00.0: init 0001:09:00.0 fail, -110
[ 70.264652] xhci_hcd: probe of 0001:09:00.0 failed with error -110

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-06-29 16:57 EDT-------
(In reply to comment #32)
> Hello Canonical,
> Sorry, I use apt-get dist-upgrade and it installed the -28 kernel.
> But when I use "aptitude" command then I get my system upgrade to -29 kernel.
> With the -29 kernel, I am able to start my guest that has the pci pass
> through.
>
> root@micro:~# uname -r
> 4.4.0-29-generic
>
> root@micro:~# virsh list
> Id Name State
> ----------------------------------------------------
> 6 microg4 running
>
> I also see this error "xhci_hcd". Should I be worried about that init
> fail.......
> root@micro:~# dmesg |grep "xhci_hcd"
> [ 1.884017] xhci_hcd 0001:09:00.0: xHCI Host Controller
> [ 1.884079] xhci_hcd 0001:09:00.0: new USB bus registered, assigned bus
> number 1
> [ 1.884166] xhci_hcd 0001:09:00.0: Using 64-bit DMA iommu bypass
> [ 1.884229] xhci_hcd 0001:09:00.0: hcc params 0x0270f06d hci version 0x96
> quirks 0x00000000
> [ 1.884936] xhci_hcd 0001:09:00.0: xHCI Host Controller
> [ 1.884941] xhci_hcd 0001:09:00.0: new USB bus registered, assigned bus
> number 2
> [ 2.193049] usb 1-3: new high-speed USB device number 2 using xhci_hcd
> [ 2.433162] usb 2-2: new SuperSpeed USB device number 2 using xhci_hcd
> [ 2.561045] usb 1-4: new high-speed USB device number 3 using xhci_hcd
> [ 2.801107] usb 2-4: new SuperSpeed USB device number 3 using xhci_hcd
> [ 2.913045] usb 1-3.1: new low-speed USB device number 4 using xhci_hcd
> [ 68.765623] xhci_hcd 0001:09:00.0: remove, state 1
> [ 68.865172] xhci_hcd 0001:09:00.0: Host not halted after 16000
> microseconds.
> [ 68.865175] xhci_hcd 0001:09:00.0: Host controller not halted, aborting
> reset.
> [ 68.865244] xhci_hcd 0001:09:00.0: USB bus 2 deregistered
> [ 68.865299] xhci_hcd 0001:09:00.0: remove, state 1
> [ 69.329779] xhci_hcd 0001:09:00.0: USB bus 1 deregistered
> [ 70.233109] xhci_hcd 0001:09:00.0: xHCI Host Controller
> [ 70.233116] xhci_hcd 0001:09:00.0: new USB bus registered, assigned bus
> number 1
> [ 70.264505] xhci_hcd 0001:09:00.0: Host not halted after 16000
> microseconds.
> [ 70.264507] xhci_hcd 0001:09:00.0: can't setup: -110
> [ 70.264586] xhci_hcd 0001:09:00.0: USB bus 1 deregistered
> [ 70.264597] xhci_hcd 0001:09:00.0: init 0001:09:00.0 fail, -110
> [ 70.264652] xhci_hcd: probe of 0001:09:00.0 failed with error -110

This log was taken from the host after the guest is destroyed, right? That's a different issue, which also reproduces upstream. I think it has something to do with an errata for this hardware.

Does the controller probe successfully from inside the guest?

We should have a new bug opened to track it.

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-06-29 17:47 EDT-------
> Does the controller probe successfully from inside the guest?
It probe successfully inside the guest.

bugproxy (bugproxy) on 2016-07-01
tags: added: verification-done-xenial
removed: verification-needed-xenial
Launchpad Janitor (janitor) wrote :
Download full text (5.8 KiB)

This bug was fixed in the package linux - 4.4.0-30.49

---------------
linux (4.4.0-30.49) xenial; urgency=low

  [ Kamal Mostafa ]

  * Release Tracking Bug
    - LP: #1597897

  * FCP devices are not detected correctly nor deterministically (LP: #1567602)
    - scsi_dh_alua: Disable ALUA handling for non-disk devices
    - scsi_dh_alua: Use vpd_pg83 information
    - scsi_dh_alua: improved logging
    - scsi_dh_alua: sanitze sense code handling
    - scsi_dh_alua: use standard logging functions
    - scsi_dh_alua: return standard SCSI return codes in submit_rtpg
    - scsi_dh_alua: fixup description of stpg_endio()
    - scsi_dh_alua: use flag for RTPG extended header
    - scsi_dh_alua: use unaligned access macros
    - scsi_dh_alua: rework alua_check_tpgs() to return the tpgs mode
    - scsi_dh_alua: simplify sense code handling
    - scsi: Add scsi_vpd_lun_id()
    - scsi: Add scsi_vpd_tpg_id()
    - scsi_dh_alua: use scsi_vpd_tpg_id()
    - scsi_dh_alua: Remove stale variables
    - scsi_dh_alua: Pass buffer as function argument
    - scsi_dh_alua: separate out alua_stpg()
    - scsi_dh_alua: Make stpg synchronous
    - scsi_dh_alua: call alua_rtpg() if stpg fails
    - scsi_dh_alua: switch to scsi_execute_req_flags()
    - scsi_dh_alua: allocate RTPG buffer separately
    - scsi_dh_alua: Use separate alua_port_group structure
    - scsi_dh_alua: use unique device id
    - scsi_dh_alua: simplify alua_initialize()
    - revert commit a8e5a2d593cb ("[SCSI] scsi_dh_alua: ALUA handler attach should
      succeed while TPG is transitioning")
    - scsi_dh_alua: move optimize_stpg evaluation
    - scsi_dh_alua: remove 'rel_port' from alua_dh_data structure
    - scsi_dh_alua: Use workqueue for RTPG
    - scsi_dh_alua: Allow workqueue to run synchronously
    - scsi_dh_alua: Add new blacklist flag 'BLIST_SYNC_ALUA'
    - scsi_dh_alua: Recheck state on unit attention
    - scsi_dh_alua: update all port states
    - scsi_dh_alua: Send TEST UNIT READY to poll for transitioning
    - scsi_dh_alua: do not fail for unknown VPD identification

linux (4.4.0-29.48) xenial; urgency=low

  [ Kamal Mostafa ]

  * Release Tracking Bug
    - LP: #1597015

  * Wireless hotkey fails on Dell XPS 15 9550 (LP: #1589886)
    - intel-hid: new hid event driver for hotkeys
    - intel-hid: fix incorrect entries in intel_hid_keymap
    - intel-hid: allocate correct amount of memory for private struct
    - intel-hid: add a workaround to ignore an event after waking up from S4.
    - [Config] CONFIG_INTEL_HID_EVENT=m

  * cgroupfs mounts can hang (LP: #1588056)
    - Revert "UBUNTU: SAUCE: (namespace) mqueue: Super blocks must be owned by the
      user ns which owns the ipc ns"
    - Revert "UBUNTU: SAUCE: kernfs: Do not match superblock in another user
      namespace when mounting"
    - Revert "UBUNTU: SAUCE: cgroup: Use a new super block when mounting in a
      cgroup namespace"
    - (namespace) bpf: Use mount_nodev not mount_ns to mount the bpf filesystem
    - (namespace) bpf, inode: disallow userns mounts
    - (namespace) ipc: Initialize ipc_namespace->user_ns early.
    - (namespace) vfs: Pass data, ns, and ns->userns to mount_ns
    - SAUCE: (namespace) S...

Read more...

Changed in linux (Ubuntu):
status: Triaged → Fix Released
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-07-12 10:35 EDT-------
(In reply to comment #37)
> This bug was fixed in the package linux - 4.4.0-30.49

Thanks!

Chanh, please give a last try to 4.4.0-30.49 such that we can close this.

Launchpad Janitor (janitor) wrote :
Download full text (6.1 KiB)

This bug was fixed in the package linux - 4.4.0-31.50

---------------
linux (4.4.0-31.50) xenial; urgency=low

  [ Kamal Mostafa ]

  * Release Tracking Bug
    - LP: #1602449

  * nouveau: boot hangs at blank screen with unsupported graphics cards
    (LP: #1602340)
    - SAUCE: drm: check for supported chipset before booting fbdev off the hw

linux (4.4.0-30.49) xenial; urgency=low

  [ Kamal Mostafa ]

  * Release Tracking Bug
    - LP: #1597897

  * FCP devices are not detected correctly nor deterministically (LP: #1567602)
    - scsi_dh_alua: Disable ALUA handling for non-disk devices
    - scsi_dh_alua: Use vpd_pg83 information
    - scsi_dh_alua: improved logging
    - scsi_dh_alua: sanitze sense code handling
    - scsi_dh_alua: use standard logging functions
    - scsi_dh_alua: return standard SCSI return codes in submit_rtpg
    - scsi_dh_alua: fixup description of stpg_endio()
    - scsi_dh_alua: use flag for RTPG extended header
    - scsi_dh_alua: use unaligned access macros
    - scsi_dh_alua: rework alua_check_tpgs() to return the tpgs mode
    - scsi_dh_alua: simplify sense code handling
    - scsi: Add scsi_vpd_lun_id()
    - scsi: Add scsi_vpd_tpg_id()
    - scsi_dh_alua: use scsi_vpd_tpg_id()
    - scsi_dh_alua: Remove stale variables
    - scsi_dh_alua: Pass buffer as function argument
    - scsi_dh_alua: separate out alua_stpg()
    - scsi_dh_alua: Make stpg synchronous
    - scsi_dh_alua: call alua_rtpg() if stpg fails
    - scsi_dh_alua: switch to scsi_execute_req_flags()
    - scsi_dh_alua: allocate RTPG buffer separately
    - scsi_dh_alua: Use separate alua_port_group structure
    - scsi_dh_alua: use unique device id
    - scsi_dh_alua: simplify alua_initialize()
    - revert commit a8e5a2d593cb ("[SCSI] scsi_dh_alua: ALUA handler attach should
      succeed while TPG is transitioning")
    - scsi_dh_alua: move optimize_stpg evaluation
    - scsi_dh_alua: remove 'rel_port' from alua_dh_data structure
    - scsi_dh_alua: Use workqueue for RTPG
    - scsi_dh_alua: Allow workqueue to run synchronously
    - scsi_dh_alua: Add new blacklist flag 'BLIST_SYNC_ALUA'
    - scsi_dh_alua: Recheck state on unit attention
    - scsi_dh_alua: update all port states
    - scsi_dh_alua: Send TEST UNIT READY to poll for transitioning
    - scsi_dh_alua: do not fail for unknown VPD identification

linux (4.4.0-29.48) xenial; urgency=low

  [ Kamal Mostafa ]

  * Release Tracking Bug
    - LP: #1597015

  * Wireless hotkey fails on Dell XPS 15 9550 (LP: #1589886)
    - intel-hid: new hid event driver for hotkeys
    - intel-hid: fix incorrect entries in intel_hid_keymap
    - intel-hid: allocate correct amount of memory for private struct
    - intel-hid: add a workaround to ignore an event after waking up from S4.
    - [Config] CONFIG_INTEL_HID_EVENT=m

  * cgroupfs mounts can hang (LP: #1588056)
    - Revert "UBUNTU: SAUCE: (namespace) mqueue: Super blocks must be owned by the
      user ns which owns the ipc ns"
    - Revert "UBUNTU: SAUCE: kernfs: Do not match superblock in another user
      namespace when mounting"
    - Revert "UBUNTU: SAUCE: cgroup: Use a new super block when mounting in a
      cgroup namespace"
    - (name...

Read more...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers