xhci hangs; reset results in NULL pointer dereference

Bug #1763594 reported by Bas Zoetekouw on 2018-04-13
84
This bug affects 14 people
Affects Status Importance Assigned to Milestone
HWE Next
Undecided
Unassigned
linux (Arch Linux)
New
Undecided
Unassigned
linux (Ubuntu)
Medium
Unassigned
Bionic
Medium
Unassigned
linux-oem (Ubuntu)
Undecided
Unassigned
Bionic
Undecided
Unassigned

Bug Description

===SRU Justification===
[Impact]
xHC stops to work after some time. This happens when the xHC gets
runtime resumed/suspended constantly.

[Test]
User reports this backport fixes the issue.

[Fix]
In addition to check EINT, also check ports' status.

[Regression Potential]
Low. It fixes a known bug and it's in -stable.

===Original Bug Report===

Now and then, my xhci bus will hang, resulting in these kinds of messages in dmesg:

[252220.002102] xhci_hcd 0000:00:14.0: xHC is not running.
[252220.037491] xhci_hcd 0000:00:14.0: xHCI host controller not responding, assume dead
[252220.037500] xhci_hcd 0000:00:14.0: HC died; cleaning up
[252220.133794] usb 1-2: USB disconnect, device number 2
[252220.135042] usb 1-7: USB disconnect, device number 3
[252220.137455] usb 1-8: USB disconnect, device number 4
[252220.243317] usb 1-9: USB disconnect, device number 5

Usually, I can fix this bij resetting the bus by calling a script reset-xhci:

for xhci in /sys/bus/pci/drivers/?hci_hcd ; do
  cd $xhci
  echo Resetting devices from $xhci...
  for i in ????:??:??.? ; do
    echo -n "$i" > unbind
    echo -n "$i" > bind
  done
done

But doing this today resulted in a kernel bug:

[252243.401814] xhci_hcd 0000:00:14.0: remove, state 4
[252243.401887] usb usb2: USB disconnect, device number 1
[252243.470365] xhci_hcd 0000:00:14.0: USB bus 2 deregistered
[252243.470378] xhci_hcd 0000:00:14.0: remove, state 4
[252243.470383] usb usb1: USB disconnect, device number 1
[252243.470831] xhci_hcd 0000:00:14.0: Host halt failed, -19
[252243.470837] xhci_hcd 0000:00:14.0: Host not accessible, reset failed.
[252243.475918] xhci_hcd 0000:00:14.0: USB bus 1 deregistered
[252243.475938] ------------[ cut here ]------------
[252243.475939] xhci_hcd 0000:00:14.0: disabling already-disabled device
[252243.475951] WARNING: CPU: 2 PID: 1787 at /build/linux-bdpCf2/linux-4.15.0/drivers/pci/pci.c:1642 pci_disable_device+0x9c/0xc0
[252243.475951] Modules linked in: cpuid snd_seq_dummy usb_storage hid_generic hidp ip6t_REJECT nf_reject_ipv6 ip6table_nat nf_nat_ipv6 ip6table_mangle xt_hashlimit ip6table_raw nf_conntrack_ipv6 nf_defrag_ipv6 nf_log_ipv6 xt_recent xt_comment ipt_REJECT nf_reject_ipv4 xt_mark iptable_mangle xt_tcpudp xt_CT iptable_raw xt_multiport xt_NFLOG nfnetlink_log nf_log_ipv4 nf_log_common xt_LOG nf_conntrack_sane nf_conntrack_netlink nfnetlink nf_nat_tftp nf_nat_snmp_basic nf_conntrack_snmp nf_nat_sip nf_nat_pptp nf_nat_proto_gre nf_nat_irc nf_nat_h323 nf_nat_ftp nf_nat_amanda nf_conntrack_tftp nf_conntrack_sip nf_conntrack_pptp nf_conntrack_proto_gre nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_irc nf_conntrack_h323 nf_conntrack_ftp ts_kmp nf_conntrack_amanda ipt_MASQUERADE nf_nat_masquerade_ipv4
[252243.475984] xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype xt_conntrack nf_nat nf_conntrack br_netfilter aufs vhost_net vhost tap ccm rfcomm bridge stp llc devlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter msr cmac bnep binfmt_misc snd_hda_codec_hdmi nls_iso8859_1 arc4 snd_soc_skl snd_hda_codec_realtek snd_soc_skl_ipc snd_hda_ext_core snd_hda_codec_generic snd_soc_sst_dsp snd_soc_sst_ipc snd_soc_acpi snd_soc_core snd_compress ac97_bus snd_pcm_dmaengine snd_hda_intel dell_laptop dell_smbios_smm dcdbas snd_hda_codec dell_smm_hwmon snd_hda_core snd_hwdep snd_pcm intel_rapl snd_seq_midi snd_seq_midi_event x86_pkg_temp_thermal intel_powerclamp coretemp snd_rawmidi kvm_intel kvm btusb irqbypass intel_cstate intel_rapl_perf snd_seq btrtl
[252243.476023] iwlmvm btbcm btintel mac80211 hid_multitouch uvcvideo joydev input_leds dell_smbios_wmi snd_seq_device dell_wmi bluetooth serio_raw snd_timer videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 dell_smbios videobuf2_core iwlwifi sparse_keymap ecdh_generic snd wmi_bmof dell_wmi_descriptor videodev cfg80211 media soundcore rtsx_pci_ms memstick shpchp mei_me mei processor_thermal_device intel_pch_thermal intel_soc_dts_iosf int3400_thermal acpi_thermal_rel dell_rbtn mac_hid acpi_pad int3403_thermal int340x_thermal_zone tpm_crb sch_fq_codel cuse parport_pc ppdev nfsd lp parport auth_rpcgss nfs_acl lockd grace sunrpc ip_tables x_tables autofs4 btrfs zstd_compress algif_skcipher af_alg dm_crypt raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1
[252243.476067] raid0 multipath linear dm_mirror dm_region_hash dm_log usbhid hid crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc i915 rtsx_pci_sdmmc i2c_algo_bit drm_kms_helper e1000e syscopyarea sysfillrect sysimgblt fb_sys_fops ptp aesni_intel psmouse drm pps_core rtsx_pci aes_x86_64 ahci crypto_simd glue_helper libahci wmi cryptd video
[252243.476089] CPU: 2 PID: 1787 Comm: reset-xhci Tainted: G U W 4.15.0-13-generic #14-Ubuntu
[252243.476090] Hardware name: Dell Inc. Latitude E7470/0T6HHJ, BIOS 1.18.5 12/11/2017
[252243.476092] RIP: 0010:pci_disable_device+0x9c/0xc0
[252243.476092] RSP: 0018:ffffa61206edfd40 EFLAGS: 00010286
[252243.476094] RAX: 0000000000000000 RBX: ffff9356fcc25000 RCX: ffffffffa9862888
[252243.476095] RDX: 0000000000000001 RSI: 0000000000000082 RDI: 0000000000000247
[252243.476096] RBP: ffffa61206edfd50 R08: 0000000000000038 R09: 000000000000c694
[252243.476097] R10: ffffa61206edfcf0 R11: 0000000000000000 R12: ffff9356fced8700
[252243.476098] R13: ffffffffa99d52c0 R14: ffffffffa99d5330 R15: 0000000000000060
[252243.476100] FS: 00007f13a7aea740(0000) GS:ffff93570fd00000(0000) knlGS:0000000000000000
[252243.476102] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[252243.476102] CR2: 000055f5dcf9cef0 CR3: 000000011db38006 CR4: 00000000003626e0
[252243.476103] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[252243.476104] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[252243.476105] Call Trace:
[252243.476110] usb_hcd_pci_remove+0xcf/0x130
[252243.476112] xhci_pci_remove+0x6b/0x70
[252243.476116] pci_device_remove+0x3e/0xb0
[252243.476124] device_release_driver_internal+0x15b/0x220
[252243.476126] device_release_driver+0x12/0x20
[252243.476127] unbind_store+0x87/0x150
[252243.476130] drv_attr_store+0x27/0x40
[252243.476132] sysfs_kf_write+0x3c/0x50
[252243.476135] kernfs_fop_write+0x125/0x1a0
[252243.476138] __vfs_write+0x1b/0x40
[252243.476140] vfs_write+0xb1/0x1a0
[252243.476142] SyS_write+0x55/0xc0
[252243.476145] do_syscall_64+0x73/0x130
[252243.476148] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[252243.476150] RIP: 0033:0x7f13a71f0154
[252243.476151] RSP: 002b:00007fff8cf40498 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[252243.476153] RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007f13a71f0154
[252243.476154] RDX: 000000000000000c RSI: 000055ef072cc230 RDI: 0000000000000001
[252243.476155] RBP: 000055ef072cc230 R08: 00007f13a74cd8c0 R09: 00007f13a7aea740
[252243.476156] R10: 00000000fffffff4 R11: 0000000000000246 R12: 00007f13a74cc760
[252243.476157] R13: 000000000000000c R14: 00007f13a74c82a0 R15: 00007f13a74c7760
[252243.476158] Code: 00 c6 05 5a 6f 12 01 01 4d 85 e4 74 36 48 8d bb a0 00 00 00 e8 26 55 15 00 4c 89 e2 48 89 c6 48 c7 c7 28 90 51 a9 e8 e4 11 ba ff <0f> 0b eb 82 48 89 df e8 d8 fe ff ff 80 a3 c1 07 00 00 f7 5b 41
[252243.476192] ---[ end trace abf3a4d94dd3a5a8 ]---
[252243.513857] BUG: unable to handle kernel NULL pointer dereference at 0000000000000128
[252243.513866] IP: check_root_hub_suspended+0x10/0x60
[252243.513868] PGD 0 P4D 0
[252243.513872] Oops: 0000 [#1] SMP PTI
[252243.513876] Modules linked in: cpuid snd_seq_dummy usb_storage hid_generic hidp ip6t_REJECT nf_reject_ipv6 ip6table_nat nf_nat_ipv6 ip6table_mangle xt_hashlimit ip6table_raw nf_conntrack_ipv6 nf_defrag_ipv6 nf_log_ipv6 xt_recent xt_comment ipt_REJECT nf_reject_ipv4 xt_mark iptable_mangle xt_tcpudp xt_CT iptable_raw xt_multiport xt_NFLOG nfnetlink_log nf_log_ipv4 nf_log_common xt_LOG nf_conntrack_sane nf_conntrack_netlink nfnetlink nf_nat_tftp nf_nat_snmp_basic nf_conntrack_snmp nf_nat_sip nf_nat_pptp nf_nat_proto_gre nf_nat_irc nf_nat_h323 nf_nat_ftp nf_nat_amanda nf_conntrack_tftp nf_conntrack_sip nf_conntrack_pptp nf_conntrack_proto_gre nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_irc nf_conntrack_h323 nf_conntrack_ftp ts_kmp nf_conntrack_amanda ipt_MASQUERADE nf_nat_masquerade_ipv4
[252243.513913] xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype xt_conntrack nf_nat nf_conntrack br_netfilter aufs vhost_net vhost tap ccm rfcomm bridge stp llc devlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter msr cmac bnep binfmt_misc snd_hda_codec_hdmi nls_iso8859_1 arc4 snd_soc_skl snd_hda_codec_realtek snd_soc_skl_ipc snd_hda_ext_core snd_hda_codec_generic snd_soc_sst_dsp snd_soc_sst_ipc snd_soc_acpi snd_soc_core snd_compress ac97_bus snd_pcm_dmaengine snd_hda_intel dell_laptop dell_smbios_smm dcdbas snd_hda_codec dell_smm_hwmon snd_hda_core snd_hwdep snd_pcm intel_rapl snd_seq_midi snd_seq_midi_event x86_pkg_temp_thermal intel_powerclamp coretemp snd_rawmidi kvm_intel kvm btusb irqbypass intel_cstate intel_rapl_perf snd_seq btrtl
[252243.513954] iwlmvm btbcm btintel mac80211 hid_multitouch uvcvideo joydev input_leds dell_smbios_wmi snd_seq_device dell_wmi bluetooth serio_raw snd_timer videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 dell_smbios videobuf2_core iwlwifi sparse_keymap ecdh_generic snd wmi_bmof dell_wmi_descriptor videodev cfg80211 media soundcore rtsx_pci_ms memstick shpchp mei_me mei processor_thermal_device intel_pch_thermal intel_soc_dts_iosf int3400_thermal acpi_thermal_rel dell_rbtn mac_hid acpi_pad int3403_thermal int340x_thermal_zone tpm_crb sch_fq_codel cuse parport_pc ppdev nfsd lp parport auth_rpcgss nfs_acl lockd grace sunrpc ip_tables x_tables autofs4 btrfs zstd_compress algif_skcipher af_alg dm_crypt raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1
[252243.513989] raid0 multipath linear dm_mirror dm_region_hash dm_log usbhid hid crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc i915 rtsx_pci_sdmmc i2c_algo_bit drm_kms_helper e1000e syscopyarea sysfillrect sysimgblt fb_sys_fops ptp aesni_intel psmouse drm pps_core rtsx_pci aes_x86_64 ahci crypto_simd glue_helper libahci wmi cryptd video
[252243.514007] CPU: 2 PID: 31613 Comm: kworker/2:1 Tainted: G U W 4.15.0-13-generic #14-Ubuntu
[252243.514008] Hardware name: Dell Inc. Latitude E7470/0T6HHJ, BIOS 1.18.5 12/11/2017
[252243.514012] Workqueue: pm pm_runtime_work
[252243.514014] RIP: 0010:check_root_hub_suspended+0x10/0x60
[252243.514016] RSP: 0018:ffffa61207057cb0 EFLAGS: 00010286
[252243.514017] RAX: 0000000000000000 RBX: ffff9356fcc250a0 RCX: 0000000000000000
[252243.514019] RDX: ffffffffa99d52c0 RSI: 0000000000000001 RDI: ffff9356fcc250a0
[252243.514020] RBP: ffffa61207057cb0 R08: 0000000000000000 R09: ffffa61207057db8
[252243.514021] R10: 0000000000000000 R11: 0000000000000274 R12: 0000000000000001
[252243.514022] R13: ffffffffa92ec040 R14: 0000000000000000 R15: ffffffffa88ec000
[252243.514024] FS: 0000000000000000(0000) GS:ffff93570fd00000(0000) knlGS:0000000000000000
[252243.514025] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[252243.514026] CR2: 0000000000000128 CR3: 000000009ac0a003 CR4: 00000000003626e0
[252243.514028] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[252243.514029] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[252243.514030] Call Trace:
[252243.514033] suspend_common+0x22/0x160
[252243.514035] hcd_pci_runtime_suspend+0x1b/0x50
[252243.514038] pci_pm_runtime_suspend+0x64/0x180
[252243.514040] ? pci_pm_runtime_resume+0xa0/0xa0
[252243.514042] __rpm_callback+0xca/0x210
[252243.514045] ? __switch_to_asm+0x34/0x70
[252243.514047] ? __switch_to_asm+0x40/0x70
[252243.514050] rpm_callback+0x24/0x80
[252243.514051] ? pci_pm_runtime_resume+0xa0/0xa0
[252243.514053] rpm_suspend+0x137/0x640
[252243.514056] rpm_idle+0x58/0x2a0
[252243.514058] pm_runtime_work+0x92/0xa0
[252243.514061] process_one_work+0x1de/0x410
[252243.514062] worker_thread+0x32/0x410
[252243.514065] kthread+0x121/0x140
[252243.514067] ? process_one_work+0x410/0x410
[252243.514069] ? kthread_create_worker_on_cpu+0x70/0x70
[252243.514072] ? do_syscall_64+0x73/0x130
[252243.514074] ? SyS_exit_group+0x14/0x20
[252243.514076] ret_from_fork+0x35/0x40
[252243.514077] Code: 48 8d b2 a0 00 00 00 48 81 c7 a0 00 00 00 48 89 e5 e8 65 a0 f1 ff 5d c3 0f 1f 00 0f 1f 44 00 00 48 8b 87 98 00 00 00 55 48 89 e5 <f6> 80 28 01 00 00 20 75 2c 48 8b 90 f8 01 00 00 31 c0 48 85 d2
[252243.514103] RIP: check_root_hub_suspended+0x10/0x60 RSP: ffffa61207057cb0
[252243.514104] CR2: 0000000000000128
[252243.514106] ---[ end trace abf3a4d94dd3a5a9 ]---
[252243.533589] xhci_hcd 0000:00:14.0: xHCI Host Controller
[252243.533600] xhci_hcd 0000:00:14.0: new USB bus registered, assigned bus number 1
[252243.534713] xhci_hcd 0000:00:14.0: hcc params 0x200077c1 hci version 0x100 quirks 0x00109810
[252243.534721] xhci_hcd 0000:00:14.0: cache line size of 64 is not supported
[252243.534892] usb usb1: runtime PM trying to activate child device usb1 but parent (0000:00:14.0) is not active

This is a plain linux-image-extra-4.15.0-13-generic kernel on Ubuntu 18.04 running on a Dell Latitude E7470.

For completeness sake:

╰─▶ lsb_release -rd
Description: Ubuntu Bionic Beaver (development branch)
Release: 18.04

╰─▶ uname -a
Linux regan 4.15.0-13-generic #14-Ubuntu SMP Sat Mar 17 13:44:27 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

╰─▶ dpkg -l linux-image-\*|cat
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-===================================-============-============-===============================================================
ii linux-image-4.14.0-16-generic 4.14.0-16.19 amd64 Linux kernel image for version 4.14.0 on 64 bit x86 SMP
ii linux-image-4.15.0-12-generic 4.15.0-12.13 amd64 Linux kernel image for version 4.15.0 on 64 bit x86 SMP
ii linux-image-4.15.0-13-generic 4.15.0-13.14 amd64 Linux kernel image for version 4.15.0 on 64 bit x86 SMP
ii linux-image-extra-4.14.0-16-generic 4.14.0-16.19 amd64 Linux kernel extra modules for version 4.14.0 on 64 bit x86 SMP
ii linux-image-extra-4.15.0-12-generic 4.15.0-12.13 amd64 Linux kernel extra modules for version 4.15.0 on 64 bit x86 SMP
ii linux-image-extra-4.15.0-13-generic 4.15.0-13.14 amd64 Linux kernel extra modules for version 4.15.0 on 64 bit x86 SMP
ii linux-image-generic 4.15.0.13.14 amd64 Generic Linux kernel image

╰─▶ cat /proc/version_signature
Ubuntu 4.15.0-13.14-generic 4.15.10

╰─▶ cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-4.15.0-13-generic root=/dev/mapper/regan-root ro nosplash acpi_backlight=vendor intel_iommu=off
---
ApportVersion: 2.20.9-0ubuntu4
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: bas 2245 F.... pulseaudio
CurrentDesktop: GNOME
DistroRelease: Ubuntu 18.04
HibernationDevice:
 RESUME=UUID=16195e13-9fb3-41b2-9671-fb4e1df1ff93
 #RESUME=/dev/dm-2
 #RESUME=/dev/mapper/regan-swap
InstallationDate: Installed on 2016-12-22 (476 days ago)
InstallationMedia: Ubuntu 16.10 "Yakkety Yak" - Release amd64 (20161012.2)
MachineType: Dell Inc. Latitude E7470
Package: linux (not installed)
ProcFB: 0 inteldrmfb
ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-4.15.0-13-generic root=/dev/mapper/hostname-root ro nosplash acpi_backlight=vendor intel_iommu=off
ProcVersionSignature: Ubuntu 4.15.0-13.14-generic 4.15.10
RelatedPackageVersions:
 linux-restricted-modules-4.15.0-13-generic N/A
 linux-backports-modules-4.15.0-13-generic N/A
 linux-firmware 1.173
Tags: bionic apport-hook-error
Uname: Linux 4.15.0-13-generic x86_64
UnreportableReason: This report is about a package that is not installed.
UpgradeStatus: Upgraded to bionic on 2017-09-18 (206 days ago)
UserGroups: adm cdrom dialout dip docker libvirt lp lpadmin lxd plugdev sambashare scanner src sudo tss wireshark
_MarkForUpload: False
dmi.bios.date: 12/11/2017
dmi.bios.vendor: Dell Inc.
dmi.bios.version: 1.18.5
dmi.board.name: 0T6HHJ
dmi.board.vendor: Dell Inc.
dmi.board.version: A00
dmi.chassis.type: 9
dmi.chassis.vendor: Dell Inc.
dmi.modalias: dmi:bvnDellInc.:bvr1.18.5:bd12/11/2017:svnDellInc.:pnLatitudeE7470:pvr:rvnDellInc.:rn0T6HHJ:rvrA00:cvnDellInc.:ct9:cvr:
dmi.product.family: Latitude
dmi.product.name: Latitude E7470
dmi.sys.vendor: Dell Inc.

Bas Zoetekouw (baszoetekouw) wrote :

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1763594

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: bionic

apport information

tags: added: apport-collected apport-hook-error
description: updated

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Joseph Salisbury (jsalisbury) wrote :

Did this issue start happening after an update/upgrade? Was there a prior kernel version where you were not having this particular problem?

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.16 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.17-rc1/

Changed in linux (Ubuntu):
importance: Undecided → Medium
tags: added: kernel-da-key needs-bisect
Changed in linux (Ubuntu Bionic):
status: Confirmed → Incomplete
bp0 (bullet-proof-0) wrote :

This is above my competence level in Linux, but I can confirm this problem in:

[code]
$ uname -a
Linux 4.15.0-15-generic #16-Ubuntu SMP Wed Apr 4 13:58:14 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
[/code]

bp0 (bullet-proof-0) wrote :
Download full text (6.0 KiB)

So I installed the latest upstream kernel (4.17-rc1) and the bug described above still occurs. It's frustrating because it means my bluetooth mouse stops working after about 5 minutes, and I can't get it to work again without rebooting. Code copied and pasted below

<code>
[ 721.603218] xhci_hcd 0000:00:14.0: xHC is not running.
[ 750.046445] xhci_hcd 0000:00:14.0: xHC is not running.
[ 759.047364] xhci_hcd 0000:00:14.0: xHC is not running.
[ 779.310499] xhci_hcd 0000:00:14.0: xHC is not running.
[ 795.342998] xhci_hcd 0000:00:14.0: xHC is not running.
[ 801.529731] xhci_hcd 0000:00:14.0: xHC is not running.
[ 805.647450] xhci_hcd 0000:00:14.0: xHC is not running.
[ 812.578673] xhci_hcd 0000:00:14.0: xHC is not running.
[ 825.775154] xhci_hcd 0000:00:14.0: xHC is not running.
[ 865.768631] xhci_hcd 0000:00:14.0: xHC is not running.
[ 865.773773] xhci_hcd 0000:00:14.0: xHCI host controller not responding, assume dead
[ 865.773790] xhci_hcd 0000:00:14.0: HC died; cleaning up
[ 865.892098] usb 1-4: USB disconnect, device number 2
[ 866.006922] usb 1-6: USB disconnect, device number 3
[ 866.010148] usb 1-8: USB disconnect, device number 4
[ 957.285300] xhci_hcd 0000:00:14.0: remove, state 4
[ 957.285314] usb usb2: USB disconnect, device number 1
[ 957.286028] xhci_hcd 0000:00:14.0: USB bus 2 deregistered
[ 957.286045] xhci_hcd 0000:00:14.0: remove, state 4
[ 957.286058] usb usb1: USB disconnect, device number 1
[ 957.287117] BUG: unable to handle kernel NULL pointer dereference at 0000000000000034
[ 957.287132] PGD 0 P4D 0
[ 957.287145] Oops: 0000 [#1] SMP PTI
[ 957.287151] Modules linked in: btrfs zstd_compress xor raid6_pq ufs qnx4 hfsplus hfs minix ntfs msdos jfs xfs libcrc32c hidp thunderbolt rfcomm ccm cmac bnep uvcvideo videobuf2_vmalloc videobuf2_memops btusb videobuf2_v4l2 btrtl btbcm videobuf2_common btintel videodev bluetooth media ecdh_generic msr joydev hid_sensor_als hid_sensor_accel_3d hid_sensor_trigger industrialio_triggered_buffer kfifo_buf hid_sensor_iio_common industrialio hid_sensor_custom snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic nls_iso8859_1 wacom usbhid hid_sensor_hub hid_multitouch hid_generic 8250_dw arc4 wmi_bmof intel_wmi_thunderbolt snd_hda_intel intel_rapl snd_hda_codec x86_pkg_temp_thermal snd_hda_core intel_powerclamp snd_hwdep coretemp snd_pcm snd_seq_midi kvm irqbypass snd_seq_midi_event crct10dif_pclmul
[ 957.287291] iwlmvm crc32_pclmul ghash_clmulni_intel snd_rawmidi pcbc mac80211 snd_seq aesni_intel aes_x86_64 crypto_simd cryptd glue_helper intel_cstate intel_rapl_perf iwlwifi idma64 virt_dma snd_seq_device snd_timer input_leds cfg80211 serio_raw ucsi_acpi snd typec_ucsi mei_me intel_lpss_pci processor_thermal_device shpchp mei intel_lpss intel_pch_thermal intel_soc_dts_iosf typec soundcore ideapad_laptop sparse_keymap int3403_thermal int340x_thermal_zone wmi int3400_thermal acpi_pad mac_hid acpi_thermal_rel sch_fq_codel parport_pc ppdev lp parport ip_tables x_tables autofs4 i915 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm i2c_hid hid pinctrl_sunrisepoint pinctrl_intel video
[ 957.287418] CPU: 1 PID: 17454 Comm: sh Not tainted ...

Read more...

Kai-Heng Feng (kaihengfeng) wrote :

Does this happen in previous kernel versions?

Bas Zoetekouw (baszoetekouw) wrote :

This seems similar indeed, although I don't have any thunderbolt hardware. I'll try running a kernel with CONFIG_USB_XHCI_DBGCAP=n to see if that solves the issue.

bp0 (bullet-proof-0) wrote :

Also maybe related as I am running TLP: https://www.spinics.net/lists/linux-usb/msg167941.html

bp0 (bullet-proof-0) wrote :

Maybe related: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1413440.

When I use the workaround described in comment 29 in bug 1413440 (booting with GRUB options 'pci=nomsi iommu=soft') then XHCI doesn't hang, and when I run the reset script described in this bug report, it doesn't cause a kernel bug.

Mark van Beek (flipvb) wrote :

Looking at the changelogs for the kernel, it seems someone has already created a fix for this and submitted it for 4.17.0-rc6 (http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.17-rc6/CHANGES)

excerpt:
---
      ARM: 8772/1: kprobes: Prohibit kprobes on get_user functions

Mathias Nyman (1):
      xhci: Fix USB3 NULL pointer dereference at logical disconnect.

Mathieu Malaterre (2):
---

Bas Zoetekouw (baszoetekouw) wrote :

I just encountered this bug again in cosmic with kernel 4.17.0-041700rc1-generic. Definitely not fixed in 4.17.

Note that with kernel 4.15, the xhci hang occurs too, but then I can reset the bus without causing a kernel Oops.

I'll attach the Oops for 4.17

Bas Zoetekouw (baszoetekouw) wrote :

I just noticed that this 4.17 kernel is from Debian Experimental. No idea what this was installed on my system though...

Kai-Heng Feng (kaihengfeng) wrote :

Please try this kernel, which has some back ported commits to solve the issue:
https://people.canonical.com/~khfeng/lp1763594/

Szobonya Csaba (csaba215) wrote :

I reproduced the same(?) issue by passing by passing an usb controller to kvm.(https://www.redhat.com/archives/vfio-users/2018-February/msg00030.html) When I shut down the virtual machine and try to start again it always fails. Tested your kernel and still the same issue. Hope this helps.

Szobonya Csaba (csaba215) wrote :
Kai-Heng Feng (kaihengfeng) wrote :

I don't see "xhci_hcd 0000:00:14.0: xHCI host controller not responding, assume dead" in the logs you attached, so it's quite possibly another issue.

Please file a new bug for that, thanks!

Bas Zoetekouw (baszoetekouw) wrote :

Re #29: I've been running the patched kernel for a day now, and so far I haven't seen any xhci issues. I'll give an update later this week.

Kai-Heng Feng (kaihengfeng) wrote :

Bas,

Do you still see the issue? I intend to backport the fixes to Bionic's kernel if the fix works.

Bas Zoetekouw (baszoetekouw) wrote :

I haven't encountered the bug since I've installed your new kernel. However, I haven't used my computer very much over the weekend, but so far, so good!

description: updated
tags: added: originate-from-1776806 somerville
Timo Aaltonen (tjaalton) on 2018-07-05
Changed in linux-oem (Ubuntu Bionic):
status: New → Fix Committed
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux-oem (Ubuntu):
status: New → Confirmed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux-oem - 4.15.0-1012.15

---------------
linux-oem (4.15.0-1012.15) bionic; urgency=medium

  * linux-oem: 4.15.0-1012.15 -proposed tracker (LP: #1782181)

  * Miscellaneous Ubuntu changes
    - Rebase to 4.15.0-29.31

  [ Ubuntu: 4.15.0-29.31 ]

  * linux: 4.15.0-29.31 -proposed tracker (LP: #1782173)
  * [SRU Bionic][Cosmic] kernel panic in ipmi_ssif at msg_done_handler
    (LP: #1777716)
    - ipmi_ssif: Fix kernel panic at msg_done_handler
  * Update to ocxl driver for 18.04.1 (LP: #1775786)
    - misc: ocxl: use put_device() instead of device_unregister()
    - powerpc: Add TIDR CPU feature for POWER9
    - powerpc: Use TIDR CPU feature to control TIDR allocation
    - powerpc: use task_pid_nr() for TID allocation
    - ocxl: Rename pnv_ocxl_spa_remove_pe to clarify it's action
    - ocxl: Expose the thread_id needed for wait on POWER9
    - ocxl: Add an IOCTL so userspace knows what OCXL features are available
    - ocxl: Document new OCXL IOCTLs
    - ocxl: Fix missing unlock on error in afu_ioctl_enable_p9_wait()
  * Critical upstream bugfix missing in Ubuntu 18.04 - frequent Xorg crash after
    suspend (LP: #1776887)
    - ocxl: Document the OCXL_IOCTL_GET_METADATA IOCTL
  * Hard LOCKUP observed on stressing Ubuntu 18 04 (LP: #1777194)
    - powerpc: use NMI IPI for smp_send_stop
    - powerpc: Fix smp_send_stop NMI IPI handling
  * IPL: ppc64_cpu --frequency hang with INFO: rcu_sched detected stalls on
    CPUs/tasks on w34 and wsbmc016 with 920.1714.20170330n (LP: #1773964)
    - rtc: opal: Fix OPAL RTC driver OPAL_BUSY loops
  * [Regression] EXT4-fs error (device sda2): ext4_validate_block_bitmap:383:
    comm stress-ng: bg 4705: bad block bitmap checksum (LP: #1781709)
    - SAUCE: Revert "UBUNTU: SAUCE: ext4: fix ext4_validate_inode_bitmap: comm
      stress-ng: Corrupt inode bitmap"
    - SAUCE: ext4: check for allocation block validity with block group locked

 -- Timo Aaltonen <email address hidden> Wed, 18 Jul 2018 15:56:13 +0300

Changed in linux-oem (Ubuntu Bionic):
status: Fix Committed → Fix Released
status: Fix Committed → Fix Released
Claudius Thomas (ctde) on 2018-07-21
tags: added: kernel-bug-exists-upstream
Claudius Thomas (ctde) wrote :

Verified this on a Thinkpad X230 with lastest Ubuntu 18.04/Linux Mint 19 kernel and Archlinux (4.17.8-1-ARCH).
For me, however, the symtoms are a bit different:

After boot and after device being removed, devices are not recognized anymore wehn plugged in.

"xhci_hcd 0000:00:14.0: HC died; cleaning up" is in the logs.

sudo bash -c 'cd /sys/bus/pci/drivers/xhci_hcd; for d in ????:??:??.? ; do echo -n "$d" > unbind; echo -n "$d" > bind; done'
fixes this - but only once.

Kernel parameters 'pci=nomsi iommu=soft' do not fix this (iommu=soft is default anyways).
However, "usbcore.autosuspend=-1" does seem to fix the issue.

Haven't had the chance to try out the new kernel so far...

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux-oem - 4.15.0-1012.15

---------------
linux-oem (4.15.0-1012.15) bionic; urgency=medium

  * linux-oem: 4.15.0-1012.15 -proposed tracker (LP: #1782181)

  * Miscellaneous Ubuntu changes
    - Rebase to 4.15.0-29.31

  [ Ubuntu: 4.15.0-29.31 ]

  * linux: 4.15.0-29.31 -proposed tracker (LP: #1782173)
  * [SRU Bionic][Cosmic] kernel panic in ipmi_ssif at msg_done_handler
    (LP: #1777716)
    - ipmi_ssif: Fix kernel panic at msg_done_handler
  * Update to ocxl driver for 18.04.1 (LP: #1775786)
    - misc: ocxl: use put_device() instead of device_unregister()
    - powerpc: Add TIDR CPU feature for POWER9
    - powerpc: Use TIDR CPU feature to control TIDR allocation
    - powerpc: use task_pid_nr() for TID allocation
    - ocxl: Rename pnv_ocxl_spa_remove_pe to clarify it's action
    - ocxl: Expose the thread_id needed for wait on POWER9
    - ocxl: Add an IOCTL so userspace knows what OCXL features are available
    - ocxl: Document new OCXL IOCTLs
    - ocxl: Fix missing unlock on error in afu_ioctl_enable_p9_wait()
  * Critical upstream bugfix missing in Ubuntu 18.04 - frequent Xorg crash after
    suspend (LP: #1776887)
    - ocxl: Document the OCXL_IOCTL_GET_METADATA IOCTL
  * Hard LOCKUP observed on stressing Ubuntu 18 04 (LP: #1777194)
    - powerpc: use NMI IPI for smp_send_stop
    - powerpc: Fix smp_send_stop NMI IPI handling
  * IPL: ppc64_cpu --frequency hang with INFO: rcu_sched detected stalls on
    CPUs/tasks on w34 and wsbmc016 with 920.1714.20170330n (LP: #1773964)
    - rtc: opal: Fix OPAL RTC driver OPAL_BUSY loops
  * [Regression] EXT4-fs error (device sda2): ext4_validate_block_bitmap:383:
    comm stress-ng: bg 4705: bad block bitmap checksum (LP: #1781709)
    - SAUCE: Revert "UBUNTU: SAUCE: ext4: fix ext4_validate_inode_bitmap: comm
      stress-ng: Corrupt inode bitmap"
    - SAUCE: ext4: check for allocation block validity with block group locked

 -- Timo Aaltonen <email address hidden> Wed, 18 Jul 2018 15:56:13 +0300

Changed in linux-oem (Ubuntu):
status: Confirmed → Fix Released
Deltik (deltik) on 2018-08-05
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Changed in linux (Ubuntu Bionic):
status: Incomplete → Confirmed
Deltik (deltik) wrote :

Kai-Heng,

Your fix works in linux-oem for Bionic! We're pretty happy about it in Ask Ubuntu: https://askubuntu.com/a/1059424/18979

Can you backport the fix to the standard Ubuntu Bionic kernel?

Thanks!

Kai-Heng Feng (kaihengfeng) wrote :

The kernel team doesn't like the big pull I've made, so I made a backported version of the patch which is now in upstream linux stable.

So when the next Bionic kernel update pulls patches from linux-stable, the fix will be there.

Steve Chadsey (schadsey) wrote :

I've been getting similar disconnects on a Dell Latitude 5580 that is connected to a Dell D6000 docking station via USB-C. I'm also using DisplayLink driver version 4.4.24 (https://www.displaylink.com/downloads/ubuntu) in order to connect to two external displays. With Ubuntu 17.10 I was not seeing any disconnects. With 18.04 I see at least one per day where the external monitors go dark, and these messages in syslog:

Nov 13 10:26:39 steveclx5580 kernel: usb 3-1: USB disconnect, device number 2
Nov 13 10:26:39 steveclx5580 kernel: usb 3-1.2: USB disconnect, device number 3
Nov 13 10:26:39 steveclx5580 kernel: usb 3-1.2.2: USB disconnect, device number 5
Nov 13 10:26:39 steveclx5580 kernel: xhci_hcd 0000:3d:00.0: xHCI host controller not responding, assume dead
Nov 13 10:26:39 steveclx5580 kernel: xhci_hcd 0000:3d:00.0: HC died; cleaning up
Nov 13 10:26:39 steveclx5580 kernel: usb 4-1: USB disconnect, device number 2
Nov 13 10:26:39 steveclx5580 kernel: usb 4-1.1: USB disconnect, device number 3
Nov 13 10:26:39 steveclx5580 kernel: cdc_ncm 4-1.1:1.5 enx9cebe8551508: unregister 'cdc_ncm' usb-0000:3d:00.0-1.1, CDC NCM
Nov 13 10:26:39 steveclx5580 NetworkManager[1089]: <info> [1542122799.4322] device (enx9cebe8551508): state change: unavailable ->
 unmanaged (reason 'removed', sys-iface-state: 'removed')

This morning I tried the linux-oem package based on comments in this bug report.
$ uname -a
Linux steveclx5580 4.15.0-1024-oem #29-Ubuntu SMP Tue Oct 16 08:14:23 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

I am still seeing disconnects, and they are more frequent than before:

Nov 14 10:43:46 steveclx5580 kernel: usb 3-1: USB disconnect, device number 2
Nov 14 10:43:46 steveclx5580 kernel: usb 3-1.2: USB disconnect, device number 3
Nov 14 10:43:46 steveclx5580 kernel: usb 3-1.2.2: USB disconnect, device number 5
Nov 14 10:43:46 steveclx5580 kernel: xhci_hcd 0000:3d:00.0: xHCI host controller not responding, assume dead
Nov 14 10:43:46 steveclx5580 kernel: xhci_hcd 0000:3d:00.0: HC died; cleaning up
Nov 14 10:43:46 steveclx5580 kernel: usb 4-1: USB disconnect, device number 2
Nov 14 10:43:46 steveclx5580 kernel: usb 4-1.1: USB disconnect, device number 3
Nov 14 10:43:46 steveclx5580 kernel: cdc_ncm 4-1.1:1.5 enx9cebe8551508: unregister 'cdc_ncm' usb-0000:3d:00.0-1.1, CDC NCM
Nov 14 10:43:46 steveclx5580 upowerd[2272]: unhandled action 'unbind' on /sys/devices/pci0000:00/0000:00:1c.4/0000:05:00.0/0000:06:
02.0/0000:3d:00.0/usb4/4-1/4-1.1/4-1.1:1.1
Nov 14 10:43:46 steveclx5580 upowerd[2272]: unhandled action 'unbind' on /sys/devices/pci0000:00/0000:00:1c.4/0000:05:00.0/0000:06:
02.0/0000:3d:00.0/usb4/4-1/4-1.1/4-1.1:1.0
Nov 14 10:43:46 steveclx5580 dhclient[3858]: receive_packet failed on enx9cebe8551508: Network is down

I do regularly suspend/resume this laptop, but the latest disconnects have happened without a prior resume.

Shirish S (shirish.s) wrote :

Kai-Heng,

Am able to see "xHCI host controller not responding, assume dead" pretty much regularly during suspend/resume.

I verified that the kernel am using has patches mentioned in https://patchwork.ozlabs.org/project/ubuntu-kernel/list/?series=53395 and am using amd platform.

I also have iommu disabled. Impact being system hangs for a while, watchdog kicks in and reboot.

Kai-Heng Feng (kaihengfeng) wrote :

@Steve,

Docking station is a whole different matter. Please file a new bug.

@Shirish
What's the kernel version? Please try latest -generic kernel, I think it's fixed there.

Deltik (deltik) wrote :

Kai-Heng,

Actually, I saw that the fix just landed in bionic-proposed in version 4.15.0-46.49 on 2019-02-15, so bionic-updates and bionic-security don't have the fix yet as of today.

Changelog for anyone wanting to know: https://launchpad.net/ubuntu/+source/linux/+changelog

Relevant snippet:

  * Bionic update: upstream stable patchset 2019-01-17 (LP: #1812229)
    - xhci: Fix perceived dead host due to runtime suspend race with event handler

Changed in linux (Ubuntu):
status: Confirmed → Fix Released
Changed in linux (Ubuntu Bionic):
status: Confirmed → Fix Released
Changed in hwe-next:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.