khubd/usbhid deadlock(?) creates processes in state D

Bug #1557172 reported by Mike Gerow on 2016-03-14
38
This bug affects 7 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Medium
Unassigned

Bug Description

(Note that the attached logs are not a machine exhibiting the issue)

I've observed this issue specifically on Dell's PowerEdge R410 with its DRAC. I manage quite a few other machines with DRACs and haven't seen this issue so it may be limited to the R410. I've seen cases of this both with the precise and trusty kernel.

Basically, what happens is that things that read any info about the DRAC's usbhid device will block in an uninterruptible state. A quick way to do this is with "cat /sys/devices/pci0000:00/0000:00:1d.0/usb5/5-2/manufacturer". This means that commands like lspci will block as well. The thing is that it doesn't happen every time, but once it happens the first time all future reads on the device will block in state D too.

This is also associated with the following from the kernel (first from a precise machine):
[197852.595820] usb 5-2: reset full-speed USB device number 2 using uhci_hcd
[197857.716899] usb 5-2: device descriptor read/64, error -71
[197857.944966] usb 5-2: device descriptor read/64, error -71
[197858.165020] usb 5-2: reset full-speed USB device number 2 using uhci_hcd
[197858.285090] usb 5-2: device descriptor read/64, error -71
[197858.513108] usb 5-2: device descriptor read/64, error -71
[197858.733154] usb 5-2: reset full-speed USB device number 2 using uhci_hcd
[197859.145214] usb 5-2: device not accepting address 2, error -71
[197859.261234] usb 5-2: reset full-speed USB device number 2 using uhci_hcd
[197859.669322] usb 5-2: device not accepting address 2, error -71
[197859.675357] usb 5-2: USB disconnect, device number 2
[198047.530714] INFO: task khubd:203 blocked for more than 120 seconds.
[198047.537147] Not tainted 3.13.0-74-generic #118~precise1-Ubuntu
[198047.543691] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[198047.551792] khubd D ffff880803fe18a0 0 203 2 0x00000000
[198047.551803] ffff8810036b1c18 0000000000000046 ffff88102fc53180 ffff8810036b1fd8
[198047.551809] 0000000000013180 0000000000013180 ffff881003f7c800 ffff880804634800
[198047.551814] ffff8810036b1c38 ffff8810020908e8 ffff8810020908ec 00000000ffffffff
[198047.551819] Call Trace:
[198047.551828] [<ffffffff81764b59>] schedule+0x29/0x70
[198047.551833] [<ffffffff81764e7e>] schedule_preempt_disabled+0xe/0x10
[198047.551837] [<ffffffff81766cb4>] __mutex_lock_slowpath+0x114/0x1b0
[198047.551845] [<ffffffff8156e66e>] ? usb_alloc_urb+0x1e/0x50
[198047.551848] [<ffffffff81766d73>] mutex_lock+0x23/0x37
[198047.551852] [<ffffffff8156597a>] usb_disconnect+0x5a/0x210
[198047.551857] [<ffffffff8156ff6a>] ? usb_control_msg+0xea/0x110
[198047.551860] [<ffffffff8156833f>] hub_port_connect_change+0xcf/0x9c0
[198047.551864] [<ffffffff81562715>] ? hub_port_status+0xd5/0x120
[198047.551868] [<ffffffff81569094>] hub_events+0x464/0x8c0
[198047.551871] [<ffffffff8176472e>] ? __schedule+0x38e/0x700
[198047.551875] [<ffffffff81569525>] hub_thread+0x35/0x150
[198047.551881] [<ffffffff810b0410>] ? __wake_up_sync+0x20/0x20
[198047.551885] [<ffffffff815694f0>] ? hub_events+0x8c0/0x8c0
[198047.551890] [<ffffffff8108ff09>] kthread+0xc9/0xe0
[198047.551893] [<ffffffff8108fe40>] ? flush_kthread_worker+0xb0/0xb0
[198047.551898] [<ffffffff81771768>] ret_from_fork+0x58/0x90
[198047.551902] [<ffffffff8108fe40>] ? flush_kthread_worker+0xb0/0xb0
[198047.551908] INFO: task kworker/9:1:246 blocked for more than 120 seconds.
[198047.558892] Not tainted 3.13.0-74-generic #118~precise1-Ubuntu
[198047.565491] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[198047.573524] kworker/9:1 D ffff880803fe18a0 0 246 2 0x00000000
[198047.573547] Workqueue: events hid_reset [usbhid]
[198047.573550] ffff8808011f9788 0000000000000046 ffff88080fc93180 ffff8808011f9fd8
[198047.573555] 0000000000013180 0000000000013180 ffff881003f94800 ffff8808011f0000
[198047.573559] ffff8808011f9798 ffff8808011f98d8 7fffffffffffffff 7fffffffffffffff
[198047.573564] Call Trace:
[198047.573579] [<ffffffff81764b59>] schedule+0x29/0x70
[198047.573582] [<ffffffff81763e25>] schedule_timeout+0x1e5/0x250
[198047.573587] [<ffffffff810a4b1d>] ? wake_affine+0x16d/0x2d0
[198047.573591] [<ffffffff81765df7>] wait_for_completion+0xa7/0x160
[198047.573596] [<ffffffff8109fe00>] ? try_to_wake_up+0x210/0x210
[198047.573602] [<ffffffff81089509>] flush_work+0x29/0x40
[198047.573605] [<ffffffff81085170>] ? start_worker+0x40/0x40
[198047.573609] [<ffffffff8108962c>] __cancel_work_timer+0x9c/0x1b0
[198047.573623] [<ffffffff81089770>] cancel_work_sync+0x10/0x20
[198047.573632] [<ffffffffa00a6a19>] hid_cancel_delayed_stuff+0x29/0x30 [usbhid]
[198047.573640] [<ffffffffa00a88cc>] usbhid_close+0xcc/0x110 [usbhid]
[198047.573650] [<ffffffffa0036772>] hidinput_close+0x22/0x30 [hid]
[198047.573659] [<ffffffff815b3bf1>] input_close_device+0x61/0x90
[198047.573663] [<ffffffff815b9b7f>] evdev_cleanup+0xbf/0xd0
[198047.573670] [<ffffffff815b9bc6>] evdev_disconnect+0x36/0x70
[198047.573674] [<ffffffff815b5b75>] __input_unregister_device+0xc5/0x1a0
[198047.573678] [<ffffffff815b5ca5>] input_unregister_device+0x55/0x80
[198047.573687] [<ffffffffa0036105>] hidinput_disconnect+0x95/0xc0 [hid]
[198047.573695] [<ffffffffa0033548>] hid_disconnect+0x68/0x70 [hid]
[198047.573708] [<ffffffffa0033625>] hid_device_remove+0xd5/0xf0 [hid]
[198047.573720] [<ffffffff814b7c7f>] __device_release_driver+0x7f/0xf0
[198047.573724] [<ffffffff814b7d1c>] device_release_driver+0x2c/0x40
[198047.573728] [<ffffffff814b76c4>] bus_remove_device+0x104/0x170
[198047.573736] [<ffffffff814b46e8>] device_del+0x118/0x1a0
[198047.573744] [<ffffffffa0032af0>] hid_destroy_device+0x30/0x70 [hid]
[198047.573754] [<ffffffffa00a6a4b>] usbhid_disconnect+0x2b/0x50 [usbhid]
[198047.573758] [<ffffffff81573380>] usb_unbind_interface+0x60/0x1b0
[198047.573764] [<ffffffff814b7c7f>] __device_release_driver+0x7f/0xf0
[198047.573768] [<ffffffff814b7d1c>] device_release_driver+0x2c/0x40
[198047.573772] [<ffffffff81573560>] usb_driver_release_interface+0x90/0xa0
[198047.573780] [<ffffffffa00a8c51>] ? hid_post_reset+0x51/0x200 [usbhid]
[198047.573785] [<ffffffff815735a7>] usb_forced_unbind_intf+0x37/0x60
[198047.573788] [<ffffffff81573629>] unbind_marked_interfaces.isra.8+0x59/0x80
[198047.573792] [<ffffffff815737dd>] usb_unbind_and_rebind_marked_interfaces+0x1d/0x30
[198047.573796] [<ffffffff81567c1b>] usb_reset_device+0x15b/0x1d0
[198047.573804] [<ffffffffa00a7e2f>] hid_reset+0x16f/0x1e0 [usbhid]
[198047.573808] [<ffffffff81087c5f>] process_one_work+0x17f/0x4c0
[198047.573811] [<ffffffff81088dbb>] worker_thread+0x11b/0x3d0
[198047.573815] [<ffffffff81088ca0>] ? manage_workers.isra.21+0x190/0x190
[198047.573819] [<ffffffff8108ff09>] kthread+0xc9/0xe0
[198047.573823] [<ffffffff8108fe40>] ? flush_kthread_worker+0xb0/0xb0
[198047.573827] [<ffffffff81771768>] ret_from_fork+0x58/0x90
[198047.573830] [<ffffffff8108fe40>] ? flush_kthread_worker+0xb0/0xb0

And these from a trusty machine (which includes info from one of the stuck processes too):
[1367979.659164] SysRq : Show Blocked State
[1367979.663114] task PC stack pid father
[1367979.669300] khubd D ffff88100054f160 0 203 2 0x00000000
[1367979.676574] ffff88100352fc18 0000000000000046 ffff88080fc93180 ffff88100352ffd8
[1367979.684216] 0000000000013180 0000000000013180 ffff8808049f4800 ffff8808043f4800
[1367979.691855] ffff88100352fc38 ffff881002e100e8 ffff881002e100ec 00000000ffffffff
[1367979.699495] Call Trace:
[1367979.702134] [<ffffffff81764b69>] schedule+0x29/0x70
[1367979.707288] [<ffffffff81764e8e>] schedule_preempt_disabled+0xe/0x10
[1367979.713831] [<ffffffff81766cc4>] __mutex_lock_slowpath+0x114/0x1b0
[1367979.720288] [<ffffffff8156e67e>] ? usb_alloc_urb+0x1e/0x50
[1367979.726049] [<ffffffff81766d83>] mutex_lock+0x23/0x37
[1367979.731376] [<ffffffff8156598a>] usb_disconnect+0x5a/0x210
[1367979.737138] [<ffffffff8156ff7a>] ? usb_control_msg+0xea/0x110
[1367979.743160] [<ffffffff8156834f>] hub_port_connect_change+0xcf/0x9c0
[1367979.749703] [<ffffffff81562725>] ? hub_port_status+0xd5/0x120
[1367979.755724] [<ffffffff815690a4>] hub_events+0x464/0x8c0
[1367979.761223] [<ffffffff8176473e>] ? __schedule+0x38e/0x700
[1367979.766897] [<ffffffff81569535>] hub_thread+0x35/0x150
[1367979.772313] [<ffffffff810b0410>] ? __wake_up_sync+0x20/0x20
[1367979.778160] [<ffffffff81569500>] ? hub_events+0x8c0/0x8c0
[1367979.784942] [<ffffffff8108ff09>] kthread+0xc9/0xe0
[1367979.790008] [<ffffffff8108fe40>] ? flush_kthread_worker+0xb0/0xb0
[1367979.796378] [<ffffffff81771768>] ret_from_fork+0x58/0x90
[1367979.801965] [<ffffffff8108fe40>] ? flush_kthread_worker+0xb0/0xb0
[1367979.808345] kworker/8:1 D ffff88100054f160 0 329 2 0x00000000
[1367979.815632] Workqueue: events hid_reset [usbhid]
[1367979.820445] ffff881002625788 0000000000000046 ffff88102fc93180 ffff881002625fd8
[1367979.828088] 0000000000013180 0000000000013180 ffff8808049f3000 ffff881002639800
[1367979.835726] ffff881002625798 ffff8810026258d8 7fffffffffffffff 7fffffffffffffff
[1367979.843367] Call Trace:
[1367979.845999] [<ffffffff81764b69>] schedule+0x29/0x70
[1367979.851151] [<ffffffff81763e35>] schedule_timeout+0x1e5/0x250
[1367979.857174] [<ffffffff810aa9cf>] ? enqueue_entity+0x2bf/0x760
[1367979.863196] [<ffffffff81765e07>] wait_for_completion+0xa7/0x160
[1367979.869392] [<ffffffff8109fe00>] ? try_to_wake_up+0x210/0x210
[1367979.875414] [<ffffffff81089509>] flush_work+0x29/0x40
[1367979.880741] [<ffffffff81085170>] ? start_worker+0x40/0x40
[1367979.886415] [<ffffffff8108962c>] __cancel_work_timer+0x9c/0x1b0
[1367979.892609] [<ffffffff81089770>] cancel_work_sync+0x10/0x20
[1367979.898461] [<ffffffffa0072a19>] hid_cancel_delayed_stuff+0x29/0x30 [usbhid]
[1367979.905801] [<ffffffffa00748cc>] usbhid_close+0xcc/0x110 [usbhid]
[1367979.912176] [<ffffffffa0042772>] hidinput_close+0x22/0x30 [hid]
[1367979.918374] [<ffffffff815b3c01>] input_close_device+0x61/0x90
[1367979.924397] [<ffffffff815b9b8f>] evdev_cleanup+0xbf/0xd0
[1367979.929983] [<ffffffff815b9bd6>] evdev_disconnect+0x36/0x70
[1367979.935832] [<ffffffff815b5b85>] __input_unregister_device+0xc5/0x1a0
[1367979.942547] [<ffffffff815b5cb5>] input_unregister_device+0x55/0x80
[1367979.949007] [<ffffffffa0042105>] hidinput_disconnect+0x95/0xc0 [hid]
[1367979.955638] [<ffffffffa003f548>] hid_disconnect+0x68/0x70 [hid]
[1367979.961837] [<ffffffffa003f625>] hid_device_remove+0xd5/0xf0 [hid]
[1367979.968298] [<ffffffff814b7c8f>] __device_release_driver+0x7f/0xf0
[1367979.974752] [<ffffffff814b7d2c>] device_release_driver+0x2c/0x40
[1367979.981095] [<ffffffff814b76d4>] bus_remove_device+0x104/0x170
[1367979.987202] [<ffffffff814b46f8>] device_del+0x118/0x1a0
[1367979.992707] [<ffffffffa003eaf0>] hid_destroy_device+0x30/0x70 [hid]
[1367979.999253] [<ffffffffa0072a4b>] usbhid_disconnect+0x2b/0x50 [usbhid]
[1367980.005968] [<ffffffff81573390>] usb_unbind_interface+0x60/0x1b0
[1367980.012252] [<ffffffff814b7c8f>] __device_release_driver+0x7f/0xf0
[1367980.018710] [<ffffffff814b7d2c>] device_release_driver+0x2c/0x40
[1367980.024992] [<ffffffff81573570>] usb_driver_release_interface+0x90/0xa0
[1367980.031885] [<ffffffffa0074c51>] ? hid_post_reset+0x51/0x200 [usbhid]
[1367980.038603] [<ffffffff815735b7>] usb_forced_unbind_intf+0x37/0x60
[1367980.044971] [<ffffffff81573639>] unbind_marked_interfaces.isra.8+0x59/0x80
[1367980.052121] [<ffffffff815737ed>] usb_unbind_and_rebind_marked_interfaces+0x1d/0x30
[1367980.059981] [<ffffffff81567c2b>] usb_reset_device+0x15b/0x1d0
[1367980.066004] [<ffffffffa0073e2f>] hid_reset+0x16f/0x1e0 [usbhid]
[1367980.072199] [<ffffffff81087c5f>] process_one_work+0x17f/0x4c0
[1367980.078221] [<ffffffff81088dbb>] worker_thread+0x11b/0x3d0
[1367980.083982] [<ffffffff81088ca0>] ? manage_workers.isra.21+0x190/0x190
[1367980.090699] [<ffffffff8108ff09>] kthread+0xc9/0xe0
[1367980.095765] [<ffffffff8108fe40>] ? flush_kthread_worker+0xb0/0xb0
[1367980.102132] [<ffffffff81771768>] ret_from_fork+0x58/0x90
[1367980.107719] [<ffffffff8108fe40>] ? flush_kthread_worker+0xb0/0xb0
[1367980.114098] python2.7 D ffff88100054f160 0 18303 1 0x00000004
[1367980.121369] ffff881001497d08 0000000000000082 ffff88080fc53180 ffff881001497fd8
[1367980.129007] 0000000000013180 0000000000013180 ffff881003f86000 ffff881001688000
[1367980.136648] 0000881000000024 ffff881002e100e8 ffff881002e100ec 00000000ffffffff
[1367980.144286] Call Trace:
[1367980.146921] [<ffffffff81764b69>] schedule+0x29/0x70
[1367980.152073] [<ffffffff81764e8e>] schedule_preempt_disabled+0xe/0x10
[1367980.158615] [<ffffffff81766cc4>] __mutex_lock_slowpath+0x114/0x1b0
[1367980.165069] [<ffffffff81766d83>] mutex_lock+0x23/0x37
[1367980.170397] [<ffffffff81576a8b>] manufacturer_show+0x2b/0x60
[1367980.176331] [<ffffffff814b4890>] dev_attr_show+0x20/0x60
[1367980.181918] [<ffffffff81766d76>] ? mutex_lock+0x16/0x37
[1367980.187418] [<ffffffff811eec6b>] ? seq_buf_alloc+0x1b/0x50
[1367980.193180] [<ffffffff81242875>] sysfs_seq_show+0x145/0x260
[1367980.199027] [<ffffffff811ef3fb>] seq_read+0xfb/0x3c0
[1367980.204267] [<ffffffff811cbc8b>] vfs_read+0xab/0x180
[1367980.209506] [<ffffffff811cbf12>] SyS_read+0x52/0xa0
[1367980.214657] [<ffffffff8177181d>] system_call_fastpath+0x1a/0x1f

lsusb for the device in question:
$ sudo lsusb -v -d 0624:0248

Bus 005 Device 002: ID 0624:0248 Avocent Corp. Virtual Hub
Device Descriptor:
  bLength 18
  bDescriptorType 1
  bcdUSB 1.10
  bDeviceClass 0 (Defined at Interface level)
  bDeviceSubClass 0
  bDeviceProtocol 0
  bMaxPacketSize0 64
  idVendor 0x0624 Avocent Corp.
  idProduct 0x0248 Virtual Hub
  bcdDevice 0.00
  iManufacturer 1 Avocent
  iProduct 2 USB Composite Device-0
  iSerial 3 20080519
  bNumConfigurations 1
  Configuration Descriptor:
    bLength 9
    bDescriptorType 2
    wTotalLength 59
    bNumInterfaces 2
    bConfigurationValue 1
    iConfiguration 0
    bmAttributes 0xc0
      Self Powered
    MaxPower 2mA
    Interface Descriptor:
      bLength 9
      bDescriptorType 4
      bInterfaceNumber 0
      bAlternateSetting 0
      bNumEndpoints 1
      bInterfaceClass 3 Human Interface Device
      bInterfaceSubClass 1 Boot Interface Subclass
      bInterfaceProtocol 1 Keyboard
      iInterface 4 Keyboard
        HID Device Descriptor:
          bLength 9
          bDescriptorType 33
          bcdHID 1.00
          bCountryCode 0 Not supported
          bNumDescriptors 1
          bDescriptorType 34 Report
          wDescriptorLength 65
         Report Descriptors:
           ** UNAVAILABLE **
      Endpoint Descriptor:
        bLength 7
        bDescriptorType 5
        bEndpointAddress 0x81 EP 1 IN
        bmAttributes 3
          Transfer Type Interrupt
          Synch Type None
          Usage Type Data
        wMaxPacketSize 0x0008 1x 8 bytes
        bInterval 8
    Interface Descriptor:
      bLength 9
      bDescriptorType 4
      bInterfaceNumber 1
      bAlternateSetting 0
      bNumEndpoints 1
      bInterfaceClass 3 Human Interface Device
      bInterfaceSubClass 1 Boot Interface Subclass
      bInterfaceProtocol 2 Mouse
      iInterface 5 Mouse
        HID Device Descriptor:
          bLength 9
          bDescriptorType 33
          bcdHID 1.00
          bCountryCode 0 Not supported
          bNumDescriptors 1
          bDescriptorType 34 Report
          wDescriptorLength 63
         Report Descriptors:
           ** UNAVAILABLE **
      Endpoint Descriptor:
        bLength 7
        bDescriptorType 5
        bEndpointAddress 0x82 EP 2 IN
        bmAttributes 3
          Transfer Type Interrupt
          Synch Type None
          Usage Type Data
        wMaxPacketSize 0x0008 1x 8 bytes
        bInterval 8
Device Status: 0x0000
  (Bus Powered)

And now for my purely speculative analysis of what might be going on after digging through the code...

There are two devices interacting here, the usbhid that is the keyboard for the DRAC and the hub that it's connected to.

As a part of the usbhid initialization (usbhid_probe) it creates a tasklet for hid_reset (I have no idea why, but perhaps it needs to reset the underlying hid/usb devices to start working) which unloads and reloads all the parts of the hidusb driver (usb, hid, input, evdev). As a part of the unload for the usbhid it calls hid_cancel_delayed_stuff (don't let it fool you, it actually works with a usbhid device, not a hid device) iff it is no longer opened, which normally it should already be opened, but...

I haven't been able to confirm this, but I think that if the hub closes it can close the devices associated with it, which gives us a fun race condition where if we start the probe for the usbhid device and then the underlying hub gets removed it will have no instances open, which causes usbhid to call hid_cancel_delayed_stuff, from its own cancel tasklet. Unsurprisingly when it does this it ends up waiting on itself forever.

I'm currently working around the issue by adding a udev rule to prevent the device from being used by setting authorized=0 for it. This obviously prevents the DRAC from actually working, though.

Mike Gerow (gerow) wrote :

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1557172

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: trusty
Mike Gerow (gerow) wrote :

Sorry, I'm not super familiar with apport (our security folks make us disable it because they get upset for reasons I don't quite grasp).

I tried to install apport and python-apport, but no dice on actually running apport-collect.

$ apport-collect 1557172
Traceback (most recent call last):
  File "/usr/bin/apport-cli", line 370, in <module>
    if not app.run_argv():
  File "/usr/lib/python2.7/dist-packages/apport/ui.py", line 653, in run_argv
    return self.run_update_report()
  File "/usr/lib/python2.7/dist-packages/apport/ui.py", line 525, in run_update_report
    pkgs = self.crashdb.get_affected_packages(self.options.update_report)
  File "/usr/lib/python2.7/dist-packages/apport/crashdb_impl/memory.py", line 79, in get_affected_packages
    return [self.reports[id]['report']['SourcePackage']]
IndexError: list index out of range

Is there some other way for me to get the info you need?

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.5 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.5-wily/

Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Incomplete
tags: added: kernel-da-key
Mike Gerow (gerow) wrote :

Hi Joseph

I'll see if I can nab some extra hardware to try it out. Keep in mind that I haven't found a way to reproduce this reliably yet, with our usual setup it can take a few weeks before it happens. Basically we have a cron job that periodically exports the state of our machines, which includes calling lsusb. Eventually this bug happens, all future calls to the job get stuck in D, and the machine eventually runs out of resources, requiring it to be rebooted.

I did browse kernel sources upstream and didn't notice anything that would appear to fix this issue, but I'm certainly no kernel-guru.

Romiras (romiras-users) wrote :

Hi,
can you look please at http://ubuntuforums.org/showthread.php?t=2321068
it appears pretty close to described issue.
I can provide more details upon request.

Mike Gerow (gerow) wrote :

Hi, sorry for the lack of update. In the meantime I've seen reports of a few cases of this on actual desktops and laptops, so the workaround of disabling the usbhid dirver via udev isn't really an option there :)

Our setup relies on a handful of out-of-tree modules that don't seem to build on 4.5.2, and since this tends to only happen after about a week of use I don't think I'll be able to find anyone who'd be able to test the 4.5.2 mainline kernel. I do know that these modules compile on the wily kernel, though. Would it be helpful to test against that?

I'll start trying to get some machines trying to reproduce this on the wily kernel, but let me know if there's anything else I can do that might help to debug this.

Romiras (romiras-users) wrote :

All problems disappeared when I disconnected malfunctioning USB hub model EL-UDON-77151 (monitor stand).
The issue appeared in kernels version 3.13 and 3.16.

Romiras (romiras-users) wrote :

Anybody knows who is maintainer of khubd?

Hans Bausewein (hansbausewein) wrote :

For a few weeks I have the same bug in Debian/stable: linux-image-3.16.0-4-amd64
That kernel image is from 9th March.
I upgraded on April 4. dpkg log says 3.16.7-ckt20-1+deb8u4 3.16.7-ckt25-1

Hans Bausewein (hansbausewein) wrote :

Maybe this patch "HID: usbhid: fix recursive deadlock" fixes the Ubuntu problem:

  https://lists.ubuntu.com/archives/kernel-team/2016-February/071250.html

But that fix is in my Debian kernel:
   https://tracker.debian.org/media/packages/l/linux/changelog-3.16.7-ckt25-2

Maybe it did not fix all or it enabled another path to the same dead-lock.

On my machine I got the same dead-lock after a few suspends or hibernates.
I'm running the Debian/testing kernel (4.5.4-1) now: linux-image-4.5.0-2-amd64
With the 3.16 kernel it happened quite often, every 3 to 5 suspends or hibernates.

Hans Bausewein (hansbausewein) wrote :

I'm running Debian stable with the 4.5 kernel for 8 days without restart, now.
First I did some 5 suspend/resumes without any error. Then I used it as my desktop machine,
doing at least one suspend/resume every day. I'm convinced the issue I had is fixed in the 4.5 kernel.

Attached the difference between the Debian 3.16 and 4.5 hid-core.c (hid-core.c-3.16-to-4.5.diff).
It's quite a bit more than a one-line fix.

Hans Bausewein (hansbausewein) wrote :

Made it a valid patch file (for Debian stable):

amd64:/usr/src/linux-3.16.35# cd /usr/src/linux-3.16.35
amd64:/usr/src/linux-3.16.35# patch -p1 --dry-run </tmp/hid-core.c-3.16-to-4.5.diff
checking file drivers/hid/usbhid/hid-core.c

See attached hid-core.c-3.16-to-4.5.diff

I don't know how it will work on Ubuntu. Maybe they differ too much.

tags: added: patch
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
Vivien GUEANT (vivienfr) wrote :
Download full text (10.8 KiB)

I have the same problem with a Dell PowerEdge R210 II, with Dell DRAC under Ubuntu Server 14.04 LTS and 3.13 kernel: Process khubd will block in state D too.
=> need to do a hard reboot

Screenshot of htop: https://lafibre.info/serveur-linux/load-average/msg372058/#msg372058

The "dmesg" data corresponding ;

[11273834.551052] type=1400 audit(1473524094.532:28): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/sbin/dhclient" pid=22297 comm="apparmor_parser"
[11273834.551236] type=1400 audit(1473524094.532:29): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/usr/lib/NetworkManager/nm-dhcp-client.action" pid=22297 comm="apparmor_parser"
[11273834.551388] type=1400 audit(1473524094.532:30): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/usr/lib/connman/scripts/dhclient-script" pid=22297 comm="apparmor_parser"
[11359185.861753] usb 1-1.1: reset full-speed USB device number 3 using ehci-pci
[11359190.938076] usb 1-1.1: device descriptor read/64, error -32
[11359191.114227] usb 1-1.1: device descriptor read/64, error -32
[11359191.290375] usb 1-1.1: reset full-speed USB device number 3 using ehci-pci
[11359191.362427] usb 1-1.1: device descriptor read/64, error -32
[11359191.538586] usb 1-1.1: device descriptor read/64, error -32
[11359191.714748] usb 1-1.1: reset full-speed USB device number 3 using ehci-pci
[11359192.123084] usb 1-1.1: device not accepting address 3, error -32
[11359192.195140] usb 1-1.1: reset full-speed USB device number 3 using ehci-pci
[11359192.603502] usb 1-1.1: device not accepting address 3, error -32
[11359192.604316] usb 1-1.1: USB disconnect, device number 3
[11359384.590873] INFO: task khubd:71 blocked for more than 120 seconds.
[11359384.591241] Not tainted 3.13.0-85-generic #129-Ubuntu
[11359384.591574] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11359384.592047] khubd D ffff88043fc93180 0 71 2 0x00000000
[11359384.592052] ffff880428cd7c80 0000000000000046 ffff8804293d6000 ffff880428cd7fd8
[11359384.592057] 0000000000013180 0000000000013180 ffff8804293d6000 ffff880428c290e8
[11359384.592061] ffff880428c290ec ffff8804293d6000 00000000ffffffff ffff880428c290f0
[11359384.592065] Call Trace:
[11359384.592074] [<ffffffff81730089>] schedule_preempt_disabled+0x29/0x70
[11359384.592078] [<ffffffff81731ef5>] __mutex_lock_slowpath+0x135/0x1b0
[11359384.592082] [<ffffffff81731f8f>] mutex_lock+0x1f/0x2f
[11359384.592088] [<ffffffff81545c04>] usb_disconnect+0x64/0x200
[11359384.592091] [<ffffffff815487d3>] hub_port_connect_change+0xd3/0xb20
[11359384.592096] [<ffffffff8154333d>] ? hub_port_status+0xdd/0x120
[11359384.592099] [<ffffffff815496f4>] hub_events+0x4d4/0xa20
[11359384.592102] [<ffffffff81549c75>] hub_thread+0x35/0x160
[11359384.592106] [<ffffffff810add60>] ? prepare_to_wait_event+0x100/0x100
[11359384.592109] [<ffffffff81549c40>] ? hub_events+0xa20/0xa20
[11359384.592113] [<ffffffff8108deb2>] kthread+0xd2/0xf0
[11359384.592117] [<ffffffff8108dde0>] ? kthread_create_on_node+0x1c0/0x1c0
[11359384.592121] [<ffffffff8173c2e8>] ret_from_fork+0x58/0x90
[11359384.592125] [<f...

Changed in linux (Ubuntu):
status: Expired → Confirmed
Romiras (romiras-users) wrote :

Looks like nobody cares about this issue?!

Mike Gerow (gerow) wrote :

Checking in again, no real changes in my observations of this bug. It's worth noting that this only occurs in 0.5% of our machines (yes, half a percent!). A good chunk of them are servers with DRACs, but there's a nontrivial amount of desktops that hit this issue too.

I'm having affected users try running the linux-generic-lts-xenial kernel to see if that makes the issue go away. Will update this bug with my findings.

Aakash Roy (aakashrc) wrote :
Download full text (3.5 KiB)

Mike - checking if you have found a workaround or any temporary fix for this issue ? I have faced the same problem on an R410 with iDRAC running trusty kernel.

Following are the logs from my server:

[3903494.645100] usb 5-2: reset full-speed USB device number 2 using uhci_hcd
[3903499.760216] usb 5-2: device descriptor read/64, error -110
[3903499.984358] usb 5-2: device descriptor read/64, error -71
[3903500.200497] usb 5-2: reset full-speed USB device number 2 using uhci_hcd
[3903500.320567] usb 5-2: device descriptor read/64, error -71
[3903500.544709] usb 5-2: device descriptor read/64, error -71
[3903500.760848] usb 5-2: reset full-speed USB device number 2 using uhci_hcd
[3903501.169123] usb 5-2: device not accepting address 2, error -71
[3903501.281153] usb 5-2: reset full-speed USB device number 2 using uhci_hcd
[3903501.689411] usb 5-2: device not accepting address 2, error -71
[3903501.689842] usb 5-2: USB disconnect, device number 2
[3903717.219659] INFO: task khubd:119 blocked for more than 120 seconds.
[3903717.220045] Not tainted 3.13.0-100-generic #147-Ubuntu
[3903717.220441] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[3903717.220975] khubd D ffff88012b293180 0 119 2 0x00000000
[3903717.220980] ffff880128f85c80 0000000000000046 ffff880128ece000 0000000000013180
[3903717.220984] ffff880128f85fd8 0000000000013180 ffff880128ece000 ffff8801253da0e8
[3903717.220988] ffff8801253da0ec ffff880128ece000 00000000ffffffff ffff8801253da0f0
[3903717.220992] Call Trace:
[3903717.221002] [<ffffffff8172ee59>] schedule_preempt_disabled+0x29/0x70
[3903717.221006] [<ffffffff81730c95>] __mutex_lock_slowpath+0x135/0x1b0
[3903717.221010] [<ffffffff81730d2f>] mutex_lock+0x1f/0x2f
[3903717.221016] [<ffffffff81544894>] usb_disconnect+0x64/0x200
[3903717.221019] [<ffffffff81547483>] hub_port_connect_change+0xd3/0xb30
[3903717.221023] [<ffffffff81541fcd>] ? hub_port_status+0xdd/0x120
[3903717.221026] [<ffffffff815483b4>] hub_events+0x4d4/0xa20
[3903717.221030] [<ffffffff81548935>] hub_thread+0x35/0x150
[3903717.221034] [<ffffffff810adad0>] ? prepare_to_wait_event+0x100/0x100
[3903717.221037] [<ffffffff81548900>] ? hub_events+0xa20/0xa20
[3903717.221041] [<ffffffff8108dc79>] kthread+0xc9/0xe0
[3903717.221045] [<ffffffff8108dbb0>] ? kthread_create_on_node+0x1c0/0x1c0
[3903717.221050] [<ffffffff8173afa8>] ret_from_fork+0x58/0x90
[3903717.221053] [<ffffffff8108dbb0>] ? kthread_create_on_node+0x1c0/0x1c0
[3903717.221064] INFO: task kworker/2:1:12543 blocked for more than 120 seconds.
[3903717.221495] Not tainted 3.13.0-100-generic #147-Ubuntu
[3903717.221828] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[3903717.222285] kworker/2:1 D ffff88012b253180 0 12543 2 0x00000000
[3903717.222299] Workqueue: events hid_reset [usbhid]
[3903717.222301] ffff880036091850 0000000000000046 ffff8800c4c40000 0000000000013180
[3903717.222304] ffff880036091fd8 0000000000013180 ffff8800c4c40000 ffff880036091998
[3903717.222308] ffff8800360919a0 7fffffffffffffff ffff8800c4c40000 ffff8800c4c40000
[3903717.222312] Call Trace:
[3903717.222315] [<ffffffff817...

Read more...

Mike Gerow (gerow) wrote :

aakashrc: From the original post:

> I'm currently working around the issue by adding a udev rule to prevent the device from being used by setting authorized=0 for it. This obviously prevents the DRAC from actually working, though.

Aakash Roy (aakashrc) wrote :

Mike: I did read that from the original post. Just that preventing the iDRAC from working would prevent remote access to the terminal which is sometimes required at our end - typically contingencies when there is some issue with network, we use the iDRAC for remote access.

Unlike your situation, this bug affects all our servers with iDRAC (R410 & R610) running Ubuntu 14.05.1.

Can you help me with the command which can be run if iDRAC access is required? I know this might create the hung process again but for us we use iDRAC in contingency scenarios only & would not mind restarting once things are restored.

Vladimir Mokrozub (mogaba2009) wrote :

Hi,

we have the same problem with an old Celeron PC running Debian 8.8, is there a workaround?

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers