xhci_hcd: TRB DMA errors reported with ASMedia ASM1142 USB 3.1 Controller

Bug #1749961 reported by Guilherme G. Piccoli on 2018-02-16
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Debian)
Confirmed
Undecided
Guilherme G. Piccoli
linux (Ubuntu)
Medium
Guilherme G. Piccoli
Trusty
Medium
Guilherme G. Piccoli
Xenial
Medium
Guilherme G. Piccoli
Artful
Medium
Guilherme G. Piccoli
Bionic
Medium
Guilherme G. Piccoli

Bug Description

It was observed that while trying to use a 4K USB webcam connected to USB port provided by ASMedia ASM1142 USB 3.1 Controller, the webcam does not work and kernel log shows the following messages:

[431.928016] xhci_hcd 0000:12:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
[431.928021] xhci_hcd 0000:12:00.0: Looking for event-dma 0000003f3330e020 trb-start 0000003f3330e000 trb-end 0000003f3330e000 seg-start 0000003f3330e000 seg-end 0000003f3330eff0
[431.928024] xhci_hcd 0000:12:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
[431.928026] xhci_hcd 0000:12:00.0: Looking for event-dma 0000003f3330e030 trb-start 0000003f3330e000 trb-end 0000003f3330e000 seg-start 0000003f3330e000 seg-end 0000003f3330eff0
[431.928027] xhci_hcd 0000:12:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
[431.928029] xhci_hcd 0000:12:00.0: Looking for event-dma 0000003f3330e050 trb-start 0000003f3330e000 trb-end 0000003f3330e000 seg-start 0000003f3330e000 seg-end 0000003f3330eff0
[431.928386] xhci_hcd 0000:12:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13

A similar issue was already reported on Launchpad: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1667750

The fix to this issue seems to be the following patch: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9da5a109

Tests in our scenario with this patch proved still broken. Our next approach is to modify the patch a bit and re-test.

This LP will be used to document our progress in the investigation.

no longer affects: linux-meta (Ubuntu)
description: updated

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1749961

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Changed in linux (Ubuntu):
status: Incomplete → In Progress
importance: Undecided → Medium
assignee: nobody → Guilherme G. Piccoli (gpiccoli)
Changed in linux (Ubuntu Trusty):
importance: Undecided → Medium
status: New → In Progress
assignee: nobody → Guilherme G. Piccoli (gpiccoli)
Changed in linux (Ubuntu Xenial):
assignee: nobody → Guilherme G. Piccoli (gpiccoli)
importance: Undecided → Medium
status: New → In Progress
Changed in linux (Ubuntu Artful):
assignee: nobody → Guilherme G. Piccoli (gpiccoli)
importance: Undecided → Medium
status: New → In Progress
Guilherme G. Piccoli (gpiccoli) wrote :

Patch was modified (by adding the PCI_ID of device 1142A, which confusingly is 1242!) and still the problem reproduces.

New approaches to be tried soon.

tags: added: kernel-da-key
imperia (imperia777) wrote :

Hello,
Looks like I am having the same problem.

After some hours(random time) my USB 3.1 asmedia controller crashes the driver with following error:
[ 873.661534] xhci_hcd 0000:00:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 0 comp_code 3
[ 873.661629] xhci_hcd 0000:00:00.0: Looking for event-dma 00000002722ed630 trb-start 00000002722ed9b0 trb-end 00000002722ed9d0 seg-start 00000002722ed000 seg-end 00000002722edff0
[ 875.673409] xhci_hcd 0000:00:00.0: WARN Set TR Deq Ptr cmd failed due to incorrect slot or ep state.
I am struggling with this error for more than year. Its very annoying to have to restart the PC every few hours. USB tuner card is connected to the port.
I would like to provide whatever information and support is necessary to fix this damn bug. Logs, ssh access to the affected box and everything else what is needed.

Please ask me here or write to my e mail imperia777_yahoo.com
Thanks.

Guilherme G. Piccoli (gpiccoli) wrote :

Nice imperia, thanks for the report here. First we need to be sure it's exactly the same adapter.
Can you provide the the output of "lspci -nn" ?

Then, if it's the same adapter:

0) Which Ubuntu version are you running? Which kernel version are you using? Can you try in the latest 4.13 for Xenial? (or even better, the hwe-edge 4.15)
Instructions to run the latest 4.15 version: https://launchpad.net/~canonical-kernel-team/+archive/ubuntu/unstable?field.series_filter=xenial

1) You said "after some hours" - can you provide some details? You've been using the USB tuner for like 2 hours? 12 hours? The tuner is in constant use and suddenly the issue happens?

2) If possible, enable xhci dynamic debug and provide logs after the issue; in order to do this, run the following command as root:
echo "module xhci_hcd +flpt" > /sys/kernel/debug/dynamic_debug/control

After issue reproduces, collect the /var/log/kern.log file.

Thanks,

Guilherme

imperia (imperia777) wrote :

Hello,

00:00.0 USB controller [0c03]: ASMedia Technology Inc. ASM1142 USB 3.1 Host Controller [1b21:1242]

Actually I am on debian buster. I am running kernel 4.16-rc6 from experimental repository.

I am running program for watching satellite channels called vdr.
When I am not watching TV, while idle, every few minutes vdr scans for channel list updates from satellites. It is safe to say that tuner is occupied every few minutes for a scan, but not occupied with bandwidth like when watching TV. While in this mode vdr is able to crash the driver in ~6-30 hours.

There is program that you use to initially create your channels list for vdr. When I use it I am able to crash the driver in ~1-2 hours.

But when I just watch one channel and don't change it for hours, driver is least likely to crash.

I think something in consecutive opening (initializing) of the usb port/driver forces this error.
Because the program that scans for channels crash it much faster.
This program work like this:

:go
open port
scan some frequency
write to file new channels
close port
goto go

I made this script that I will use to capture the log.

echo "module xhci_hcd +flpt" > /sys/kernel/debug/dynamic_debug/control
(tail -F -n0 /var/log/kern.log &) | grep -q "TRB DMA"
cp /var/log/kern.log /home/imperia/log1.log

And I will run initial channels list scan to force it faster.

I will be back later with the logs.
Thanks for your help.

imperia (imperia777) wrote :
Download full text (8.6 KiB)

Mar 29 20:20:03 vdr kernel: [119370.230528] <intr> xhci_dbg_trace:31: xhci_hcd 0000:00:00.0: Removing canceled TD starting at 0x2ae36c590 (dma).
Mar 29 20:20:03 vdr kernel: [119370.230533] <intr> xhci_dbg_trace:31: xhci_hcd 0000:00:00.0: Finding endpoint context
Mar 29 20:20:03 vdr kernel: [119370.230537] <intr> xhci_dbg_trace:31: xhci_hcd 0000:00:00.0: Cycle state = 0x0
Mar 29 20:20:03 vdr kernel: [119370.230542] <intr> xhci_dbg_trace:31: xhci_hcd 0000:00:00.0: New dequeue segment = 00000000573583cc (virtual)
Mar 29 20:20:03 vdr kernel: [119370.230547] <intr> xhci_dbg_trace:31: xhci_hcd 0000:00:00.0: New dequeue pointer = 0x2ae36c5a0 (DMA)
Mar 29 20:20:03 vdr kernel: [119370.230553] <intr> xhci_dbg_trace:31: xhci_hcd 0000:00:00.0: Set TR Deq Ptr cmd, new deq seg = 00000000573583cc (0x2ae36c000 dma), new deq ptr = 0000000041e92668 (0x2ae36c5a0 dma), new cycle = 0
Mar 29 20:20:03 vdr kernel: [119370.230558] <intr> xhci_ring_cmd_db:282: xhci_hcd 0000:00:00.0: // Ding dong!
Mar 29 20:20:03 vdr kernel: [119370.230631] [27868] xhci_dbg_trace:31: xhci_hcd 0000:00:00.0: Cancel URB 0000000060641c50, dev 2, ep 0x82, starting at offset 0x2ae36c5a0
Mar 29 20:20:03 vdr kernel: [119370.230638] [27868] xhci_ring_cmd_db:282: xhci_hcd 0000:00:00.0: // Ding dong!
Mar 29 20:20:03 vdr kernel: [119370.230650] <intr> xhci_dbg_trace:31: xhci_hcd 0000:00:00.0: Successful Set TR Deq Ptr cmd, deq = @2ae36c5a0
Mar 29 20:20:03 vdr kernel: [119370.230700] <intr> xhci_dbg_trace:31: xhci_hcd 0000:00:00.0: Removing canceled TD starting at 0x2ae36c5a0 (dma).
Mar 29 20:20:03 vdr kernel: [119370.230705] <intr> xhci_dbg_trace:31: xhci_hcd 0000:00:00.0: Finding endpoint context
Mar 29 20:20:03 vdr kernel: [119370.230710] <intr> xhci_dbg_trace:31: xhci_hcd 0000:00:00.0: Cycle state = 0x0
Mar 29 20:20:03 vdr kernel: [119370.230715] <intr> xhci_dbg_trace:31: xhci_hcd 0000:00:00.0: New dequeue segment = 00000000573583cc (virtual)
Mar 29 20:20:03 vdr kernel: [119370.230719] <intr> xhci_dbg_trace:31: xhci_hcd 0000:00:00.0: New dequeue pointer = 0x2ae36c5b0 (DMA)
Mar 29 20:20:03 vdr kernel: [119370.230725] <intr> xhci_dbg_trace:31: xhci_hcd 0000:00:00.0: Set TR Deq Ptr cmd, new deq seg = 00000000573583cc (0x2ae36c000 dma), new deq ptr = 0000000050070757 (0x2ae36c5b0 dma), new cycle = 0
Mar 29 20:20:03 vdr kernel: [119370.230730] <intr> xhci_ring_cmd_db:282: xhci_hcd 0000:00:00.0: // Ding dong!
Mar 29 20:20:03 vdr kernel: [119370.230798] [27868] xhci_dbg_trace:31: xhci_hcd 0000:00:00.0: Cancel URB 00000000588cca08, dev 2, ep 0x82, starting at offset 0x2ae36c5b0
Mar 29 20:20:03 vdr kernel: [119370.230805] [27868] xhci_ring_cmd_db:282: xhci_hcd 0000:00:00.0: // Ding dong!
Mar 29 20:20:03 vdr kernel: [119370.230816] <intr> xhci_dbg_trace:31: xhci_hcd 0000:00:00.0: Successful Set TR Deq Ptr cmd, deq = @2ae36c5b0
Mar 29 20:20:03 vdr kernel: [119370.230865] <intr> xhci_dbg_trace:31: xhci_hcd 0000:00:00.0: Removing canceled TD starting at 0x2ae36c5b0 (dma).
Mar 29 20:20:03 vdr kernel: [119370.230870] <intr> xhci_dbg_trace:31: xhci_hcd 0000:00:00.0: Finding endpoint context
Mar 29 20:20:03 vdr kernel: [119370.230874] <intr> xhci_dbg_trace:31: xhci_hcd 0000:00:00.0: Cycle state = 0x0
Mar 29 20:...

Read more...

Guilherme G. Piccoli (gpiccoli) wrote :

Thanks a lot Imperia! It's indeed the same PCI adapter, and it's even better you're running an upstream kernel like this.

I'll analyze your logs in order to match with the ones I have here.
I might need some xhci traces to understand the TRBs operations (like the enqueue and completion of TRBs). I'll comment here in case I need it.

Cheers,

Guilherme

imperia (imperia777) wrote :

echo xhci-hcd >> /sys/kernel/debug/tracing/set_event
(tail -F -n0 /var/log/kern.log &) | grep -q "TRB DMA"
cp /var/log/kern.log /home/imperia/log1.log

Is this correct command to get traces?
I will run it in advance.

Somebody told me to run this before when I was looking for help.

BTW did you download the full logs so I can remove it from web page?

I will can provide ssh access to box affected if needed.

Guilherme G. Piccoli (gpiccoli) wrote :

Wow Imperia, you're being really helpful here, thank you very much!

To enable traces, these are the instructions I've provided to other people affected so far:

0) Reboot the machine in order to put it in a consistent state;
1) echo "module xhci_hcd +flpt" > /sys/kernel/debug/dynamic_debug/control
2) echo nop > /sys/kernel/debug/tracing/current_tracer
3) echo 81920 > /sys/kernel/debug/tracing/buffer_size_kb
4) echo 0 > /sys/kernel/debug/tracing/trace
5) echo 1 > /sys/kernel/debug/tracing/tracing_on
6) echo 1 > /sys/kernel/debug/tracing/events/xhci-hcd/enable

After reproduce the issue, you should collect /sys/kernel/debug/tracing/trace. Problem is that the file might be huge, much larger than the kernel log you provided for instance.

About the SSH access, I'm interested in getting it next week, if it doesn't annoy you too much. It'll be really helpful, but I might need to reboot the machine.

Oh, I've downloaded the logs from your website, so you can delete it now.
Cheers,

Guilherme

imperia (imperia777) wrote :

Hello,

I think I am ready with the trace log. Hopefully it is full, because machine run out of disk space :)
http://imperia.mine.nu/trace1.log.bz2
Interesting is that it took ~12 hours to crash it this time.

The problem with ssh access is that this is virtual machine under XEN and when you reboot it, the USB controller is gone(not assigned to virtual machine anymore). I have to re-assign the USB controller for passthrough from xen host. (this is xen bug I think, it wasn't like this before).

This is what I do when I have to restart vdr virtual machine:
xl pci-assignable-remove 03:00.0
xl pci-assignable-add 03:00.0
xl create /etc/xen/vdr.cfg

Anyway we can get in touch on irc and I can do restarts for you.

BTW, I shutdown the whole xen server. Then I turn off the power button on PSU and pressed the power button on the case to discharge any electricity left and put it in consistent state before getting the trace logs.

Thanks again Imperia, the traces are fine. They're only 25MB, shouldn't have caused any kind of disk issues, like out of space condition. Also, I'd like to see the correlated kernel log to match the problematic TRBs from the kernel log with trace information. Can you provide me the relevant kern.log file?

I've already downloaded the traces from your server, in case you want to remove the file.

About the SSH, thanks for the offering and let's talk on IRC in case I need it. I'll try the logs first, not sure they're enough for me to understand the issue completely.

Cheers,

Guilherme

Hi Imperia, I built a mainline kernel (version 4.16) with a different quirk that I think might help here. Can you test it? Thanks in advance!

Instructions (run all as root):

1) wget people.canonical.com/~gpiccoli/imperia416.tgz
2) mv imperia416.tgz /
3) tar -zxf imperia416.tgz
4) update-initramfs -c -k 4.16.0-imperia+

Now that's important: if you have access to some serial console in the machine (or if you have physical access), you can reboot into this new kernel. In case _you only have ssh_, I'd suggest to remove the kernel boot entry from grub, and boot through kexec for safety reasons:

a) Remove boot entries from grub.cfg (you can copy away vmlinuz-4.16-imperia+ to some place outside /boot and run "update-grub" for this)
b) apt-get install kexec-tools
c) kexec vmlinuz-4.16-imperia+ --initrd initrd.img-4.16-imperia+ --append="$(cat /proc/cmdline)"
----

After machine (hopefully!) boot to the new kernel, check in dmesg if the quirk is there:
#$ dmesg|grep QUIRK
[0.813486] QUIRK: XHCI_AVOID_BEI

If you can see that output ("QUIRK: XHCI_AVOID_BEI"), then the quirk was applied.
Now, just need to try to reproduce the issue again.

Thanks a lot,

Guilherme

imperia (imperia777) wrote :

Hello,

I am unable to test with the kernel you provided, because my tuner card doesn't have driver in mainline kernel tree. So I have to compile it myself and I need kernel headers for this.

So I compiled kernel 4.16 from debian linux-source-4.16 package and applied the patch you provided:

From dd0375ffba55172194999d40b35344e9dc2682df Mon Sep 17 00:00:00 2001
From: "Guilherme G. Piccoli" <email address hidden>
Date: Wed, 11 Apr 2018 11:04:13 +0000
Subject: [PATCH] xhci: Add quirk to ASMedia 0x1242 adapter to avoid BEI

Signed-off-by: Guilherme G. Piccoli <email address hidden>
---
 drivers/usb/host/xhci-pci.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/usb/host/xhci-pci.c b/drivers/usb/host/xhci-pci.c
index d9f831b..0654461 100644
--- a/drivers/usb/host/xhci-pci.c
+++ b/drivers/usb/host/xhci-pci.c
@@ -213,6 +213,12 @@ static void xhci_pci_quirks(struct device *dev, struct xhci_hcd *xhci)
                xhci->quirks |= XHCI_TRUST_TX_LENGTH;

        if (pdev->vendor == PCI_VENDOR_ID_ASMEDIA &&
+ pdev->device == 0x1242) {
+ xhci->quirks |= XHCI_AVOID_BEI;
+ pr_warn("QUIRK: XHCI_AVOID_BEI");
+ }
+
+ if (pdev->vendor == PCI_VENDOR_ID_ASMEDIA &&
                pdev->device == PCI_DEVICE_ID_ASMEDIA_1042A_XHCI)
                xhci->quirks |= XHCI_ASMEDIA_MODIFY_FLOWCONTROL;

--
2.7.4

Compiled my tuner card driver now and I am testing.

This bug was nominated against a series that is no longer supported, ie artful. The bug task representing the artful nomination is being closed as Won't Fix.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu Artful):
status: In Progress → Won't Fix
imperia (imperia777) wrote :

this is dmidecode output of my machine, in case the fix is FW related, it may be useful in order to contact the motherboard vendor

Roy Thompson (royt77) wrote :

I am running into this same issue with an ASMedia 2142 USB board. Was a fix ever identified?

Hi Roy, thanks for the report. What is your motherboard? What kernel are you running? And what tests are triggering this issue for you?
If you have logs, it'll be pretty useful.

Maybe it's a similar but different case..or the logs may help to confirm it's exact the same issue.

ASMedia seems to have a FW fix but that depends on your motherboard vendor to provide it. They don't provide the fix themselves...it needs some cooking from the vendor, to match subsystem IDs and whatnot.

Cheers,

Guilherme

Changed in linux (Debian):
assignee: nobody → Guilherme G. Piccoli (gpiccoli)
status: New → Confirmed
Roy Thompson (royt77) wrote :

Hi Guilherme,

Thanks for the response. I have several (3) quad port ASMedia 2142 PCIe/USB 3.1 cards installed in a Dell R740 rack server. I am using the standard Ubuntu 18.04 kernel (Linux dell-PowerEdge-R740 4.15.0-36-generic #39-Ubuntu SMP Mon Sep 24 16:19:09 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux).

For one of my applications, it runs a loop that opens and closes a high speed connection to a USB device connected through the ASMedia board. After this goes on for several minutes without any issues, I see this in dmesg:

[Oct 5 10:12] xhci_hcd 0000:be:00.0: WARN Set TR Deq Ptr cmd failed due to incorrect slot or ep state.
[ +3.418076] xhci_hcd 0000:be:00.0: WARN Successful completion on short TX
[ +0.000035] xhci_hcd 0000:be:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 12 comp_code 1
[ +0.000003] xhci_hcd 0000:be:00.0: Looking for event-dma 0000001fe9759610 trb-start 0000001fe9759620 trb-end 0000001fe9759620 seg-start 0000001fe9759000 seg-end 0000001fe9759ff0

This is then followed shortly after by several kernel dump messages, and then the whole system starts behaving erratically, requiring a hard reboot to recover.

The condition is easy for me to reproduce and I will happily provide any logs that may be of use to help debug this. Please just let me know what you would like and how to get them (as I am not a kernel expert).

Thanks,
Roy

Hi Roy, thanks for your quick response. First thing, I'd like to ask you to attach the output of "lspci -vvv" and "dmidecode" in this LP so we can validate the adapters and be sure they are exactly the same, and also the motherboard type. Run both commands as root user.

After that, i'll ask you to reproduce the issue and attach the output of "dmesg" command right after reproduction. If you can also elaborate more about the test you're running, I'd really be glad.

I'll then provide you custom commands to use the kernel trace system to infer more about the issue. One final thing: are you willing to test with mainline kernel in order to check if there's some upstream fix for your instance of the issue?
If so, you can get it here: https://launchpad.net/~canonical-kernel-team/+archive/ubuntu/unstable
This PPA provides a build from kernel 4.18.

Thanks in advance,

Guilherme

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Bug attachments