TB16 dock ethernet corrupts data with hw checksum silently failing

Bug #1729674 reported by Dave Chiluk on 2017-11-02
46
This bug affects 8 people
Affects Status Importance Assigned to Milestone
Dell Sputnik
High
Unassigned
linux (Fedora)
Confirmed
Undecided
linux (Ubuntu)
High
Kai-Heng Feng
Xenial
Undecided
Unassigned
Artful
High
Unassigned
Bionic
High
Kai-Heng Feng

Bug Description

It looks like TCP rx and tx checksum offloading is broken on the TB16 dock's ethernet adapter. For example downloading a large file such as the Ubuntu ISO, and then running an md5sum on it yields the incorrect md5sum. This is because
rx-checksumming: on
tx-checksumming: on
and both set to on by default.

Running sudo ethtool -K <TB16 eth device> tx off rx off, allows the download to complete correctly. This is very bad since this can cause very bad untrustworthy behavior.

This was conducted using an Dell Precision 5520 on Ubuntu 16.04+linux-generic-hwe-16.04-edge.

Thank you

Download full text (9.2 KiB)

This is a Dell XPS 13 connected to the network via the TB16 dock.
Kernel is: Linux ag13.local 4.12.0-0.rc3.git0.2.fc27.x86_64 #1 SMP Tue May 30 19:36:51 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Host controller of the dock:
09:00.0 USB controller: ASMedia Technology Inc. ASM1042A USB 3.0 Host Controller

USB network interface in the dock:
/: Bus 04.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/2p, 5000M
    |__ Port 1: Dev 2, If 0, Class=Hub, Driver=hub/7p, 5000M
        |__ Port 2: Dev 3, If 0, Class=Vendor Specific Class, Driver=r8152, 5000M

[32930.573816] usb 4-1.2: new SuperSpeed USB device number 3 using xhci_hcd
[32930.591744] usb 4-1.2: New USB device found, idVendor=0bda, idProduct=8153
[32930.591752] usb 4-1.2: New USB device strings: Mfr=1, Product=2, SerialNumber=6
[32930.591757] usb 4-1.2: Product: USB 10/100/1000 LAN
[32930.591761] usb 4-1.2: Manufacturer: Realtek
[32930.591766] usb 4-1.2: SerialNumber: 000001000000
[32930.739428] usb 4-1.2: reset SuperSpeed USB device number 3 using xhci_hcd

I *sometimes* get the following in the log and with that the ethernet port stops working.
Jun 12 19:00:04 ag13.local kernel: xhci_hcd 0000:09:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
Jun 12 19:00:04 ag13.local kernel: xhci_hcd 0000:09:00.0: Looking for event-dma 00000001c3eec010 trb-start 00000001c3eebfe0 trb-end 00000001c3eebfe0 seg-start 00000001c3eeb000 seg-end 00000001c3eebff0
Jun 12 19:00:04 ag13.local kernel: xhci_hcd 0000:09:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
Jun 12 19:00:04 ag13.local kernel: xhci_hcd 0000:09:00.0: Looking for event-dma 00000001c3eec020 trb-start 00000001c3eebfe0 trb-end 00000001c3eebfe0 seg-start 00000001c3eeb000 seg-end 00000001c3eebff0
Jun 12 19:00:04 ag13.local kernel: xhci_hcd 0000:09:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
Jun 12 19:00:04 ag13.local kernel: xhci_hcd 0000:09:00.0: Looking for event-dma 00000001c3eec030 trb-start 00000001c3eebfe0 trb-end 00000001c3eebfe0 seg-start 00000001c3eeb000 seg-end 00000001c3eebff0
Jun 12 19:00:04 ag13.local kernel: xhci_hcd 0000:09:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
Jun 12 19:00:04 ag13.local kernel: xhci_hcd 0000:09:00.0: Looking for event-dma 00000001c3eec040 trb-start 00000001c3eebfe0 trb-end 00000001c3eebfe0 seg-start 00000001c3eeb000 seg-end 00000001c3eebff0
Jun 12 19:00:04 ag13.local kernel: xhci_hcd 0000:09:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
Jun 12 19:00:04 ag13.local kernel: xhci_hcd 0000:09:00.0: Looking for event-dma 00000001c3eec050 trb-start 00000001c3eebfe0 trb-end 00000001c3eebfe0 seg-start 00000001c3eeb000 seg-end 00000001c3eebff0
Jun 12 19:00:04 ag13.local kernel: xhci_hcd 0000:09:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
Jun 12 19:00:04 ag13.local kernel: xhci_hcd 0000:09:00.0: Looking for event-dma 00000001c3eec060 trb-start 00000001c3eebfe0 trb-end 00000001c3eebfe0 seg-start 00000001c3eeb000 seg-end 00000001c3eebff0
Jun 12 19:00:04 ag13.local kernel: xhci_hcd 0000:09...

Read more...

There is an upstream patch for the ASM1042A host controller[1] that has been reported to help with the issue (see corresponding launchpad issue[2]).

[1] http://www.spinics.net/lists/linux-usb/msg157958.html
[2] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1667750

After an initial hiccup with the LAN cable in the dock (and plugging it into a different socket), the performance is now much better (not sure if I can say it's perfect, yet) using the patched kernel.
Thanks!

For future reference, the mentioned patch git merged upstream, as commit 9da5a1092b13468839b1a864b126cacfb72ad016
It also made it into stable, 4.12.4 I believe, as 5cc9b698a494827b15f74ef70a31d7911d00e52a

So I think this should be fixed (or at least better) in F26, because we currently ship 4.12.5-300.fc26.x86_64

(In reply to Christian Kellner from comment #5)
> For future reference, the mentioned patch git merged upstream, as commit
> 9da5a1092b13468839b1a864b126cacfb72ad016
> It also made it into stable, 4.12.4 I believe, as
> 5cc9b698a494827b15f74ef70a31d7911d00e52a
>
> So I think this should be fixed (or at least better) in F26, because we
> currently ship 4.12.5-300.fc26.x86_64

The network works, but sadly it corrupts packets. Martin says because of it he has difficulties to download things, connect to services...

@Jiri,

Are you sure that's a result of this patch? This is the first report i've heard of that.

@Mario, I think what Jiri means is that without the patch it doesn't work well at all but even with the patch the situation is not perfect. Let me cc Benjamin, maybe we can add a test in our Fedora Hardware test suit for that. We still have the TB16 dock in Munich right now, maybe we can be of help.

I'll let Martin speak for himself because it was him who complained about it to me.
I've been using kernel 4.12.8 which should have the patch included since the morning and haven't experienced any noticeable problems with the network.

Yes, for me, the Ethernet on the Docks is pretty broken. For example, when downloading a whole Koji build with about 13 packages, each time the download got broken at about 4th or 5th package, with (I think) a SSL handshake error. Also when downloading a Fedora ISO 4 times in a row, each of them got corrupted (md5 check just didn't pass).

Also, the USB performance of the dock is terrible, I'm not sure if this is related to the issue the patch in question is supposed to solve but after updating the laptop firmware to 1.2.1.0, my mouse and keyboard get disconnected very often. On the other hand, dock audio works just fine and one would assume all of these devices are on the same USB hub.

I'm currently working around this by plugging a USB-C adapter with ethernet into the Thunderbolt port on the docking station.

Martin, could you maybe try disabling RC checksum offloading and see if that helps? Then the corrupted packages should be discarded by the kernel (even if they are only corrupted during the transfer over USB). i.e. try again after running:

  ethtool --offload $DEVICE rx off

@Martin

Just to make sure - this is a TB16 not TB15 right? This is sounding suspiciously like a hardware problem to me.

(In reply to Mario Limonciello from comment #12)
> @Martin
>
> Just to make sure - this is a TB16 not TB15 right? This is sounding
> suspiciously like a hardware problem to me.

It's TB16.
You mean the ethernet or USB problem? I think we've started mixing two (most likely) unrelated problems. I have not been able to reproduce the ethernet problem for the whole day. Martin also has Windows 10 installed on his XPS 13, so he could try it there and if the problem still occurs it's very likely a hardware problem.

The USB one doesn't seem like a hardware problem because I'm affected by that, too, after the last firmware update. Devices connected to the USB ports don't work at all or just for a short period of time after they're plugged in.

Well i'm not sure if they're related, but since the Ethernet device is a USB device on the hub, I would suspect them to be.

Can you please clarify which XPS machine you guys are affected? There are at least 4 different XPS models that support TB16.
Please comment your last working and last failed BIOS versions too.

We both have XPS 13 9360. I had problems with Ethernet from the very beginning until I used a patched kernel. But after updating the firmware to 1.3.7 USB devices stopped working*. Now we're on 2.1.0 and they still don't work, no matter if we use the kernel patch or not. I have to have a USB hub connected directly to the laptop. The last working firmware for me was 1.3.5.

* It really depends on the type of the device. The mouse and keyboard don't work at all or just for a very short time after plugging in. I also have a USB sound card. It seems to work, the system identifies the sound card as an audio output, it plays sound, but there are audible corruptions (cracks etc) which don't occur when the sound card is connected directly to the laptop. What I'm experiencing with sound may be similar to what Martin is experiencing with the Ethernet.

Ah OK thanks. I just poked around the Dell forums a little bit and you guys aren't the first ones reporting this on 9360 after upgrade.

http://en.community.dell.com/support-forums/laptop/f/3518/t/20017063?pi41097=1

I'll poke some of the Dell support guys to look at this, it sounds like it might have slipped through the cracks.

I also checked internally on what went into 1.3.6/1.3.7.
At least 1.3.6 had some tweaks for adressing noise which would be most suspicious to me as a possible impact.

For now, can you two downgrade to 1.3.5? Fwupd probably won't let you, but you can place the .EXE file on a FAT32 partition and do it from F12 menu at POST I expect.

We'll try to downgrade for the time being. BTW I also reported the issue to @DellCaresPRO like Barton George instructed me on Twitter. They said 10 days ago they had people looking into it, but there hasn't been any update since then, so I have no idea if someone is really looking into it and if they've made any progress, and who is "they".

I won't be able to shortcut the process by pinging people, but I understand this is being investigated, it will just take some time.

(In reply to Benjamin Berg from comment #11)
> Martin, could you maybe try disabling RC checksum offloading and see if that
> helps? Then the corrupted packages should be discarded by the kernel (even
> if they are only corrupted during the transfer over USB). i.e. try again
> after running:
>
> ethtool --offload $DEVICE rx off

With this, it seems to work alright, thanks! Kernel 4.13.0-0.rc5.git1.1.fc27.x86_64 BTW.

(In reply to Mario Limonciello from comment #16)
> For now, can you two downgrade to 1.3.5? Fwupd probably won't let you, but
> you can place the .EXE file on a FAT32 partition and do it from F12 menu at
> POST I expect.

I'm able to function this way so I'll probably not go for that - unless it'll be necessary to verify it actually happened between the mentioned versions.
I'd rather track if there's a new release and then upgrade when it's out and see if it fixes the USB problem.

Download full text (4.1 KiB)

Every now and then (especially when downloading large files), the ethernet simply stops working with the following log in dmesg.
Unloading the r8152 module results in gnome-shell dying. After reloading it, ethernet still doesn't work. Disconnecting the Dock in this state kills everything from GDM down to my user session.

[159642.248648] pcieport 0000:00:1c.0: AER: Corrected error received: id=00e0
[159642.248666] pcieport 0000:00:1c.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=00e0(Transmitter ID)
[159642.248680] pcieport 0000:00:1c.0: device [8086:9d10] error status/mask=00001000/00002000
[159642.248690] pcieport 0000:00:1c.0: [12] Replay Timer Timeout
[159661.087306] xhci_hcd 0000:0a:00.0: port 1 resume PLC timeout
[159667.687492] xhci_hcd 0000:09:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
[159667.687514] xhci_hcd 0000:09:00.0: Looking for event-dma 00000004694bc010 trb-start 00000003a0cfefe0 trb-end 00000003a0cfefe0 seg-start 00000003a0cfe000 seg-end 00000003a0cfeff0
[159667.687610] xhci_hcd 0000:09:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
[159667.687627] xhci_hcd 0000:09:00.0: Looking for event-dma 00000004694bc020 trb-start 00000003a0cfefe0 trb-end 00000003a0cfefe0 seg-start 00000003a0cfe000 seg-end 00000003a0cfeff0
[159667.687722] xhci_hcd 0000:09:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
[159667.687735] xhci_hcd 0000:09:00.0: Looking for event-dma 00000004694bc030 trb-start 00000003a0cfefe0 trb-end 00000003a0cfefe0 seg-start 00000003a0cfe000 seg-end 00000003a0cfeff0
[159667.687829] xhci_hcd 0000:09:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
[159667.687838] xhci_hcd 0000:09:00.0: Looking for event-dma 00000004694bc040 trb-start 00000003a0cfefe0 trb-end 00000003a0cfefe0 seg-start 00000003a0cfe000 seg-end 00000003a0cfeff0
[159667.687971] xhci_hcd 0000:09:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
[159667.687988] xhci_hcd 0000:09:00.0: Looking for event-dma 00000004694bc050 trb-start 00000003a0cfefe0 trb-end 00000003a0cfefe0 seg-start 00000003a0cfe000 seg-end 00000003a0cfeff0
[159667.723135] xhci_hcd 0000:09:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
[159667.723158] xhci_hcd 0000:09:00.0: Looking for event-dma 00000004694bc060 trb-start 00000003a0cfefe0 trb-end 00000003a0cfefe0 seg-start 00000003a0cfe000 seg-end 00000003a0cfeff0
[159667.723202] xhci_hcd 0000:09:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
[159667.723219] xhci_hcd 0000:09:00.0: Looking for event-dma 00000004694bc070 trb-start 00000003a0cfefe0 trb-end 00000003a0cfefe0 ...

Read more...

As I understand the particular problem linked with the issue in BIOS 1.3.6/1.37 adjusts a voltage regulator (to fix something else; this was an unanticipated/undiscovered regression). I would recommend for now to downgrade to 1.3.5 until a fixed BIOS is issued.

It got really annoying lately. How do I downgrade to 1.3.5, please? I can't find it on the Dell website and fwupd doesn't provide anything too.

Running kernel-4.13.0-1.fc27.x86_64.

BIOS 2.2.1 finally hit the Dell website. I can confirm that with this, the USB overall experience is now much much better (except the occasional mouse stutter but that may as well be on the OS side). There seems to be no problem at all with the dock Ethernet adapter.

On 4.13.4-300.fc27.x86_64, I still experience the SSL errors when downloading larger amounts of data, like git repositories and such. It gets fixed after disabling RC checksum offloading with the ethtool command you have provided before.

Mario Limonciello (superm1) wrote :

Does this same behavior happen in 4.14-rc7? There's a few interesting commits that have happened since 4.10 (eg https://github.com/torvalds/linux/commit/b20cb60e2b865638459e6ec82ad3536d3734e555#diff-d45f6c5dfa1088acc4fb00e7636dbba7)

Dave Chiluk (chiluk) wrote :

I should have been more specific I'm on 4.13.0-16-generic which already contains that change. Good to see you are still around watching this project.

Dave Chiluk (chiluk) wrote :

For completeness the ethernet device is.

Bus 004 Device 003: ID 0bda:8153 Realtek Semiconductor Corp.
...
  idVendor 0x0bda Realtek Semiconductor Corp.
  idProduct 0x8153
  bcdDevice 30.11
  iManufacturer 1 Realtek
  iProduct 2 USB 10/100/1000 LAN
...

Dave Chiluk (chiluk) on 2017-11-02
summary: - TB16 dock ethernet is broken by default
+ TB16 dock ethernet corrupts data with hw checksum silently failing
description: updated
Dave Chiluk (chiluk) wrote :

Going the opposite direction it looks like 4.10.0-38-generic may be working fine. b20cb60 may actually be a regression for rtl8153.

Dave Chiluk (chiluk) wrote :

I spoke too soon. It looks like both 4.10-0-38 and 4.13.0-16-generic have issue.

Mario Limonciello (superm1) wrote :

Going the opposite direction you may or may not have https://github.com/torvalds/linux/commit/9da5a1092b13468839b1a864b126cacfb72ad016#diff-8e88b7e83565580efd59e852f42341a5

That's supposed to be fixing the problems with Ethernet.

If you can trivially reproduce this, could you maybe bisect?

That one applies pretty exclusively to asmedia devices. I don't see how
that would affect a realtek device. Either way, I'll try to carve out some
time to check the mainline kernel and bisect if possible. It's pretty
straightforward to reproduce. All I've been doing is downloading the an
ubuntu iso, and checking the md5sum of it. If I have hardware offloading
on it will not pass the md5sum.

Is it possible that there is an updated firmware for the tb16 dock that I
may need? Otherwise you might want to contact the chip vendor to get them
working on this.

On Thu, Nov 2, 2017 at 5:03 PM, Mario Limonciello <email address hidden>
wrote:

> Going the opposite direction you may or may not have
> https://github.com/torvalds/linux/commit/9da5a1092b13468839b1a864b126ca
> cfb72ad016
> #diff-8e88b7e83565580efd59e852f42341a5
>
> That's supposed to be fixing the problems with Ethernet.
>
> If you can trivially reproduce this, could you maybe bisect?
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1729674
>
> Title:
> TB16 dock ethernet corrupts data with hw checksum silently failing
>
> Status in Dell Sputnik:
> New
>
> Bug description:
> It looks like TCP rx and tx checksum offloading is broken on the TB16
> dock's ethernet adapter. For example downloading a large file such as the
> Ubuntu ISO, and then running an md5sum on it yields the incorrect md5sum.
> This is because
> rx-checksumming: on
> tx-checksumming: on
> and both set to on by default.
>
> Running sudo ethtool -K <TB16 eth device> tx off rx off, allows the
> download to complete correctly. This is very bad since this can cause
> very bad untrustworthy behavior.
>
> This was conducted using an Dell Precision 5520 on Ubuntu 16.04+linux-
> generic-hwe-16.04-edge.
>
> Thank you
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/dell-sputnik/+bug/1729674/+subscriptions
>

Mario Limonciello (superm1) wrote :

The asmedia host controller is what the realtek device is hooked up to.
That patch for asmedia host controller was developed specifically because
of problems with Ethernet not working at >100mbps and causing errors in the
syslog. So yeah that patch does help in that regard :(

It's possible that with the bisect you'll find that you'll come down to
that commit and the problem goes away traded for the poor performance
problem. If so that's not ideal, but I guess let's cross that bridge when
we come to it.

On Thu, Nov 2, 2017, 23:00 Dave Chiluk <email address hidden> wrote:

> That one applies pretty exclusively to asmedia devices. I don't see how
> that would affect a realtek device. Either way, I'll try to carve out some
> time to check the mainline kernel and bisect if possible. It's pretty
> straightforward to reproduce. All I've been doing is downloading the an
> ubuntu iso, and checking the md5sum of it. If I have hardware offloading
> on it will not pass the md5sum.
>
> Is it possible that there is an updated firmware for the tb16 dock that I
> may need? Otherwise you might want to contact the chip vendor to get them
> working on this.
>
> On Thu, Nov 2, 2017 at 5:03 PM, Mario Limonciello <email address hidden>
> wrote:
>
> > Going the opposite direction you may or may not have
> > https://github.com/torvalds/linux/commit/9da5a1092b13468839b1a864b126ca
> > cfb72ad016
> > #diff-8e88b7e83565580efd59e852f42341a5
> >
> > That's supposed to be fixing the problems with Ethernet.
> >
> > If you can trivially reproduce this, could you maybe bisect?
> >
> > --
> > You received this bug notification because you are subscribed to the bug
> > report.
> > https://bugs.launchpad.net/bugs/1729674
> >
> > Title:
> > TB16 dock ethernet corrupts data with hw checksum silently failing
> >
> > Status in Dell Sputnik:
> > New
> >
> > Bug description:
> > It looks like TCP rx and tx checksum offloading is broken on the TB16
> > dock's ethernet adapter. For example downloading a large file such as the
> > Ubuntu ISO, and then running an md5sum on it yields the incorrect md5sum.
> > This is because
> > rx-checksumming: on
> > tx-checksumming: on
> > and both set to on by default.
> >
> > Running sudo ethtool -K <TB16 eth device> tx off rx off, allows the
> > download to complete correctly. This is very bad since this can cause
> > very bad untrustworthy behavior.
> >
> > This was conducted using an Dell Precision 5520 on Ubuntu 16.04+linux-
> > generic-hwe-16.04-edge.
> >
> > Thank you
> >
> > To manage notifications about this bug go to:
> > https://bugs.launchpad.net/dell-sputnik/+bug/1729674/+subscriptions
> >
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1729674
>
> Title:
> TB16 dock ethernet corrupts data with hw checksum silently failing
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/dell-sputnik/+bug/1729674/+subscriptions
>

Dave Chiluk (chiluk) wrote :

I just upgraded to 17.10, and tested out 4.14.0-041400rc8-generic. The issue still exists in 4.14.0-041400rc8-generic. It's pretty simple to reproduce @superm1, you really should get your device partners alerted about this.

Changed in dell-sputnik:
assignee: nobody → Kai-Heng Feng (kaihengfeng)
status: New → Triaged
importance: Undecided → High
Changed in linux (Ubuntu):
assignee: nobody → Kai-Heng Feng (kaihengfeng)
Changed in dell-sputnik:
assignee: Kai-Heng Feng (kaihengfeng) → nobody

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1729674

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Kai-Heng Feng (kaihengfeng) wrote :

I can reproduce the issue on a TB15 (which should be the same?).

Changed in linux (Ubuntu):
status: Incomplete → In Progress
Kai-Heng Feng (kaihengfeng) wrote :

Tried two other r8152 devices,
- r8152 <-> USB-C <-> Host system. No checksum issue.
- r8152 <-> Genesys Logic Hub <-> USB-C <-> Host system. No checksum issue.

So it's more likely to be a ASMedia issue.

Kai-Heng Feng (kaihengfeng) wrote :

This issue only happens under 1Gbps speed with checksum offloading.
Turn off checksum offloading or change the speed to 100Mbps can workaround the issue.

Mario Limonciello (superm1) wrote :

@Dave:

I was glancing at r8152 driver and notice that it has some special handling for ipv6. Is this issue reproducing only in ipv6 for you?
https://github.com/torvalds/linux/commit/6128d1bb30748d0ff56a63898d14f312126e404c

I am not using ipv6.

On Tue, Nov 14, 2017 at 9:10 AM, Mario Limonciello <email address hidden>
wrote:

> @Dave:
>
> I was glancing at r8152 driver and notice that it has some special
> handling for ipv6. Is this issue reproducing only in ipv6 for you?
> https://github.com/torvalds/linux/commit/6128d1bb30748d0ff56a63898d14f3
> 12126e404c
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1729674
>
> Title:
> TB16 dock ethernet corrupts data with hw checksum silently failing
>
> Status in Dell Sputnik:
> Triaged
> Status in linux package in Ubuntu:
> In Progress
>
> Bug description:
> It looks like TCP rx and tx checksum offloading is broken on the TB16
> dock's ethernet adapter. For example downloading a large file such as the
> Ubuntu ISO, and then running an md5sum on it yields the incorrect md5sum.
> This is because
> rx-checksumming: on
> tx-checksumming: on
> and both set to on by default.
>
> Running sudo ethtool -K <TB16 eth device> tx off rx off, allows the
> download to complete correctly. This is very bad since this can cause
> very bad untrustworthy behavior.
>
> This was conducted using an Dell Precision 5520 on Ubuntu 16.04+linux-
> generic-hwe-16.04-edge.
>
> Thank you
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/dell-sputnik/+bug/1729674/+subscriptions
>

Dave Chiluk (chiluk) wrote :

I also just went through the process of reproducing this while watching the kern.log. Absolutely 0 messages came out. If you find some verbose debugging you want me to turn on let me know.

This is happening even on my 9560 with 4.13.9 vanilla; when running a background rsync backup job, packages downloaded in a Debian docker build frequently do not match their checksum and need multiple runs to succeed.

And just to illustrate my point, on 4.14.0 vanilla:

while true; do
dd if=/nfsmount/debian-live-9.1.0-amd64-xfce+nonfree.iso bs=16M iflag=direct 2>/dev/null | sha1sum; done

With rx offload on (default):

489ed92b17aa9a4582899356d3123621b5d92189 -
742462292c76189f63fc3e7af1acc9dec56c0a8d -
f11ba5f624dbab5a52319801c28a7032cc9b5100 -
742462292c76189f63fc3e7af1acc9dec56c0a8d -
e925ff013c99a1b732a99aeaf5d3f1f02c8dfa40 -

With rx offload off:

742462292c76189f63fc3e7af1acc9dec56c0a8d -
742462292c76189f63fc3e7af1acc9dec56c0a8d -
742462292c76189f63fc3e7af1acc9dec56c0a8d -
742462292c76189f63fc3e7af1acc9dec56c0a8d -
742462292c76189f63fc3e7af1acc9dec56c0a8d -

Changed in linux (Fedora):
importance: Unknown → Undecided
status: Unknown → Confirmed
Kai-Heng Feng (kaihengfeng) wrote :

Mario, Dave,

Do you use TB16? I only have TB15 at hand.
Can you attach the output of `udevadm info -e` here if you use TB16?

Thanks.

Dave Chiluk (chiluk) wrote :

udevadm info -e as requested.

Kai-Heng Feng (kaihengfeng) wrote :

Dave,

Please try this kernel,
http://people.canonical.com/~khfeng/lp1729674/

The temporary workaround is what we can get before chip vendors solve the issue.

I reviewed your patch, and it appears as if it only turns off receive
checksumming. The internets are saying that transmit needs to be turned off
as well. Have you checked to see if transmit is affected as well?

I will try to do some transmit tests. I'll also "test" your kernel when I
get a chance, but by the looks of the patch there shouldn't be too much to
test as it simply turns off RX checksuming which is already a known good
solution.

On Mon, Nov 20, 2017 at 2:06 AM, Kai-Heng Feng <email address hidden>
wrote:

> Dave,
>
> Please try this kernel,
> http://people.canonical.com/~khfeng/lp1729674/
>
> The temporary workaround is what we can get before chip vendors solve
> the issue.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1729674
>
> Title:
> TB16 dock ethernet corrupts data with hw checksum silently failing
>
> Status in Dell Sputnik:
> Triaged
> Status in linux package in Ubuntu:
> In Progress
> Status in linux package in Fedora:
> Confirmed
>
> Bug description:
> It looks like TCP rx and tx checksum offloading is broken on the TB16
> dock's ethernet adapter. For example downloading a large file such as the
> Ubuntu ISO, and then running an md5sum on it yields the incorrect md5sum.
> This is because
> rx-checksumming: on
> tx-checksumming: on
> and both set to on by default.
>
> Running sudo ethtool -K <TB16 eth device> tx off rx off, allows the
> download to complete correctly. This is very bad since this can cause
> very bad untrustworthy behavior.
>
> This was conducted using an Dell Precision 5520 on Ubuntu 16.04+linux-
> generic-hwe-16.04-edge.
>
> Thank you
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/dell-sputnik/+bug/1729674/+subscriptions
>

Kai-Heng Feng (kaihengfeng) wrote :
Download full text (3.2 KiB)

> On 20 Nov 2017, at 11:47 PM, Dave Chiluk <email address hidden> wrote:
>
> I reviewed your patch, and it appears as if it only turns off receive
> checksumming. The internets are saying that transmit needs to be turned off
> as well. Have you checked to see if transmit is affected as well?

TX is not affected under my simple NFS testing.

>
> I will try to do some transmit tests. I'll also "test" your kernel when I
> get a chance, but by the looks of the patch there shouldn't be too much to
> test as it simply turns off RX checksuming which is already a known good
> solution.

Yea, just want to make sure USB/PCI ids are the same for both TB15 and TB16.
The TB15 at my hand is not even the mass production one.

>
> On Mon, Nov 20, 2017 at 2:06 AM, Kai-Heng Feng <email address hidden>
> wrote:
>
>> Dave,
>>
>> Please try this kernel,
>> http://people.canonical.com/~khfeng/lp1729674/
>>
>> The temporary workaround is what we can get before chip vendors solve
>> the issue.
>>
>> --
>> You received this bug notification because you are subscribed to the bug
>> report.
>> https://bugs.launchpad.net/bugs/1729674
>>
>> Title:
>> TB16 dock ethernet corrupts data with hw checksum silently failing
>>
>> Status in Dell Sputnik:
>> Triaged
>> Status in linux package in Ubuntu:
>> In Progress
>> Status in linux package in Fedora:
>> Confirmed
>>
>> Bug description:
>> It looks like TCP rx and tx checksum offloading is broken on the TB16
>> dock's ethernet adapter. For example downloading a large file such as the
>> Ubuntu ISO, and then running an md5sum on it yields the incorrect md5sum.
>> This is because
>> rx-checksumming: on
>> tx-checksumming: on
>> and both set to on by default.
>>
>> Running sudo ethtool -K <TB16 eth device> tx off rx off, allows the
>> download to complete correctly. This is very bad since this can cause
>> very bad untrustworthy behavior.
>>
>> This was conducted using an Dell Precision 5520 on Ubuntu 16.04+linux-
>> generic-hwe-16.04-edge.
>>
>> Thank you
>>
>> To manage notifications about this bug go to:
>> https://bugs.launchpad.net/dell-sputnik/+bug/1729674/+subscriptions
>>
>
> --
> You received this bug notification because you are a bug assignee.
> https://bugs.launchpad.net/bugs/1729674
>
> Title:
> TB16 dock ethernet corrupts data with hw checksum silently failing
>
> Status in Dell Sputnik:
> Triaged
> Status in linux package in Ubuntu:
> In Progress
> Status in linux package in Fedora:
> Confirmed
>
> Bug description:
> It looks like TCP rx and tx checksum offloading is broken on the TB16 dock's ethernet adapter. For example downloading a large file such as the Ubuntu ISO, and then running an md5sum on it yields the incorrect md5sum. This is because
> rx-checksumming: on
> tx-checksumming: on
> and both set to on by default.
>
> Running sudo ethtool -K <TB16 eth device> tx off rx off, allows the
> download to complete correctly. This is very bad since this can cause
> very bad untrustworthy behavior.
>
> This was conducted using an Dell Precision 5520 on Ubuntu 16.04+linux-
> generic-hwe-16.04-edge.
>
> Thank you
>
> To manage notifications abou...

Read more...

Kai-Heng Feng (kaihengfeng) wrote :

Dave, can you share the output of `sudo lsusb -v`?

Dave Chiluk (chiluk) wrote :

I'm sorry I've been unable to test this from my end. Have you been able to make any progress on this?

Kai-Heng Feng (kaihengfeng) wrote :

Yes.
Asmedia folks are working a workaround for this issue.
I'll poke around to see if there's any timeline.

Kai-Heng Feng (kaihengfeng) wrote :

This kernel disables RX aggregation instead, please check if it works on your side.
http://people.canonical.com/~khfeng/lp1729674-2/

Dave Chiluk (chiluk) wrote :

Those changes test as good.

@Kai-Heng Feng. In the future you should consider setting LOCALVERSION or using a PPA and setting a +lp1729674 to version string in the changelog. With what you did it's hard to distinguish between your test package and an official package.

See https://wiki.ubuntu.com/Kernel/Dev/KernelBugFixing#Publish_a_Package_for_Testing for more info.

Thanks for the work.

Dave Chiluk (chiluk) wrote :

I should also mention that this should probably be pushed to linux-stable as well as mainline as this is a silent data corruption bug.

Kai-Heng Feng (kaihengfeng) wrote :

Sometimes I forget to set a version number. Sorry about that.

Currently I am still gathering some information from Dell/Realtek. I'll send a new patch to upstream soon.

Dave Chiluk (chiluk) wrote :

That patch note makes it sound like there will be a hardware/firmware fix that will hopefully resolve this. If so it is very unlikely that upstream will accept your patch, as the proper fix will really be to upgrade your firmware. A more preferable patch will be to log a big bad warning if you have TB16 with the bad firmware revision. The kernel people don't tend to like to BUG hardware issues that are resolvable through firmware updates.

Mario Limonciello (superm1) wrote :

This is not a hardware failure.

When a proper fix is developed I'd expect it to come in the form of a patch to XHCI driver to adjust internally how ASMedia host controller operates.

Applied to 4.14.14. Offload:

tcp-segmentation-offload: on
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: on
rx-vlan-offload: on
tx-vlan-offload: on

dd | sha1sum loop:

742462292c76189f63fc3e7af1acc9dec56c0a8d -
742462292c76189f63fc3e7af1acc9dec56c0a8d -
742462292c76189f63fc3e7af1acc9dec56c0a8d -
742462292c76189f63fc3e7af1acc9dec56c0a8d -
742462292c76189f63fc3e7af1acc9dec56c0a8d -
742462292c76189f63fc3e7af1acc9dec56c0a8d -

Ran for 10 minutes, so looks like that patch works (doing around 90mbit/s of traffic).

Changed in linux (Ubuntu Artful):
status: New → Fix Committed
Download full text (7.3 KiB)

On 4.15.4 I see a lot of:

Feb 21 15:43:31 localhost kernel: [18401.483078] pcieport 0000:00:1d.6: AER: Corrected error received: id=00ee
Feb 21 15:43:31 localhost kernel: [18401.483095] pcieport 0000:00:1d.6: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=00ee(Transmitter ID)
Feb 21 15:43:31 localhost kernel: [18401.483097] pcieport 0000:00:1d.6: device [8086:a11e] error status/mask=00001000/00002000
Feb 21 15:43:31 localhost kernel: [18401.483099] pcieport 0000:00:1d.6: [12] Replay Timer Timeout

Which may or may not be related. However, randomly, r8152 stops working entirely. Most recent dmesg:

Feb 21 15:43:42 localhost kernel: [18412.136941] ------------[ cut here ]------------
Feb 21 15:43:42 localhost kernel: [18412.136947] NETDEV WATCHDOG: enxa44cc8d0edff (r8152): transmit queue 0 timed out
Feb 21 15:43:42 localhost kernel: [18412.136969] WARNING: CPU: 1 PID: 0 at net/sched/sch_generic.c:323 dev_watchdog+0x215/0x220
Feb 21 15:43:42 localhost kernel: [18412.136972] Modules linked in: sg uas usb_storage rfcomm nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype br_netfilter ctr ccm xt_C
HECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter devlink cmac bnep binfmt_misc snd_usb_audio cdc_ether usbnet snd_usbmidi_lib r8152 snd_rawmidi snd_seq_device mii btusb btrtl uvcvideo btbcm btintel videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 bluetooth videodev videobuf2_core ecdh_generic joydev mousedev hid_multitouch snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic arc4 i2c_designware_platform
Feb 21 15:43:42 localhost kernel: [18412.137037] i2c_designware_core iwlmvm input_leds i2c_hid mac80211 dell_smm_hwmon x86_pkg_temp_thermal crc32_pclmul iwlwifi crc32c_intel i915 snd_hda_intel ghash_clmulni_intel pcbc snd_hda_codec aesni_intel snd_hwdep aes_x86_64 snd_hda_core crypto_simd sha256_mb snd_pcm_oss glue_helper mcryptd snd_mixer_oss cryptd sha256_ssse3 snd_pcm snd_timer sha256_generic dell_smbios_wmi snd soundcore cfg80211 pcspkr int3400_thermal rtsx_pci acpi_thermal_rel intel_hid xhci_pci int3403_thermal processor_thermal_device mei_me xhci_hcd int340x_thermal_zone shpchp intel_lpss_pci mei intel_soc_dts_iosf intel_pch_thermal intel_lpss loop vhost_net tun vhost tap coretemp i2c_i801 kvm_intel kvm irqbypass uinput evdev nfsd ip_tables x_tables
Feb 21 15:43:42 localhost kernel: [18412.137101] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G U 4.15.4 #5
Feb 21 15:43:42 localhost kernel: [18412.137104] Hardware name: Dell Inc. XPS 15 9560/05FFDN, BIOS 1.7.0 12/15/2017
Feb 21 15:43:42 localhost kernel: [18412.137108] RIP: 0010:dev_watchdog+0x215/0x220
Feb 21 15:43:42 localhost kernel: [18412.137112] RSP: 0018:ffff88087e443ea0 EFLAGS: 00010286
Feb 21 15:43:42 localhost kernel: [18412.137116] RAX: 0000000000000044 RBX: 0000000000000000 RCX: 0000000000000103
Feb 21 15:43:42 localhost kernel: [18412.137119] RDX: 0000000080000103 RSI: ffffffff82063a3a RDI: 000...

Read more...

Georgi Boiko (pandasauce) wrote :

Update to my October post in LP#1667750 which turned out to be a separate issue (1Gbps mode dropouts) on the same adapter.

Dell Precision 5520 and BIOS 1.7 using TB16. This is on Ubuntu 16.04.3, kernel 4.13.0

The issue is still present. I tried limiting the bandwidth using `ethtool -s eth0 speed 100 duplex full autoneg on` and also as described in this blog post: http://mark.koli.ch/slowdown-throttle-bandwidth-linux-network-interface and it *seems* to be making the issue less apparent, but still present.

$ for i in 1 2 3 4; do curl -s http://old-releases.ubuntu.com/releases/17.04/ubuntu-17.04-server-amd64.img -o $i.iso; md5sum $i.iso; done
2641b55ed2e203861fb6f642bb05b8f7 1.iso
63f41e8b8e4e5ad1909637dbd2efd849 2.iso
^C%

$ sudo ethtool -s eth0 speed 100 duplex full autoneg on

$ for i in 1 2 3 4; do curl -s http://old-releases.ubuntu.com/releases/17.04/ubuntu-17.04-server-amd64.img -o $i.iso; md5sum $i.iso; done
4672ce371fb3c1170a9e71bc4b2810b9 1.iso
4672ce371fb3c1170a9e71bc4b2810b9 2.iso
4672ce371fb3c1170a9e71bc4b2810b9 3.iso
4672ce371fb3c1170a9e71bc4b2810b9 4.iso

$ for i in 1 2 3 4; do curl -s http://old-releases.ubuntu.com/releases/17.04/ubuntu-17.04-server-amd64.img -o $i.iso; md5sum $i.iso; done
ed13e9c6c45f027f686000eccce42254 1.iso
4672ce371fb3c1170a9e71bc4b2810b9 2.iso
^C%

Next, I tried disabling offloading as described in LP#1667750. Keep in mind, the device still needs to be in 100Mbps mode or you will experience dropouts in addition to any packet corruption issues that you may run into.

$ sudo ethtool --offload eth0 tx off
Actual changes:
tx-checksumming: off
    tx-checksum-ipv4: off
    tx-checksum-ipv6: off
tcp-segmentation-offload: off
    tx-tcp-segmentation: off [requested on]
    tx-tcp6-segmentation: off [requested on]

$ sudo ethtool --offload eth0 rx off

$ for i in 1 2 3 4 5 6; do curl -s http://old-releases.ubuntu.com/releases/17.04/ubuntu-17.04-server-amd64.img -o $i.iso; md5sum $i.iso; done
4672ce371fb3c1170a9e71bc4b2810b9 1.iso
4672ce371fb3c1170a9e71bc4b2810b9 2.iso
4672ce371fb3c1170a9e71bc4b2810b9 3.iso
4672ce371fb3c1170a9e71bc4b2810b9 4.iso
4672ce371fb3c1170a9e71bc4b2810b9 5.iso
4672ce371fb3c1170a9e71bc4b2810b9 6.iso

I left it to run over lunch at 25 loops to be sure and it's working fine. This weekend I may be able to test this on a 2017 XPS 9560 (non-DE) too.

Thanks for the workaround and looking forward to the patch making it to Ubuntu repos.

Dave Chiluk (chiluk) wrote :

@pandasauce ... Fix committed means it's in the git archive, but has completed testing nor been integrated into the archives yet.

Also please refrain from repeating things we already know in the thread or otherwise +1'ing or me-tooing. It just wastes developers time that could be spent actually fixing the issue.

Dave Chiluk (chiluk) wrote :

Hasn't completed testing or been integrated into the archives.

Dave Chiluk (chiluk) on 2018-02-22
Changed in linux (Ubuntu):
importance: Undecided → High
Changed in linux (Ubuntu Artful):
importance: Undecided → High
Dave Chiluk (chiluk) wrote :

@kmously I see that you marked this fix as Fix Committed in Artful, but I do not see it in the master-next branch of artful. I'm moving this back to In progress in artful as this does not appear to have been pushed to master-next for artful yet. Feel free to push it back to Fix Committed when you accept or merge the patch into master-next.

Changed in linux (Ubuntu Artful):
status: Fix Committed → In Progress
Dave Chiluk (chiluk) wrote :

Looks like this has been released with 4.15.0-9.10 which is available in bionic.

Changed in linux (Ubuntu Bionic):
status: In Progress → Fix Released
milestone: none → ubuntu-18.04
Georgi Boiko (pandasauce) wrote :

This also affects 16.04 (Xenial) but that isn't reflected in the ticket.

Luciano (luciano) wrote :
Download full text (10.1 KiB)

Hi. I'm a user of another distro (gentoo), and found this bug while googling for a problem I'm having. I'm using a realtek-based USB3 to RJ45 gigabit adapter. This plugs directly into my laptop (not any sort of hub as with the DELL hubs above), which is a Toshiba Radius P20W-C-103, skylake based, with the following controller:

```
00:14.0 USB controller: Intel Corporation Sunrise Point-LP USB 3.0 xHCI Controller (rev 21)
```

I am experiencing this on 4.9.79-r1, and also 4.14.22.

When I plug the device in, unless I disable power management on USB hubs 3 and 4, I get errors saying 'root hub lost power or was reset'. However, if I disable PM using powertop, I get the device to work seemingly well. But, as soon as I start heavy transfers (in my case distributed compile), the network device stops responding

The error messages that I'm receiving are very similar to what is posted above. This is the device coming up:

```
Feb 26 20:17:09 nizuc kernel: usb usb3: root hub lost power or was reset
Feb 26 20:17:09 nizuc kernel: usb usb4: root hub lost power or was reset
Feb 26 20:17:41 nizuc kernel: usb 4-1: new SuperSpeed USB device number 2 using xhci_hcd
Feb 26 20:17:41 nizuc kernel: usb 4-1: New USB device found, idVendor=0bda, idProduct=8153
Feb 26 20:17:41 nizuc kernel: usb 4-1: New USB device strings: Mfr=1, Product=2, SerialNumber=6
Feb 26 20:17:41 nizuc kernel: usb 4-1: Product: USB 10/100/1000 LAN
Feb 26 20:17:41 nizuc kernel: usb 4-1: Manufacturer: Realtek
Feb 26 20:17:41 nizuc kernel: usb 4-1: SerialNumber: 000001
Feb 26 20:17:41 nizuc kernel: usb 4-1: reset SuperSpeed USB device number 2 using xhci_hcd
Feb 26 20:17:41 nizuc NetworkManager[2049]: <info> [1519676261.9009] manager: (eth0): new Ethernet device (/org/freedesktop/NetworkManager/Devices/5)
Feb 26 20:17:41 nizuc kernel: r8152 4-1:1.0 eth0: v1.09.9
Feb 26 20:17:42 nizuc mtp-probe[3673]: checking bus 4, device 2: "/sys/devices/pci0000:00/0000:00:1c.0/0000:01:00.0/usb4/4-1"
Feb 26 20:17:42 nizuc mtp-probe[3673]: bus: 4, device: 2 was not an MTP device
Feb 26 20:17:42 nizuc upowerd[2168]: unhandled action 'bind' on /sys/devices/pci0000:00/0000:00:1c.0/0000:01:00.0/usb4/4-1
Feb 26 20:17:42 nizuc systemd-udevd[3676]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Feb 26 20:17:42 nizuc kernel: r8152 4-1:1.0 enp1s0u1: renamed from eth0
Feb 26 20:17:42 nizuc NetworkManager[2049]: <info> [1519676262.2645] device (eth0): interface index 4 renamed iface from 'eth0' to 'enp1s0u1'
Feb 26 20:17:42 nizuc kernel: IPv6: ADDRCONF(NETDEV_UP): enp1s0u1: link is not ready
Feb 26 20:17:42 nizuc NetworkManager[2049]: <info> [1519676262.2829] device (enp1s0u1): state change: unmanaged -> unavailable (reason 'managed', internal state 'external')
Feb 26 20:17:42 nizuc upowerd[2168]: unhandled action 'bind' on /sys/devices/pci0000:00/0000:00:1c.0/0000:01:00.0/usb4/4-1/4-1:1.0
Feb 26 20:17:42 nizuc kernel: IPv6: ADDRCONF(NETDEV_UP): enp1s0u1: link is not ready
Feb 26 20:17:46 nizuc kernel: r8152 4-1:1.0 enp1s0u1: carrier on
Feb 26 20:17:46 nizuc kernel: IPv6: ADDRCONF(NETDEV_CHANGE): enp1s0u1: link becomes ready
Feb 26 20:17:46 nizuc NetworkManager[2049]: <info> ...

Kai-Heng Feng (kaihengfeng) wrote :

Weird, somehow it doesn't get pulled in for Xenail/Artful, I'll poke around to make the them in next kernel release.

@Luciano
Please file a separate bug via `ubuntu-bug linux`, thanks!
It's specific to ASMedia xHC on the Dell TB16.

Changed in linux (Ubuntu Xenial):
status: New → Fix Committed
Changed in linux (Ubuntu Artful):
status: In Progress → Fix Committed
Stefan Bader (smb) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
tags: added: verification-needed-artful
Stefan Bader (smb) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-artful' to 'verification-done-artful'. If the problem still exists, change the tag 'verification-needed-artful' to 'verification-failed-artful'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

Georgi Boiko (pandasauce) wrote :

At 6 iterations of ubuntu-17.04-server-amd64.img (4.2 gigs) I no longer see the corruptions on both 4.13.0-38 and 4.15.0-13 from xenial-proposed. Thanks!

tags: added: verification-done-xenial
removed: verification-needed-xenial
Dave Chiluk (chiluk) on 2018-03-29
tags: added: verification-done-artful
removed: verification-needed-artful
Dave Chiluk (chiluk) wrote :

Ran same tests against 4.13.0-38 on artful.

Just curious, this only seems to be applied to hwe and hwe-edge kernels for xenial. Is that a change in policy?

Even though I haven't attempted it, it appears as if this should be pretty straightforward apply on the 4.4 kernel stream.

Looks like this is more of a firmware issue with these docks and/or a driver issue with the 8152, so I'm throwing this back onto the queue where it was.

Launchpad Janitor (janitor) wrote :
Download full text (18.9 KiB)

This bug was fixed in the package linux - 4.13.0-38.43

---------------
linux (4.13.0-38.43) artful; urgency=medium

  * linux: 4.13.0-38.43 -proposed tracker (LP: #1755762)

  * Servers going OOM after updating kernel from 4.10 to 4.13 (LP: #1748408)
    - i40e: Fix memory leak related filter programming status
    - i40e: Add programming descriptors to cleaned_count

  * [SRU] Lenovo E41 Mic mute hotkey is not responding (LP: #1753347)
    - platform/x86: ideapad-laptop: Increase timeout to wait for EC answer

  * fails to dump with latest kpti fixes (LP: #1750021)
    - kdump: write correct address of mem_section into vmcoreinfo

  * headset mic can't be detected on two Dell machines (LP: #1748807)
    - ALSA: hda/realtek - Support headset mode for ALC215/ALC285/ALC289
    - ALSA: hda - Fix headset mic detection problem for two Dell machines
    - ALSA: hda - Fix a wrong FIXUP for alc289 on Dell machines

  * CIFS SMB2/SMB3 does not work for domain based DFS (LP: #1747572)
    - CIFS: make IPC a regular tcon
    - CIFS: use tcon_ipc instead of use_ipc parameter of SMB2_ioctl
    - CIFS: dump IPC tcon in debug proc file

  * i2c-thunderx: erroneous error message "unhandled state: 0" (LP: #1754076)
    - i2c: octeon: Prevent error message on bus error

  * hisi_sas: Add disk LED support (LP: #1752695)
    - scsi: hisi_sas: directly attached disk LED feature for v2 hw

  * EDAC, sb_edac: Backport 1 patch to Ubuntu 17.10 (Fix missing DIMM sysfs
    entries with KNL SNC2/SNC4 mode) (LP: #1743856)
    - EDAC, sb_edac: Fix missing DIMM sysfs entries with KNL SNC2/SNC4 mode

  * [regression] Colour banding and artefacts appear system-wide on an Asus
    Zenbook UX303LA with Intel HD 4400 graphics (LP: #1749420)
    - drm/edid: Add 6 bpc quirk for CPT panel in Asus UX303LA

  * DVB Card with SAA7146 chipset not working (LP: #1742316)
    - vmalloc: fix __GFP_HIGHMEM usage for vmalloc_32 on 32b systems

  * [Asus UX360UA] battery status in unity-panel is not changing when battery is
    being charged (LP: #1661876) // AC adapter status not detected on Asus
    ZenBook UX410UAK (LP: #1745032)
    - ACPI / battery: Add quirk for Asus UX360UA and UX410UAK

  * ASUS UX305LA - Battery state not detected correctly (LP: #1482390)
    - ACPI / battery: Add quirk for Asus GL502VSK and UX305LA

  * support thunderx2 vendor pmu events (LP: #1747523)
    - perf pmu: Extract function to get JSON alias map
    - perf pmu: Pass pmu as a parameter to get_cpuid_str()
    - perf tools arm64: Add support for get_cpuid_str function.
    - perf pmu: Add helper function is_pmu_core to detect PMU CORE devices
    - perf vendor events arm64: Add ThunderX2 implementation defined pmu core
      events
    - perf pmu: Add check for valid cpuid in perf_pmu__find_map()

  * lpfc.ko module doesn't work (LP: #1746970)
    - scsi: lpfc: Fix loop mode target discovery

  * Ubuntu 17.10 crashes on vmalloc.c (LP: #1739498)
    - powerpc/mm/book3s64: Make KERN_IO_START a variable
    - powerpc/mm/slb: Move comment next to the code it's referring to
    - powerpc/mm/hash64: Make vmalloc 56T on hash

  * ethtool -p fails to light NIC LED on HiSilicon D05 systems (LP: #1748567)
    - net...

Changed in linux (Ubuntu Artful):
status: Fix Committed → Fix Released
Launchpad Janitor (janitor) wrote :
Download full text (56.9 KiB)

This bug was fixed in the package linux - 4.4.0-119.143

---------------
linux (4.4.0-119.143) xenial; urgency=medium

  * linux: 4.4.0-119.143 -proposed tracker (LP: #1760327)

  * Dell XPS 13 9360 bluetooth scan can not detect any device (LP: #1759821)
    - Revert "Bluetooth: btusb: fix QCA Rome suspend/resume"

linux (4.4.0-118.142) xenial; urgency=medium

  * linux: 4.4.0-118.142 -proposed tracker (LP: #1759607)

  * Kernel panic with AWS 4.4.0-1053 / 4.4.0-1015 (Trusty) (LP: #1758869)
    - x86/microcode/AMD: Do not load when running on a hypervisor

  * CVE-2018-8043
    - net: phy: mdio-bcm-unimac: fix potential NULL dereference in
      unimac_mdio_probe()

linux (4.4.0-117.141) xenial; urgency=medium

  * linux: 4.4.0-117.141 -proposed tracker (LP: #1755208)

  * Xenial update to 4.4.114 stable release (LP: #1754592)
    - x86/asm/32: Make sync_core() handle missing CPUID on all 32-bit kernels
    - usbip: prevent vhci_hcd driver from leaking a socket pointer address
    - usbip: Fix implicit fallthrough warning
    - usbip: Fix potential format overflow in userspace tools
    - x86/microcode/intel: Fix BDW late-loading revision check
    - x86/retpoline: Fill RSB on context switch for affected CPUs
    - sched/deadline: Use the revised wakeup rule for suspending constrained dl
      tasks
    - can: af_can: can_rcv(): replace WARN_ONCE by pr_warn_once
    - can: af_can: canfd_rcv(): replace WARN_ONCE by pr_warn_once
    - PM / sleep: declare __tracedata symbols as char[] rather than char
    - time: Avoid undefined behaviour in ktime_add_safe()
    - timers: Plug locking race vs. timer migration
    - Prevent timer value 0 for MWAITX
    - drivers: base: cacheinfo: fix x86 with CONFIG_OF enabled
    - drivers: base: cacheinfo: fix boot error message when acpi is enabled
    - PCI: layerscape: Add "fsl,ls2085a-pcie" compatible ID
    - PCI: layerscape: Fix MSG TLP drop setting
    - mmc: sdhci-of-esdhc: add/remove some quirks according to vendor version
    - fs/select: add vmalloc fallback for select(2)
    - hwpoison, memcg: forcibly uncharge LRU pages
    - cma: fix calculation of aligned offset
    - mm, page_alloc: fix potential false positive in __zone_watermark_ok
    - ipc: msg, make msgrcv work with LONG_MIN
    - x86/ioapic: Fix incorrect pointers in ioapic_setup_resources()
    - ACPI / processor: Avoid reserving IO regions too early
    - ACPI / scan: Prefer devices without _HID/_CID for _ADR matching
    - ACPICA: Namespace: fix operand cache leak
    - netfilter: x_tables: speed up jump target validation
    - netfilter: arp_tables: fix invoking 32bit "iptable -P INPUT ACCEPT" failed
      in 64bit kernel
    - netfilter: nf_dup_ipv6: set again FLOWI_FLAG_KNOWN_NH at flowi6_flags
    - netfilter: nf_ct_expect: remove the redundant slash when policy name is
      empty
    - netfilter: nfnetlink_queue: reject verdict request from different portid
    - netfilter: restart search if moved to other chain
    - netfilter: nf_conntrack_sip: extend request line validation
    - netfilter: use fwmark_reflect in nf_send_reset
    - ext2: Don't clear SGID when inheriting ACLs
    - reiserfs: fix race in prealloc discard
    - re...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released

I think I have the same issue with my laptop and dock (Dell TB16).
Laptop is new and installed in Fedora 28. All firmware are up-to-date.

Ethernet works fine unless I want to transfert a large amount of data. Session (sftp, rsync or scp) cut abruptly after a few seconds. Nothing relevant appears in system logs.

If I offload the RC checksums (as suggested above) using : ethtool --offload enp11s0u1u2 rx off
Everything works fine.

Tell me if you need more logs or informations

FYI this commit ended up landing related to this. I would recommend to backport it.

https://github.com/torvalds/linux/commit/0b1655143df00ac5349f27b765b2ed13a3ac40ca

Hi Mario, thanks for the pointer. Fedora stable releases are currently on 4.16.15 so that fix should be in place. I've got a TB16 at home so I can also try to reproduce this on Fedora 28 this evening.

marianne, adding the dmesg logs would be helpful. Thanks!

Kathryn Morgan (katamo) wrote :

Confirmed the following command resolved issue disconnect in transferring 25GB+ files over TB16 ethernet device via both SCP & SFTP
  `$ ethtool --offload enp14s0u1u2 rx off`

Models Tested:
  - 7720
  - 5520

Kernels Tested:
 - 4.14.xx

Observed Symptom
  Error Encountered when using scp or sftp:
  - sh_dispatch_run_fatal: Connection to 10.10.10.36 port 22: message authentication code incorrect
  - lost connection

Unable to test >4.14 at this time

Mario Limonciello (superm1) wrote :

@Kat,

Can you please confirm the particular Ubuntu kernel that you are still encountering the need to run this command? As I understood this patch (that effectively does what that command does) is backported into all the latest Ubuntu kernels, so if it's still happening that is important information.

Dave Chiluk (chiluk) wrote :

@katamo
4.14.xx is not a supported Ubuntu kernel. I'm not sure where you pulled that kernel from, but it is not supportable.

At this point all Supported Ubuntu kernels and mainline 4.15+ have this fix.

@EVERYONE ELSE
If you think you are hitting this issue and are running the latest supported kernels, you are likely hitting a different issue, and should be opening a new bug. Please stop resurrecting this bug.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.