Kernel 4.15.0-36 network performance regression

Bug #1796895 reported by Calvin Cheng on 2018-10-09
44
This bug affects 8 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
High
Joseph Salisbury
Bionic
High
Joseph Salisbury

Bug Description

Hardware:

HP DL380 Gen 7 server with Gigabit interface.
Ubuntu release: 18.04.1

Since upgrading from 4.15.0-33 to 4.15.0-36, I have noticed a dramatic slowdown in TCP over IPv6.

Set up: two servers, Server A in US on Ubuntu 18.04.1, Server B in Europe.

Test: Sending 400MB over TCP from Server B to Server A.

With 4.15.0-33 on Server A, the transfer is completed in 24 seconds, with average speed of 139 mbps.

With 4.15.0-36 on Server A, the same amount of data is transferred in 369 seconds, average speed less than 9 mbps.

Looking at the result from tcpdump, I see that the there is a big difference in the TCP Window size.

With 4.15-0.33 the TCP window size is around 30000. But with 4.15-0.36, the window size is only 734.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1796895

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Van Stokes, Jr. (vstokes) wrote :

We had a similar issue with our MySQL server farm (seven dedicated MySQL servers). Replication was very slow, falling behind for no apparent reason. We were reaching 24 hours of lag! We never had that kind of lag before. A couple of minutes at the most.

Long story short, we rebooted to 4.15.0.34 and the servers caught up replication with in minutes!

References:
https://forums.mysql.com/read.php?26,669517
https://askubuntu.com/questions/1080911/linux-image-4-15-0-36-reduces-rsync-over-ssh-to-10-times-less/1082322#1082322

Calvin Cheng (calvincheng) wrote :

I can't use apport-collect since it requires IPv4 connection to internet.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Calvin Cheng (calvincheng) wrote :

Also want to add that once we downgraded back to 4.15.0-33 the problem went away. Rebooting back to 4.15.0-36 the problem reappears.

Changed in linux (Ubuntu Bionic):
importance: Undecided → High
Changed in linux (Ubuntu):
importance: Undecided → High
Changed in linux (Ubuntu Bionic):
status: New → Confirmed
tags: added: bionic performing-bisect
Changed in linux (Ubuntu):
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Bionic):
assignee: nobody → Joseph Salisbury (jsalisbury)
Joseph Salisbury (jsalisbury) wrote :

I can perform a kernel bisect to identify the commit that introduced this regression. Before starting the bisect, can you test the -proposed kernel:

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed.

Joseph Salisbury (jsalisbury) wrote :

It might also be good to test the latest mainline kernel to see if this issue also exists upstream, if it is fixed upstream and/or if this regression is Ubuntu specific.

The mainline kernel is available from:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.19-rc7

Some documentation regarding mainline is available here:
https://wiki.ubuntu.com/KernelMainlineBuilds

Shelby Cain (alyandon) wrote :

I'm seeing similar behavior on multiple physical hosts running 4.15.0-36-generic. The TCP window size only appears to grow very slightly (from initial of 224 to around 1400) during the entire length of the transfer.

Downgrading to 4.15.0-34-generic fixes the issue for me - TCP window size continues to grow until it is around 24k.

Joseph Salisbury (jsalisbury) wrote :

I started a kernel bisect between 4.15.0-34 and 4.15.0-36. The kernel bisect will require testing of about 7-10 test kernels.

I built the first test kernel, up to the following commit:
003ae88ae88d48643e71dc69c18d4eda598339d5

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1796895

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Calvin Cheng (calvincheng) wrote :

Just tested with the proposed build, the problem is fixed with the proposed kernel 4.15.0-37 on my server.

Does this mean the problem is Ubuntu specific? When will the proposed version be released? Do you still think it's necessary to test against the mainline?

Calvin Cheng (calvincheng) wrote :

Tested the first bisected version, on commit 003ae88ae88d48643e71dc69c18d4eda598339d5, problem still present.

Joseph Salisbury (jsalisbury) wrote :

Thanks for all the testing. If this bug is fixed in -proposed, then the fix is already applied to the Ubuntu kernel and will be released in the next round of updates on November 12th.

The fix will be in the 4.15.0-37 or newer kernel.

Can others affected by this bug confirm that the proposed kernel resolves the bug as well?

Changed in linux (Ubuntu Bionic):
status: Confirmed → Fix Committed
Changed in linux (Ubuntu):
status: Confirmed → Fix Released
Shelby Cain (alyandon) wrote :

I installed the proposed kernel on a remote host and rebooted but of course the host didn't come back online for who only knows what reason. I'll post an update in a few hours once I have a chance to get my hands on it.

Shelby Cain (alyandon) wrote :

For whatever reason, after installing linux-image-4.15.0-37-generic the system boots up fine but doesn't appear to initialize the motherboard's built in realtek network adapter so there is no network connectivity to test.

Is there a missing dependency in the package that needs to be installed to ensure that network connectivity works?

Shelby Cain (alyandon) wrote :

Ok - for anyone that isn't used to installing kernels from proposed. It helps to install linux-image-generic and not linux-image-XXXXX-generic in order to pull in the modules package.

Tested and verified - network speeds are back to normal in -37.

For the record:

I've tracked this down to a pair of commmits, the first of which landed in -35 (released in -36) the second in -37.

The first:
https://lkml.org/lkml/2018/6/5/765
The second:
https://lkml.org/lkml/2018/6/24/161

In the interim state, it was casting a u64 to a u32, which truncates to the least-significant 32 bits.
This was, from what I could see using tcpdump and other tools, causing the TCP window size to truncate to a very small number, leading to hilariously slow network traffic. Further, it seems (although I haven't yet got the whole logic chain in my head) that the TCP algorithm could never get itself out of this state again, so the connection was then permanently stuck in a slow state.

The two commits were originally from the one author, within a second of each other, back in Dec 2017. They really needed to be pulled in together.

Will Brown (sleepin) on 2018-10-21
Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
Will Brown (sleepin) wrote :

Sorry for the accidental status change. I can't revert. Not sure why I was able to do that. Verified fix in -37.

James Troup (elmo) on 2018-10-29
Changed in linux (Ubuntu Bionic):
status: Fix Released → Fix Committed
description: updated
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers