large performance regression (~30-40%) in wifi with 19.10 / 5.3 kernel

Bug #1847892 reported by arQon
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

Probably relevant: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1795116

Card is an RTL8723BE.

On 16.04 with the HWE stack, after 1795116 was fixed performance was a stable 75-80Mb/s.

Linux 4.15.0-55-generic #60~16.04.2-Ubuntu SMP Thu Jul 4 09:03:09 UTC 2019 x86_64
Fri 26-Jul-19 12:28
sent 459,277,171 bytes received 35 bytes 9,278,327.39 bytes/sec

Linux 4.15.0-55-generic #60~16.04.2-Ubuntu SMP Thu Jul 4 09:03:09 UTC 2019 x86_64
Sat 27-Jul-19 01:23
sent 459,277,171 bytes received 35 bytes 10,320,836.09 bytes/sec

On 18.04, performance was still a stable 75-80Mb/s.

After updating to 19.10, performance is typically ~50Mb/s, or about a 37% regression.

$ iwconfig wlan0
wlan0 IEEE 802.11 ESSID:"**"
          Mode:Managed Frequency:2.442 GHz Access Point: 4C:60:DE:FB:A8:AB
          Bit Rate=150 Mb/s Tx-Power=20 dBm
          Retry short limit:7 RTS thr=2347 B Fragment thr:off
          Power Management:on
          Link Quality=59/70 Signal level=-51 dBm
          Rx invalid nwid:0 Rx invalid crypt:0 Rx invalid frag:0
          Tx excessive retries:0 Invalid misc:315 Missed beacon:0

$ ./wifibench.sh
Linux 5.3.0-13-generic #14-Ubuntu SMP Tue Sep 24 02:46:08 UTC 2019 x86_64
Sat 12-Oct-19 20:30
sent 459,277,171 bytes received 35 bytes 5,566,996.44 bytes/sec

$ iwconfig wlan0
wlan0 IEEE 802.11 ESSID:"**"
          Mode:Managed Frequency:2.442 GHz Access Point: 4C:60:DE:FB:A8:AB
          Bit Rate=150 Mb/s Tx-Power=20 dBm
          Retry short limit:7 RTS thr=2347 B Fragment thr:off
          Power Management:on
          Link Quality=68/70 Signal level=-42 dBm
          Rx invalid nwid:0 Rx invalid crypt:0 Rx invalid frag:0
          Tx excessive retries:0 Invalid misc:315 Missed beacon:0

So no corrupted packets or etc during that transfer.

$ ifconfig wlan0
wlan0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
        inet 192.168.1.33 netmask 255.255.255.0 broadcast 192.168.1.255
        ether dc:85:de:e4:17:a3 txqueuelen 1000 (Ethernet)
        RX packets 56608204 bytes 79066485957 (79.0 GB)
        RX errors 0 dropped 0 overruns 0 frame 0
        TX packets 21634510 bytes 8726094217 (8.7 GB)
        TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

No issues of any kind in the week that it's been up. Just terrible performance.

I'm painfully aware of all the module's parameters etc, and have tried them all, with no change in the results outside of typical wifi variance.

arQon (pf.arqon)
affects: ubuntu-mate → linux (Ubuntu)
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1847892

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: bionic
Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Would it be possible for you to do a kernel bisection?

First, find the last good -rc kernel and the first bad -rc kernel from http://kernel.ubuntu.com/~kernel-ppa/mainline/

Then,
$ sudo apt build-dep linux
$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
$ cd linux
$ git bisect start
$ git bisect good $(the good version you found)
$ git bisect bad $(the bad version found)
$ make localmodconfig
$ make -j`nproc` deb-pkg
Install the newly built kernel, then reboot with it.
If the issue still happens,
$ git bisect bad
Otherwise,
$ git bisect good
Repeat to "make -j`nproc` deb-pkg" until you find the commit that causes the regression.

arQon (pf.arqon)
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
arQon (pf.arqon) wrote :

That machine doesn't have access to launchpad, so until someone fixes the bugs (referenced out in the other thread) so that "ubuntu-bug -c" works, I can't provide that info.

Kai - this is a low-power HTPC, with very little disk space. Assuming it can even clone the kernel, it will likely take weeks for me to bisect the problem. I'll do what I can, but unless we get very lucky it'll be a long time before we have an answer. Remember, the LKG is the 18.04 kernel, which is now mostly 18 months old. :(

Revision history for this message
arQon (pf.arqon) wrote :

The dist-upgrades have wiped out all the previous kernels, of course.
The only one left on the machine at all was the 5.0 from 19.04, and that's no good either. :(

Linux 5.0.0-29-generic #31-Ubuntu SMP Thu Sep 12 13:05:32 UTC 2019 x86_64
Wed 23-Oct-19 04:47
sent 459,277,171 bytes received 35 bytes 5,635,303.14 bytes/sec

The window is too large for me to bisect any time soon. The machine is inconveniently located for such things, but needs to stay there for the testing to be valid; and I don't expect to be able to average more than one build every few days.

I'll do what I can to at least narrow things down a little, but since it took more than 6 months to get the last regression fixed after I'd already provided the exact release that introduced it, I'm sure you can understand why I'm not too keen on burning days of my free time on this.

4.15.0-55 is the last *recorded* good, but I did test 18.04 from a USB prior to installing that and hit 80+, so at least one version of 4.18 is okay. I'll re-check that build first just in case, then try a Live 18.04.3. If that one's good, the range would only be 5.0-xx to 5.0-29, and i can probably get that covered in a few weeks.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Is this reproducible via iperf?

Revision history for this message
arQon (pf.arqon) wrote :

I expect so. I don't usually have a machine available to run as a server though, hence the preference for rsync.

If you're concerned that the NAS might be the bottleneck, don't be. That's a sensible point to raise, but it's GbE and saturates it wired. (To say nothing of the months during which the rsync consistenly hit 80+).

If you think iperf might narrow the scope of where to look though, let me know and I'll try it next time I can dedicate a machine to it: shouldn't be more than a week.
(hmm - I could run one up in a VM at will, which would still have more than enough network performance for this, but I'm not sure I'd fully trust the results from that until I've got a baseline to compare against. I'll try to remember it when I boot the client off an older kernel).

Revision history for this message
arQon (pf.arqon) wrote :

Any specific params you want for iperf BTW?

I ran some basic tests against a VM after all: it might have lost a few %, but wireless is so slow that it's not going to make any meaningful difference.

[ ID] Interval Transfer Bandwidth Reads Dist(bin=16.0K)
[ 4] 0.0000-11.4354 sec 45.8 MBytes 33.6 Mbits/sec 22348 22341:6:0:1:0:0:0:0
[ 4] local 192.168.1.34 port 5001 connected with 192.168.1.33 port 45314
[ 4] 0.0000-10.3068 sec 68.6 MBytes 55.9 Mbits/sec 35906 35901:2:3:0:0:0:0:0
[ 4] local 192.168.1.34 port 5001 connected with 192.168.1.33 port 45316
[ 4] 0.0000-10.3491 sec 71.6 MBytes 58.1 Mbits/sec 35522 35512:2:4:4:0:0:0:0
[ 4] local 192.168.1.34 port 5001 connected with 192.168.1.33 port 45324
[ 4] 0.0000-10.3927 sec 75.1 MBytes 60.6 Mbits/sec 35410 35397:7:3:2:1:0:0:0
[ 4] local 192.168.1.34 port 5001 connected with 192.168.1.33 port 45326
[ 4] 0.0000-10.3910 sec 59.8 MBytes 48.2 Mbits/sec 30683 30675:3:4:1:0:0:0:0

The first run may well have had the wifi at low power: the client had just been woken up a couple of minutes earlier, but it tends to dial that back quite quickly. I left it in for completeness.

An rsync right after the last run was 6,906,424.15 bytes/sec, so ~58.65 Mb/s. I think that's a good indicator of being able to trust the historical data from it.

I may have the machine in here at some point in the next few days, and will test that too if so. For 1795116, the performance in that scenario (~6' and LOS) was fine: it wasn't that the driver was broken across the board, it just wasn't managing power properly so it fell apart if conditions weren't perfect.
(Not saying this is the same bug, just reminding myself what a result like that means).

I still need to boot into 16.04 / etc to try and find a working kernel. AFAICT 16.04.6 shipped with one that has the post-1795116 fix in it, so it should be okay off a USB. If not though I'll have to get creative, as there isn't a spare partition on that machine to install to.

My notes show 4.15.0-55 as the first version with the old bug fixed. It would be helpful if there was a reasonable way for me to just install that, since there are multiple dist-upgrade's worth of other changes since then.
While that specific kernel would be ideal from a testing standpoint, isn't there somewhere I can just grab the current Bionic kernel from, as a dpkg, and just install the damn thing without having to jump through any hoops? Surely there must be a better system in place already than randomly guessing at versions or doing builds on a system that is totally unsuitable for such tasks. :(

Revision history for this message
arQon (pf.arqon) wrote :

16.04.6 turned out to have 4.15.0-45, which is one of the known-broken releases. Unsurprisingly, it delivered the same poor results as 5.3.

I'm running low on sensible options here.

I no longer have the bootable 18.04 stick I used before, but I can create a new one easily enough (as long as I remember to use the .0 image and avoid the HWE stack). But it's not going to tell us anything we don't already know, and it's no use for either normal use or testing.

I could wipe the machine and install 18.04, which would cost me years of customisation and bugfixing. Since the bug is in the HWE kernels it would also be possible to test those separately, though it would need a lot of messing around each time.
This is sort-of tempting for other reasons anyway, but it'll take weeks of elapsed time to get everything repaired, which is time I won't be able to spend on this bug.

I can't really repartition it while doing that, unfortunately. It just has a very small SSD to boot off, and is heavily dependent on the LAN. I could probably JUST about carve out another very small partition to put 18.04 on, which would make the backwards-migration easier and leave me able to validate a fix for this at some point, but it's already seriously hurting for space at times.

I'll give it some thought. In the meantime, do you have a preference or any suggestions that might influence that decision?

Returning to your original bisection request: since Ubuntu generates its own kernels, if you can give me the mapping from the not-broken 4.15.0-55 to the equivalent starting point in Linus's tree, and likewise for the broken 5.0.0-29, I'll see about sacrificing a USB HDD to that machine for a while for it to build on. No promises, and it'll still take weeks to actually complete, but I'll do what I can.

Revision history for this message
arQon (pf.arqon) wrote :

Okay - the 18.04.3 release I tested in September, which was fine, has 5.0.0-23.

-29 is broken, as mentioned above. That's a pretty narrow window to work with.
I'd prefer it if someone from Canonical took it from here.
(Heck, there are probably few enough commits to that driver in that timeframe that I could find the bad one more easily just by browsing the source than getting that machine to build it).

Since it's the HWE stack that's broken, I could still install 18.04 from the image I have and switch back to the original Bionic kernel series, though it's anybody's guess if the same regression was merged into 4.15 again as well.
Having to do a fresh install sucks pretty hard, but I'd at least be able to pin the last unbroken kernel, whereas on 19.10 there aren't any working ones in the repos at all (AFAIK) so I'm basically screwed until a fix makes its way through the system or I brute-force an older one onto it. I'm still trying to decide if I really want to go down that road or not.

I found https://people.canonical.com/~kernel/info/kernel-version-map.html , but it's basically useless. All the 4.15 kernels from the working -32 through the 7 months of buggy ones and out the other side to -55 when it was fixed are all just "4.15.18 mainline".
Maybe I'm missing something here, but without the tree that you guys are *actually building from* I don't see how me bisecting mainline is going to achieve anything. If that page is accurate it has nothing to do with mainline at all and the bug only exists in the Ubuntu tree in the first place, neh?

Revision history for this message
arQon (pf.arqon) wrote :

Had the affected machine in here (i.e. "where the router is") for other reasons, so I was able to check those conditions too. As expected, it's a power / antenna / etc issue:

Linux 5.3.0-19-generic #20-Ubuntu SMP Fri Oct 18 09:04:39 UTC 2019 x86_64
Fri 08-Nov-19 03:29
sent 459,277,171 bytes received 35 bytes 7,467,922.05 bytes/sec

iwlist scan, WHILE the rsync was running (so it shouldn't be power-saving at all) with the router 6' away with clear LOS showed:
                    Frequency:2.422 GHz (Channel 3)
                    Quality=62/70 Signal level=-48 dBm

Looking at 1795116, I used to get 70/70 at -32 dBm upstairs and through half a dozen walls, so obviously this is a significant drop in link quality given the hugely better conditions.
Compared to 4.15.0-36 - which, bear in mind was one of the BROKEN kernels, the performance is 7,467,922 / 11,167,414, or 66.9% of what it should be in that scenario.

The good news is, that difference suggests it's at least not a simple merge regression of 1795116. The bad news is, of course, that it'll need a new round of investigation track down.

As I've said, I'm willing to set things up to help out, but I'm still waiting for you to provide me with the ACTUAL Ubuntu kernel tree to bisect against. In the meantime, I'll probaby try a couple of the mainline builds since I can just dpkg those, but with no way to map the working Ubuntu ones to the mainline tree I'll never have a baseline to compare against in case something stupid has happened like one of the antenna connectors has become disconnected or etc.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Please use mainline kernel tree do do bisection:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

If you really want to use Ubuntu kernel to bisect, here's the tree:
https://code.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/

Revision history for this message
arQon (pf.arqon) wrote :

Thanks Kai.

Yes, I really want to use Ubuntu kernel to bisect: or at least, I need the option to be able to - because if the problem is coming from the Ubuntu patchset, I could spend weeks bisecting mainline and never find it, whereas if I bisect the Ubuntu tree I'm guaranteed to find it and that one can be mapped back to mainline if needed, whereas we can't go the other way around.

The other critical aspect to having the Ubuntu tree available is that it gives me the Ubuntu 5.0.0-23 build as a sanity check, it case the problem is being caused elsewhere. Remember, this is *wifi* we're talking about: any number of pieces could be to blame, from a power outage resetting something in the router (20/40 coexistence, for example) to physical antennas in multiple devices, and so on. I need a way to validate beyond question that it IS the kernel that's at fault before I go any further with this to avoid risking wasting everybody's time, including yours. :)

Anyway, now that I have that info I'll free up a Passport and get things underway. Thanks again.

Revision history for this message
arQon (pf.arqon) wrote :

Sorry Kai - I got swamped by real-life issues.

I finally got the time to look into this a few days ago, and when I went to check on it then in case a kernel update had fixed things already, it looks like it has (or at least, mostly so) - peaks are back to 80+, so that's within wifi variance again.

It did show a strange very severe falloff in a couple of tests - down to 40-ish for several seconds during large file transfer, which is new. But I haven't been able to repeat that anything like enough to be able to test versions against it.

I'm set up to DO that testing if I can find a good way to repro that, or for the next time performance goes in the tank. For now though, this is going to have to sit in the "I'll keep an eye on it, but can't do anything right now" pile.

Thanks again for all your help.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.