large performance regression (~20-40%) in wifi with 4.15.0-33 and later

Bug #1795116 reported by arQon
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Medium
Unassigned

Bug Description

16.04 install using the HWE stack. after several weeks of uptime on -32, an update to -34 showed a major drop in wifi throughput, dependent solely on the kernel chosen:

$ uname -a && ./wifibench.sh
Linux brix 4.15.0-32-generic #35~16.04.1-Ubuntu SMP Fri Aug 10 21:54:34 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Tue 25-Sep-18 06:07
PlatformSDK-SVR2003R2.iso
    429,840,384 100% 8.41MB/s 0:00:48 (xfr#1, to-chk=0/1)
sent 429,945,420 bytes received 35 bytes 8,685,766.77 bytes/sec

$ uname -a && ./wifibench.sh
Linux brix 4.15.0-34-generic #37~16.04.1-Ubuntu SMP Tue Aug 28 10:44:06 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Tue 25-Sep-18 06:14
PlatformSDK-SVR2003R2.iso
    429,840,384 100% 4.83MB/s 0:01:24 (xfr#1, to-chk=0/1)
sent 429,945,420 bytes received 35 bytes 5,028,601.81 bytes/sec

(the script is a simple rsync from a NAS. nothing in the NAS, or the router, or the physical positions of the devices or etc etc has changed, let alone in those 7 minutes).

there is no interference, no other devices connected, etc.
prior to the transfer, the link quality is 62-64 and the signal similarly weaker because of power saving, but once packets are in flight it looks perfect:

$ iwconfig
wlan0 IEEE 802.11 ESSID:"****"
          Mode:Managed Frequency:2.452 GHz Access Point: ****
          Bit Rate=150 Mb/s Tx-Power=20 dBm
          Retry short limit:7 RTS thr=2347 B Fragment thr:off
          Power Management:on
          Link Quality=70/70 Signal level=10 dBm
          Rx invalid nwid:0 Rx invalid crypt:0 Rx invalid frag:0
          Tx excessive retries:0 Invalid misc:15 Missed beacon:0

the client is a J1900-based (baytrail) machine with a RTL8723BE wifi module. average performance while on -32 was consistently 70-80 Mb/s, over a period of several weeks. with -33 and later, it's generally 50-65.
(that machine has no access to launchpad. i'll try apport-cli over the weekend).

Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. It seems that your bug report is not filed about a specific source package though, rather it is just filed against Ubuntu in general. It is important that bug reports be filed about source packages so that people interested in the package can find the bugs about it. You can find some hints about determining what package your bug might be about at https://wiki.ubuntu.com/Bugs/FindRightPackage. You might also ask for help in the #ubuntu-bugs irc channel on Freenode.

To change the source package that this bug is filed about visit https://bugs.launchpad.net/ubuntu/+bug/1795116/+editstatus and add the package name in the text box next to the word Package.

[This is an automated message. I apologize if it reached you inappropriately; please just reply to this message indicating so.]

tags: added: bot-comment
Revision history for this message
arQon (pf.arqon) wrote :

obviously i've tested this with 32/33/34 a dozen or so times, but the performance regression shows up in all of them, so i've only copied that particular one. the best (i.e. "least bad for the bugged kernel") result so far was "only" -18%, and most runs are down by 25-30%.

affects: ubuntu → linux (Ubuntu)
Revision history for this message
arQon (pf.arqon) wrote :

unfortunately, the instructions on https://help.ubuntu.com/community/ReportingBugs don't seem to work:

>>
If this is to be added to an existing bug report, also use the -u option:

ubuntu-bug -c FILENAME.apport -u BUGNUMBER
<<

$ ubuntu-bug -c /mnt/nas/wifi.apport -u 1795116
Usage: ubuntu-bug [options] [symptom|pid|package|program path|.apport/.crash file]

ubuntu-bug: error: -u/--update-bug option cannot be used together with options for a new report

and since -c is the only option that allows me to specify an apport file, i'm not sure how to proceed from here...

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1795116

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: bionic
Revision history for this message
arQon (pf.arqon) wrote :

marked as confirmed per #3 and #4.

i have the apport file stashed away and can upload it as an attachment upon request if anyone's interested.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.19 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.19-rc6

Changed in linux (Ubuntu):
importance: Undecided → Medium
tags: added: kernel-da-key
Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
arQon (pf.arqon) wrote :

sure, i'll try.

sidenote, https://wiki.ubuntu.com/KernelMainlineBuilds is missing any reference to kernel modules, which seem like they might be kind of important for bugs like this. is that a failure in the doc, or is the goal here just to test e.g. the tcp changes in -33 rather than any changes in the device driver itself?

Revision history for this message
arQon (pf.arqon) wrote :

ok - it's a failure in the doc, since the kernel image can't be installed without them.

also note that the headers package can't be installed on 16.04 because of a change in the ?libssl? dependency. (from memory: might be the wrong dependency).

that aside, the results with 4.19 are ... odd, so far. i need to test a bit more before committing to a call.

Revision history for this message
arQon (pf.arqon) wrote :

so, the methodology is, reboot, wait for things to settle (i.e. for the initial "performance" cpu period ubuntu uses to pass), run a simple script that dumps out some diags and rsyncs that 400MB iso.

--

Linux brix 4.19.0-041900rc6-generic #201809301631 SMP Sun Sep 30 16:32:51 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
          Current Frequency:2.452 GHz (Channel 9)
          Link Quality=62/70 Signal level=-48 dBm
Wed 03-Oct-18 01:35
    429,840,384 100% 6.80MB/s 0:01:00 (xfr#1, to-chk=0/1)
sent 429,945,420 bytes received 35 bytes 7,106,536.45 bytes/sec

---

Linux brix 4.15.0-34-generic #37~16.04.1-Ubuntu SMP Tue Aug 28 10:44:06 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
          Current Frequency:2.452 GHz (Channel 9)
          Link Quality=62/70 Signal level=-48 dBm
Wed 03-Oct-18 01:40
    429,840,384 100% 6.38MB/s 0:01:04 (xfr#1, to-chk=0/1)
sent 429,945,420 bytes received 35 bytes 6,665,821.01 bytes/sec

---

Linux brix 4.15.0-32-generic #35~16.04.1-Ubuntu SMP Fri Aug 10 21:54:34 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
          Current Frequency:2.452 GHz (Channel 9)
          Link Quality=70/70 Signal level=-32 dBm
Wed 03-Oct-18 01:43
    429,840,384 100% 8.14MB/s 0:00:50 (xfr#1, to-chk=0/1)
sent 429,945,420 bytes received 35 bytes 8,348,455.44 bytes/sec

--

that was the best run 4.19 had, at -18%, out of about a dozen. as with 4.15.0-33 and later most were down by 25-30%. so, "still exists" it is.

here's where things get weird though...

---

Linux brix 4.19.0-041900rc6-generic #201809301631 SMP Sun Sep 30 16:32:51 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
          Current Frequency:2.447 GHz (Channel 8)
          Link Quality=66/70 Signal level=-44 dBm
Wed 03-Oct-18 01:12
    429,840,384 100% 8.07MB/s 0:00:50 (xfr#1, to-chk=0/1)
sent 429,945,420 bytes received 35 bytes 8,348,455.44 bytes/sec

---

on channel 8(+4), 4.19 typically performed on par with -32 and older.
on channel 9(+5) though it consistently showed the same regression as -33. again, this is over about a dozen runs (and multiple reboots switching between kernels), not a onetime event.
-32 though consistently performed at 70+Mb/s averages regardless of channel.

i'm way out of my depth at this point, but that seemed a remarkable weirdness and worth pointing out. :)

tags: added: kernel-bug-exists-upstream
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
arQon (pf.arqon) wrote :

> on channel 8(+4), 4.19 typically performed on par with -32 and older.

apparently only because of some fluke. the router switched to 4(+8) some time in the past couple of days, and the newer kernels are certainly sucking hard on that too:

---

Linux brix 4.15.0-32-generic #35~16.04.1-Ubuntu SMP Fri Aug 10 21:54:34 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
          Current Frequency:2.427 GHz (Channel 4)
          Link Quality=70/70 Signal level=-36 dBm
Sun 07-Oct-18 03:30
    429,840,384 100% 9.00MB/s 0:00:45 (xfr#1, to-chk=0/1)
sent 429,945,420 bytes received 35 bytes 9,449,350.66 bytes/sec

---

Linux brix 4.15.0-36-generic #39~16.04.1-Ubuntu SMP Tue Sep 25 08:59:23 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
          Current Frequency:2.427 GHz (Channel 4)
          Link Quality=64/70 Signal level=-46 dBm
Sun 07-Oct-18 05:05
    429,840,384 100% 6.46MB/s 0:01:03 (xfr#1, to-chk=0/1)
sent 429,945,420 bytes received 35 bytes 6,665,821.01 bytes/sec

---

this is a major regression. is there any activity on it upstream?

i can't imagine it's ALL wifi, or people would be screaming for blood, but it would be good to at least get it in front of larry finger (the rtl driver author) as a starting point.

Revision history for this message
arQon (pf.arqon) wrote :

> prior to the transfer [on the buggy versions] the link quality is 62-64 and the signal similarly weaker because of power saving, but once packets are in flight it looks perfect

-32 doesn't seem to have that problem: it's been 70/70 every time i've looked, even with the link speed showing as 15Mb/s rather than 150, indicating power saving. so i modprobed the driver with all the power-saving options (ips, fwlps, and aspm) set to 0 while running one of the broken kernels (i don't remember if it was -36 or the 4.19 build, sorry) to see if that made any difference, but AFAICT it didn't. (i.e. still around 55Mb/s, when -32 was hitting 70+ 5 mins before/after).

Revision history for this message
arQon (pf.arqon) wrote :

the machine is usually ~20 feet away from the router, through multiple walls and a floor. i moved it last night so it was 6 feet away with LOS, and the results from that are very interesting:

Linux brix 4.15.0-36-generic #39~16.04.1-Ubuntu SMP Tue Sep 25 08:59:23 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
          Current Frequency:2.452 GHz (Channel 9)
          Link Quality=70/70 Signal level=-30 dBm
Sat 20-Oct-18 17:23
    429,840,384 100% 10.91MB/s 0:00:37 (xfr#1, to-chk=0/1)
sent 429,945,420 bytes received 35 bytes 11,167,414.42 bytes/sec

and under the same ideal conditions:

Linux brix 4.15.0-32-generic #35~16.04.1-Ubuntu SMP Fri Aug 10 21:54:34 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
          Current Frequency:2.452 GHz (Channel 9)
          Link Quality=70/70 Signal level=-32 dBm
Sat 20-Oct-18 17:29
    429,840,384 100% 9.10MB/s 0:00:45 (xfr#1, to-chk=0/1)
sent 429,945,420 bytes received 35 bytes 9,449,350.66 bytes/sec

so the newer kernel is MUCH better in that situation, whereas the old kernel essentially performs exactly the same.
the machine has a fairly weak cpu, and basically has one core pegged with iowait during these transfers, so optimisations (either in general for meltdown etc, or the stack, or in the driver specifically) could certainly account for that sort of speedup.

however, that only further highlights just how bad this regression is, because the driver is now nearly 20% faster in the abstract but still massively slower at range despite all that improvement.

it also explains why nobody would notice the regression during development.

i'm happy to test proposed fixes or provide more information, but i don't think there's anything more i can do at my end until somebody else steps in.

@joseph - do you have a tracking reference for upstream yet?

TIA

Revision history for this message
arQon (pf.arqon) wrote :

given the link quality observations in #11 and the behavior in #12, it's pretty clear that the problem is with the radio power management.

since it isn't improved at all via any of the PM settings, that suggests it's simply broken rather than overly-aggressive.

for reference, the head for the last good version of the driver is 6A56582B1FEECB841E329C4

Revision history for this message
arQon (pf.arqon) wrote :

Still hopelessly broken. :(
Throughput with the latest kernel was down to about 45Mb/s when I tested it a few days ago.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

This should be fixed in latest bionic kernel, LP: #1788997.

Revision history for this message
arQon (pf.arqon) wrote :

Thanks, but that seems unlikely: I'm aware of the ant_sel issue on HP laptops etc, but this machine isn't one and has never benefitted from it. If it was using the wrong one of two antennae, it wouldn't hit 70/70 at 20ft away through walls, nor would the throughput be almost half of what it was with the good kernels when registering that signal quality.

I'll try it to see, of course, but if it does fix it then the driver's got some drastic bugs with its information reporting! :P

Revision history for this message
arQon (pf.arqon) wrote :

Although, I see what appears to be an unrelated (to ant_sel)

+ if (rtlpriv->cfg->ops->get_btc_status())
+ rtlpriv->btcoexist.btc_ops->btc_power_on_setting(rtlpriv);

added in that commit as well. I can't go digging into the source right now to see what that's doing, but since THIS bug reeks of bad power management you're hopefully right that the patch as a whole will also fix this issue.
I'll try it over the weekend - thanks again for the heads up.

Revision history for this message
arQon (pf.arqon) wrote :

Hooray, it seems that it has - but possibly only partially.

Peak and Avg throughput were both a few % down compared to -32, but well within the sort of variance wifi suffers from. High-70s for download is certainly good enough to use.

What's less encouraging, to the point of being an outright concern, is the UPLOAD rate. That has consistently peaked at 100Mb/s with -32 over the past several months, with averages not much lower, but remains massively worse in the current kernel: somewhere around 60Mb/s. I've only had time to run one batch of tests on it, so it could have been an extremely unlucky fluke, but like I say, it's worrying.
I'll do some more testing when I get the chance.

Revision history for this message
arQon (pf.arqon) wrote :

That bad run was an outlier. No idea what caused it, but cron'd tests over the past couple of days have all shown results similar to the pre-33 breakage.

So it looks like this is finally fixed, thanks. It's a shame testing didn't catch it, but understandable. It's a bit more worrying that the regression took seven months to get repaired, but at least it's all good now.

Changed in linux (Ubuntu):
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.