Wifi does down "crash" in Surface Pro 4

Bug #1730924 reported by Hicks
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Linux
Fix Released
Medium
linux (Ubuntu)
Incomplete
Medium
Unassigned

Bug Description

I have a Surface Pro 4. The wifi works well in principle, but unfortunately it drops every x minutes. The only way to fix it I've found is to reboot the computer.
lsb_release -rd
Description: Ubuntu 17.10
Release: 17.10

ProblemType: Bug
DistroRelease: Ubuntu 17.10
Package: linux-image-4.13.0-16-generic 4.13.0-16.19
ProcVersionSignature: Ubuntu 4.13.0-16.19-generic 4.13.4
Uname: Linux 4.13.0-16-generic x86_64
ApportVersion: 2.20.7-0ubuntu3.1
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: predatux 1537 F.... pulseaudio
CurrentDesktop: KDE
Date: Wed Nov 8 10:41:26 2017
HibernationDevice: RESUME=UUID=147af4ba-a4ce-41fe-a176-b36a1f6a590b
Lsusb:
 Bus 002 Device 002: ID 045e:090c Microsoft Corp.
 Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
 Bus 001 Device 002: ID 045e:07e8 Microsoft Corp.
 Bus 001 Device 003: ID 1286:204c Marvell Semiconductor, Inc.
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
MachineType: Microsoft Corporation Surface Pro 4
ProcFB: 0 inteldrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.13.0-16-generic root=UUID=125200b0-7377-4985-a217-15503781a525 ro quiet splash vt.handoff=7
RelatedPackageVersions:
 linux-restricted-modules-4.13.0-16-generic N/A
 linux-backports-modules-4.13.0-16-generic N/A
 linux-firmware 1.169
SourcePackage: linux
UpgradeStatus: Upgraded to artful on 2017-10-22 (16 days ago)
dmi.bios.date: 02/24/2017
dmi.bios.vendor: Microsoft Corporation
dmi.bios.version: 106.1624.768
dmi.board.name: Surface Pro 4
dmi.board.vendor: Microsoft Corporation
dmi.chassis.type: 9
dmi.chassis.vendor: Microsoft Corporation
dmi.modalias: dmi:bvnMicrosoftCorporation:bvr106.1624.768:bd02/24/2017:svnMicrosoftCorporation:pnSurfacePro4:pvrD0B08F1C03P38:rvnMicrosoftCorporation:rnSurfacePro4:rvr:cvnMicrosoftCorporation:ct9:cvr:
dmi.product.family: Surface
dmi.product.name: Surface Pro 4
dmi.product.version: D:0B:08F:1C:03P:38
dmi.sys.vendor: Microsoft Corporation

Revision history for this message
In , wengxt (wengxt-linux-kernel-bugs) wrote :

Wifi looks fine at first after boot, but after hours of use or even idle, it goes into a unrecoverable state (unable to send and receive) until next reboot. dmesg logs shows lots of following messages:

[ 6822.586704] mwifiex_pcie 0000:02:00.0: PREP_CMD: FW in reset state
[ 6822.586731] mwifiex_pcie 0000:02:00.0: failed to get signal information
[ 6822.586899] mwifiex_pcie 0000:02:00.0: PREP_CMD: FW in reset state
[ 6822.586904] mwifiex_pcie 0000:02:00.0: failed to get signal information
[ 6827.925487] mwifiex_pcie 0000:02:00.0: 4296924600 : Tx timeout(#40), bss_type-num = 0-0
[ 6828.584671] mwifiex_pcie 0000:02:00.0: PREP_CMD: FW in reset state
[ 6828.584687] mwifiex_pcie 0000:02:00.0: failed to get signal information
[ 6828.584755] mwifiex_pcie 0000:02:00.0: PREP_CMD: FW in reset state
[ 6828.584763] mwifiex_pcie 0000:02:00.0: failed to get signal information
[ 6834.587608] mwifiex_pcie 0000:02:00.0: PREP_CMD: FW in reset state
[ 6834.587623] mwifiex_pcie 0000:02:00.0: failed to get signal information
[ 6834.587698] mwifiex_pcie 0000:02:00.0: PREP_CMD: FW in reset state
[ 6834.587706] mwifiex_pcie 0000:02:00.0: failed to get signal information

Revision history for this message
In , wengxt (wengxt-linux-kernel-bugs) wrote :

Created attachment 197921
a more complete log when problem happens.

Attach a more complete log when problem happens.

Revision history for this message
In , akarwar (akarwar-linux-kernel-bugs) wrote :

Thanks for reporting the problem. This is a command timeout problem caused mostly due to a firmware bug.
Which firmware is being used here?

Revision history for this message
In , wengxt (wengxt-linux-kernel-bugs) wrote :

It's commit bbe4917 from linux-firmware git repository, not quite sure about the actual version.

Revision history for this message
In , akarwar (akarwar-linux-kernel-bugs) wrote :
Revision history for this message
In , wengxt (wengxt-linux-kernel-bugs) wrote :

Created attachment 197961
log with firmware 15.68.7.p53

At first I though it's fixed, but I just got into similar situation. Log attached.

BTW, I also seems to have unstable connection when wifi is working. E.g. I open google.com in firefox with google's instant search feature enabled. While typing, The result usually not displayed and I have to refresh page several times to get web page displayed. The other laptop I have doesn't show the same problem so it's not network issue.

Revision history for this message
In , anton (anton-linux-kernel-bugs) wrote :

I have the same problem with mwifiex_pcie (latest git firmware) on Surface Pro 3

Revision history for this message
In , wengxt (wengxt-linux-kernel-bugs) wrote :

Created attachment 201031
A more complete log with debug flag set to 0xffffff

4.3.3 with firmware 15.68.7.p53

Revision history for this message
In , mikael (mikael-linux-kernel-bugs) wrote :

I have the same problem. In addition, using mwiflex_pcie on Surface Pro 4 seems to cause intermittent system crashes.

Using kernel 4.3.3-5 compiled from linux-source.

Revision history for this message
In , andrea (andrea-linux-kernel-bugs) wrote :

I have the same problem on my Microsoft Surface 3 (not Pro model), same issues (even though the rest of system is stable) and same dmesg printout.
Lspci lists the card on my tablet as: Marvell Technology Group Ltd. 88W8897 [AVASTAR] 802.11ac Wireless.

Revision history for this message
In , andrea (andrea-linux-kernel-bugs) wrote :

I have also noticed that the issues appear as soon as I start using the wireless card heavily (downloading files, apt update, etc.). As long as I just use ping, everything is fine.

Revision history for this message
In , wengxt (wengxt-linux-kernel-bugs) wrote :

It becomes much worse in linux 4.5 comparing with 4.4.2.

Is there any update on this issue?

Revision history for this message
In , anton (anton-linux-kernel-bugs) wrote :

Created attachment 209311
attachment-2572-0.html

The driver was updated recently:

Из git://git.marvell.com/mwifiex-firmware
   e92f8b3..a96efa0 master -> origin/master
Обновление e92f8b3..a96efa0
Fast-forward
 mrvl/pcie8897_uapsta.bin | Bin 803884 -> 816772 bytes
 mrvl/sd8897_uapsta.bin | Bin 780760 -> 794148 bytes

2016-03-15 0:25 GMT+03:00 <email address hidden>:

> https://bugzilla.kernel.org/show_bug.cgi?id=109681
>
> --- Comment #11 from Weng Xuetian <email address hidden> ---
> It becomes much worse in linux 4.5 comparing with 4.4.2.
>
> Is there any update on this issue?
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
>

Revision history for this message
In , wengxt (wengxt-linux-kernel-bugs) wrote :

Actually I'm aware of that.

However, with latest firmware + linux 4.5, my system would simply freeze in a few minutes that I can't get any useful information for this bug.

With 15.68.7.p53 my kernel could live longer, but with kernel 4.5, wifi also enters bad state quite fast comparing with 4.4.2 or 4.3.3.

Revision history for this message
In , wengxt (wengxt-linux-kernel-bugs) wrote :

The only thing I notice so far is that:

With latest firmware + linux 4.3.3, I notice a lot of "mwifiex_pcie 0000:02:00.0: mwifiex_process_sleep_confirm_resp: cmd size is 0" when wifi enters bad state, which doesn't happen with 15.68.7.p53.

Revision history for this message
In , akarwar (akarwar-linux-kernel-bugs) wrote :

Hi Weng,

Can you provide complete log for the issue mentioned in comment#14?

Revision history for this message
In , anton (anton-linux-kernel-bugs) wrote :

Created attachment 209441
attachment-7733-0.html

My system freezes too with 4.5 kernel and latest and previous versions of
firmware.

2016-03-16 6:52 GMT+03:00 <email address hidden>:

> https://bugzilla.kernel.org/show_bug.cgi?id=109681
>
> --- Comment #13 from Weng Xuetian <email address hidden> ---
> Actually I'm aware of that.
>
> However, with latest firmware + linux 4.5, my system would simply freeze
> in a
> few minutes that I can't get any useful information for this bug.
>
> With 15.68.7.p53 my kernel could live longer, but with kernel 4.5, wifi
> also
> enters bad state quite fast comparing with 4.4.2 or 4.3.3.
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
>

Revision history for this message
In , wengxt (wengxt-linux-kernel-bugs) wrote :

Created attachment 210651
kernel 4.5 log with debug mask 0xfffffff

The complete log is too large(286MB), so I cut it to last 1 minute before PREP_CMD: FW is in bad state happens.

mwifiex_pcie 0000:02:00.0: info: MWIFIEX VERSION: mwifiex 1.0 (15.68.7.p66)

Revision history for this message
In , jwhite (jwhite-linux-kernel-bugs) wrote :

I'm experiencing a variety of problems with this driver as well.

Using Linux v4.6-rc1-64-g6ddf37d, firmware 15.68.7.p66.

The first failure is an inability to connect. This is sporadic; it seems to happen about 1 time in 4 boots, but one time I had it three or four times in a row. I will attach a dmesg log with debug increased part way through. It seems as though this may be similar to the problem reported here:
  https://<email address hidden>/msg19911.html

The second failure is horrific performance, accompanied by constantly repeated mwifiex_process_sleep_confirm_resp: cmd size is 0 messages. The user experience is harder to quantify; you get absolutely terrible network performance (7-10% packet loss), and general instability. I'll attach a dmesg log for that as well, again with debug increased part way through. This is similar to what Weng Xuetian reported in Comment 14.

Finally, as I use this device, I find one consistent and persistent oddity. That is, even when everything is 'working', I get poor latency connecting to the device. That is, my latency when I ping 'out' is fine (< 10 ms, not much variation). But when I ping the device from a different system, I get highly variable latency. (Ranging from 10 ms to 4000 ms). No dropped packets, but horrible latency. You really notice it when you're ssh'd in :-/.

Revision history for this message
In , jwhite (jwhite-linux-kernel-bugs) wrote :

Created attachment 211711
Log of the failure to connect.

Revision history for this message
In , jwhite (jwhite-linux-kernel-bugs) wrote :

Created attachment 211721
Log of the cmd size 0 failure mode.

Revision history for this message
In , jwhite (jwhite-linux-kernel-bugs) wrote :

I've spent some time looking at the source code and trying to debug what is happening. I'm using the kvalo wireless-drivers-next pending branch, and I've applied recent patches sent to lkml that seem valuable; but the problems persist.

It seems as though in the 'normal' case we get an unusual number of 'max count reached while accessing sleep cookie' messages. The usleep calls in that case make me suspicious, and being slow to wake up would seem to be a good explanation of the performance we see. (That is, sending a packet is fast, but responding is slow). It also seems like we're in a rather startling loop of awake/sleep events; doing multiple spins within a millisecond, as far as I can tell.

In the disconnect case, I see an 'invalid cmd resp' error followed by 'There is no command but got a command response'.

I have now started booting with debug_mask=0xffffffff; I'll replace my logs with those, as they appear to be more complete.

For reference, here are related areas of the code:

Sleep message:
https://kernel.googlesource.com/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next/+/pending/drivers/net/wireless/marvell/mwifiex/pcie.c#377

There is no command:
https://kernel.googlesource.com/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next/+/pending/drivers/net/wireless/marvell/mwifiex/pcie.c#1637
(although intriguingly to my naive mind, that code block is related to the sleep check as well).

Revision history for this message
In , jwhite (jwhite-linux-kernel-bugs) wrote :

Created attachment 212611
Log of a failure to connect, with debug mask set from boot.

Revision history for this message
In , jwhite (jwhite-linux-kernel-bugs) wrote :

Created attachment 212621
Log of a 'working' session, which has poor latency. debug_mask ffffffff from start.

38 comments hidden view all 160 comments
Revision history for this message
Hicks (hicks1gb) wrote :
Revision history for this message
Hicks (hicks1gb) wrote :

Tested with "wicd" too.
Wicd crashes when wifi goes down

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Hicks (hicks1gb) wrote :
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Did this issue start happening after an update/upgrade? Was there a prior kernel version where you were not having this particular problem?

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.14 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.14

Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Incomplete
Revision history for this message
Hicks (hicks1gb) wrote : Re: [Bug 1730924] Re: Wifi does down "crash" in Surface Pro 4

El 14/11/17 a las 17:54, Joseph Salisbury escribió:
> Did this issue start happening after an update/upgrade? Was there a
> prior kernel version where you were not having this particular problem?
No, always
>
> Would it be possible for you to test the latest upstream kernel? Refer
> to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest
> v4.14 kernel[0].

Testing...
>
> If this bug is fixed in the mainline kernel, please add the following
> tag 'kernel-fixed-upstream'.
>
> If the mainline kernel does not fix this bug, please add the tag:
> 'kernel-bug-exists-upstream'.
>
> Once testing of the upstream kernel is complete, please mark this bug as
> "Confirmed".
>
>
> Thanks in advance.
>
> [0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.14
>
>
> ** Changed in: linux (Ubuntu)
> Importance: Undecided => Medium
>
> ** Changed in: linux (Ubuntu)
> Status: Confirmed => Incomplete
>

Revision history for this message
Hicks (hicks1gb) wrote :

At the moment wifi still working... No crashes.

Revision history for this message
Hicks (hicks1gb) wrote :

More crashes! Problem still here.

Revision history for this message
Hicks (hicks1gb) wrote :

bug-exists-upstream

Revision history for this message
Hicks (hicks1gb) wrote :

Let's see, I just tested this patch, which supposedly solves the wifi controller instability, but it doesn't work for me. I'll leave the link for what it's worth.

https://github.com/EiNSTeiN-/mwifiex

Another detail I want to comment on is that every time the wifi is dropped, the surface is semi blocked. Terminal stops responding, I can't run any commands. Ctrl+c does not work either. Wicd also stops responding.
If I try to restart the terminal, it remains "waiting" for processes to close, which never finish closing.
I finally have to force the switch off.

Revision history for this message
Hicks (hicks1gb) wrote :
Revision history for this message
Hicks (hicks1gb) wrote :
Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Try the daily mainline kernel, http://kernel.ubuntu.com/~kernel-ppa/mainline/daily/current/

There's a new fix for random MAC scan, worth a try.

Revision history for this message
Hicks (hicks1gb) wrote :

Testing...

Revision history for this message
Hicks (hicks1gb) wrote :

At the moment the wifi has not fallen, but after a reboot of the system has been slow enough to start and gives me this error.
I'll keep trying to see if he can hold on.

Revision history for this message
Hicks (hicks1gb) wrote :

Still doesn't work. The wifi is down again, taking the rest of the system with it.

Changed in linux:
importance: Unknown → Medium
status: Unknown → Confirmed
Brad Figg (brad-figg)
tags: added: ubuntu-certified
Brad Figg (brad-figg)
tags: removed: ubuntu-certified
Brad Figg (brad-figg)
tags: added: cscc
104 comments hidden view all 160 comments
Revision history for this message
In , kitakar (kitakar-linux-kernel-bugs) wrote :

Hi Verdre and Ganapathi,

I own Surface Book 1 (Skylake, same as Surface Pro 4) and Surface 3 (non-pro, Intel Atom Cherry Trail) for daily usage
and recently I got Surface Pro 4 (with broken sensor) for debugging purposes.

env:
- freshly installed Fedora 31 Beta on an external USB drive and use it on SP4
- run `sudo dnf upgrade` and now `uname -r` shows `5.3.2-300.fc31.x86_64`

@Ganapathi
> I am doing different try outs, I will update as soon I could recreate these;

You mean you're having trouble recreating issues?
On my SP4:
- "cmd_timeout"
     This issue will occur after a second suspend like the dmesg log on Comment#91 regardless of 2.4GHz/5GHz AP or power_save on/off. So, I'm not sure why you cannot recreate this issue.

- "Firmware wakeup failed"
     This issue will randomly occur after a while when using with 5GHz AP+power_save on

- "WLAN FW already running! Skip FW dnld" after card reset
     As you can see on the dmesg log (Comment#91), after card reset, the firmware will not be re-downloaded on Surface devices. I think that re-downloading the firmware is the expected behavior.
You may recreate this issue by this command:

echo 1 | sudo tee /sys/kernel/debug/mwifiex/mlan0/reset

After resetting, it shows these messages repeatedly on dmesg and wifi is not usable anymore:

kern :info : [ 175.478500] mwifiex_pcie 0000:02:00.0: cmd_wait_q terminated: -110
kern :info : [ 175.478510] mwifiex_pcie 0000:02:00.0: deleting the crypto keys

Maybe this is the simplest one to fix? and at least if resetting works, we can continue to use wifi.

To reset the card with firmware re-downloading, currently I have to call an ACPI method using acpi_call and remove/rescan the parent bridge of wifi (need to be installed) like this:

WIFI_PARENT="0000:00:1d.0" # SP4 specific
ACPI_WIFI_RST_PATH="\_SB.PCI0.RP09.PXSX.PRWF._RST" # SP4 specific
acall(){ echo "$1" | tee /proc/acpi/call >/dev/null && cat /proc/acpi/call;echo;} # wrapper

modprobe acpi_call
acall $ACPI_WIFI_RST_PATH
echo 1 | tee "/sys/bus/pci/devices/$WIFI_PARENT/remove"
sleep 1
echo 1 | tee /sys/bus/pci/rescan

---

By the way, I find 2.4GHz AP is much more stable with even power_save on. 5GHz AP with power_save on causes wifi stop without any message on dmesg.

It seems that showing APs (e.g. `nmcli d wifi list`) will temporarily fix the wifi stop.

This is another issue from what we're discussing now, though.

Best regards,
Tsuchiya, Yuto (kitakar5525)

Revision history for this message
In , kitakar (kitakar-linux-kernel-bugs) wrote :

> acpi_call and remove/rescan the parent bridge of wifi (need to be installed)

Sorry, I made a mistake
acpi_call (need to be installed) and remove/rescan the parent bridge of wifi

Revision history for this message
In , gbhat (gbhat-linux-kernel-bugs) wrote :

Hi Tsuchiya/Verdre,

>- "cmd_timeout"
> This issue will occur after a second suspend like the dmesg log on
> >Comment#91 regardless of 2.4GHz/5GHz AP or power_save on/off

1. Command timeout after second suspend could be recreated in my Surface 4;

2. Logs were similar to the one in #91 & #92 by Verdre;

3. It is found that the command, which timed out did not reach the firmware at all and am trying to understand more;

4. Once a command times out, driver dumps few scratch registers and also it triggers the firmware dump; But I fail to get these in my device(I don't know why, it is a different problem); So, I still fail to confirm we are hitting same issue;

Note that, when the dump is uploaded to user-space, dmesg will say:
"== mwifiex dump information to /sys/class/devcoredump end"

Could you please share the dump(along with dmesg), which is saved in below file(it is automatically deleted after 5 minutes):
********************************************
/sys/devices/virtual/devcoredump/devcd1/data
********************************************

>- "Firmware wakeup failed"
This could also be recreated for the first time now; I kept Surface 4 connected to AP and PING overnight; But I have no much detail yet;

I will comeback soon on remaining two concerns you raised in comment#95;

Regards,
Ganapathi

Regards,
Ganapathi

Revision history for this message
In , kitakar (kitakar-linux-kernel-bugs) wrote :

> But I fail to get these in my device(I don't know why, it is a different
> problem); So, I still fail to confirm we are hitting same issue;

Yeah, the device_dump is also a problematic one on Surface devices.
It's reported that if_ops.device_dump() will cause lockups [1] (e.g. prevent system power off)

I also confirmed that by
1. adding debugfs entry for calling that function
2. and called the function via debugfs (echo 1 | sudo tee /sys/kernel/debug/mwifiex/mlan0/device_dump)
Sometimes I can get the device_dump log, sometimes I can't and it prevents system power off. I will post the patch which adds the debugfs entry to the next comment.

I will post the dump log (along with the dmesg log) as soon as I manage to get the logs after wifi malfunctioning.

By the way, the user manual on drivers/net/wireless/marvell/mwifiex/README describes that device_dump can be obtained by `cat fw_dump` but the debugfs entry does not exist anymore.

References:
[1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1730924/comments/99
Comment #99 : Bug #1730924 “Wifi does down “crash” in Surface Pro 4” : Bugs : linux package : Ubuntu

Revision history for this message
In , kitakar (kitakar-linux-kernel-bugs) wrote :

Created attachment 285391
[PATCH] mwifiex: debugfs: add entry for device_dump

This is a patch to add debugfs entry for device_dump.
Just to test device_dump functionality on Surface devices.

Usage:

  echo 1 | sudo tee /sys/kernel/debug/mwifiex/mlan0/device_dump

Revision history for this message
In , verdre (verdre-linux-kernel-bugs) wrote :

Hi Ganapathi and Tsuchiya,

here's a dmesg and a firmware dump of the command timeout issue. It wasn't hard to get it, the devcoredump entry appeared as soon as "mwifiex dump information to /sys/class/devcoredump end" was written to the log.

Regards,
Jonas

Revision history for this message
In , verdre (verdre-linux-kernel-bugs) wrote :

Created attachment 285445
dmesg of command timeout issue

Revision history for this message
In , verdre (verdre-linux-kernel-bugs) wrote :

Created attachment 285447
firmware dump of command timeout issue

Revision history for this message
In , gbhat (gbhat-linux-kernel-bugs) wrote :

>> 1. Command timeout after second suspend could be recreated in my Surface 4;
more specific observation on this(you could cross check, if possible):
1. timeout happens after a resume, if we see below message in dmesg:
   "mwifiex_pci 0000:02:00.0: Refused to change power state, currently in D0"

2. if device is connected before the 1st suspend, no issue is seen after 1st resume;

3. if device is connected before 2nd suspend(i.e after 1st suspend_resume), then issue is seen after 2nd resume

4. if device is not at all connected during suspend_resume stress test, no issue is seen

Regards,
Ganapathi

Revision history for this message
In , gbhat (gbhat-linux-kernel-bugs) wrote :

Hi verdre,

>> went back to a clean kernel and tried disabling L1.2 to fix the command
>> timeout. While that indeed fixed the command timeout issue, the "Firmware
>> wakeup
failed" issue wasn't fixed.

could you share your changes, with which command timeout is not seen;

Regards,
Ganapathi

Revision history for this message
In , gbhat (gbhat-linux-kernel-bugs) wrote :

Created attachment 285537
0001-mwifiex-set-PCIE-to-PCI_D0-power-state-on-resume

Hi Tsuchiya/Verdre,

Could please give a try with attached change;

Regards,
Ganapathi

Revision history for this message
In , andy.shevchenko (andy.shevchenko-linux-kernel-bugs) wrote :

(In reply to Ganapathi Bhat from comment #105)
> Created attachment 285537 [details]
> 0001-mwifiex-set-PCIE-to-PCI_D0-power-state-on-resume
>
> Hi Tsuchiya/Verdre,
>
> Could please give a try with attached change;
>
> Regards,
> Ganapathi

I'm wondering if it has any effect on any system. PCI core *does* the same (*) for you already. Does it mean we have some race there?

*) See implementation of pci_pm_resume().

Revision history for this message
In , gbhat (gbhat-linux-kernel-bugs) wrote :

Hi Andy,

>PCI core *does* the same (*) for you already.

OK; I see, with this change, command timeout is not recreated; without this change, just after second iteration I see the issue(as given in comment#103);

>Does it mean we have some race there?
I need to understand more on this; I was just trying this change because, different open source wireless drivers have this in their resume handler;

Regards,
Ganapathi

Revision history for this message
In , kitakar (kitakar-linux-kernel-bugs) wrote :

> 1. timeout happens after a resume, if we see below message in dmesg:
> "mwifiex_pci 0000:02:00.0: Refused to change power state, currently in D0"

For me, it says instead:
`mwifiex_pcie 0000:02:00.0: Refused to change power state, currently in D3`

> 2. if device is connected before the 1st suspend, no issue is seen after 1st
> resume;
> 3. if device is connected before 2nd suspend(i.e after 1st suspend_resume),
> then issue is seen after 2nd resume

Yes. However, the "cmd_timeout" will sometimes also occur after 1st resume for me. Also in this case, the message from mwifiex is the same as No. 1.

> 4. if device is not at all connected during suspend_resume stress test, no
> issue is seen

For me, the "cmd_timeout" will occur regardless of AP connection status/rfkill status/power_save status.
What changed by No. 4 is that, it will not cause the "lockup" (e.g. not preventing system shutdown)

Also, I tried the patch 0001-mwifiex-set-PCIE-to-PCI_D0-power-state-on-resume on comment#105.
- So far, the wifi is still working after several suspend.
- I remember I've done exactly the same thing before and it did not
  work before. Hmm…?
- The same message as No. 1 will still be printed on dmesg after
  the patch.
- It may still not connect to AP automatically because it fails to find APs
  (`nmcli d wifi list` shows empty). This is easily fixed by reloading "mwifiex_pcie" module

Revision history for this message
In , verdre (verdre-linux-kernel-bugs) wrote :

Hi Ganapathi,

since you can now reproduce the issues, are there any updates on a proper fix in the firmware or the kernel driver?

Regards,
Jonas

Revision history for this message
In , gbhat (gbhat-linux-kernel-bugs) wrote :

Hi Verdre,

I am yet to try on my side, the reason for why change in comment#105 don't cause the issue;

Please let know your inputs on the change shared in comment#105; We will improve the change considering different observations(comment#106, #108)

Regards,
Ganapathi

Revision history for this message
In , verdre (verdre-linux-kernel-bugs) wrote :

Hi Ganapathi,

so I've been running a 5.4 kernel with your changes from https://bugzilla.kernel.org/show_bug.cgi?id=109681#c105 for some time now,

Revision history for this message
In , gbhat (gbhat-linux-kernel-bugs) wrote :

Created attachment 286459
attachment-14890-0.html

As of 6-DEC-2019, NXP has acquired Marvell’s Wireless business unit. You can now reach me at <email address hidden>

Revision history for this message
In , akarwar (akarwar-linux-kernel-bugs) wrote :

Created attachment 286461
attachment-14943-0.html

As of 6-DEC-2019, NXP has acquired Marvell’s Wireless business unit. You can now reach me at <email address hidden>

Revision history for this message
In , verdre (verdre-linux-kernel-bugs) wrote :

Sorry, last comment was sent accidentally...

Hi Ganapathi,

so I've been running a 5.4 kernel with your changes from comment#105 for some time now, and I'm still seeing the "Refused to change power state, currently in D3" messages after resuming from the second suspend.

It seems like the "Firmware wakeup failed" issue is not happening anymore with this change, but the "cmd timeout" issue is definitely still happening, also without suspending at all, so I'm seeing all the same things as Tsuchiya in comment#108.

Thank you,
Jonas

Revision history for this message
In , ganapathi.bhat (ganapathi.bhat-linux-kernel-bugs) wrote :

Hi Verdre,

>>but the "cmd timeout" issue is definitely still happening

you could share the dmesg + firmware dump;

also, we find below discussions; could you give a try with the script below:
https://github.com/jakeday/linux-surface/issues/456

Regards,
Ganapathi

Revision history for this message
In , gbhat (gbhat-linux-kernel-bugs) wrote :

Created attachment 286651
attachment-16293-0.html

As of 6-DEC-2019, NXP has acquired Marvell’s Wireless business unit. You can now reach me at <email address hidden>

Revision history for this message
In , kitakar (kitakar-linux-kernel-bugs) wrote :

Thank you for looking into this issue. Reporting current status.

== "mwifiex crash after suspend"

This issue might be the Surface devices specific problem? Not sure.
I know some ways to prevent at least mwifiex crashing:

1. Disabling d3cold for wifi device (even doable on mainline kernels):

        echo 0 | sudo tee /sys/devices/pci0000:00/0000:00:1d.3/0000:03:00.0/d3cold_allowed # on SB1
        echo 0 | sudo tee /sys/devices/pci0000:00/0000:00:1d.0/0000:02:00.0/d3cold_allowed # on SP4

2. Your patch "0001-mwifiex-set-PCIE-to-PCI_D0-power-state-on-resume"

3. Disable bridge_d3 on mwifiex_pcie_probe()

        pdev->bus->self->bridge_d3 = false;

but with those ways, as I mentioned before (Comment 108), it sometimes not connect to AP automatically because it fails to find APs (`nmcli d wifi list` shows empty). This is easily fixed by reloading the "mwifiex_pcie" module.

Patch for this issue we (linux-surface community) are currently using is, based this patch from sebanc:
- https://github.com/jakeday/linux-surface/issues/420#issuecomment-544615371
Actual commit is here:
- https://github.com/linux-surface/kernel/commit/c8da40c0538a1528f3efd85cd91d9e314c85dd19

The patch from sebanc does the following things:
- modify mwifiex_pcie_suspend/mwifiex_pcie_resume functions to fix crash after suspend.
  It seems that the function mwifiex_pcie_suspend2 is from mwifiex_pcie_remove
  and the function mwifiex_pcie_resume2 is largely from mwifiex_pcie_probe.
  So, I think that it behaves in the same way as we unload modules before
  suspend and re-load modules after suspend.
- disable bridge_d3 to fix crash after suspend
- disable "auto deep sleep" (I think not needed for this issue, but introduced
  hoping to fix other issues)

I think the patch is too aggressive to be upstreamed as they are, but please use it as a reference.

The patch also fixes S0ix on KBL/KBL-R devices (such as Surface Book 2, Surface Pro 5, Surface Laptop 1 or later). SKL devices such as Surface Pro 4 or Surface Book 1 cannot achieve S0ix anyway because of other reasons outside of mwifiex.

Revision history for this message
In , kitakar (kitakar-linux-kernel-bugs) wrote :
Download full text (3.3 KiB)

== "mwifiex crash on idle/using"

I heard this issue only from Surface 3 and SP5 owners.
It can be fixed by disabling ASPM L1 state. I think this is the hardest issue to fix. So, I hope at least we could fix the mwifiex reset feature (see below).

== "mwifiex reset feature broken"

I think this issue is common on all devices that use the 88W8897 chip.
You can reproduce this issue by the following command:

        echo 1 | sudo tee /sys/kernel/debug/mwifiex/mlan0/reset

Expected: mwifiex is usable after this reset.
Actual: mwifiex is broke after this reset with following dmesg output

        $ echo 1 | sudo tee /sys/kernel/debug/mwifiex/mlan0/reset
        mwifiex_pcie 0000:03:00.0: Resetting per request
        mwifiex_pcie 0000:03:00.0: info: successfully disconnected from [BSSID]:
        reason code 3
        mwifiex_pcie 0000:03:00.0: PREP_CMD: card is removed
        mwifiex_pcie 0000:03:00.0: deleting the crypto keys
        mwifiex_pcie 0000:03:00.0: PREP_CMD: card is removed
        mwifiex_pcie 0000:03:00.0: deleting the crypto keys
        mwifiex_pcie 0000:03:00.0: PREP_CMD: card is removed
        mwifiex_pcie 0000:03:00.0: deleting the crypto keys
        mwifiex_pcie 0000:03:00.0: PREP_CMD: card is removed
        mwifiex_pcie 0000:03:00.0: deleting the crypto keys
        mwifiex_pcie 0000:03:00.0: PREP_CMD: card is removed
        mwifiex_pcie 0000:03:00.0: deleting the crypto keys
        mwifiex_pcie 0000:03:00.0: PREP_CMD: card is removed
        mwifiex_pcie 0000:03:00.0: deleting the crypto keys
        mwifiex_pcie 0000:03:00.0: info: shutdown mwifiex...
        mwifiex_pcie 0000:03:00.0: PREP_CMD: card is removed
        mwifiex_pcie 0000:03:00.0: PREP_CMD: card is removed
        mwifiex_pcie 0000:03:00.0: WLAN FW already running! Skip FW dnld
        mwifiex_pcie 0000:03:00.0: WLAN FW is active
        mwifiex_pcie 0000:03:00.0: Unknown api_id: 4
        mwifiex_pcie 0000:03:00.0: info: MWIFIEX VERSION: mwifiex 1.0 (15.68.19.p21)
        mwifiex_pcie 0000:03:00.0: driver_version = mwifiex 1.0 (15.68.19.p21)
        mwifiex_pcie 0000:03:00.0: info: trying to associate to '[SSID]' bssid [BSSID]
        mwifiex_pcie 0000:03:00.0: info: associated to bssid [BSSID] successfully
        mwifiex_pcie 0000:03:00.0: cmd_wait_q terminated: -110
        mwifiex_pcie 0000:03:00.0: info: successfully disconnected from [BSSID]:
        reason code 15
        mwifiex_pcie 0000:03:00.0: cmd_wait_q terminated: -110
        mwifiex_pcie 0000:03:00.0: deleting the crypto keys
        mwifiex_pcie 0000:03:00.0: cmd_wait_q terminated: -110
        mwifiex_pcie 0000:03:00.0: deleting the crypto keys
        mwifiex_pcie 0000:03:00.0: cmd_wait_q terminated: -110
        mwifiex_pcie 0000:03:00.0: deleting the crypto keys
        [...]

Also, does anyone know what is the expected behavior on function level reset? Should firmware be re-downloaded? As you can see on the dmesg log, fw is still active after the reset. ("WLAN FW already running! Skip FW dnld")

== "power_save causes connection instability at least on 5GHz APs"

"connection instability" here means not crash, but networking stop, like ping not responding.

I think this issue is c...

Read more...

Revision history for this message
In , kitakar (kitakar-linux-kernel-bugs) wrote :

Also, ff someone knows devices (other than Surface) that use the same 88W8897 chip (or at least similar PCIe chip like 88W8997), let me know. I want to get the device as a reference.

Thank you.

Revision history for this message
In , kitakar (kitakar-linux-kernel-bugs) wrote :

*if someone knows devices

Revision history for this message
In , ganapathi.bhat (ganapathi.bhat-linux-kernel-bugs) wrote :

Hi Tsuchiya,

>>comment#118:

>>Also, does anyone know what is the expected behavior on function level reset?

note that FLR does require support in the device; 88W8897(PCIE) do not support this;

[we will respond to remaining comments soon]

Regards,
Ganapathi

Revision history for this message
In , verdre (verdre-linux-kernel-bugs) wrote :

Hi everyone,

a few months have passed, any news about this bug?

Regards,
Jonas

Revision history for this message
In , gbhat (gbhat-linux-kernel-bugs) wrote :

Created attachment 288579
attachment-30359-0.html

As of 6-DEC-2019, NXP has acquired Marvell’s Wireless business unit. You can now reach me at <email address hidden>

Revision history for this message
In , akarwar (akarwar-linux-kernel-bugs) wrote :

Created attachment 288581
attachment-30418-0.html

As of 6-DEC-2019, NXP has acquired Marvell’s Wireless business unit. You can now reach me at <email address hidden>

You-Sheng Yang (vicamo)
tags: added: hwe-networking-wifi
Revision history for this message
In , verdre (verdre-linux-kernel-bugs) wrote :

Hi everyone,

so Tsuchiya and I have been looking into this quite a lot for the last few weeks and we've identified several issues and tried to find the underlying reasons as far as possible.

We hope this information will be helpful for you in indentifying issues with the firmware and the driver so you don't have to take a shot in the dark. Because most issues and their reproducers are already extensively explained in this bugreport, we've listed what we suspect to be some of the root causes for those issues here:

1) The LTR (Latency Tolerance Reporting) messages sent by the firmware might be incorrect. With the newest firmware, the card always reports the maximum latency, which prevents the System from entering the lowest C-States (pc10 and slp_s0). I'm not sure whether that actually prevents ASPM L1.2 from working, because the command timeouts I see when ASPM L1.2 is enabled (other devices see them only with (Surface 3) L1 or (Surface Pro 6) L0s) are still happening.

2) Performing a PCI function level reset of the card with firmware redownloading only works when power-cycling the card by switching it to D3cold and then to D0 again, see [1], which implements this. As mentioned in that commit, this is a quirk that's also done by Windows and seems to be specific to only Surface devices.

3) For avoid a crash of the firmware when resuming from suspend, it's important to disable D3 for the PCI bridge, see [2]. This is another quirk that also seems to be applied by Windows.

4) There's also a bluetooth powersaving issue, which makes the bluetooth device never enter USB the suspend state as long as no LE device is connected: That's because (assuming the linux bluetooth stack works correctly) the firmware never stops sending interrupts with LE advertising reports, which can easily be confirmed by using the btmon utility and by disabling Bluetooth LE functionality in /etc/bluettoth/main.conf.

[1] https://github.com/kitakar5525/linux-surface-kernel/commit/6eba36073e85a64200f2c28e2a6ba0dbeb120bf2

[2] https://github.com/kitakar5525/linux-surface-kernel/commit/4606706679bf81ef41e701fc5e4d5b87adb983dd

Thank you!

Revision history for this message
In , andy.shevchenko (andy.shevchenko-linux-kernel-bugs) wrote :

(In reply to verdre from comment #125)
> Hi everyone,
>
> so Tsuchiya and I have been looking into this quite a lot for the last few
> weeks and we've identified several issues and tried to find the underlying
> reasons as far as possible.

I think, especially after acquisition, Marvell is not capable to fix this. Perhaps we simple have to move on, i.e. promote your excellent work to upstream. I can help to assure wireless maintainer that this can be accepted even w.o. Marvell involvement.

Let's simple proceed to fix this very annoying issues.

Revision history for this message
In , pali (pali-linux-kernel-bugs) wrote :

Hello!

Tsuchiya & Verdre: Do you have some fixes for instability issues when card is in 5GHz AP mode? This is issue which we are observing on 88W8997 SDIO wifi chips.

I described some problems related to power save mode, nl80211/cfg80211 layer and mwifiex driver in email which I sent to linux-wireless mailing list: https://lore.kernel.org/linux-wireless/20200609111544.v7u5ort3yk4s7coy@pali/

But I have not got any response about it yet and I think that without fixing those issues at cfg80211 layer we cannot fix mwifiex driver for power save mode.

Andy, are you still able to help us with (generic/cfg80211) wireless part?

Revision history for this message
In , akarwar (akarwar-linux-kernel-bugs) wrote :

Created attachment 290043
attachment-26841-0.html

As of 6-DEC-2019, NXP has acquired Marvell’s Wireless business unit. You can now reach me at <email address hidden>

Revision history for this message
In , kitakar (kitakar-linux-kernel-bugs) wrote :

(In reply to Pali Rohár from comment #127)
> Hello!

Hello!

> Tsuchiya & Verdre: Do you have some fixes for instability issues when card
> is in 5GHz AP mode? This is issue which we are observing on 88W8997 SDIO
> wifi chips.

Unfortunately, not yet. I'm using 5GHz AP with just power_save off. Maybe
all what we can do about power_save and firmware crashing is submitting
a journal log and devcoredump. I'll attach them in the next comment.

> I described some problems related to power save mode, nl80211/cfg80211 layer
> and mwifiex driver in email which I sent to linux-wireless mailing list:
> https://lore.kernel.org/linux-wireless/20200609111544.v7u5ort3yk4s7coy@pali/

Thank you for letting us know!

Revision history for this message
In , kitakar (kitakar-linux-kernel-bugs) wrote :

Created attachment 290067
mwifiex crash log on SB1 with ps_on, 5GHzAP

- power_save on
- connected to 5GHz AP
- connection instability (ping not responding)
- firmware crashing eventually

Journal log and devcoredump are included in the archive. I kept sending ping to 192.168.1.1 and the output was redirected to kmsg to show connection stability
state.

    $ ping 192.168.1.1 | sudo tee /dev/kmsg
    [...]
    64 bytes from 192.168.1.1: icmp_seq=5798 ttl=255 time=3.81 ms
    64 bytes from 192.168.1.1: icmp_seq=5799 ttl=255 time=3.50 ms
    64 bytes from 192.168.1.1: icmp_seq=5800 ttl=255 time=7.16 ms
    #
    # connection stopped suddenly
    #
    From 192.168.1.59 icmp_seq=5830 Destination Host Unreachable
    From 192.168.1.59 icmp_seq=5831 Destination Host Unreachable
    From 192.168.1.59 icmp_seq=5832 Destination Host Unreachable
    [...]
    From 192.168.1.59 icmp_seq=6019 Destination Host Unreachable
    From 192.168.1.59 icmp_seq=6020 Destination Host Unreachable
    From 192.168.1.59 icmp_seq=6025 Destination Host Unreachable
    #
    # connection started working again
    #
    64 bytes from 192.168.1.1: icmp_seq=5801 ttl=255 time=230664 ms
    64 bytes from 192.168.1.1: icmp_seq=5802 ttl=255 time=229660 ms
    64 bytes from 192.168.1.1: icmp_seq=5803 ttl=255 time=228647 ms
    [...]
    64 bytes from 192.168.1.1: icmp_seq=6082 ttl=255 time=7.24 ms
    64 bytes from 192.168.1.1: icmp_seq=6083 ttl=255 time=80.0 ms
    64 bytes from 192.168.1.1: icmp_seq=6084 ttl=255 time=6.73 ms
    #
    # firmware crashed suddenly
    #
    ping: sendmsg: Network is unreachable
    ping: sendmsg: Network is unreachable
    ping: sendmsg: Network is unreachable
    ping: sendmsg: Network is unreachable
    ping: sendmsg: Network is unreachable
    [...]

Revision history for this message
In , kitakar (kitakar-linux-kernel-bugs) wrote :

Hi Ganapathi,

We can't do anything about power_save and firmware crashing. So, I
attached journal log and devcoredump when using with power_save on and
5GHz AP, containing 1) connection instability and 2) firmware crashing
eventually.

I hope it helps in debugging.

Thank you.

Revision history for this message
In , andy.shevchenko (andy.shevchenko-linux-kernel-bugs) wrote :

*** Bug 195183 has been marked as a duplicate of this bug. ***

Revision history for this message
In , andy.shevchenko (andy.shevchenko-linux-kernel-bugs) wrote :

I have reassigned this to Jonas because it seems he knows much better the state of affairs with this crappy hardware/firmware than vendor does. Jonas, if you are not comfortable, tell me what to do with this bug report (and AFAIU the v5.15+ behaves much more stable nowadays, correct?).

Changed in linux:
status: Confirmed → Incomplete
Revision history for this message
In , verdre (verdre-linux-kernel-bugs) wrote :

Thanks Andy, indeed with new kernels all the stability issues should be gone.

What remains unsolved are all the firmware bugs without workarounds (see the list I put together in this email: https://lore<email address hidden>/) and the patch to allow s0ix sleep during runtime, I'm not sure how/if I should follow up on that: https://<email address hidden>/T/#u

Anyway, given that this bug is specifically about stability, I'd say we can finally close it.

Changed in linux:
status: Incomplete → Fix Released
Displaying first 40 and last 40 comments. View all 160 comments or add a comment.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.