Wifi does down "crash" in Surface Pro 4

Bug #1730924 reported by Hicks
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Linux
Fix Released
Medium
linux (Ubuntu)
Incomplete
Medium
Unassigned

Bug Description

I have a Surface Pro 4. The wifi works well in principle, but unfortunately it drops every x minutes. The only way to fix it I've found is to reboot the computer.
lsb_release -rd
Description: Ubuntu 17.10
Release: 17.10

ProblemType: Bug
DistroRelease: Ubuntu 17.10
Package: linux-image-4.13.0-16-generic 4.13.0-16.19
ProcVersionSignature: Ubuntu 4.13.0-16.19-generic 4.13.4
Uname: Linux 4.13.0-16-generic x86_64
ApportVersion: 2.20.7-0ubuntu3.1
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: predatux 1537 F.... pulseaudio
CurrentDesktop: KDE
Date: Wed Nov 8 10:41:26 2017
HibernationDevice: RESUME=UUID=147af4ba-a4ce-41fe-a176-b36a1f6a590b
Lsusb:
 Bus 002 Device 002: ID 045e:090c Microsoft Corp.
 Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
 Bus 001 Device 002: ID 045e:07e8 Microsoft Corp.
 Bus 001 Device 003: ID 1286:204c Marvell Semiconductor, Inc.
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
MachineType: Microsoft Corporation Surface Pro 4
ProcFB: 0 inteldrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.13.0-16-generic root=UUID=125200b0-7377-4985-a217-15503781a525 ro quiet splash vt.handoff=7
RelatedPackageVersions:
 linux-restricted-modules-4.13.0-16-generic N/A
 linux-backports-modules-4.13.0-16-generic N/A
 linux-firmware 1.169
SourcePackage: linux
UpgradeStatus: Upgraded to artful on 2017-10-22 (16 days ago)
dmi.bios.date: 02/24/2017
dmi.bios.vendor: Microsoft Corporation
dmi.bios.version: 106.1624.768
dmi.board.name: Surface Pro 4
dmi.board.vendor: Microsoft Corporation
dmi.chassis.type: 9
dmi.chassis.vendor: Microsoft Corporation
dmi.modalias: dmi:bvnMicrosoftCorporation:bvr106.1624.768:bd02/24/2017:svnMicrosoftCorporation:pnSurfacePro4:pvrD0B08F1C03P38:rvnMicrosoftCorporation:rnSurfacePro4:rvr:cvnMicrosoftCorporation:ct9:cvr:
dmi.product.family: Surface
dmi.product.name: Surface Pro 4
dmi.product.version: D:0B:08F:1C:03P:38
dmi.sys.vendor: Microsoft Corporation

Revision history for this message
In , wengxt (wengxt-linux-kernel-bugs) wrote :

Wifi looks fine at first after boot, but after hours of use or even idle, it goes into a unrecoverable state (unable to send and receive) until next reboot. dmesg logs shows lots of following messages:

[ 6822.586704] mwifiex_pcie 0000:02:00.0: PREP_CMD: FW in reset state
[ 6822.586731] mwifiex_pcie 0000:02:00.0: failed to get signal information
[ 6822.586899] mwifiex_pcie 0000:02:00.0: PREP_CMD: FW in reset state
[ 6822.586904] mwifiex_pcie 0000:02:00.0: failed to get signal information
[ 6827.925487] mwifiex_pcie 0000:02:00.0: 4296924600 : Tx timeout(#40), bss_type-num = 0-0
[ 6828.584671] mwifiex_pcie 0000:02:00.0: PREP_CMD: FW in reset state
[ 6828.584687] mwifiex_pcie 0000:02:00.0: failed to get signal information
[ 6828.584755] mwifiex_pcie 0000:02:00.0: PREP_CMD: FW in reset state
[ 6828.584763] mwifiex_pcie 0000:02:00.0: failed to get signal information
[ 6834.587608] mwifiex_pcie 0000:02:00.0: PREP_CMD: FW in reset state
[ 6834.587623] mwifiex_pcie 0000:02:00.0: failed to get signal information
[ 6834.587698] mwifiex_pcie 0000:02:00.0: PREP_CMD: FW in reset state
[ 6834.587706] mwifiex_pcie 0000:02:00.0: failed to get signal information

Revision history for this message
In , wengxt (wengxt-linux-kernel-bugs) wrote :

Created attachment 197921
a more complete log when problem happens.

Attach a more complete log when problem happens.

Revision history for this message
In , akarwar (akarwar-linux-kernel-bugs) wrote :

Thanks for reporting the problem. This is a command timeout problem caused mostly due to a firmware bug.
Which firmware is being used here?

Revision history for this message
In , wengxt (wengxt-linux-kernel-bugs) wrote :

It's commit bbe4917 from linux-firmware git repository, not quite sure about the actual version.

Revision history for this message
In , akarwar (akarwar-linux-kernel-bugs) wrote :
Revision history for this message
In , wengxt (wengxt-linux-kernel-bugs) wrote :

Created attachment 197961
log with firmware 15.68.7.p53

At first I though it's fixed, but I just got into similar situation. Log attached.

BTW, I also seems to have unstable connection when wifi is working. E.g. I open google.com in firefox with google's instant search feature enabled. While typing, The result usually not displayed and I have to refresh page several times to get web page displayed. The other laptop I have doesn't show the same problem so it's not network issue.

Revision history for this message
In , anton (anton-linux-kernel-bugs) wrote :

I have the same problem with mwifiex_pcie (latest git firmware) on Surface Pro 3

Revision history for this message
In , wengxt (wengxt-linux-kernel-bugs) wrote :

Created attachment 201031
A more complete log with debug flag set to 0xffffff

4.3.3 with firmware 15.68.7.p53

Revision history for this message
In , mikael (mikael-linux-kernel-bugs) wrote :

I have the same problem. In addition, using mwiflex_pcie on Surface Pro 4 seems to cause intermittent system crashes.

Using kernel 4.3.3-5 compiled from linux-source.

Revision history for this message
In , andrea (andrea-linux-kernel-bugs) wrote :

I have the same problem on my Microsoft Surface 3 (not Pro model), same issues (even though the rest of system is stable) and same dmesg printout.
Lspci lists the card on my tablet as: Marvell Technology Group Ltd. 88W8897 [AVASTAR] 802.11ac Wireless.

Revision history for this message
In , andrea (andrea-linux-kernel-bugs) wrote :

I have also noticed that the issues appear as soon as I start using the wireless card heavily (downloading files, apt update, etc.). As long as I just use ping, everything is fine.

Revision history for this message
In , wengxt (wengxt-linux-kernel-bugs) wrote :

It becomes much worse in linux 4.5 comparing with 4.4.2.

Is there any update on this issue?

Revision history for this message
In , anton (anton-linux-kernel-bugs) wrote :

Created attachment 209311
attachment-2572-0.html

The driver was updated recently:

Из git://git.marvell.com/mwifiex-firmware
   e92f8b3..a96efa0 master -> origin/master
Обновление e92f8b3..a96efa0
Fast-forward
 mrvl/pcie8897_uapsta.bin | Bin 803884 -> 816772 bytes
 mrvl/sd8897_uapsta.bin | Bin 780760 -> 794148 bytes

2016-03-15 0:25 GMT+03:00 <email address hidden>:

> https://bugzilla.kernel.org/show_bug.cgi?id=109681
>
> --- Comment #11 from Weng Xuetian <email address hidden> ---
> It becomes much worse in linux 4.5 comparing with 4.4.2.
>
> Is there any update on this issue?
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
>

Revision history for this message
In , wengxt (wengxt-linux-kernel-bugs) wrote :

Actually I'm aware of that.

However, with latest firmware + linux 4.5, my system would simply freeze in a few minutes that I can't get any useful information for this bug.

With 15.68.7.p53 my kernel could live longer, but with kernel 4.5, wifi also enters bad state quite fast comparing with 4.4.2 or 4.3.3.

Revision history for this message
In , wengxt (wengxt-linux-kernel-bugs) wrote :

The only thing I notice so far is that:

With latest firmware + linux 4.3.3, I notice a lot of "mwifiex_pcie 0000:02:00.0: mwifiex_process_sleep_confirm_resp: cmd size is 0" when wifi enters bad state, which doesn't happen with 15.68.7.p53.

Revision history for this message
In , akarwar (akarwar-linux-kernel-bugs) wrote :

Hi Weng,

Can you provide complete log for the issue mentioned in comment#14?

Revision history for this message
In , anton (anton-linux-kernel-bugs) wrote :

Created attachment 209441
attachment-7733-0.html

My system freezes too with 4.5 kernel and latest and previous versions of
firmware.

2016-03-16 6:52 GMT+03:00 <email address hidden>:

> https://bugzilla.kernel.org/show_bug.cgi?id=109681
>
> --- Comment #13 from Weng Xuetian <email address hidden> ---
> Actually I'm aware of that.
>
> However, with latest firmware + linux 4.5, my system would simply freeze
> in a
> few minutes that I can't get any useful information for this bug.
>
> With 15.68.7.p53 my kernel could live longer, but with kernel 4.5, wifi
> also
> enters bad state quite fast comparing with 4.4.2 or 4.3.3.
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
>

Revision history for this message
In , wengxt (wengxt-linux-kernel-bugs) wrote :

Created attachment 210651
kernel 4.5 log with debug mask 0xfffffff

The complete log is too large(286MB), so I cut it to last 1 minute before PREP_CMD: FW is in bad state happens.

mwifiex_pcie 0000:02:00.0: info: MWIFIEX VERSION: mwifiex 1.0 (15.68.7.p66)

Revision history for this message
In , jwhite (jwhite-linux-kernel-bugs) wrote :

I'm experiencing a variety of problems with this driver as well.

Using Linux v4.6-rc1-64-g6ddf37d, firmware 15.68.7.p66.

The first failure is an inability to connect. This is sporadic; it seems to happen about 1 time in 4 boots, but one time I had it three or four times in a row. I will attach a dmesg log with debug increased part way through. It seems as though this may be similar to the problem reported here:
  https://<email address hidden>/msg19911.html

The second failure is horrific performance, accompanied by constantly repeated mwifiex_process_sleep_confirm_resp: cmd size is 0 messages. The user experience is harder to quantify; you get absolutely terrible network performance (7-10% packet loss), and general instability. I'll attach a dmesg log for that as well, again with debug increased part way through. This is similar to what Weng Xuetian reported in Comment 14.

Finally, as I use this device, I find one consistent and persistent oddity. That is, even when everything is 'working', I get poor latency connecting to the device. That is, my latency when I ping 'out' is fine (< 10 ms, not much variation). But when I ping the device from a different system, I get highly variable latency. (Ranging from 10 ms to 4000 ms). No dropped packets, but horrible latency. You really notice it when you're ssh'd in :-/.

Revision history for this message
In , jwhite (jwhite-linux-kernel-bugs) wrote :

Created attachment 211711
Log of the failure to connect.

Revision history for this message
In , jwhite (jwhite-linux-kernel-bugs) wrote :

Created attachment 211721
Log of the cmd size 0 failure mode.

Revision history for this message
In , jwhite (jwhite-linux-kernel-bugs) wrote :

I've spent some time looking at the source code and trying to debug what is happening. I'm using the kvalo wireless-drivers-next pending branch, and I've applied recent patches sent to lkml that seem valuable; but the problems persist.

It seems as though in the 'normal' case we get an unusual number of 'max count reached while accessing sleep cookie' messages. The usleep calls in that case make me suspicious, and being slow to wake up would seem to be a good explanation of the performance we see. (That is, sending a packet is fast, but responding is slow). It also seems like we're in a rather startling loop of awake/sleep events; doing multiple spins within a millisecond, as far as I can tell.

In the disconnect case, I see an 'invalid cmd resp' error followed by 'There is no command but got a command response'.

I have now started booting with debug_mask=0xffffffff; I'll replace my logs with those, as they appear to be more complete.

For reference, here are related areas of the code:

Sleep message:
https://kernel.googlesource.com/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next/+/pending/drivers/net/wireless/marvell/mwifiex/pcie.c#377

There is no command:
https://kernel.googlesource.com/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next/+/pending/drivers/net/wireless/marvell/mwifiex/pcie.c#1637
(although intriguingly to my naive mind, that code block is related to the sleep check as well).

Revision history for this message
In , jwhite (jwhite-linux-kernel-bugs) wrote :

Created attachment 212611
Log of a failure to connect, with debug mask set from boot.

Revision history for this message
In , jwhite (jwhite-linux-kernel-bugs) wrote :

Created attachment 212621
Log of a 'working' session, which has poor latency. debug_mask ffffffff from start.

Revision history for this message
In , jwhite (jwhite-linux-kernel-bugs) wrote :

Disabling power management (via a Network Manager dispatcher.d script) seems to help the first problem (latency issues). But it leaves the sporadic failures to connect.

A hack to prevent power management from ever starting seems to resolve those failures as well, although it's hard to claim that as true; it could be I was simply not patient enough to reboot enough times to reproduce the failure.

ping latency and dropped packets can still suffer, so there is clearly something else still wrong, but I'm hoping this is a clue to help understand the issue.

I'll attach the hack.

Revision history for this message
In , jwhite (jwhite-linux-kernel-bugs) wrote :

Created attachment 212631
Hack to prevent power saving from ever starting

Revision history for this message
In , akarwar (akarwar-linux-kernel-bugs) wrote :

Created attachment 214061
Debug patch for sleep issue

Revision history for this message
In , akarwar (akarwar-linux-kernel-bugs) wrote :

We could not recreate below error message on our reference platform.

[ 16.506296] mwifiex_pcie 0000:02:00.0: mwifiex_process_sleep_confirm_resp: cmd size is 0
[ 16.506513] mwifiex_pcie 0000:02:00.0: max count reached while accessing sleep cookie

Could you apply attached debug patch and share dmesg log for the issue?

Revision history for this message
In , jwhite (jwhite-linux-kernel-bugs) wrote :

Created attachment 214081
Log with new debug patch; this is a case where we cannot connect.

Revision history for this message
In , jwhite (jwhite-linux-kernel-bugs) wrote :

Created attachment 214091
Case where we get just a few slow pings, but it otherwise works.

Interestingly, the constant problem I was having (most packets delayed) seems to have softened, for no discernable reason. But I do still see a few slow packets (e.g. replies in 80-300 ms range), and in the log, I do see a few of the 'max count reached while accessing sleep cookie' messages.

Revision history for this message
In , akarwar (akarwar-linux-kernel-bugs) wrote :

Created attachment 214141
debug sleep issue changes

Revision history for this message
In , akarwar (akarwar-linux-kernel-bugs) wrote :

Please try attached patch. It includes a fix and debug messages.

Revision history for this message
In , jwhite (jwhite-linux-kernel-bugs) wrote :

Created attachment 214161
Run with dbg3

This does seem to fix the lack of connection, although again, it's tough to claim that for sure. But I did get 8 or 9 clean connections.

I do still get hangs at reboot and shutdown sporadically; I don't have a strong sense of whether my hack eliminated those or not. (And they seem to happen after file systems are unmounted so I don't have an easy way to capture them.).

Revision history for this message
In , akarwar (akarwar-linux-kernel-bugs) wrote :

Thanks for the tests.
Please share the logs for the observed issues.

Revision history for this message
In , duyhieu.bui (duyhieu.bui-linux-kernel-bugs) wrote :

Created attachment 214431
mwifiex on Surface Pro 3

I have a Surface pro 3 model and I had the same problems. I started booting my system with debug flag for mwifiex module with wireless-next kernel yesterday. I have some logs and I hope that they will help you find the problems.

Revision history for this message
In , jwhite (jwhite-linux-kernel-bugs) wrote :

My apologies; I'm traveling for work, and am away from my normal work environment, which is making it harder for me to be thorough and responsive. However, what I can tell with the tablet and my laptop is that the DBG3 patch seems to improve matters, but it does not appear to be a fix.

I spent some time rebooting the tablet again and again. I eventually was able to get a failure to connect. That was without the debug_mask=0xffffffff; but I will attach that log regardless. I will try to reproduce it with debug on.

I also get frequent kernel panics at reboot or shutdown. They seem to come after the file system is unmounted; I can't seem to find the logs in /var/log or using journalctl -k -b -1. Forgive me if there is kernel dev 101 I am forgetting to properly capture those.

I will try to attach a photograph of one of the kernel panics; the stack trace, at least, appears to be legible.

I do not seem to get the kernel panics on reboot with a 4.4 kernel using my hack.

Full disclosure: I do not have a 'pure' 4.6 tree; I'm using the wireless-next tree, and I have one or two of the interesting mwifiex patches from lkml applied. I will also try to reproduce these tests with a pure 4.6 tree.

Revision history for this message
In , jwhite (jwhite-linux-kernel-bugs) wrote :

Created attachment 214471
dmesg log of a failure to connect

Revision history for this message
In , jwhite (jwhite-linux-kernel-bugs) wrote :

Created attachment 214481
Screenshot of sample kernel panic with dbg3

Revision history for this message
In , duyhieu.bui (duyhieu.bui-linux-kernel-bugs) wrote :

Created attachment 214581
FW in bad state

With the wireless-next kernel and the power saving patch and the debug sleep patch, sometimes I got this error very often. The driver says the firmware is in bad state.

Revision history for this message
In , duyhieu.bui (duyhieu.bui-linux-kernel-bugs) wrote :

Created attachment 214681
Multiple "FW in bad state"

I got multiple "FW in bad state" errors with the latest wifi-next kernel.

Revision history for this message
In , fc (fc-linux-kernel-bugs) wrote :

I am also experiencing `PREP_CMD: FW in reset state` from mwifiex_pcie on a Surface Book, often after long idle periods.

Revision history for this message
In , jwhite (jwhite-linux-kernel-bugs) wrote :

Studying some of the logs, and the logic change in DBG3, a potential clue is that in pcie.c, around line 1610, there is logic something like:
  skb_trim(skb, rx_len)
  skb_pull(skb, INTF_HEADER_LEN)
  if (...) {
    skb_push(skb, INTF_HEADER_LEN)
  }

It appears logical that the skb_push and skb_pull are meant to balance each other. However, I see from studying logs that it we get cases where rx_len is 0. In that case, the skb_pull will be a NOP, and the skb_push will move the data pointer, potentially incorrectly.

I have one skb_under_crash that seems likely to be related to that.

Revision history for this message
In , jwhite (jwhite-linux-kernel-bugs) wrote :

I have a hypothesis as to what I am seeing.

That is, with a fair bit of debug logging, I found that two interrupts in a row could arrive, even if the handler for the first one sent a mask of 0 to the PCIE_HOST_INT_MASK port. That combined with a unique situation involving the while loop in mwifiex_process_pcie_int(). That is, it seems that the flow can be:
  Interrupt #1 (set int_status)
  Start processing interrupt #1. Start by clearing int_status. While processing, interrupt #2 arrives, once again setting int_status. After interrupt #1 is processed, the while loop manually checks the interrupt status on the card and reloops. That causes it to fully process the second interrupt *without clearing int_status*.
  Now we re-enter, see that int_status is set to 0x4, and try to process a command that is no longer there. That is a 0 length command, and we get into a lot of trouble.

I have found that if I comment out the recheck (and effectively disable the while loop) here:
  https://kernel.googlesource.com/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next/+/pending/drivers/net/wireless/marvell/mwifiex/pcie.c#2238

I get stable results. It's not clear if that's a correct fix. And, frankly, my results are sporadic enough that I'm going to need some time before I believe it's a real result. But I thought I'd share what seems like a promising lead.

Revision history for this message
In , jwhite (jwhite-linux-kernel-bugs) wrote :

Created attachment 215931
Diff that shows problem, and has current working solution

The attached diff is the patch set I was running with (against wireless-drivers-next/master; HEAD is e1ca790c8a32c0c77b9d89089ac7e73b72c2adfc).

Parts of it are the hack to prevent power saving. Parts of it are debug messages that seem to exacerbate the problem. That is, with these messages in, I get a fairly reproducible problem set.

Finally, there is an #ifdef; JPW_EXPERIMENT_1 that 'fixes' the problems for me.

Revision history for this message
In , viz (viz-linux-kernel-bugs) wrote :

I have tried the patch from Jeremy White (but without disabling power-saving feature) on Linux 4.5.3 (Gentoo Source and latest firmware 15.68.7.p66) and it runs for several hours without any problem, thanks.

Revision history for this message
In , fc (fc-linux-kernel-bugs) wrote :

Thanks Jeremy, I applied the patch on top of tigerite's kernel (https://launchpad.net/~tigerite/+archive/ubuntu/kernel) on top of 4.4.6-3-surface. I will run it for a while on my Surface Book and see if that solves the problem.

It's here in case someone wants to save some time: https://github.com/EiNSTeiN-/mwifiex

Revision history for this message
In , akarwar (akarwar-linux-kernel-bugs) wrote :

Hi Jeremy,

I got the race condition you described in comment #42. It’s not expected as per our design. Here is the flow.

1) Interrupt is received over PCIe interface
2) Interrupt handler does following things.
    a) Read and clear the interrupt register, copy the interrupt bitmap to "adapter->int_status"
    b) Disable interrupts and exit.

3) Other thread will call mwifiex_process_int_status() to serve the interrupts in while loop.

4) At the end of loop, we read interrupt register for any further interrupts and serve them.

5) We exit from while loop if there are no more interrupts.

6) Interrupts are enabled while exiting mwifiex_process_int_status()

Basically we don't expect interrupt handler being called while serving the interrupts. The reason is interrupts are disabled at that time.

Looks like mwifiex_pcie_disable_host_int() call in mwifiex_interrupt_status() didn't take an effect for some reason.

Can you confirm by adding debug message if you receive an interrupt when it's disabled?

Revision history for this message
In , akarwar (akarwar-linux-kernel-bugs) wrote :

The fix from Jeremy is ok and doesn't have any side-effects. It removes the code which was added for improving the performance.

I am curious to know why interrupt is received even after disabling it which caused a race condition.

Revision history for this message
In , jwhite (jwhite-linux-kernel-bugs) wrote :

Created attachment 216101
Log of failing session

The attached log should reflect the wireless next tree, with the patch I provided applied. (Except that I've left the bottom part of the while loop enabled, so as to reproduce the failure).

I believe the problem case starts at time index 30.893034.

(I had instrumented the 0 byte command failure; so I generally search for '0 byte cmd' and then look backwards to hunt for the reason we got a 0 byte command).

Again, my sense of the events is:
  30.893034 Event (0x00000008) interrupt arrives, we disable
  30.893300 We detect a command response, reloop, and start processing
  30.893300 Command (0x00000004) interrupt arrives, we stash it into
            adapter->int_status
  30.893314 We re-enable (but too late; damage is done)
  30.893374 We 'pop' the already processed command from adapter->int_status,
            and think it's a 0 byte cmd response

Revision history for this message
In , anton (anton-linux-kernel-bugs) wrote :

Created attachment 216161
attachment-8673-0.html

I have applied the patch above to the 4.6-rc7 kernel on my Surface Pro 3.
After some time of usage I am unable to turn of wifi or change the AP.
The dmesg log:

[ 8155.017776] mwifiex_pcie 0000:01:00.0: disable_host_int
[ 8155.017781] mwifiex_pcie 0000:01:00.0: jpw:
mwifiex_pcie_process_cmd_complete skb ffff880244ee1700, skb->len 12,
skb->data_len 0, skb->head ffff880242b2a940, skb->data ffff880242b2a980,
skb->tail 76, skb->end 2432
[ 8155.017784] mwifiex_pcie 0000:01:00.0: enable_host_int
[ 8155.017789] mwifiex_pcie 0000:01:00.0: jpw: mwifiex_pcie_cmdrsp_complete
skb ffff880244ee1700, skb->len 8, skb->data_len 0, skb->head
ffff880242b2a940, skb->data ffff880242b2a984, skb->tail 76, skb->end 2432
[ 8155.017867] mwifiex_pcie 0000:01:00.0: disable_host_int
[ 8155.017873] mwifiex_pcie 0000:01:00.0: jpw:
mwifiex_pcie_process_cmd_complete skb ffff880244ee1700, skb->len 12,
skb->data_len 0, skb->head ffff880242b2a940, skb->data ffff880242b2a980,
skb->tail 76, skb->end 2432
[ 8155.017875] mwifiex_pcie 0000:01:00.0: enable_host_int
[ 8155.017879] mwifiex_pcie 0000:01:00.0: jpw: mwifiex_pcie_cmdrsp_complete
skb ffff880244ee1700, skb->len 8, skb->data_len 0, skb->head
ffff880242b2a940, skb->data ffff880242b2a984, skb->tail 76, skb->end 2432

2016-05-12 16:31 GMT+03:00 <email address hidden>:

> https://bugzilla.kernel.org/show_bug.cgi?id=109681
>
> --- Comment #48 from Jeremy White <email address hidden> ---
> Created attachment 216101
> --> https://bugzilla.kernel.org/attachment.cgi?id=216101&action=edit
> Log of failing session
>
> The attached log should reflect the wireless next tree, with the patch I
> provided applied. (Except that I've left the bottom part of the while loop
> enabled, so as to reproduce the failure).
>
> I believe the problem case starts at time index 30.893034.
>
> (I had instrumented the 0 byte command failure; so I generally search for
> '0
> byte cmd' and then look backwards to hunt for the reason we got a 0 byte
> command).
>
> Again, my sense of the events is:
> 30.893034 Event (0x00000008) interrupt arrives, we disable
> 30.893300 We detect a command response, reloop, and start processing
> 30.893300 Command (0x00000004) interrupt arrives, we stash it into
> adapter->int_status
> 30.893314 We re-enable (but too late; damage is done)
> 30.893374 We 'pop' the already processed command from
> adapter->int_status,
> and think it's a 0 byte cmd response
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
>

Revision history for this message
In , anton (anton-linux-kernel-bugs) wrote :

Created attachment 216261
attachment-11994-0.html

Well, I have tested it for 3 days and it seems to work fine

2016-05-12 20:08 GMT+03:00 Anton Anikin <email address hidden>:

> I have applied the patch above to the 4.6-rc7 kernel on my Surface Pro 3.
> After some time of usage I am unable to turn of wifi or change the AP.
> The dmesg log:
>
> [ 8155.017776] mwifiex_pcie 0000:01:00.0: disable_host_int
> [ 8155.017781] mwifiex_pcie 0000:01:00.0: jpw:
> mwifiex_pcie_process_cmd_complete skb ffff880244ee1700, skb->len 12,
> skb->data_len 0, skb->head ffff880242b2a940, skb->data ffff880242b2a980,
> skb->tail 76, skb->end 2432
> [ 8155.017784] mwifiex_pcie 0000:01:00.0: enable_host_int
> [ 8155.017789] mwifiex_pcie 0000:01:00.0: jpw:
> mwifiex_pcie_cmdrsp_complete skb ffff880244ee1700, skb->len 8,
> skb->data_len 0, skb->head ffff880242b2a940, skb->data ffff880242b2a984,
> skb->tail 76, skb->end 2432
> [ 8155.017867] mwifiex_pcie 0000:01:00.0: disable_host_int
> [ 8155.017873] mwifiex_pcie 0000:01:00.0: jpw:
> mwifiex_pcie_process_cmd_complete skb ffff880244ee1700, skb->len 12,
> skb->data_len 0, skb->head ffff880242b2a940, skb->data ffff880242b2a980,
> skb->tail 76, skb->end 2432
> [ 8155.017875] mwifiex_pcie 0000:01:00.0: enable_host_int
> [ 8155.017879] mwifiex_pcie 0000:01:00.0: jpw:
> mwifiex_pcie_cmdrsp_complete skb ffff880244ee1700, skb->len 8,
> skb->data_len 0, skb->head ffff880242b2a940, skb->data ffff880242b2a984,
> skb->tail 76, skb->end 2432
>
>
> 2016-05-12 16:31 GMT+03:00 <email address hidden>:
>
>> https://bugzilla.kernel.org/show_bug.cgi?id=109681
>>
>> --- Comment #48 from Jeremy White <email address hidden> ---
>> Created attachment 216101
>> --> https://bugzilla.kernel.org/attachment.cgi?id=216101&action=edit
>> Log of failing session
>>
>> The attached log should reflect the wireless next tree, with the patch I
>> provided applied. (Except that I've left the bottom part of the while
>> loop
>> enabled, so as to reproduce the failure).
>>
>> I believe the problem case starts at time index 30.893034.
>>
>> (I had instrumented the 0 byte command failure; so I generally search for
>> '0
>> byte cmd' and then look backwards to hunt for the reason we got a 0 byte
>> command).
>>
>> Again, my sense of the events is:
>> 30.893034 Event (0x00000008) interrupt arrives, we disable
>> 30.893300 We detect a command response, reloop, and start processing
>> 30.893300 Command (0x00000004) interrupt arrives, we stash it into
>> adapter->int_status
>> 30.893314 We re-enable (but too late; damage is done)
>> 30.893374 We 'pop' the already processed command from
>> adapter->int_status,
>> and think it's a 0 byte cmd response
>>
>> --
>> You are receiving this mail because:
>> You are on the CC list for the bug.
>>
>
>
>
>
>

Revision history for this message
In , fc (fc-linux-kernel-bugs) wrote :

I have been running a version of the patch with the power management changes removed (i.e. power management turned on) to see if it made a difference. I had several instances of unusable wifi (until next reboot) in 2 days. So disabling power management definitely seems to be part of the solution.

Revision history for this message
In , david.lopez (david.lopez-linux-kernel-bugs) wrote :

Created attachment 218101
Trace error Surface Pro 3 (kernel 4.6, Jeremy White's patch, archlinux)

I use archlinux in a Surface Pro 3, I think is the same wifi card as SP4. I use the package linux-surfacepro3 4.6-1 (https://aur.archlinux.org/packages/linux-surfacepro3/), which includes kernel 4.6 and Jeremy White's patch. Although wifi works, I receive an error message in dmesg which didn't appear in previous versions of the package.

I don't know if this error message could be of interest.

Revision history for this message
In , jwhite (jwhite-linux-kernel-bugs) wrote :

(In reply to David López from comment #52)
> Created attachment 218101 [details]
> Trace error Surface Pro 3 (kernel 4.6, Jeremy White's patch, archlinux)
>
> I use archlinux in a Surface Pro 3, I think is the same wifi card as SP4. I
> use the package linux-surfacepro3 4.6-1
> (https://aur.archlinux.org/packages/linux-surfacepro3/), which includes
> kernel 4.6 and Jeremy White's patch. Although wifi works, I receive an error
> message in dmesg which didn't appear in previous versions of the package.
>
> I don't know if this error message could be of interest.

That's not related or (as far as I can tell) especially harmful. If you look at the referenced line of code (net/wireless/core.c, line 363) you see it's a message generated because we have a set_tx_power function without a get_tx_power function. (The second message is the same; but for get_antenna). I suspect you're seeing it because you've updated your driver and/or firmware.

Revision history for this message
In , stephenjust (stephenjust-linux-kernel-bugs) wrote :

There is some new firmware available from Marvell [1]. So far, I've had no issues with it, but I have only tried it for a few days. Perhaps this new firmware makes any patch unnecessary?

I've actually managed to dig up some release notes in ChromiumOS's commit logs that suggest the newer firmware may help with stability [2].

[1] http://git.marvell.com/?p=mwifiex-firmware.git;a=tree;f=mrvl;hb=HEAD
[2] https://chromium.googlesource.com/chromiumos/third_party/marvell/

Revision history for this message
In , cfeck (cfeck-linux-kernel-bugs) wrote :

The Marvell firmware from link [1] in (previous) comment #54 does not help avoiding the random disconnects on my Surface Pro 3 with openSUSE Tumbleweed kernel 4.6.3. Not sure if this kernel carries any relevant patches compared to upstream.

Revision history for this message
In , akarwar (akarwar-linux-kernel-bugs) wrote :

I have just submitted patches to resolve original issue reported in this bug. i.e. below error messages followed by command timeout.

[ 11.522123] mwifiex_pcie 0000:01:00.0: CMD_RESP: invalid cmd resp
[ 11.680412] mwifiex_pcie 0000:01:00.0: There is no command but got cmdrsp

https://patchwork.kernel.org/patch/9232091/
https://patchwork.kernel.org/patch/9232093/

Revision history for this message
In , duyhieu.bui (duyhieu.bui-linux-kernel-bugs) wrote :

(In reply to Amitkumar Karwar from comment #56)
> I have just submitted patches to resolve original issue reported in this
> bug. i.e. below error messages followed by command timeout.
>
> [ 11.522123] mwifiex_pcie 0000:01:00.0: CMD_RESP: invalid cmd resp
> [ 11.680412] mwifiex_pcie 0000:01:00.0: There is no command but got cmdrsp
>
> https://patchwork.kernel.org/patch/9232091/
> https://patchwork.kernel.org/patch/9232093/

It seems more stable but the 'CMD_RESP' problem persists on my system (Surface Pro 3 running Linux 4.7.0).

    7.720645] mwifiex_pcie 0000:01:00.0: WLAN FW is active
[ 7.808716] mwifiex_pcie 0000:01:00.0: info: MWIFIEX VERSION: mwifiex 1.0 (15.68.7.p77)
[ 7.808721] mwifiex_pcie 0000:01:00.0: driver_version = mwifiex 1.0 (15.68.7.p77)
[ 7.812614] mwifiex_pcie 0000:01:00.0 eth0: renamed from mlan0
[ 9.696401] mwifiex_pcie 0000:01:00.0: info: trying to associate to 'n0wifi' bssid 4c:e6:76:71:25:1b
[ 9.726026] mwifiex_pcie 0000:01:00.0: info: associated to bssid 4c:e6:76:71:25:1b successfully
[ 1145.239295] mwifiex_pcie 0000:01:00.0: CMD_RESP: invalid cmd resp
[ 1146.012946] mwifiex_pcie 0000:01:00.0: There is no command but got cmdrsp
[ 1624.849858] mwifiex_pcie 0000:01:00.0: CMD_RESP: invalid cmd resp
[ 1625.236366] mwifiex_pcie 0000:01:00.0: CMD_RESP: invalid cmd resp
[ 1626.009534] mwifiex_pcie 0000:01:00.0: There is no command but got cmdrsp
[ 3988.474123] mwifiex_pcie 0000:01:00.0: info: successfully disconnected from 4c:e6:76:71:25:1b: reason code 3
[ 3988.474151] mwifiex_pcie 0000:01:00.0: info: successfully disconnected from 00:00:00:00:00:00: reason code 3

Revision history for this message
In , duyhieu.bui (duyhieu.bui-linux-kernel-bugs) wrote :

Created attachment 227261
Firmware in bad state after sleep on Surface pro 3

I got the previous errors (firmware in bad state, 'CMD_RESP') after sleep with the new patch. I tried to reset the pci card/rmmod and modprobe the module again without success.

Revision history for this message
In , mikael (mikael-linux-kernel-bugs) wrote :

I'm now running a 4.7 kernel which has Amitkumar's patches from 2016-07-15 included. I get errors like this:

Sep 25 11:22:14 hat kernel: [ 1058.526154] mwifiex_pcie 0000:02:00.0: CMD_RESP: cmd 0x10f error, result=0x2

and the driver crashes after a couple of hours.

With Jeremy White's patch from 2016-05-11 above, the driver is completely stable.

Revision history for this message
In , mikael (mikael-linux-kernel-bugs) wrote :

To clarify my last comment:

With JW:s patch, I *still* get the CMD_RESP error (which I assume is harmless) but the driver doesn't crash.

Could this have something to do with the following change?:

--- a/drivers/net/wireless/marvell/mwifiex/sta_cmd.c
+++ a/drivers/net/wireless/marvell/mwifiex/sta_cmd.c
@@ -2237,7 +2237,7 @@ int mwifiex_sta_init_cmd(struct mwifiex_private *priv, u8 first_sta, bool init)
   if (ret)
    return -1;

- if (priv->bss_type != MWIFIEX_BSS_TYPE_UAP) {
+ if (0 && priv->bss_type != MWIFIEX_BSS_TYPE_UAP) {
    /* Enable IEEE PS by default */
    priv->adapter->ps_mode = MWIFIEX_802_11_POWER_MODE_PSP;
    ret = mwifiex_send_cmd(priv,
@@ -2300,7 +2300,7 @@ int mwifiex_sta_init_cmd(struct mwifiex_private *priv, u8 first_sta, bool init)
  if (ret)
   return -1;

- if (!disable_auto_ds &&
+ if (0 && !disable_auto_ds &&
      first_sta && priv->adapter->iface_type != MWIFIEX_USB &&
      priv->bss_type != MWIFIEX_BSS_TYPE_UAP) {
   /* Enable auto deep sleep */

Revision history for this message
In , mikael (mikael-linux-kernel-bugs) wrote :

This is the behavior for 4.8.0 (without any patches):

The driver works for 25 s. Then I get the message:

Oct 6 09:34:15 hat wpa_supplicant[1959]: wlp2s0: CTRL-EVENT-SCAN-FAILED ret=-16 retry=1

every second.

If I shut down network-manager and reload the module, it again works for 25 s, followed by the above message. I also cannot scan for any wireless networks when the message starts, although network communication works.

No sign, yet, of the CMD_RESP error which I've reported about above for previous kernels.

Should I, perhaps, report this as a new bug?

Revision history for this message
In , mikael (mikael-linux-kernel-bugs) wrote :

I can now report that I've tested the kernel from Linus master branch. It seems like the patches by Amitkumar Karwar from July which went into the mainline kernel yesterday finally makes the mwifiex driver stable on MS Surface Pro 4, so I suggest that this bug report can now be closed.

Revision history for this message
In , kj.kuhns (kj.kuhns-linux-kernel-bugs) wrote :

Mikael,

What version of both the kernel and mwifiex are you using.

I am using mwifiex 1.0 (15.68.7.p77)
and kernel 4.8.6-300.fc25 (I am trying to install Fedora25)

if the console ever goes into a suspend state, I end up with what you were seeing: "PREP_CMD: FW in reset state"

Thanks,
Kevin

Revision history for this message
In , cfeck (cfeck-linux-kernel-bugs) wrote :

I can confirm that with newest firmware and kernel 4.10, the connection is stable on a SP3. I do get constant 5% CPU usage in a kworker/uN:N process caused by the mwifiex driver, though. Not sure if this is expected, or worth a separate bug report.

Revision history for this message
In , cfeck (cfeck-linux-kernel-bugs) wrote :

With kernel 4.11.0, the issue is back.

[ 7.890616] mwifiex_pcie 0000:01:00.0: info: trying to associate to '<routername>' bssid <ID>
[ 7.920245] mwifiex_pcie 0000:01:00.0: info: associated to bssid <ID> successfully
[ 7834.643967] mwifiex_pcie 0000:01:00.0: Firmware wakeup failed
[ 8947.336070] mwifiex_pcie 0000:01:00.0: PREP_CMD: FW in reset state

Revision history for this message
In , mikael (mikael-linux-kernel-bugs) wrote :

Created attachment 256641
no-power-save

Dear Kevin,

Sorry for taking so long to respond.

I have got this error in many different kernels. I still get it in 4.11.0.
But I also know what causes it: The current linux driver doesn't handle
power management properly. So, it is necessary to insert if-up rules which
switches off power management. In my case, where the interface is governed
by NetworkManager, I wrote a file
/etc/NetworkManager/dispatcher.d/no-power-save which I attach to this email.

With this fix, my mwifiex driver is now fully stable under kernel 4.11.0.

Best regards,
Mikael

On Fri, Feb 17, 2017 at 6:24 PM, <email address hidden>
wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=109681
>
> Kevin (<email address hidden>) changed:
>
> What |Removed |Added
> ------------------------------------------------------------
> ----------------
> CC| |<email address hidden>
>
> --- Comment #63 from Kevin (<email address hidden>) ---
> Mikael,
>
> What version of both the kernel and mwifiex are you using.
>
> I am using mwifiex 1.0 (15.68.7.p77)
> and kernel 4.8.6-300.fc25 (I am trying to install Fedora25)
>
> if the console ever goes into a suspend state, I end up with what you were
> seeing: "PREP_CMD: FW in reset state"
>
> Thanks,
> Kevin
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
>

Revision history for this message
In , gbhat (gbhat-linux-kernel-bugs) wrote :

Hi Christoph

(In reply to Christoph from comment #65)
> With kernel 4.11.0, the issue is back.
>
> [ 7.890616] mwifiex_pcie 0000:01:00.0: info: trying to associate to
> '<routername>' bssid <ID>
> [ 7.920245] mwifiex_pcie 0000:01:00.0: info: associated to bssid <ID>
> successfully
> [ 7834.643967] mwifiex_pcie 0000:01:00.0: Firmware wakeup failed
> [ 8947.336070] mwifiex_pcie 0000:01:00.0: PREP_CMD: FW in reset state

Can you please share the firmware dump after the above issue occur.

1. To manually trigger the dump you can use below command:

cat /sys/kernel/debug/mwifiex/mlan0/device_dump

2. When the dump is complete dmesg will log "mwifiex firmware dump end".

3. Collect the dump with below script:
-------------------------------------------------------------
#!/bin/bash
/sbin/ethtool --set-dump mlan0 0
/sbin/ethtool --get-dump mlan0
/sbin/ethtool --get-dump mlan0 data /tmp/ITCM.log

/sbin/ethtool --set-dump mlan0 1
/sbin/ethtool --get-dump mlan0
/sbin/ethtool --get-dump mlan0 data /tmp/DTCM.log

/sbin/ethtool --set-dump mlan0 2
/sbin/ethtool --get-dump mlan0
/sbin/ethtool --get-dump mlan0 data /tmp/SQRAM.log

/sbin/ethtool --set-dump mlan0 3
/sbin/ethtool --get-dump mlan0
/sbin/ethtool --get-dump mlan0 data /tmp/IRAM.log
-------------------------------------------------------------

4. Share the /tmp/*.log files collected above for further debugging.

5. Also, please share the dmesg logs generated after you have run the above command.

Thanks,
Ganapathi

Revision history for this message
Hicks (hicks1gb) wrote :
Revision history for this message
Hicks (hicks1gb) wrote :

Tested with "wicd" too.
Wicd crashes when wifi goes down

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Hicks (hicks1gb) wrote :
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Did this issue start happening after an update/upgrade? Was there a prior kernel version where you were not having this particular problem?

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.14 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.14

Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Incomplete
Revision history for this message
Hicks (hicks1gb) wrote : Re: [Bug 1730924] Re: Wifi does down "crash" in Surface Pro 4

El 14/11/17 a las 17:54, Joseph Salisbury escribió:
> Did this issue start happening after an update/upgrade? Was there a
> prior kernel version where you were not having this particular problem?
No, always
>
> Would it be possible for you to test the latest upstream kernel? Refer
> to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest
> v4.14 kernel[0].

Testing...
>
> If this bug is fixed in the mainline kernel, please add the following
> tag 'kernel-fixed-upstream'.
>
> If the mainline kernel does not fix this bug, please add the tag:
> 'kernel-bug-exists-upstream'.
>
> Once testing of the upstream kernel is complete, please mark this bug as
> "Confirmed".
>
>
> Thanks in advance.
>
> [0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.14
>
>
> ** Changed in: linux (Ubuntu)
> Importance: Undecided => Medium
>
> ** Changed in: linux (Ubuntu)
> Status: Confirmed => Incomplete
>

Revision history for this message
Hicks (hicks1gb) wrote :

At the moment wifi still working... No crashes.

Revision history for this message
Hicks (hicks1gb) wrote :

More crashes! Problem still here.

Revision history for this message
Hicks (hicks1gb) wrote :

bug-exists-upstream

Revision history for this message
Hicks (hicks1gb) wrote :

Let's see, I just tested this patch, which supposedly solves the wifi controller instability, but it doesn't work for me. I'll leave the link for what it's worth.

https://github.com/EiNSTeiN-/mwifiex

Another detail I want to comment on is that every time the wifi is dropped, the surface is semi blocked. Terminal stops responding, I can't run any commands. Ctrl+c does not work either. Wicd also stops responding.
If I try to restart the terminal, it remains "waiting" for processes to close, which never finish closing.
I finally have to force the switch off.

Revision history for this message
Hicks (hicks1gb) wrote :
Revision history for this message
Hicks (hicks1gb) wrote :
Revision history for this message
In , jwhite (jwhite-linux-kernel-bugs) wrote :

I see this issue on a newer Surface Pro 2017, with kernel 4.14, firmware 15.68.7.p119, and the symptoms are very familiar. Unfortunately, disabling power save does not seem to be as persistent a cure; it improves matters, but does not resolve the issue.

I've attempted to capture a dump, but I am being told that --set-dump mlan0 0 has no such device, and --set-dump wlp1s0 0 returns no such device.

Revision history for this message
In , gbhat (gbhat-linux-kernel-bugs) wrote :

Hi Jeremy,

Thanks for the intimation.

>>I've attempted to capture a dump, but I am being told that --set-dump mlan0 0
>>has no such device, and --set-dump wlp1s0 0 returns no such device.

Ok. Firmware dump is written into below file:
/sys/devices/virtual/devcoredump/devcd1/data

Can you please check if this file is generated for you. If generated kindly share the same.

Regards,
Ganapathi

Revision history for this message
In , jwhite (jwhite-linux-kernel-bugs) wrote :

Created attachment 260791
dmesg log and data file from firmware dump

Attached is a tarball with a dmesg log and the /sys/devices/virtual/devcoredump/devcd1/data
 file resulting after requesting a firmware dump.

Note that this is a 4.14 kernel, modified for Surface Pro use, but with the mwifiex module identical to the current tip of the 4.14 tree. (This tree to be precise https://github.com/jakeday/linux-surface).

I also get this result with a variety of other kernels; it seems specific to this hardware + firmware, and perhaps to this particular use case. (It is an odd case; a modified Fedora Core 23 stack.)

Nicely, I now get fairly consistent behavior. That is, upon first launch, the wifi stack is very laggy - you'll get ping times of 43, 100, 243, 79, and so on (when you'd expect steady 1.43). That shows up in interactive ssh sessions. Further, after a modest amount of use (usually just a few meg will do it), the wireless will stop working. You also can no longer successfully reboot or poweroff; you get a hang at kernel shutdown time.

The attached came from a session with a clean boot, debug_mask 0xffffffff, and about 3M of the kernel source downloaded before the wifi stopped functioning.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Try the daily mainline kernel, http://kernel.ubuntu.com/~kernel-ppa/mainline/daily/current/

There's a new fix for random MAC scan, worth a try.

Revision history for this message
Hicks (hicks1gb) wrote :

Testing...

Revision history for this message
Hicks (hicks1gb) wrote :

At the moment the wifi has not fallen, but after a reboot of the system has been slow enough to start and gives me this error.
I'll keep trying to see if he can hold on.

Revision history for this message
Hicks (hicks1gb) wrote :

Still doesn't work. The wifi is down again, taking the rest of the system with it.

Revision history for this message
In , andy.shevchenko (andy.shevchenko-linux-kernel-bugs) wrote :

Any updates on this from Marvell side?

Revision history for this message
In , gbhat (gbhat-linux-kernel-bugs) wrote :

Hi Jeremy,

Sorry for the delay. I checked the logs you shared. Here it is a scan command(0x107) timeout. But the firmware dump did not give any hints for firmware going to bad state. Moreover the scratch registers dumps tell us that the timed out command 0x107 did not reach the firmware at all. We are checking firmware dumps and will share the updates soon.

On the other hand, are you not getting below error:

> With kernel 4.11.0, the issue is back.
>
> [ 7.890616] mwifiex_pcie 0000:01:00.0: info: trying to associate to
> '<routername>' bssid <ID>
> [ 7.920245] mwifiex_pcie 0000:01:00.0: info: associated to bssid <ID>
> successfully
> [ 7834.643967] mwifiex_pcie 0000:01:00.0: Firmware wakeup failed
> [ 8947.336070] mwifiex_pcie 0000:01:00.0: PREP_CMD: FW in reset state

Thanks,
Ganapathi

Revision history for this message
In , jwhite (jwhite-linux-kernel-bugs) wrote :

(In reply to Ganapathi Bhat from comment #72)
> Hi Jeremy,
>
> Sorry for the delay. I checked the logs you shared. Here it is a scan
> command(0x107) timeout. But the firmware dump did not give any hints for
> firmware going to bad state. Moreover the scratch registers dumps tell us
> that the timed out command 0x107 did not reach the firmware at all. We are
> checking firmware dumps and will share the updates soon.

Thanks, I look forward to the update.

>
>
> On the other hand, are you not getting below error:

No, I don't recall seeing that error recently. Note that was Christoph that reported that, not I. I haven't tried with 4.11.

Cheers,

Jeremy

Revision history for this message
In , hugh-coleman (hugh-coleman-linux-kernel-bugs) wrote :

I am seeing that error message on a Surface Pro 2017. The connection appears stable regardless of powersaving but after suspending I get:
[ 361.092570] mwifiex_pcie 0000:01:00.0: Firmware wakeup failed

Followed by an endless loop of:
[ 366.098914] mwifiex_pcie 0000:01:00.0: PREP_CMD: FW in reset state
[ 366.098926] mwifiex_pcie 0000:01:00.0: scan failed: -1

Changed in linux:
importance: Unknown → Medium
status: Unknown → Confirmed
Revision history for this message
P.I.Julius (pijulius) wrote :
  • dump Edit (1.1 MiB, application/octet-stream)

Hi Guys,

I'm suffering this same problem on Surface Pro 2017 and would really like to find a solution as using Linux is the only option for me to not throw it out through the window as it has a corrupted memory an so can only use Linux which allows me to skip the affected area using memmap.

Unfortunately this problem forces me to reboot around 10 times a day with light usage so it makes it hard to enjoy the tablet.

Googling found this case and would like to attach the core dump from /sys/class/devcoredump maybe it will be helpful, also can duplicate it pretty easily so if help is needed re testing I'm here and finally would also offer a $100 reward for a solution if I may, I know its not that much but as gesture of appreciation hopefully it will be received with kindness.

Thank you guys in advance for the work on this!

Regards,
Julius

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Please attach `sudo lspci -vvnn`.

Revision history for this message
P.I.Julius (pijulius) wrote :

For sure, please see it attached below.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Have you tried kernel parameter `pcie_aspm=off`? There are some PCIe WiFi modules don't like ASPM L0s.

Revision history for this message
P.I.Julius (pijulius) wrote :

Yes, just tried it and unfortunately no help, same coredump (let me know if you want one attached again) after playing a bit of video from youtube :(

Revision history for this message
In , mail (mail-linux-kernel-bugs) wrote :

I owned a Surface Book and now own a Surface Book 2. And both were/are afffected by this. Still with kernel 4.20-rc5 the disconnect problem persists after long heavy-usage. Does Ganapathi have some updates yet?

Revision history for this message
In , gbhat (gbhat-linux-kernel-bugs) wrote :

Hi Chatty,

Do you mean you see the error in #Comment74? (after suspending)?

Regards,
Ganapathi

Revision history for this message
P.I.Julius (pijulius) wrote :

Hi Guys,

Just a note that may be helpful for tracking this down here is what I found out lately.

If in "drivers/net/wireless/marvell/mwifiex/cmdevt.c" file, "mwifiex_cmd_timeout_func(struct timer_list *t)" function I comment out these:

//if (adapter->if_ops.device_dump)
// adapter->if_ops.device_dump(adapter);

//if (adapter->if_ops.card_reset)
// adapter->if_ops.card_reset(adapter);

The wifi will still drop with the timeout command but at least the mwifiex_pcie module wont freeze and so you will be able to restart or even put the pc into sleep and after a resume simply continue where you left off without having to do a hard restart by long pressing power button.

It seems the card_reset(adapter) function is the one freezing the module and so freezing the wifi, so will try to see what's there but if any of you have any idea how we could reset the adapter and even (just for testing why not) always reload the firmware and see if that fixes it as it seems the resume does the same thing and that makes things work again.

Regards,
Julius

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

What kernel version did you use with `pcie_aspm=off`?
There's bug in older kernel that `pcie_aspm=off` doesn't turn ASPM off.

Please use mainline kernel with that kernel parameter.

Revision history for this message
P.I.Julius (pijulius) wrote :

Hi Kai-Heng Feng,

I see you wrote me twice with the same suggestion and GOOD SOOOO as I have given it up to easily on the first try but I can say now that GOT IT WORKING!!!

So here is everything I found out and did to make it work on Fedora 29 (Kernel 4.18.11)

First of all of course rebuild the whole kernel with all the patches from here:
https://github.com/jakeday/linux-surface

FYI: it now supports Surface Pro 5 2017 (including battery and buttons once qzed's patch is accepted too)

Added pcie_aspm=off to the kernel (NOT Needed, instead please keep reading on) command but unfortunately that didn't do anything so started digging and realized drivers/pci/pcie/aspm.c doesn't even compiles when building the kernel so there is no code to check this command parameter for the kernel anyway.

Then dug around some more and found I don't even have "/sys/module/pcie_aspm/parameters/policy" file not even the directory "/sys/module/pcie_aspm" so it seems the kernel for whatever reason (as couldn't find PCIE_ASPM config related things in make menuconfig either).

So wanted to make sure ASPM is indeed turned on for the wireles so I did:
lspci -vvv

and found this device:
01:00.0 Ethernet controller: Marvell Technology Group Ltd. 88W8897 [AVASTAR] 802.11ac Wireless

with the following:
LnkCtl: ASPM L0s L1 Enabled;

Now I started to google around on how to disable ASPM on device basis as it seems these values are set from the firmware itself (if I understood things correctly) and found baybal's script (which enables ASPM on dell devices):
https://gist.github.com/baybal/b499fc5811a7073df0c03ab8da4be904

and instead of enabling I changed it to disable on the ethernet controller and that did the trick!!! Now doing lspci -s 01:00.0 -vvv shows me this instead:
LnkCtl: ASPM Disabled;

and since then I can play a YouTube video in 1440p and it never freezes (and was freezing like in first minute before that) also realized that if I changed to 720p it was dying out later bit did die out eventually but now with ASPM off even 1440p works just fine and plays trough the whole video.

I'm attaching the shell script that I quickly modified from baybal's code, it needs to be run every time on startup (not on resume as far as I could see/tested) and then all should be fine just please make sure you verify that your Marvell network device is indeed at 01:00.0 and if not just edit the file and change this.

So will keep on testing but it seems this fixed the problem and all working just great now so thanks again Kai-Heng Feng for the pointers if you give me the details I'm ready to send you over the small reward that I offered for help/solution and it seems yours was the one so just let me know the PayPal address and it will be on it's way.

Thanks again and hopefully this will be it and all the details/script will be helpful to the others too.

Regards,
Julius

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Sorry for the late reply.
The next step is to find out if we need to disable both ASPM L0s and ASPM L1, or just disable one of them can solve our issue.

Brad Figg (brad-figg)
tags: added: ubuntu-certified
Brad Figg (brad-figg)
tags: removed: ubuntu-certified
Brad Figg (brad-figg)
tags: added: cscc
Revision history for this message
In , verdre (verdre-linux-kernel-bugs) wrote :

Hi,

this still seems to be an issue on a lot of Surface devices and I'm not sure if it was ever solved.

I own a Surface Pro 2017 and I'm seeing the same issues as in #Comment74, the issue happens randomly after some time (sometimes hours, sometimes minutes) and is easily reproducible by suspending and resuming multiple times. As long as you're not suspending the device, a workaround is to manually set PCIe ASPM to L0-only for the wifi card.

I'm running a 5.3 kernel right now, are there any patches I could try?

Revision history for this message
In , andy.shevchenko (andy.shevchenko-linux-kernel-bugs) wrote :

(In reply to verdre from comment #77)
> Hi,
>
> this still seems to be an issue on a lot of Surface devices and I'm not sure
> if it was ever solved.
>
> I own a Surface Pro 2017 and I'm seeing the same issues as in #Comment74,
> the issue happens randomly after some time (sometimes hours, sometimes
> minutes) and is easily reproducible by suspending and resuming multiple
> times. As long as you're not suspending the device, a workaround is to
> manually set PCIe ASPM to L0-only for the wifi card.
>
> I'm running a 5.3 kernel right now, are there any patches I could try?

It seems a firmware issue and Marvell simple doesn't care. Perhaps power of social networks can help.

Revision history for this message
In , andy.shevchenko (andy.shevchenko-linux-kernel-bugs) wrote :

Reassign directly to developer from Marvell.

Revision history for this message
In , verdre (verdre-linux-kernel-bugs) wrote :

I've been looking into this for the last few days and it turns out the issue is definitely caused by PCI link power management: Using the CONFIG_PCIEASPM_DEBUG config option of the kernel I was able to track the issue down to the PCI ASPM L1.2 power saving state (set link_state of the parent PCI bridge to 15 or 111 to enable everything except L1.2 [1]).

Not sure how to continue at this point, since PCIe ASPM is implemented in hardware and debugging would require special equipment, I really hope someone at Marvell steps up and has a look into this. It might also be possible that L1.2 is disabled on Windows, but I haven't found tools to read the L1 substates from the PCI configuration on Windows yet.

Another interesting fact is that the random command timeouts I'm looking into here only seem to happen on some Surface devices using the 8897 chip via PCI, that is the Pro 4, Pro 2017, the Pro 6, Laptop 2 (see comments in [2]), but not on the Surface Book 2 (according to owners of the device). There are also multiple reports of the same issue on completely different devices to be found on the internet ([3], [4]), but those never got any attention by Marvell developers.

[1] https://github.com/torvalds/linux/blob/master/drivers/pci/pcie/aspm.c#L28
[2] https://github.com/jakeday/linux-surface/issues/163
[3] https://www.spinics.net/lists/linux-wireless/msg159776.html
[4] https://www.spinics.net/lists/linux-wireless/msg175943.html

Revision history for this message
In , gbhat (gbhat-linux-kernel-bugs) wrote :

Hi Andy/Verdre,

Last time I started working on this, I failed to fix due to different reasons; From past few days I see still the issue is same as given in comment#74;

>Using the CONFIG_PCIEASPM_DEBUG config option of the kernel I was able to
>track the issue down to the PCI ASPM L1.2 power saving state (set link_state
>of the parent PCI bridge to 15 or 111 to enable everything except L1.2 [1]).

This is very good help for us to continue; I will try recreate this again by flashing latest Fedora on MS - Surface 4 and share the updates here soon;

Thanks for this;

Regards,
Ganapathi

Revision history for this message
In , verdre (verdre-linux-kernel-bugs) wrote :

Hi Ganapathi,

thank you for picking this up again. To reproduce the issue it should be enough to make sure L1.2 is enabled and can be entered (all other L-states are enabled), that should be the case by default when using Fedora 30. The issue might then appear at any time, although it seems to happen more quickly when downloading a file.

Additionally, the reason why the "pcie_aspm=off" kernel cmdline parameter doesn't work for people is that the kernel follows ACPI rules here [1] and refuses to take over control of the ASPM configuration because ACPI requires ASPM to be enabled.

Also I assume the reason why some people were only seeing this issue after suspending is that the Surface firmware seems to set the ASPM bits to enabled after the device was suspended for some time (I assume it happens when the EC was in deep sleep) by itself, so if the distro or kernel version handles ASPM differently and maybe disables it by default, it would definitely be enabled after suspending and the issue would start to appear.

Note that the errors don't have to begin with "Firmware wakeup failed" as described in Comment#74. Since it seems like we're dealing with a complete loss of communication with the card here, the error might as well be a command timeout.

If there's anything else I can do to help resolving this issue let me know!

[1] https://github.com/torvalds/linux/blob/master/drivers/acpi/pci_root.c#L464

Thanks,
Jonas

Revision history for this message
In , gbhat (gbhat-linux-kernel-bugs) wrote :

Hi Verdre,

Could you give inputs on how did you get Fedora 30 installed on Surface 4; I created the image using 'dd' and tried to boot from it;It would get to grub, after I was getting the _ cursor forever;

Regards,
Ganapathi

Revision history for this message
In , verdre (verdre-linux-kernel-bugs) wrote :

> Could you give inputs on how did you get Fedora 30 installed on Surface 4; I
> created the image using 'dd' and tried to boot from it;It would get to grub,
> after I was getting the _ cursor forever;

That's weird, not sure why that wouldn't be working, if you get to grub you probably did everything correctly... I'm using a Surface Pro 2017, which is the Pro 4's successor though. Maybe make sure secure-boot (and maybe the TPM, too) is disabled?

Also it seems like someone had the same problem as you with a Pro 4 and Fedora, so you might just want to try Ubuntu: https://www.reddit.com/r/SurfaceLinux/comments/7y4k29/success_fedora_27_on_a_surface_pro_4/

Or you could try out Fedora 31 beta, which should have a 5.3 kernel: https://download.fedoraproject.org/pub/fedora/linux/releases/test/31_Beta/Workstation/x86_64/iso/Fedora-Workstation-Live-x86_64-31_Beta-1.1.iso

Revision history for this message
In , gbhat (gbhat-linux-kernel-bugs) wrote :

Hi Verdre,

>Or you could try out Fedora 31 beta, which should have a 5.3 kernel
Thanks for this, it worked for me;

Now, I try to recreate like below(power save enabled):
1. connect to a known AP
2. ping to AP for few seconds
3. suspend(use rtcwake) and resume
4. goto #1

I do not see the error in comment#74(or a command/TX timeout yet), but let me try it for longer time;

>To reproduce the issue it should be enough to make sure L1.2 is enabled and
>can be entered
I could not yet confirm this; Do you have any inputs on how to confirm this setting is as per above?

Regards,
Ganapathi

Revision history for this message
In , verdre (verdre-linux-kernel-bugs) wrote :

Hi Ganapathi,

I'm making progress with figuring out potential workarounds for this issue right now, seems like the "Firmware wakeup failed" error on resume indeed is caused by another issue (that might still be related to PCI connection/ASPM issues) than the command timeout: I wasn't seeing the "Firmware wakeup failed" error at all so far, it only started appearing after I went back to a clean kernel and tried disabling L1.2 to fix the command timeout. While that indeed fixed the command timeout issue, the "Firmware wakeup failed" issue wasn't fixed. Turns out I had at least two changes in my modified kernel sources that (combined) fix the "Firmware wakeup failed" issue, I'm trying to isolate those changes right now.

Your testing setup sounds correct and should trigger the "Firmware wakeup failed" issue (it happens after only two suspend cycles here), maybe make sure that S0Ix sleep is actually reached during suspend by checking if /sys/devices/system/cpu/cpuidle/low_power_idle_system_residency_us and /sys/devices/system/cpu/cpuidle/low_power_idle_cpu_residency_us are not 0.

The command timeout should happen quite quickly (less than 5 minutes) when downloading data, although I can't say anything for devices other than my Pro 2017, the occurrences might be a lot rarer on other devices.

You can check which L-states and substates are enabled by executing "lspci -vvv" as root. The L0s and L1 state can be found under "Capabilities: Express (v2) Endpoint -> LnkCtl" and the L1-substates can be found under "Capabilities: L1 PM Substates -> L1SubCtl1" (enabled substates are marked with a "+" after the state, disabled ones with a "-"). Or you can compile your kernel with CONFIG_PCIEASPM_DEBUG and read the value from /sys/bus/pci/devices/PARENT_BRIDGE/power/link_state

Thanks,
Jonas

Revision history for this message
In , gbhat (gbhat-linux-kernel-bugs) wrote :

Hi Verdre,

>I'm making progress with figuring out potential workarounds for this issue
>right now, seems like the "Firmware wakeup failed" error on resume indeed is
>caused by another issue (that might still be related to PCI connection/ASPM
>issues) than the command timeout
OK; Thanks for this; I read it as 'these two issues have different root causes';

>The L0s and L1 state can be found under "Capabilities: Express (v2) Endpoint
>-> LnkCtl"
---
LnkCtl: ASPM L0s L1 Enabled;
---

>the L1-substates can be found under "Capabilities: L1 PM Substates ->
>L1SubCtl1"
---
L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+
---

>make sure L1.2 is enabled and can be entered
it is enabled(above output of lspci -vvv); but is there a seperate check to know if it could be *entered*;

>maybe make sure that S0Ix sleep is actually reached during suspend by checking
cat /sys/devices/system/cpu/cpuidle/low_power_idle_system_residency_us
cat: /sys/devices/system/cpu/cpuidle/low_power_idle_system_residency_us: No such file or directory

cat /sys/devices/system/cpu/cpuidle/low_power_idle_cpu_residency_us
3565055

>Your testing setup sounds correct and should trigger the "Firmware wakeup
>failed" issue
>The command timeout should happen quite quickly

I am doing different try outs, I will update as soon I could recreate these;

Regards,
Ganapathi

Revision history for this message
In , verdre (verdre-linux-kernel-bugs) wrote :

Hi Ganapathi,

> OK; Thanks for this; I read it as 'these two issues have different root
> causes';

Not sure if that's the case since we enable link power saving after every suspend again.

> it is enabled(above output of lspci -vvv); but is there a seperate check to
> know if it could be *entered*;

I don't think there is, maybe there are unofficial debugging tools by Intel or related manufacturers to read statistics about those states from the chipset.

> cat: /sys/devices/system/cpu/cpuidle/low_power_idle_system_residency_us: No
> such file or directory

Sounds like the processor used in the Pro 4 isn't new enough, from the cpu residency it sounds like S0Ix is working. You can read more about those states here: https://01.org/blogs/qwang59/2018/how-achieve-s0ix-states-linux

Also I figured out what the fix for the "Firmware wakeup failed" errors I wrote about was: Sleeping for about a second before writing to the PCI power state register, so not really a fix, just a workaround...

If you fail to reproduce the issue on the Pro 4 maybe you can get a Pro 2017 and try it on that?

Thanks,
Jonas

Revision history for this message
In , gbhat (gbhat-linux-kernel-bugs) wrote :

Hi Verdre,

Could you kindly share the dmesg logs, when the issue is recreated; You could use below method to collect the log:

1. dmesg -c
2. echo 0xffffffff > /sys/kernel/debug/mwifiex/mlan0/debug_mask
3. while true; do dmesg -c >> dmesg.txt; done &
4. run your use case
5. share dmesg.txt once you could hit the issue

Regards,
Ganapathi

Revision history for this message
In , verdre (verdre-linux-kernel-bugs) wrote :

Hi Ganapathi,

getting the logs turned out to be harder than I expected, with the while-loop method the issue didn't happen (while the download speed of wget, which I used to reproduce the issue, sometimes dropped to near 0, it always managed to recover and went back up again).

Since the systemd-journal stores the kernel messages anyway, I'll attach a complete journald kernel log and the dmesg output after the issue happened.

Thanks,
Jonas

Revision history for this message
In , verdre (verdre-linux-kernel-bugs) wrote :

Created attachment 285333
dmesg of command timeout issue

Revision history for this message
In , verdre (verdre-linux-kernel-bugs) wrote :

Created attachment 285335
systemd journal kernel log from boot until command timeout issue

Revision history for this message
In , gbhat (gbhat-linux-kernel-bugs) wrote :

Hi Verdre,

Thanks for above log;

1. Could you confirm above issue is seen only in connected state:
[a] say, you keep scanning for AP but do not connect to any AP;
[b] along with #[a], you do suspend and resume;
[c] after a specific time(you could try different duration, based on how fast you use to see the issue) you try to connect to any AP;
[d] do we see #[c] pass(could connect) or fail; if it does pass, then would it fail after few more suspend and resume;

2. could you kindly try the same use case by disabling power_save(if not tried already); You could just use below command before connection:

iw dev mlan0 set power_save off

Let me know, if am not clear;

Regards,
Ganapathi

Revision history for this message
In , gbhat (gbhat-linux-kernel-bugs) wrote :

>2. could you kindly try the same use case by disabling power_save
I mean, your use case(not #1)

Regards,
Ganapathi

Revision history for this message
In , kitakar (kitakar-linux-kernel-bugs) wrote :

Hi Verdre and Ganapathi,

I own Surface Book 1 (Skylake, same as Surface Pro 4) and Surface 3 (non-pro, Intel Atom Cherry Trail) for daily usage
and recently I got Surface Pro 4 (with broken sensor) for debugging purposes.

env:
- freshly installed Fedora 31 Beta on an external USB drive and use it on SP4
- run `sudo dnf upgrade` and now `uname -r` shows `5.3.2-300.fc31.x86_64`

@Ganapathi
> I am doing different try outs, I will update as soon I could recreate these;

You mean you're having trouble recreating issues?
On my SP4:
- "cmd_timeout"
     This issue will occur after a second suspend like the dmesg log on Comment#91 regardless of 2.4GHz/5GHz AP or power_save on/off. So, I'm not sure why you cannot recreate this issue.

- "Firmware wakeup failed"
     This issue will randomly occur after a while when using with 5GHz AP+power_save on

- "WLAN FW already running! Skip FW dnld" after card reset
     As you can see on the dmesg log (Comment#91), after card reset, the firmware will not be re-downloaded on Surface devices. I think that re-downloading the firmware is the expected behavior.
You may recreate this issue by this command:

echo 1 | sudo tee /sys/kernel/debug/mwifiex/mlan0/reset

After resetting, it shows these messages repeatedly on dmesg and wifi is not usable anymore:

kern :info : [ 175.478500] mwifiex_pcie 0000:02:00.0: cmd_wait_q terminated: -110
kern :info : [ 175.478510] mwifiex_pcie 0000:02:00.0: deleting the crypto keys

Maybe this is the simplest one to fix? and at least if resetting works, we can continue to use wifi.

To reset the card with firmware re-downloading, currently I have to call an ACPI method using acpi_call and remove/rescan the parent bridge of wifi (need to be installed) like this:

WIFI_PARENT="0000:00:1d.0" # SP4 specific
ACPI_WIFI_RST_PATH="\_SB.PCI0.RP09.PXSX.PRWF._RST" # SP4 specific
acall(){ echo "$1" | tee /proc/acpi/call >/dev/null && cat /proc/acpi/call;echo;} # wrapper

modprobe acpi_call
acall $ACPI_WIFI_RST_PATH
echo 1 | tee "/sys/bus/pci/devices/$WIFI_PARENT/remove"
sleep 1
echo 1 | tee /sys/bus/pci/rescan

---

By the way, I find 2.4GHz AP is much more stable with even power_save on. 5GHz AP with power_save on causes wifi stop without any message on dmesg.

It seems that showing APs (e.g. `nmcli d wifi list`) will temporarily fix the wifi stop.

This is another issue from what we're discussing now, though.

Best regards,
Tsuchiya, Yuto (kitakar5525)

Revision history for this message
In , kitakar (kitakar-linux-kernel-bugs) wrote :

> acpi_call and remove/rescan the parent bridge of wifi (need to be installed)

Sorry, I made a mistake
acpi_call (need to be installed) and remove/rescan the parent bridge of wifi

Revision history for this message
In , gbhat (gbhat-linux-kernel-bugs) wrote :

Hi Tsuchiya/Verdre,

>- "cmd_timeout"
> This issue will occur after a second suspend like the dmesg log on
> >Comment#91 regardless of 2.4GHz/5GHz AP or power_save on/off

1. Command timeout after second suspend could be recreated in my Surface 4;

2. Logs were similar to the one in #91 & #92 by Verdre;

3. It is found that the command, which timed out did not reach the firmware at all and am trying to understand more;

4. Once a command times out, driver dumps few scratch registers and also it triggers the firmware dump; But I fail to get these in my device(I don't know why, it is a different problem); So, I still fail to confirm we are hitting same issue;

Note that, when the dump is uploaded to user-space, dmesg will say:
"== mwifiex dump information to /sys/class/devcoredump end"

Could you please share the dump(along with dmesg), which is saved in below file(it is automatically deleted after 5 minutes):
********************************************
/sys/devices/virtual/devcoredump/devcd1/data
********************************************

>- "Firmware wakeup failed"
This could also be recreated for the first time now; I kept Surface 4 connected to AP and PING overnight; But I have no much detail yet;

I will comeback soon on remaining two concerns you raised in comment#95;

Regards,
Ganapathi

Regards,
Ganapathi

Revision history for this message
In , kitakar (kitakar-linux-kernel-bugs) wrote :

> But I fail to get these in my device(I don't know why, it is a different
> problem); So, I still fail to confirm we are hitting same issue;

Yeah, the device_dump is also a problematic one on Surface devices.
It's reported that if_ops.device_dump() will cause lockups [1] (e.g. prevent system power off)

I also confirmed that by
1. adding debugfs entry for calling that function
2. and called the function via debugfs (echo 1 | sudo tee /sys/kernel/debug/mwifiex/mlan0/device_dump)
Sometimes I can get the device_dump log, sometimes I can't and it prevents system power off. I will post the patch which adds the debugfs entry to the next comment.

I will post the dump log (along with the dmesg log) as soon as I manage to get the logs after wifi malfunctioning.

By the way, the user manual on drivers/net/wireless/marvell/mwifiex/README describes that device_dump can be obtained by `cat fw_dump` but the debugfs entry does not exist anymore.

References:
[1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1730924/comments/99
Comment #99 : Bug #1730924 “Wifi does down “crash” in Surface Pro 4” : Bugs : linux package : Ubuntu

Revision history for this message
In , kitakar (kitakar-linux-kernel-bugs) wrote :

Created attachment 285391
[PATCH] mwifiex: debugfs: add entry for device_dump

This is a patch to add debugfs entry for device_dump.
Just to test device_dump functionality on Surface devices.

Usage:

  echo 1 | sudo tee /sys/kernel/debug/mwifiex/mlan0/device_dump

Revision history for this message
In , verdre (verdre-linux-kernel-bugs) wrote :

Hi Ganapathi and Tsuchiya,

here's a dmesg and a firmware dump of the command timeout issue. It wasn't hard to get it, the devcoredump entry appeared as soon as "mwifiex dump information to /sys/class/devcoredump end" was written to the log.

Regards,
Jonas

Revision history for this message
In , verdre (verdre-linux-kernel-bugs) wrote :

Created attachment 285445
dmesg of command timeout issue

Revision history for this message
In , verdre (verdre-linux-kernel-bugs) wrote :

Created attachment 285447
firmware dump of command timeout issue

Revision history for this message
In , gbhat (gbhat-linux-kernel-bugs) wrote :

>> 1. Command timeout after second suspend could be recreated in my Surface 4;
more specific observation on this(you could cross check, if possible):
1. timeout happens after a resume, if we see below message in dmesg:
   "mwifiex_pci 0000:02:00.0: Refused to change power state, currently in D0"

2. if device is connected before the 1st suspend, no issue is seen after 1st resume;

3. if device is connected before 2nd suspend(i.e after 1st suspend_resume), then issue is seen after 2nd resume

4. if device is not at all connected during suspend_resume stress test, no issue is seen

Regards,
Ganapathi

Revision history for this message
In , gbhat (gbhat-linux-kernel-bugs) wrote :

Hi verdre,

>> went back to a clean kernel and tried disabling L1.2 to fix the command
>> timeout. While that indeed fixed the command timeout issue, the "Firmware
>> wakeup
failed" issue wasn't fixed.

could you share your changes, with which command timeout is not seen;

Regards,
Ganapathi

Revision history for this message
In , gbhat (gbhat-linux-kernel-bugs) wrote :

Created attachment 285537
0001-mwifiex-set-PCIE-to-PCI_D0-power-state-on-resume

Hi Tsuchiya/Verdre,

Could please give a try with attached change;

Regards,
Ganapathi

Revision history for this message
In , andy.shevchenko (andy.shevchenko-linux-kernel-bugs) wrote :

(In reply to Ganapathi Bhat from comment #105)
> Created attachment 285537 [details]
> 0001-mwifiex-set-PCIE-to-PCI_D0-power-state-on-resume
>
> Hi Tsuchiya/Verdre,
>
> Could please give a try with attached change;
>
> Regards,
> Ganapathi

I'm wondering if it has any effect on any system. PCI core *does* the same (*) for you already. Does it mean we have some race there?

*) See implementation of pci_pm_resume().

Revision history for this message
In , gbhat (gbhat-linux-kernel-bugs) wrote :

Hi Andy,

>PCI core *does* the same (*) for you already.

OK; I see, with this change, command timeout is not recreated; without this change, just after second iteration I see the issue(as given in comment#103);

>Does it mean we have some race there?
I need to understand more on this; I was just trying this change because, different open source wireless drivers have this in their resume handler;

Regards,
Ganapathi

Revision history for this message
In , kitakar (kitakar-linux-kernel-bugs) wrote :

> 1. timeout happens after a resume, if we see below message in dmesg:
> "mwifiex_pci 0000:02:00.0: Refused to change power state, currently in D0"

For me, it says instead:
`mwifiex_pcie 0000:02:00.0: Refused to change power state, currently in D3`

> 2. if device is connected before the 1st suspend, no issue is seen after 1st
> resume;
> 3. if device is connected before 2nd suspend(i.e after 1st suspend_resume),
> then issue is seen after 2nd resume

Yes. However, the "cmd_timeout" will sometimes also occur after 1st resume for me. Also in this case, the message from mwifiex is the same as No. 1.

> 4. if device is not at all connected during suspend_resume stress test, no
> issue is seen

For me, the "cmd_timeout" will occur regardless of AP connection status/rfkill status/power_save status.
What changed by No. 4 is that, it will not cause the "lockup" (e.g. not preventing system shutdown)

Also, I tried the patch 0001-mwifiex-set-PCIE-to-PCI_D0-power-state-on-resume on comment#105.
- So far, the wifi is still working after several suspend.
- I remember I've done exactly the same thing before and it did not
  work before. Hmm…?
- The same message as No. 1 will still be printed on dmesg after
  the patch.
- It may still not connect to AP automatically because it fails to find APs
  (`nmcli d wifi list` shows empty). This is easily fixed by reloading "mwifiex_pcie" module

Revision history for this message
In , verdre (verdre-linux-kernel-bugs) wrote :

Hi Ganapathi,

since you can now reproduce the issues, are there any updates on a proper fix in the firmware or the kernel driver?

Regards,
Jonas

Revision history for this message
In , gbhat (gbhat-linux-kernel-bugs) wrote :

Hi Verdre,

I am yet to try on my side, the reason for why change in comment#105 don't cause the issue;

Please let know your inputs on the change shared in comment#105; We will improve the change considering different observations(comment#106, #108)

Regards,
Ganapathi

Revision history for this message
In , verdre (verdre-linux-kernel-bugs) wrote :

Hi Ganapathi,

so I've been running a 5.4 kernel with your changes from https://bugzilla.kernel.org/show_bug.cgi?id=109681#c105 for some time now,

Revision history for this message
In , gbhat (gbhat-linux-kernel-bugs) wrote :

Created attachment 286459
attachment-14890-0.html

As of 6-DEC-2019, NXP has acquired Marvell’s Wireless business unit. You can now reach me at <email address hidden>

Revision history for this message
In , akarwar (akarwar-linux-kernel-bugs) wrote :

Created attachment 286461
attachment-14943-0.html

As of 6-DEC-2019, NXP has acquired Marvell’s Wireless business unit. You can now reach me at <email address hidden>

Revision history for this message
In , verdre (verdre-linux-kernel-bugs) wrote :

Sorry, last comment was sent accidentally...

Hi Ganapathi,

so I've been running a 5.4 kernel with your changes from comment#105 for some time now, and I'm still seeing the "Refused to change power state, currently in D3" messages after resuming from the second suspend.

It seems like the "Firmware wakeup failed" issue is not happening anymore with this change, but the "cmd timeout" issue is definitely still happening, also without suspending at all, so I'm seeing all the same things as Tsuchiya in comment#108.

Thank you,
Jonas

Revision history for this message
In , ganapathi.bhat (ganapathi.bhat-linux-kernel-bugs) wrote :

Hi Verdre,

>>but the "cmd timeout" issue is definitely still happening

you could share the dmesg + firmware dump;

also, we find below discussions; could you give a try with the script below:
https://github.com/jakeday/linux-surface/issues/456

Regards,
Ganapathi

Revision history for this message
In , gbhat (gbhat-linux-kernel-bugs) wrote :

Created attachment 286651
attachment-16293-0.html

As of 6-DEC-2019, NXP has acquired Marvell’s Wireless business unit. You can now reach me at <email address hidden>

Revision history for this message
In , kitakar (kitakar-linux-kernel-bugs) wrote :

Thank you for looking into this issue. Reporting current status.

== "mwifiex crash after suspend"

This issue might be the Surface devices specific problem? Not sure.
I know some ways to prevent at least mwifiex crashing:

1. Disabling d3cold for wifi device (even doable on mainline kernels):

        echo 0 | sudo tee /sys/devices/pci0000:00/0000:00:1d.3/0000:03:00.0/d3cold_allowed # on SB1
        echo 0 | sudo tee /sys/devices/pci0000:00/0000:00:1d.0/0000:02:00.0/d3cold_allowed # on SP4

2. Your patch "0001-mwifiex-set-PCIE-to-PCI_D0-power-state-on-resume"

3. Disable bridge_d3 on mwifiex_pcie_probe()

        pdev->bus->self->bridge_d3 = false;

but with those ways, as I mentioned before (Comment 108), it sometimes not connect to AP automatically because it fails to find APs (`nmcli d wifi list` shows empty). This is easily fixed by reloading the "mwifiex_pcie" module.

Patch for this issue we (linux-surface community) are currently using is, based this patch from sebanc:
- https://github.com/jakeday/linux-surface/issues/420#issuecomment-544615371
Actual commit is here:
- https://github.com/linux-surface/kernel/commit/c8da40c0538a1528f3efd85cd91d9e314c85dd19

The patch from sebanc does the following things:
- modify mwifiex_pcie_suspend/mwifiex_pcie_resume functions to fix crash after suspend.
  It seems that the function mwifiex_pcie_suspend2 is from mwifiex_pcie_remove
  and the function mwifiex_pcie_resume2 is largely from mwifiex_pcie_probe.
  So, I think that it behaves in the same way as we unload modules before
  suspend and re-load modules after suspend.
- disable bridge_d3 to fix crash after suspend
- disable "auto deep sleep" (I think not needed for this issue, but introduced
  hoping to fix other issues)

I think the patch is too aggressive to be upstreamed as they are, but please use it as a reference.

The patch also fixes S0ix on KBL/KBL-R devices (such as Surface Book 2, Surface Pro 5, Surface Laptop 1 or later). SKL devices such as Surface Pro 4 or Surface Book 1 cannot achieve S0ix anyway because of other reasons outside of mwifiex.

Revision history for this message
In , kitakar (kitakar-linux-kernel-bugs) wrote :
Download full text (3.3 KiB)

== "mwifiex crash on idle/using"

I heard this issue only from Surface 3 and SP5 owners.
It can be fixed by disabling ASPM L1 state. I think this is the hardest issue to fix. So, I hope at least we could fix the mwifiex reset feature (see below).

== "mwifiex reset feature broken"

I think this issue is common on all devices that use the 88W8897 chip.
You can reproduce this issue by the following command:

        echo 1 | sudo tee /sys/kernel/debug/mwifiex/mlan0/reset

Expected: mwifiex is usable after this reset.
Actual: mwifiex is broke after this reset with following dmesg output

        $ echo 1 | sudo tee /sys/kernel/debug/mwifiex/mlan0/reset
        mwifiex_pcie 0000:03:00.0: Resetting per request
        mwifiex_pcie 0000:03:00.0: info: successfully disconnected from [BSSID]:
        reason code 3
        mwifiex_pcie 0000:03:00.0: PREP_CMD: card is removed
        mwifiex_pcie 0000:03:00.0: deleting the crypto keys
        mwifiex_pcie 0000:03:00.0: PREP_CMD: card is removed
        mwifiex_pcie 0000:03:00.0: deleting the crypto keys
        mwifiex_pcie 0000:03:00.0: PREP_CMD: card is removed
        mwifiex_pcie 0000:03:00.0: deleting the crypto keys
        mwifiex_pcie 0000:03:00.0: PREP_CMD: card is removed
        mwifiex_pcie 0000:03:00.0: deleting the crypto keys
        mwifiex_pcie 0000:03:00.0: PREP_CMD: card is removed
        mwifiex_pcie 0000:03:00.0: deleting the crypto keys
        mwifiex_pcie 0000:03:00.0: PREP_CMD: card is removed
        mwifiex_pcie 0000:03:00.0: deleting the crypto keys
        mwifiex_pcie 0000:03:00.0: info: shutdown mwifiex...
        mwifiex_pcie 0000:03:00.0: PREP_CMD: card is removed
        mwifiex_pcie 0000:03:00.0: PREP_CMD: card is removed
        mwifiex_pcie 0000:03:00.0: WLAN FW already running! Skip FW dnld
        mwifiex_pcie 0000:03:00.0: WLAN FW is active
        mwifiex_pcie 0000:03:00.0: Unknown api_id: 4
        mwifiex_pcie 0000:03:00.0: info: MWIFIEX VERSION: mwifiex 1.0 (15.68.19.p21)
        mwifiex_pcie 0000:03:00.0: driver_version = mwifiex 1.0 (15.68.19.p21)
        mwifiex_pcie 0000:03:00.0: info: trying to associate to '[SSID]' bssid [BSSID]
        mwifiex_pcie 0000:03:00.0: info: associated to bssid [BSSID] successfully
        mwifiex_pcie 0000:03:00.0: cmd_wait_q terminated: -110
        mwifiex_pcie 0000:03:00.0: info: successfully disconnected from [BSSID]:
        reason code 15
        mwifiex_pcie 0000:03:00.0: cmd_wait_q terminated: -110
        mwifiex_pcie 0000:03:00.0: deleting the crypto keys
        mwifiex_pcie 0000:03:00.0: cmd_wait_q terminated: -110
        mwifiex_pcie 0000:03:00.0: deleting the crypto keys
        mwifiex_pcie 0000:03:00.0: cmd_wait_q terminated: -110
        mwifiex_pcie 0000:03:00.0: deleting the crypto keys
        [...]

Also, does anyone know what is the expected behavior on function level reset? Should firmware be re-downloaded? As you can see on the dmesg log, fw is still active after the reset. ("WLAN FW already running! Skip FW dnld")

== "power_save causes connection instability at least on 5GHz APs"

"connection instability" here means not crash, but networking stop, like ping not responding.

I think this issue is c...

Read more...

Revision history for this message
In , kitakar (kitakar-linux-kernel-bugs) wrote :

Also, ff someone knows devices (other than Surface) that use the same 88W8897 chip (or at least similar PCIe chip like 88W8997), let me know. I want to get the device as a reference.

Thank you.

Revision history for this message
In , kitakar (kitakar-linux-kernel-bugs) wrote :

*if someone knows devices

Revision history for this message
In , ganapathi.bhat (ganapathi.bhat-linux-kernel-bugs) wrote :

Hi Tsuchiya,

>>comment#118:

>>Also, does anyone know what is the expected behavior on function level reset?

note that FLR does require support in the device; 88W8897(PCIE) do not support this;

[we will respond to remaining comments soon]

Regards,
Ganapathi

Revision history for this message
In , verdre (verdre-linux-kernel-bugs) wrote :

Hi everyone,

a few months have passed, any news about this bug?

Regards,
Jonas

Revision history for this message
In , gbhat (gbhat-linux-kernel-bugs) wrote :

Created attachment 288579
attachment-30359-0.html

As of 6-DEC-2019, NXP has acquired Marvell’s Wireless business unit. You can now reach me at <email address hidden>

Revision history for this message
In , akarwar (akarwar-linux-kernel-bugs) wrote :

Created attachment 288581
attachment-30418-0.html

As of 6-DEC-2019, NXP has acquired Marvell’s Wireless business unit. You can now reach me at <email address hidden>

You-Sheng Yang (vicamo)
tags: added: hwe-networking-wifi
Revision history for this message
In , verdre (verdre-linux-kernel-bugs) wrote :

Hi everyone,

so Tsuchiya and I have been looking into this quite a lot for the last few weeks and we've identified several issues and tried to find the underlying reasons as far as possible.

We hope this information will be helpful for you in indentifying issues with the firmware and the driver so you don't have to take a shot in the dark. Because most issues and their reproducers are already extensively explained in this bugreport, we've listed what we suspect to be some of the root causes for those issues here:

1) The LTR (Latency Tolerance Reporting) messages sent by the firmware might be incorrect. With the newest firmware, the card always reports the maximum latency, which prevents the System from entering the lowest C-States (pc10 and slp_s0). I'm not sure whether that actually prevents ASPM L1.2 from working, because the command timeouts I see when ASPM L1.2 is enabled (other devices see them only with (Surface 3) L1 or (Surface Pro 6) L0s) are still happening.

2) Performing a PCI function level reset of the card with firmware redownloading only works when power-cycling the card by switching it to D3cold and then to D0 again, see [1], which implements this. As mentioned in that commit, this is a quirk that's also done by Windows and seems to be specific to only Surface devices.

3) For avoid a crash of the firmware when resuming from suspend, it's important to disable D3 for the PCI bridge, see [2]. This is another quirk that also seems to be applied by Windows.

4) There's also a bluetooth powersaving issue, which makes the bluetooth device never enter USB the suspend state as long as no LE device is connected: That's because (assuming the linux bluetooth stack works correctly) the firmware never stops sending interrupts with LE advertising reports, which can easily be confirmed by using the btmon utility and by disabling Bluetooth LE functionality in /etc/bluettoth/main.conf.

[1] https://github.com/kitakar5525/linux-surface-kernel/commit/6eba36073e85a64200f2c28e2a6ba0dbeb120bf2

[2] https://github.com/kitakar5525/linux-surface-kernel/commit/4606706679bf81ef41e701fc5e4d5b87adb983dd

Thank you!

Revision history for this message
In , andy.shevchenko (andy.shevchenko-linux-kernel-bugs) wrote :

(In reply to verdre from comment #125)
> Hi everyone,
>
> so Tsuchiya and I have been looking into this quite a lot for the last few
> weeks and we've identified several issues and tried to find the underlying
> reasons as far as possible.

I think, especially after acquisition, Marvell is not capable to fix this. Perhaps we simple have to move on, i.e. promote your excellent work to upstream. I can help to assure wireless maintainer that this can be accepted even w.o. Marvell involvement.

Let's simple proceed to fix this very annoying issues.

Revision history for this message
In , pali (pali-linux-kernel-bugs) wrote :

Hello!

Tsuchiya & Verdre: Do you have some fixes for instability issues when card is in 5GHz AP mode? This is issue which we are observing on 88W8997 SDIO wifi chips.

I described some problems related to power save mode, nl80211/cfg80211 layer and mwifiex driver in email which I sent to linux-wireless mailing list: https://lore.kernel.org/linux-wireless/20200609111544.v7u5ort3yk4s7coy@pali/

But I have not got any response about it yet and I think that without fixing those issues at cfg80211 layer we cannot fix mwifiex driver for power save mode.

Andy, are you still able to help us with (generic/cfg80211) wireless part?

Revision history for this message
In , akarwar (akarwar-linux-kernel-bugs) wrote :

Created attachment 290043
attachment-26841-0.html

As of 6-DEC-2019, NXP has acquired Marvell’s Wireless business unit. You can now reach me at <email address hidden>

Revision history for this message
In , kitakar (kitakar-linux-kernel-bugs) wrote :

(In reply to Pali Rohár from comment #127)
> Hello!

Hello!

> Tsuchiya & Verdre: Do you have some fixes for instability issues when card
> is in 5GHz AP mode? This is issue which we are observing on 88W8997 SDIO
> wifi chips.

Unfortunately, not yet. I'm using 5GHz AP with just power_save off. Maybe
all what we can do about power_save and firmware crashing is submitting
a journal log and devcoredump. I'll attach them in the next comment.

> I described some problems related to power save mode, nl80211/cfg80211 layer
> and mwifiex driver in email which I sent to linux-wireless mailing list:
> https://lore.kernel.org/linux-wireless/20200609111544.v7u5ort3yk4s7coy@pali/

Thank you for letting us know!

Revision history for this message
In , kitakar (kitakar-linux-kernel-bugs) wrote :

Created attachment 290067
mwifiex crash log on SB1 with ps_on, 5GHzAP

- power_save on
- connected to 5GHz AP
- connection instability (ping not responding)
- firmware crashing eventually

Journal log and devcoredump are included in the archive. I kept sending ping to 192.168.1.1 and the output was redirected to kmsg to show connection stability
state.

    $ ping 192.168.1.1 | sudo tee /dev/kmsg
    [...]
    64 bytes from 192.168.1.1: icmp_seq=5798 ttl=255 time=3.81 ms
    64 bytes from 192.168.1.1: icmp_seq=5799 ttl=255 time=3.50 ms
    64 bytes from 192.168.1.1: icmp_seq=5800 ttl=255 time=7.16 ms
    #
    # connection stopped suddenly
    #
    From 192.168.1.59 icmp_seq=5830 Destination Host Unreachable
    From 192.168.1.59 icmp_seq=5831 Destination Host Unreachable
    From 192.168.1.59 icmp_seq=5832 Destination Host Unreachable
    [...]
    From 192.168.1.59 icmp_seq=6019 Destination Host Unreachable
    From 192.168.1.59 icmp_seq=6020 Destination Host Unreachable
    From 192.168.1.59 icmp_seq=6025 Destination Host Unreachable
    #
    # connection started working again
    #
    64 bytes from 192.168.1.1: icmp_seq=5801 ttl=255 time=230664 ms
    64 bytes from 192.168.1.1: icmp_seq=5802 ttl=255 time=229660 ms
    64 bytes from 192.168.1.1: icmp_seq=5803 ttl=255 time=228647 ms
    [...]
    64 bytes from 192.168.1.1: icmp_seq=6082 ttl=255 time=7.24 ms
    64 bytes from 192.168.1.1: icmp_seq=6083 ttl=255 time=80.0 ms
    64 bytes from 192.168.1.1: icmp_seq=6084 ttl=255 time=6.73 ms
    #
    # firmware crashed suddenly
    #
    ping: sendmsg: Network is unreachable
    ping: sendmsg: Network is unreachable
    ping: sendmsg: Network is unreachable
    ping: sendmsg: Network is unreachable
    ping: sendmsg: Network is unreachable
    [...]

Revision history for this message
In , kitakar (kitakar-linux-kernel-bugs) wrote :

Hi Ganapathi,

We can't do anything about power_save and firmware crashing. So, I
attached journal log and devcoredump when using with power_save on and
5GHz AP, containing 1) connection instability and 2) firmware crashing
eventually.

I hope it helps in debugging.

Thank you.

Revision history for this message
In , andy.shevchenko (andy.shevchenko-linux-kernel-bugs) wrote :

*** Bug 195183 has been marked as a duplicate of this bug. ***

Revision history for this message
In , andy.shevchenko (andy.shevchenko-linux-kernel-bugs) wrote :

I have reassigned this to Jonas because it seems he knows much better the state of affairs with this crappy hardware/firmware than vendor does. Jonas, if you are not comfortable, tell me what to do with this bug report (and AFAIU the v5.15+ behaves much more stable nowadays, correct?).

Changed in linux:
status: Confirmed → Incomplete
Revision history for this message
In , verdre (verdre-linux-kernel-bugs) wrote :

Thanks Andy, indeed with new kernels all the stability issues should be gone.

What remains unsolved are all the firmware bugs without workarounds (see the list I put together in this email: https://lore<email address hidden>/) and the patch to allow s0ix sleep during runtime, I'm not sure how/if I should follow up on that: https://<email address hidden>/T/#u

Anyway, given that this bug is specifically about stability, I'd say we can finally close it.

Changed in linux:
status: Incomplete → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.