Wifi does down "crash" in Surface Pro 4

Bug #1730924 reported by Hicks on 2017-11-08
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Linux
Confirmed
Medium
linux (Ubuntu)
Medium
Unassigned

Bug Description

I have a Surface Pro 4. The wifi works well in principle, but unfortunately it drops every x minutes. The only way to fix it I've found is to reboot the computer.
lsb_release -rd
Description: Ubuntu 17.10
Release: 17.10

ProblemType: Bug
DistroRelease: Ubuntu 17.10
Package: linux-image-4.13.0-16-generic 4.13.0-16.19
ProcVersionSignature: Ubuntu 4.13.0-16.19-generic 4.13.4
Uname: Linux 4.13.0-16-generic x86_64
ApportVersion: 2.20.7-0ubuntu3.1
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: predatux 1537 F.... pulseaudio
CurrentDesktop: KDE
Date: Wed Nov 8 10:41:26 2017
HibernationDevice: RESUME=UUID=147af4ba-a4ce-41fe-a176-b36a1f6a590b
Lsusb:
 Bus 002 Device 002: ID 045e:090c Microsoft Corp.
 Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
 Bus 001 Device 002: ID 045e:07e8 Microsoft Corp.
 Bus 001 Device 003: ID 1286:204c Marvell Semiconductor, Inc.
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
MachineType: Microsoft Corporation Surface Pro 4
ProcFB: 0 inteldrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.13.0-16-generic root=UUID=125200b0-7377-4985-a217-15503781a525 ro quiet splash vt.handoff=7
RelatedPackageVersions:
 linux-restricted-modules-4.13.0-16-generic N/A
 linux-backports-modules-4.13.0-16-generic N/A
 linux-firmware 1.169
SourcePackage: linux
UpgradeStatus: Upgraded to artful on 2017-10-22 (16 days ago)
dmi.bios.date: 02/24/2017
dmi.bios.vendor: Microsoft Corporation
dmi.bios.version: 106.1624.768
dmi.board.name: Surface Pro 4
dmi.board.vendor: Microsoft Corporation
dmi.chassis.type: 9
dmi.chassis.vendor: Microsoft Corporation
dmi.modalias: dmi:bvnMicrosoftCorporation:bvr106.1624.768:bd02/24/2017:svnMicrosoftCorporation:pnSurfacePro4:pvrD0B08F1C03P38:rvnMicrosoftCorporation:rnSurfacePro4:rvr:cvnMicrosoftCorporation:ct9:cvr:
dmi.product.family: Surface
dmi.product.name: Surface Pro 4
dmi.product.version: D:0B:08F:1C:03P:38
dmi.sys.vendor: Microsoft Corporation

Wifi looks fine at first after boot, but after hours of use or even idle, it goes into a unrecoverable state (unable to send and receive) until next reboot. dmesg logs shows lots of following messages:

[ 6822.586704] mwifiex_pcie 0000:02:00.0: PREP_CMD: FW in reset state
[ 6822.586731] mwifiex_pcie 0000:02:00.0: failed to get signal information
[ 6822.586899] mwifiex_pcie 0000:02:00.0: PREP_CMD: FW in reset state
[ 6822.586904] mwifiex_pcie 0000:02:00.0: failed to get signal information
[ 6827.925487] mwifiex_pcie 0000:02:00.0: 4296924600 : Tx timeout(#40), bss_type-num = 0-0
[ 6828.584671] mwifiex_pcie 0000:02:00.0: PREP_CMD: FW in reset state
[ 6828.584687] mwifiex_pcie 0000:02:00.0: failed to get signal information
[ 6828.584755] mwifiex_pcie 0000:02:00.0: PREP_CMD: FW in reset state
[ 6828.584763] mwifiex_pcie 0000:02:00.0: failed to get signal information
[ 6834.587608] mwifiex_pcie 0000:02:00.0: PREP_CMD: FW in reset state
[ 6834.587623] mwifiex_pcie 0000:02:00.0: failed to get signal information
[ 6834.587698] mwifiex_pcie 0000:02:00.0: PREP_CMD: FW in reset state
[ 6834.587706] mwifiex_pcie 0000:02:00.0: failed to get signal information

Created attachment 197921
a more complete log when problem happens.

Attach a more complete log when problem happens.

Thanks for reporting the problem. This is a command timeout problem caused mostly due to a firmware bug.
Which firmware is being used here?

It's commit bbe4917 from linux-firmware git repository, not quite sure about the actual version.

Created attachment 197961
log with firmware 15.68.7.p53

At first I though it's fixed, but I just got into similar situation. Log attached.

BTW, I also seems to have unstable connection when wifi is working. E.g. I open google.com in firefox with google's instant search feature enabled. While typing, The result usually not displayed and I have to refresh page several times to get web page displayed. The other laptop I have doesn't show the same problem so it's not network issue.

I have the same problem with mwifiex_pcie (latest git firmware) on Surface Pro 3

Created attachment 201031
A more complete log with debug flag set to 0xffffff

4.3.3 with firmware 15.68.7.p53

I have the same problem. In addition, using mwiflex_pcie on Surface Pro 4 seems to cause intermittent system crashes.

Using kernel 4.3.3-5 compiled from linux-source.

I have the same problem on my Microsoft Surface 3 (not Pro model), same issues (even though the rest of system is stable) and same dmesg printout.
Lspci lists the card on my tablet as: Marvell Technology Group Ltd. 88W8897 [AVASTAR] 802.11ac Wireless.

I have also noticed that the issues appear as soon as I start using the wireless card heavily (downloading files, apt update, etc.). As long as I just use ping, everything is fine.

It becomes much worse in linux 4.5 comparing with 4.4.2.

Is there any update on this issue?

Created attachment 209311
attachment-2572-0.html

The driver was updated recently:

Из git://git.marvell.com/mwifiex-firmware
   e92f8b3..a96efa0 master -> origin/master
Обновление e92f8b3..a96efa0
Fast-forward
 mrvl/pcie8897_uapsta.bin | Bin 803884 -> 816772 bytes
 mrvl/sd8897_uapsta.bin | Bin 780760 -> 794148 bytes

2016-03-15 0:25 GMT+03:00 <email address hidden>:

> https://bugzilla.kernel.org/show_bug.cgi?id=109681
>
> --- Comment #11 from Weng Xuetian <email address hidden> ---
> It becomes much worse in linux 4.5 comparing with 4.4.2.
>
> Is there any update on this issue?
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
>

Actually I'm aware of that.

However, with latest firmware + linux 4.5, my system would simply freeze in a few minutes that I can't get any useful information for this bug.

With 15.68.7.p53 my kernel could live longer, but with kernel 4.5, wifi also enters bad state quite fast comparing with 4.4.2 or 4.3.3.

The only thing I notice so far is that:

With latest firmware + linux 4.3.3, I notice a lot of "mwifiex_pcie 0000:02:00.0: mwifiex_process_sleep_confirm_resp: cmd size is 0" when wifi enters bad state, which doesn't happen with 15.68.7.p53.

Hi Weng,

Can you provide complete log for the issue mentioned in comment#14?

Created attachment 209441
attachment-7733-0.html

My system freezes too with 4.5 kernel and latest and previous versions of
firmware.

2016-03-16 6:52 GMT+03:00 <email address hidden>:

> https://bugzilla.kernel.org/show_bug.cgi?id=109681
>
> --- Comment #13 from Weng Xuetian <email address hidden> ---
> Actually I'm aware of that.
>
> However, with latest firmware + linux 4.5, my system would simply freeze
> in a
> few minutes that I can't get any useful information for this bug.
>
> With 15.68.7.p53 my kernel could live longer, but with kernel 4.5, wifi
> also
> enters bad state quite fast comparing with 4.4.2 or 4.3.3.
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
>

Created attachment 210651
kernel 4.5 log with debug mask 0xfffffff

The complete log is too large(286MB), so I cut it to last 1 minute before PREP_CMD: FW is in bad state happens.

mwifiex_pcie 0000:02:00.0: info: MWIFIEX VERSION: mwifiex 1.0 (15.68.7.p66)

I'm experiencing a variety of problems with this driver as well.

Using Linux v4.6-rc1-64-g6ddf37d, firmware 15.68.7.p66.

The first failure is an inability to connect. This is sporadic; it seems to happen about 1 time in 4 boots, but one time I had it three or four times in a row. I will attach a dmesg log with debug increased part way through. It seems as though this may be similar to the problem reported here:
  https://<email address hidden>/msg19911.html

The second failure is horrific performance, accompanied by constantly repeated mwifiex_process_sleep_confirm_resp: cmd size is 0 messages. The user experience is harder to quantify; you get absolutely terrible network performance (7-10% packet loss), and general instability. I'll attach a dmesg log for that as well, again with debug increased part way through. This is similar to what Weng Xuetian reported in Comment 14.

Finally, as I use this device, I find one consistent and persistent oddity. That is, even when everything is 'working', I get poor latency connecting to the device. That is, my latency when I ping 'out' is fine (< 10 ms, not much variation). But when I ping the device from a different system, I get highly variable latency. (Ranging from 10 ms to 4000 ms). No dropped packets, but horrible latency. You really notice it when you're ssh'd in :-/.

Created attachment 211711
Log of the failure to connect.

Created attachment 211721
Log of the cmd size 0 failure mode.

I've spent some time looking at the source code and trying to debug what is happening. I'm using the kvalo wireless-drivers-next pending branch, and I've applied recent patches sent to lkml that seem valuable; but the problems persist.

It seems as though in the 'normal' case we get an unusual number of 'max count reached while accessing sleep cookie' messages. The usleep calls in that case make me suspicious, and being slow to wake up would seem to be a good explanation of the performance we see. (That is, sending a packet is fast, but responding is slow). It also seems like we're in a rather startling loop of awake/sleep events; doing multiple spins within a millisecond, as far as I can tell.

In the disconnect case, I see an 'invalid cmd resp' error followed by 'There is no command but got a command response'.

I have now started booting with debug_mask=0xffffffff; I'll replace my logs with those, as they appear to be more complete.

For reference, here are related areas of the code:

Sleep message:
https://kernel.googlesource.com/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next/+/pending/drivers/net/wireless/marvell/mwifiex/pcie.c#377

There is no command:
https://kernel.googlesource.com/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next/+/pending/drivers/net/wireless/marvell/mwifiex/pcie.c#1637
(although intriguingly to my naive mind, that code block is related to the sleep check as well).

Created attachment 212611
Log of a failure to connect, with debug mask set from boot.

Created attachment 212621
Log of a 'working' session, which has poor latency. debug_mask ffffffff from start.

21 comments hidden view all 101 comments

Thanks Jeremy, I applied the patch on top of tigerite's kernel (https://launchpad.net/~tigerite/+archive/ubuntu/kernel) on top of 4.4.6-3-surface. I will run it for a while on my Surface Book and see if that solves the problem.

It's here in case someone wants to save some time: https://github.com/EiNSTeiN-/mwifiex

Hi Jeremy,

I got the race condition you described in comment #42. It’s not expected as per our design. Here is the flow.

1) Interrupt is received over PCIe interface
2) Interrupt handler does following things.
    a) Read and clear the interrupt register, copy the interrupt bitmap to "adapter->int_status"
    b) Disable interrupts and exit.

3) Other thread will call mwifiex_process_int_status() to serve the interrupts in while loop.

4) At the end of loop, we read interrupt register for any further interrupts and serve them.

5) We exit from while loop if there are no more interrupts.

6) Interrupts are enabled while exiting mwifiex_process_int_status()

Basically we don't expect interrupt handler being called while serving the interrupts. The reason is interrupts are disabled at that time.

Looks like mwifiex_pcie_disable_host_int() call in mwifiex_interrupt_status() didn't take an effect for some reason.

Can you confirm by adding debug message if you receive an interrupt when it's disabled?

The fix from Jeremy is ok and doesn't have any side-effects. It removes the code which was added for improving the performance.

I am curious to know why interrupt is received even after disabling it which caused a race condition.

Created attachment 216101
Log of failing session

The attached log should reflect the wireless next tree, with the patch I provided applied. (Except that I've left the bottom part of the while loop enabled, so as to reproduce the failure).

I believe the problem case starts at time index 30.893034.

(I had instrumented the 0 byte command failure; so I generally search for '0 byte cmd' and then look backwards to hunt for the reason we got a 0 byte command).

Again, my sense of the events is:
  30.893034 Event (0x00000008) interrupt arrives, we disable
  30.893300 We detect a command response, reloop, and start processing
  30.893300 Command (0x00000004) interrupt arrives, we stash it into
            adapter->int_status
  30.893314 We re-enable (but too late; damage is done)
  30.893374 We 'pop' the already processed command from adapter->int_status,
            and think it's a 0 byte cmd response

Created attachment 216161
attachment-8673-0.html

I have applied the patch above to the 4.6-rc7 kernel on my Surface Pro 3.
After some time of usage I am unable to turn of wifi or change the AP.
The dmesg log:

[ 8155.017776] mwifiex_pcie 0000:01:00.0: disable_host_int
[ 8155.017781] mwifiex_pcie 0000:01:00.0: jpw:
mwifiex_pcie_process_cmd_complete skb ffff880244ee1700, skb->len 12,
skb->data_len 0, skb->head ffff880242b2a940, skb->data ffff880242b2a980,
skb->tail 76, skb->end 2432
[ 8155.017784] mwifiex_pcie 0000:01:00.0: enable_host_int
[ 8155.017789] mwifiex_pcie 0000:01:00.0: jpw: mwifiex_pcie_cmdrsp_complete
skb ffff880244ee1700, skb->len 8, skb->data_len 0, skb->head
ffff880242b2a940, skb->data ffff880242b2a984, skb->tail 76, skb->end 2432
[ 8155.017867] mwifiex_pcie 0000:01:00.0: disable_host_int
[ 8155.017873] mwifiex_pcie 0000:01:00.0: jpw:
mwifiex_pcie_process_cmd_complete skb ffff880244ee1700, skb->len 12,
skb->data_len 0, skb->head ffff880242b2a940, skb->data ffff880242b2a980,
skb->tail 76, skb->end 2432
[ 8155.017875] mwifiex_pcie 0000:01:00.0: enable_host_int
[ 8155.017879] mwifiex_pcie 0000:01:00.0: jpw: mwifiex_pcie_cmdrsp_complete
skb ffff880244ee1700, skb->len 8, skb->data_len 0, skb->head
ffff880242b2a940, skb->data ffff880242b2a984, skb->tail 76, skb->end 2432

2016-05-12 16:31 GMT+03:00 <email address hidden>:

> https://bugzilla.kernel.org/show_bug.cgi?id=109681
>
> --- Comment #48 from Jeremy White <email address hidden> ---
> Created attachment 216101
> --> https://bugzilla.kernel.org/attachment.cgi?id=216101&action=edit
> Log of failing session
>
> The attached log should reflect the wireless next tree, with the patch I
> provided applied. (Except that I've left the bottom part of the while loop
> enabled, so as to reproduce the failure).
>
> I believe the problem case starts at time index 30.893034.
>
> (I had instrumented the 0 byte command failure; so I generally search for
> '0
> byte cmd' and then look backwards to hunt for the reason we got a 0 byte
> command).
>
> Again, my sense of the events is:
> 30.893034 Event (0x00000008) interrupt arrives, we disable
> 30.893300 We detect a command response, reloop, and start processing
> 30.893300 Command (0x00000004) interrupt arrives, we stash it into
> adapter->int_status
> 30.893314 We re-enable (but too late; damage is done)
> 30.893374 We 'pop' the already processed command from
> adapter->int_status,
> and think it's a 0 byte cmd response
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
>

Created attachment 216261
attachment-11994-0.html

Well, I have tested it for 3 days and it seems to work fine

2016-05-12 20:08 GMT+03:00 Anton Anikin <email address hidden>:

> I have applied the patch above to the 4.6-rc7 kernel on my Surface Pro 3.
> After some time of usage I am unable to turn of wifi or change the AP.
> The dmesg log:
>
> [ 8155.017776] mwifiex_pcie 0000:01:00.0: disable_host_int
> [ 8155.017781] mwifiex_pcie 0000:01:00.0: jpw:
> mwifiex_pcie_process_cmd_complete skb ffff880244ee1700, skb->len 12,
> skb->data_len 0, skb->head ffff880242b2a940, skb->data ffff880242b2a980,
> skb->tail 76, skb->end 2432
> [ 8155.017784] mwifiex_pcie 0000:01:00.0: enable_host_int
> [ 8155.017789] mwifiex_pcie 0000:01:00.0: jpw:
> mwifiex_pcie_cmdrsp_complete skb ffff880244ee1700, skb->len 8,
> skb->data_len 0, skb->head ffff880242b2a940, skb->data ffff880242b2a984,
> skb->tail 76, skb->end 2432
> [ 8155.017867] mwifiex_pcie 0000:01:00.0: disable_host_int
> [ 8155.017873] mwifiex_pcie 0000:01:00.0: jpw:
> mwifiex_pcie_process_cmd_complete skb ffff880244ee1700, skb->len 12,
> skb->data_len 0, skb->head ffff880242b2a940, skb->data ffff880242b2a980,
> skb->tail 76, skb->end 2432
> [ 8155.017875] mwifiex_pcie 0000:01:00.0: enable_host_int
> [ 8155.017879] mwifiex_pcie 0000:01:00.0: jpw:
> mwifiex_pcie_cmdrsp_complete skb ffff880244ee1700, skb->len 8,
> skb->data_len 0, skb->head ffff880242b2a940, skb->data ffff880242b2a984,
> skb->tail 76, skb->end 2432
>
>
> 2016-05-12 16:31 GMT+03:00 <email address hidden>:
>
>> https://bugzilla.kernel.org/show_bug.cgi?id=109681
>>
>> --- Comment #48 from Jeremy White <email address hidden> ---
>> Created attachment 216101
>> --> https://bugzilla.kernel.org/attachment.cgi?id=216101&action=edit
>> Log of failing session
>>
>> The attached log should reflect the wireless next tree, with the patch I
>> provided applied. (Except that I've left the bottom part of the while
>> loop
>> enabled, so as to reproduce the failure).
>>
>> I believe the problem case starts at time index 30.893034.
>>
>> (I had instrumented the 0 byte command failure; so I generally search for
>> '0
>> byte cmd' and then look backwards to hunt for the reason we got a 0 byte
>> command).
>>
>> Again, my sense of the events is:
>> 30.893034 Event (0x00000008) interrupt arrives, we disable
>> 30.893300 We detect a command response, reloop, and start processing
>> 30.893300 Command (0x00000004) interrupt arrives, we stash it into
>> adapter->int_status
>> 30.893314 We re-enable (but too late; damage is done)
>> 30.893374 We 'pop' the already processed command from
>> adapter->int_status,
>> and think it's a 0 byte cmd response
>>
>> --
>> You are receiving this mail because:
>> You are on the CC list for the bug.
>>
>
>
>
>
>

I have been running a version of the patch with the power management changes removed (i.e. power management turned on) to see if it made a difference. I had several instances of unusable wifi (until next reboot) in 2 days. So disabling power management definitely seems to be part of the solution.

Created attachment 218101
Trace error Surface Pro 3 (kernel 4.6, Jeremy White's patch, archlinux)

I use archlinux in a Surface Pro 3, I think is the same wifi card as SP4. I use the package linux-surfacepro3 4.6-1 (https://aur.archlinux.org/packages/linux-surfacepro3/), which includes kernel 4.6 and Jeremy White's patch. Although wifi works, I receive an error message in dmesg which didn't appear in previous versions of the package.

I don't know if this error message could be of interest.

(In reply to David López from comment #52)
> Created attachment 218101 [details]
> Trace error Surface Pro 3 (kernel 4.6, Jeremy White's patch, archlinux)
>
> I use archlinux in a Surface Pro 3, I think is the same wifi card as SP4. I
> use the package linux-surfacepro3 4.6-1
> (https://aur.archlinux.org/packages/linux-surfacepro3/), which includes
> kernel 4.6 and Jeremy White's patch. Although wifi works, I receive an error
> message in dmesg which didn't appear in previous versions of the package.
>
> I don't know if this error message could be of interest.

That's not related or (as far as I can tell) especially harmful. If you look at the referenced line of code (net/wireless/core.c, line 363) you see it's a message generated because we have a set_tx_power function without a get_tx_power function. (The second message is the same; but for get_antenna). I suspect you're seeing it because you've updated your driver and/or firmware.

There is some new firmware available from Marvell [1]. So far, I've had no issues with it, but I have only tried it for a few days. Perhaps this new firmware makes any patch unnecessary?

I've actually managed to dig up some release notes in ChromiumOS's commit logs that suggest the newer firmware may help with stability [2].

[1] http://git.marvell.com/?p=mwifiex-firmware.git;a=tree;f=mrvl;hb=HEAD
[2] https://chromium.googlesource.com/chromiumos/third_party/marvell/

The Marvell firmware from link [1] in (previous) comment #54 does not help avoiding the random disconnects on my Surface Pro 3 with openSUSE Tumbleweed kernel 4.6.3. Not sure if this kernel carries any relevant patches compared to upstream.

I have just submitted patches to resolve original issue reported in this bug. i.e. below error messages followed by command timeout.

[ 11.522123] mwifiex_pcie 0000:01:00.0: CMD_RESP: invalid cmd resp
[ 11.680412] mwifiex_pcie 0000:01:00.0: There is no command but got cmdrsp

https://patchwork.kernel.org/patch/9232091/
https://patchwork.kernel.org/patch/9232093/

(In reply to Amitkumar Karwar from comment #56)
> I have just submitted patches to resolve original issue reported in this
> bug. i.e. below error messages followed by command timeout.
>
> [ 11.522123] mwifiex_pcie 0000:01:00.0: CMD_RESP: invalid cmd resp
> [ 11.680412] mwifiex_pcie 0000:01:00.0: There is no command but got cmdrsp
>
> https://patchwork.kernel.org/patch/9232091/
> https://patchwork.kernel.org/patch/9232093/

It seems more stable but the 'CMD_RESP' problem persists on my system (Surface Pro 3 running Linux 4.7.0).

    7.720645] mwifiex_pcie 0000:01:00.0: WLAN FW is active
[ 7.808716] mwifiex_pcie 0000:01:00.0: info: MWIFIEX VERSION: mwifiex 1.0 (15.68.7.p77)
[ 7.808721] mwifiex_pcie 0000:01:00.0: driver_version = mwifiex 1.0 (15.68.7.p77)
[ 7.812614] mwifiex_pcie 0000:01:00.0 eth0: renamed from mlan0
[ 9.696401] mwifiex_pcie 0000:01:00.0: info: trying to associate to 'n0wifi' bssid 4c:e6:76:71:25:1b
[ 9.726026] mwifiex_pcie 0000:01:00.0: info: associated to bssid 4c:e6:76:71:25:1b successfully
[ 1145.239295] mwifiex_pcie 0000:01:00.0: CMD_RESP: invalid cmd resp
[ 1146.012946] mwifiex_pcie 0000:01:00.0: There is no command but got cmdrsp
[ 1624.849858] mwifiex_pcie 0000:01:00.0: CMD_RESP: invalid cmd resp
[ 1625.236366] mwifiex_pcie 0000:01:00.0: CMD_RESP: invalid cmd resp
[ 1626.009534] mwifiex_pcie 0000:01:00.0: There is no command but got cmdrsp
[ 3988.474123] mwifiex_pcie 0000:01:00.0: info: successfully disconnected from 4c:e6:76:71:25:1b: reason code 3
[ 3988.474151] mwifiex_pcie 0000:01:00.0: info: successfully disconnected from 00:00:00:00:00:00: reason code 3

Created attachment 227261
Firmware in bad state after sleep on Surface pro 3

I got the previous errors (firmware in bad state, 'CMD_RESP') after sleep with the new patch. I tried to reset the pci card/rmmod and modprobe the module again without success.

I'm now running a 4.7 kernel which has Amitkumar's patches from 2016-07-15 included. I get errors like this:

Sep 25 11:22:14 hat kernel: [ 1058.526154] mwifiex_pcie 0000:02:00.0: CMD_RESP: cmd 0x10f error, result=0x2

and the driver crashes after a couple of hours.

With Jeremy White's patch from 2016-05-11 above, the driver is completely stable.

To clarify my last comment:

With JW:s patch, I *still* get the CMD_RESP error (which I assume is harmless) but the driver doesn't crash.

Could this have something to do with the following change?:

--- a/drivers/net/wireless/marvell/mwifiex/sta_cmd.c
+++ a/drivers/net/wireless/marvell/mwifiex/sta_cmd.c
@@ -2237,7 +2237,7 @@ int mwifiex_sta_init_cmd(struct mwifiex_private *priv, u8 first_sta, bool init)
   if (ret)
    return -1;

- if (priv->bss_type != MWIFIEX_BSS_TYPE_UAP) {
+ if (0 && priv->bss_type != MWIFIEX_BSS_TYPE_UAP) {
    /* Enable IEEE PS by default */
    priv->adapter->ps_mode = MWIFIEX_802_11_POWER_MODE_PSP;
    ret = mwifiex_send_cmd(priv,
@@ -2300,7 +2300,7 @@ int mwifiex_sta_init_cmd(struct mwifiex_private *priv, u8 first_sta, bool init)
  if (ret)
   return -1;

- if (!disable_auto_ds &&
+ if (0 && !disable_auto_ds &&
      first_sta && priv->adapter->iface_type != MWIFIEX_USB &&
      priv->bss_type != MWIFIEX_BSS_TYPE_UAP) {
   /* Enable auto deep sleep */

This is the behavior for 4.8.0 (without any patches):

The driver works for 25 s. Then I get the message:

Oct 6 09:34:15 hat wpa_supplicant[1959]: wlp2s0: CTRL-EVENT-SCAN-FAILED ret=-16 retry=1

every second.

If I shut down network-manager and reload the module, it again works for 25 s, followed by the above message. I also cannot scan for any wireless networks when the message starts, although network communication works.

No sign, yet, of the CMD_RESP error which I've reported about above for previous kernels.

Should I, perhaps, report this as a new bug?

I can now report that I've tested the kernel from Linus master branch. It seems like the patches by Amitkumar Karwar from July which went into the mainline kernel yesterday finally makes the mwifiex driver stable on MS Surface Pro 4, so I suggest that this bug report can now be closed.

Mikael,

What version of both the kernel and mwifiex are you using.

I am using mwifiex 1.0 (15.68.7.p77)
and kernel 4.8.6-300.fc25 (I am trying to install Fedora25)

if the console ever goes into a suspend state, I end up with what you were seeing: "PREP_CMD: FW in reset state"

Thanks,
Kevin

I can confirm that with newest firmware and kernel 4.10, the connection is stable on a SP3. I do get constant 5% CPU usage in a kworker/uN:N process caused by the mwifiex driver, though. Not sure if this is expected, or worth a separate bug report.

With kernel 4.11.0, the issue is back.

[ 7.890616] mwifiex_pcie 0000:01:00.0: info: trying to associate to '<routername>' bssid <ID>
[ 7.920245] mwifiex_pcie 0000:01:00.0: info: associated to bssid <ID> successfully
[ 7834.643967] mwifiex_pcie 0000:01:00.0: Firmware wakeup failed
[ 8947.336070] mwifiex_pcie 0000:01:00.0: PREP_CMD: FW in reset state

Created attachment 256641
no-power-save

Dear Kevin,

Sorry for taking so long to respond.

I have got this error in many different kernels. I still get it in 4.11.0.
But I also know what causes it: The current linux driver doesn't handle
power management properly. So, it is necessary to insert if-up rules which
switches off power management. In my case, where the interface is governed
by NetworkManager, I wrote a file
/etc/NetworkManager/dispatcher.d/no-power-save which I attach to this email.

With this fix, my mwifiex driver is now fully stable under kernel 4.11.0.

Best regards,
Mikael

On Fri, Feb 17, 2017 at 6:24 PM, <email address hidden>
wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=109681
>
> Kevin (<email address hidden>) changed:
>
> What |Removed |Added
> ------------------------------------------------------------
> ----------------
> CC| |<email address hidden>
>
> --- Comment #63 from Kevin (<email address hidden>) ---
> Mikael,
>
> What version of both the kernel and mwifiex are you using.
>
> I am using mwifiex 1.0 (15.68.7.p77)
> and kernel 4.8.6-300.fc25 (I am trying to install Fedora25)
>
> if the console ever goes into a suspend state, I end up with what you were
> seeing: "PREP_CMD: FW in reset state"
>
> Thanks,
> Kevin
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
>

Hi Christoph

(In reply to Christoph from comment #65)
> With kernel 4.11.0, the issue is back.
>
> [ 7.890616] mwifiex_pcie 0000:01:00.0: info: trying to associate to
> '<routername>' bssid <ID>
> [ 7.920245] mwifiex_pcie 0000:01:00.0: info: associated to bssid <ID>
> successfully
> [ 7834.643967] mwifiex_pcie 0000:01:00.0: Firmware wakeup failed
> [ 8947.336070] mwifiex_pcie 0000:01:00.0: PREP_CMD: FW in reset state

Can you please share the firmware dump after the above issue occur.

1. To manually trigger the dump you can use below command:

cat /sys/kernel/debug/mwifiex/mlan0/device_dump

2. When the dump is complete dmesg will log "mwifiex firmware dump end".

3. Collect the dump with below script:
-------------------------------------------------------------
#!/bin/bash
/sbin/ethtool --set-dump mlan0 0
/sbin/ethtool --get-dump mlan0
/sbin/ethtool --get-dump mlan0 data /tmp/ITCM.log

/sbin/ethtool --set-dump mlan0 1
/sbin/ethtool --get-dump mlan0
/sbin/ethtool --get-dump mlan0 data /tmp/DTCM.log

/sbin/ethtool --set-dump mlan0 2
/sbin/ethtool --get-dump mlan0
/sbin/ethtool --get-dump mlan0 data /tmp/SQRAM.log

/sbin/ethtool --set-dump mlan0 3
/sbin/ethtool --get-dump mlan0
/sbin/ethtool --get-dump mlan0 data /tmp/IRAM.log
-------------------------------------------------------------

4. Share the /tmp/*.log files collected above for further debugging.

5. Also, please share the dmesg logs generated after you have run the above command.

Thanks,
Ganapathi

82 comments hidden view all 101 comments
Hicks (hicks1gb) wrote :
Hicks (hicks1gb) wrote :

Tested with "wicd" too.
Wicd crashes when wifi goes down

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Hicks (hicks1gb) wrote :
Joseph Salisbury (jsalisbury) wrote :

Did this issue start happening after an update/upgrade? Was there a prior kernel version where you were not having this particular problem?

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.14 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.14

Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Incomplete

El 14/11/17 a las 17:54, Joseph Salisbury escribió:
> Did this issue start happening after an update/upgrade? Was there a
> prior kernel version where you were not having this particular problem?
No, always
>
> Would it be possible for you to test the latest upstream kernel? Refer
> to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest
> v4.14 kernel[0].

Testing...
>
> If this bug is fixed in the mainline kernel, please add the following
> tag 'kernel-fixed-upstream'.
>
> If the mainline kernel does not fix this bug, please add the tag:
> 'kernel-bug-exists-upstream'.
>
> Once testing of the upstream kernel is complete, please mark this bug as
> "Confirmed".
>
>
> Thanks in advance.
>
> [0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.14
>
>
> ** Changed in: linux (Ubuntu)
> Importance: Undecided => Medium
>
> ** Changed in: linux (Ubuntu)
> Status: Confirmed => Incomplete
>

Hicks (hicks1gb) wrote :

At the moment wifi still working... No crashes.

Hicks (hicks1gb) wrote :

More crashes! Problem still here.

Hicks (hicks1gb) wrote :

bug-exists-upstream

Hicks (hicks1gb) wrote :

Let's see, I just tested this patch, which supposedly solves the wifi controller instability, but it doesn't work for me. I'll leave the link for what it's worth.

https://github.com/EiNSTeiN-/mwifiex

Another detail I want to comment on is that every time the wifi is dropped, the surface is semi blocked. Terminal stops responding, I can't run any commands. Ctrl+c does not work either. Wicd also stops responding.
If I try to restart the terminal, it remains "waiting" for processes to close, which never finish closing.
I finally have to force the switch off.

Hicks (hicks1gb) wrote :
Hicks (hicks1gb) wrote :
72 comments hidden view all 101 comments

I see this issue on a newer Surface Pro 2017, with kernel 4.14, firmware 15.68.7.p119, and the symptoms are very familiar. Unfortunately, disabling power save does not seem to be as persistent a cure; it improves matters, but does not resolve the issue.

I've attempted to capture a dump, but I am being told that --set-dump mlan0 0 has no such device, and --set-dump wlp1s0 0 returns no such device.

Hi Jeremy,

Thanks for the intimation.

>>I've attempted to capture a dump, but I am being told that --set-dump mlan0 0
>>has no such device, and --set-dump wlp1s0 0 returns no such device.

Ok. Firmware dump is written into below file:
/sys/devices/virtual/devcoredump/devcd1/data

Can you please check if this file is generated for you. If generated kindly share the same.

Regards,
Ganapathi

Created attachment 260791
dmesg log and data file from firmware dump

Attached is a tarball with a dmesg log and the /sys/devices/virtual/devcoredump/devcd1/data
 file resulting after requesting a firmware dump.

Note that this is a 4.14 kernel, modified for Surface Pro use, but with the mwifiex module identical to the current tip of the 4.14 tree. (This tree to be precise https://github.com/jakeday/linux-surface).

I also get this result with a variety of other kernels; it seems specific to this hardware + firmware, and perhaps to this particular use case. (It is an odd case; a modified Fedora Core 23 stack.)

Nicely, I now get fairly consistent behavior. That is, upon first launch, the wifi stack is very laggy - you'll get ping times of 43, 100, 243, 79, and so on (when you'd expect steady 1.43). That shows up in interactive ssh sessions. Further, after a modest amount of use (usually just a few meg will do it), the wireless will stop working. You also can no longer successfully reboot or poweroff; you get a hang at kernel shutdown time.

The attached came from a session with a clean boot, debug_mask 0xffffffff, and about 3M of the kernel source downloaded before the wifi stopped functioning.

73 comments hidden view all 101 comments
Kai-Heng Feng (kaihengfeng) wrote :

Try the daily mainline kernel, http://kernel.ubuntu.com/~kernel-ppa/mainline/daily/current/

There's a new fix for random MAC scan, worth a try.

Hicks (hicks1gb) wrote :

Testing...

Hicks (hicks1gb) wrote :

At the moment the wifi has not fallen, but after a reboot of the system has been slow enough to start and gives me this error.
I'll keep trying to see if he can hold on.

Hicks (hicks1gb) wrote :

Still doesn't work. The wifi is down again, taking the rest of the system with it.

71 comments hidden view all 101 comments

Any updates on this from Marvell side?

Hi Jeremy,

Sorry for the delay. I checked the logs you shared. Here it is a scan command(0x107) timeout. But the firmware dump did not give any hints for firmware going to bad state. Moreover the scratch registers dumps tell us that the timed out command 0x107 did not reach the firmware at all. We are checking firmware dumps and will share the updates soon.

On the other hand, are you not getting below error:

> With kernel 4.11.0, the issue is back.
>
> [ 7.890616] mwifiex_pcie 0000:01:00.0: info: trying to associate to
> '<routername>' bssid <ID>
> [ 7.920245] mwifiex_pcie 0000:01:00.0: info: associated to bssid <ID>
> successfully
> [ 7834.643967] mwifiex_pcie 0000:01:00.0: Firmware wakeup failed
> [ 8947.336070] mwifiex_pcie 0000:01:00.0: PREP_CMD: FW in reset state

Thanks,
Ganapathi

(In reply to Ganapathi Bhat from comment #72)
> Hi Jeremy,
>
> Sorry for the delay. I checked the logs you shared. Here it is a scan
> command(0x107) timeout. But the firmware dump did not give any hints for
> firmware going to bad state. Moreover the scratch registers dumps tell us
> that the timed out command 0x107 did not reach the firmware at all. We are
> checking firmware dumps and will share the updates soon.

Thanks, I look forward to the update.

>
>
> On the other hand, are you not getting below error:

No, I don't recall seeing that error recently. Note that was Christoph that reported that, not I. I haven't tried with 4.11.

Cheers,

Jeremy

I am seeing that error message on a Surface Pro 2017. The connection appears stable regardless of powersaving but after suspending I get:
[ 361.092570] mwifiex_pcie 0000:01:00.0: Firmware wakeup failed

Followed by an endless loop of:
[ 366.098914] mwifiex_pcie 0000:01:00.0: PREP_CMD: FW in reset state
[ 366.098926] mwifiex_pcie 0000:01:00.0: scan failed: -1

Changed in linux:
importance: Unknown → Medium
status: Unknown → Confirmed
P.I.Julius (pijulius) wrote :
  • dump Edit (1.1 MiB, application/octet-stream)

Hi Guys,

I'm suffering this same problem on Surface Pro 2017 and would really like to find a solution as using Linux is the only option for me to not throw it out through the window as it has a corrupted memory an so can only use Linux which allows me to skip the affected area using memmap.

Unfortunately this problem forces me to reboot around 10 times a day with light usage so it makes it hard to enjoy the tablet.

Googling found this case and would like to attach the core dump from /sys/class/devcoredump maybe it will be helpful, also can duplicate it pretty easily so if help is needed re testing I'm here and finally would also offer a $100 reward for a solution if I may, I know its not that much but as gesture of appreciation hopefully it will be received with kindness.

Thank you guys in advance for the work on this!

Regards,
Julius

Kai-Heng Feng (kaihengfeng) wrote :

Please attach `sudo lspci -vvnn`.

P.I.Julius (pijulius) wrote :

For sure, please see it attached below.

Kai-Heng Feng (kaihengfeng) wrote :

Have you tried kernel parameter `pcie_aspm=off`? There are some PCIe WiFi modules don't like ASPM L0s.

P.I.Julius (pijulius) wrote :

Yes, just tried it and unfortunately no help, same coredump (let me know if you want one attached again) after playing a bit of video from youtube :(

I owned a Surface Book and now own a Surface Book 2. And both were/are afffected by this. Still with kernel 4.20-rc5 the disconnect problem persists after long heavy-usage. Does Ganapathi have some updates yet?

Hi Chatty,

Do you mean you see the error in #Comment74? (after suspending)?

Regards,
Ganapathi

P.I.Julius (pijulius) wrote :

Hi Guys,

Just a note that may be helpful for tracking this down here is what I found out lately.

If in "drivers/net/wireless/marvell/mwifiex/cmdevt.c" file, "mwifiex_cmd_timeout_func(struct timer_list *t)" function I comment out these:

//if (adapter->if_ops.device_dump)
// adapter->if_ops.device_dump(adapter);

//if (adapter->if_ops.card_reset)
// adapter->if_ops.card_reset(adapter);

The wifi will still drop with the timeout command but at least the mwifiex_pcie module wont freeze and so you will be able to restart or even put the pc into sleep and after a resume simply continue where you left off without having to do a hard restart by long pressing power button.

It seems the card_reset(adapter) function is the one freezing the module and so freezing the wifi, so will try to see what's there but if any of you have any idea how we could reset the adapter and even (just for testing why not) always reload the firmware and see if that fixes it as it seems the resume does the same thing and that makes things work again.

Regards,
Julius

Kai-Heng Feng (kaihengfeng) wrote :

What kernel version did you use with `pcie_aspm=off`?
There's bug in older kernel that `pcie_aspm=off` doesn't turn ASPM off.

Please use mainline kernel with that kernel parameter.

P.I.Julius (pijulius) wrote :

Hi Kai-Heng Feng,

I see you wrote me twice with the same suggestion and GOOD SOOOO as I have given it up to easily on the first try but I can say now that GOT IT WORKING!!!

So here is everything I found out and did to make it work on Fedora 29 (Kernel 4.18.11)

First of all of course rebuild the whole kernel with all the patches from here:
https://github.com/jakeday/linux-surface

FYI: it now supports Surface Pro 5 2017 (including battery and buttons once qzed's patch is accepted too)

Added pcie_aspm=off to the kernel (NOT Needed, instead please keep reading on) command but unfortunately that didn't do anything so started digging and realized drivers/pci/pcie/aspm.c doesn't even compiles when building the kernel so there is no code to check this command parameter for the kernel anyway.

Then dug around some more and found I don't even have "/sys/module/pcie_aspm/parameters/policy" file not even the directory "/sys/module/pcie_aspm" so it seems the kernel for whatever reason (as couldn't find PCIE_ASPM config related things in make menuconfig either).

So wanted to make sure ASPM is indeed turned on for the wireles so I did:
lspci -vvv

and found this device:
01:00.0 Ethernet controller: Marvell Technology Group Ltd. 88W8897 [AVASTAR] 802.11ac Wireless

with the following:
LnkCtl: ASPM L0s L1 Enabled;

Now I started to google around on how to disable ASPM on device basis as it seems these values are set from the firmware itself (if I understood things correctly) and found baybal's script (which enables ASPM on dell devices):
https://gist.github.com/baybal/b499fc5811a7073df0c03ab8da4be904

and instead of enabling I changed it to disable on the ethernet controller and that did the trick!!! Now doing lspci -s 01:00.0 -vvv shows me this instead:
LnkCtl: ASPM Disabled;

and since then I can play a YouTube video in 1440p and it never freezes (and was freezing like in first minute before that) also realized that if I changed to 720p it was dying out later bit did die out eventually but now with ASPM off even 1440p works just fine and plays trough the whole video.

I'm attaching the shell script that I quickly modified from baybal's code, it needs to be run every time on startup (not on resume as far as I could see/tested) and then all should be fine just please make sure you verify that your Marvell network device is indeed at 01:00.0 and if not just edit the file and change this.

So will keep on testing but it seems this fixed the problem and all working just great now so thanks again Kai-Heng Feng for the pointers if you give me the details I'm ready to send you over the small reward that I offered for help/solution and it seems yours was the one so just let me know the PayPal address and it will be on it's way.

Thanks again and hopefully this will be it and all the details/script will be helpful to the others too.

Regards,
Julius

Displaying first 40 and last 40 comments. View all 101 comments or add a comment.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.