Ubuntu

8086:0085 Intel wifi frequent disconnects

Reported by Pete Graner on 2012-05-29
114
This bug affects 21 people
Affects Status Importance Assigned to Milestone
linux (Arch Linux)
New
Undecided
Unassigned
linux (Ubuntu)
High
Unassigned

Bug Description

Several people at Linaro Connect with Thinkpads and Intel wifi and experiencing, frequent disconnects and have to kill wpa_supplicant in order to reconnect.

this seems to be related to having multiple ssid with the same name in close proximity see this reference:

https://mail.gnome.org/archives/networkmanager-list/2012-February/msg00135.html

I've had this happen on my x220 with both the precise and the "Q" kernel. I've upgraded my box to Q

Here is a snippet of the log file when it happens:

May 29 10:43:18 zorak wpa_supplicant[1155]: wlan0: CTRL-EVENT-DISCONNECTED bssid
=00:00:00:00:00:00 reason=3

ProblemType: Bug
DistroRelease: Ubuntu 12.10
Package: linux-image-3.4.0-3-generic 3.4.0-3.8
ProcVersionSignature: Ubuntu 3.4.0-3.8-generic 3.4.0
Uname: Linux 3.4.0-3-generic x86_64
AlsaVersion: Advanced Linux Sound Architecture Driver Version 1.0.25.
ApportVersion: 2.1-0ubuntu1
Architecture: amd64
ArecordDevices:
 **** List of CAPTURE Hardware Devices ****
 card 0: PCH [HDA Intel PCH], device 0: CONEXANT Analog [CONEXANT Analog]
   Subdevices: 1/1
   Subdevice #0: subdevice #0
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: pgraner 1844 F.... pulseaudio
Card0.Amixer.info:
 Card hw:0 'PCH'/'HDA Intel PCH at 0xf1520000 irq 47'
   Mixer name : 'Intel CougarPoint HDMI'
   Components : 'HDA:14f1506e,17aa21da,00100002 HDA:80862805,80860101,00100000'
   Controls : 26
   Simple ctrls : 8
Card29.Amixer.info:
 Card hw:29 'ThinkPadEC'/'ThinkPad Console Audio Control at EC reg 0x30, fw unknown'
   Mixer name : 'ThinkPad EC (unknown)'
   Components : ''
   Controls : 1
   Simple ctrls : 1
Card29.Amixer.values:
 Simple mixer control 'Console',0
   Capabilities: pswitch pswitch-joined penum
   Playback channels: Mono
   Mono: Playback [on]
Date: Tue May 29 18:05:11 2012
EcryptfsInUse: Yes
HibernationDevice: RESUME=UUID=113c954a-7297-4617-b9e3-a64076beef2c
InstallationMedia: Ubuntu 12.04 LTS "Precise Pangolin" - Release amd64 (20120425)
MachineType: LENOVO 4286CTO
ProcEnviron:
 TERM=xterm
 PATH=(custom, user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB: 0 inteldrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.4.0-3-generic root=UUID=3d880a9e-f17b-4e2e-9499-6273ef2e37fb ro quiet splash vt.handoff=7
RelatedPackageVersions:
 linux-restricted-modules-3.4.0-3-generic N/A
 linux-backports-modules-3.4.0-3-generic N/A
 linux-firmware 1.80
RfKill:
 0: phy0: Wireless LAN
  Soft blocked: no
  Hard blocked: no
SourcePackage: linux
StagingDrivers: mei
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 07/07/2011
dmi.bios.vendor: LENOVO
dmi.bios.version: 8DET50WW (1.20 )
dmi.board.asset.tag: Not Available
dmi.board.name: 4286CTO
dmi.board.vendor: LENOVO
dmi.board.version: Not Available
dmi.chassis.asset.tag: No Asset Information
dmi.chassis.type: 10
dmi.chassis.vendor: LENOVO
dmi.chassis.version: Not Available
dmi.modalias: dmi:bvnLENOVO:bvr8DET50WW(1.20):bd07/07/2011:svnLENOVO:pn4286CTO:pvrThinkPadX220:rvnLENOVO:rn4286CTO:rvrNotAvailable:cvnLENOVO:ct10:cvrNotAvailable:
dmi.product.name: 4286CTO
dmi.product.version: ThinkPad X220
dmi.sys.vendor: LENOVO

Pete Graner (pgraner) wrote :
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux (Ubuntu):
status: New → Confirmed
Joey Stanford (joey) on 2012-05-29
tags: added: linaro
Andy Whitcroft (apw) wrote :

Another possible trigger is the channels that the APs are on. I note from the logs that this machine is using the World regulatory domain, but you are in a locality where other frequencies may be enabled. If the channels listed for access points in the set are above 10 (ish) then you may hit issues. If so you may be able to resolve this using the regulatory framework. Try:

    iw reg set HK

You may have to remove your wifi driver to be able to do this. This may fix active scans for those channels. Let us know if this helps any. Also please include an iwlist scanning when the issue is occuring.

Changed in linux (Ubuntu):
importance: Undecided → High
tags: added: kernel-da-key kernel-key
Seth Forshee (sforshee) wrote :

Andy: iwlwifi implements its own custom world domain, so the world domain it's using is fairly inclusive. The only channels that seem to be missing are ones specific to Japan and the DFS channels in the 5GHz band.

One thing I do see in the logs are some attempts to roam from one AP to another in the ESS, but connecting to the new AP times out. In fact I see a ton of these timeouts trying to connect to APs in the linaroconnect network in general, on different APs at various frequencies.

Joey Stanford (joey) wrote :

I ack the roaming by Seth. We have multiple 802.11n APs with the same SSID here that are WPA2 encrypted but on different channels. Those affected go into a roam loop. When they connect to a non-WPA2 single SSID AP they have no problem. I've seen this on three different Intel setups across multiple brands (so it's not "just a thinkpad thing").

It's also worth nothing that it is happening on several non-US laptops as well. We do have APs in use on channel 1, 6, and 11. To test Andy's theory we moved the 11s to 10. However the hotel's wireless has APs up to 13. The APs for reference are a mix of enGenius and Ruckus. In one test area I removed the enGenius and it actually caused more laptops to go into the roaming loop because there more APs of roughly the same signal strenth (as measured via wpa_gui) to choose from.

I didn't save the wireshark capture but if you'd like to see that I'll do another one tomorrow if we have any affected.

On Tue, May 29, 2012 at 02:00:08PM -0000, Joey Stanford wrote:
> It's also worth nothing that it is happening on several non-US laptops
> as well. We do have APs in use on channel 1, 6, and 11. To test Andy's
> theory we moved the 11s to 10. However the hotel's wireless has APs up
> to 13.

I don't think it's a regulatory problem. iwlwifi should be allowing
everything in the 2GHz aside from channel 14 until either a country IE
is received from the AP or the user manually sets the regulatory domain.

Joey Stanford (joey) wrote :

Also for confirmation this only happens on Linux (and all those variants affected are Ubuntu). No problems reported on mobile phones, Windows, or Mac.

Indeed doesn't seem to be caused by reg domain, and I'm certain it's not just Ubuntu but just any Linux ;)

So, some things I can see from the start are three different reason codes for disconnects, which confirm that it's due to the number of APs:

reason 3 -- DEAUTH_LEAVING -- in simple, roaming to a different AP.
reason 2 -- PREV_AUTH_NOT_VALID -- previous authentication for the AP roamed *from* can't be reused. It may be on a different channel, etc.
reason 4 -- DISASSOC_DUE_TO_INACTIVITY -- (AFAIUI) the other end doesn't respond.

These put together make me believe that there is definitely an issue brought forward by how aggressively NetworkManager tries to stay on the strongest AP. However, there are also ways to mitigate this particular issue: things would likely go a lot smoother if authentication was central (WPA enterprise/RADIUS, or no encryption).

There is most likely also an issue with the iwlwifi driver itself, since we can see an oops message in the WifiSyslog file attached.

Would it be possible to also attach debugging output for wpasupplicant? Those can be sent directly to syslog as well; and we now ship a script to help enabling wpasupplicant debugging:

sudo python /usr/lib/NetworkManager/debug-helper.py --wpa debug

Thanks!

Seth Forshee (sforshee) wrote :

Joey: Do you know if it only happens with Intel wireless, or is this also being seen with other vendors' wireless chipsets?

Joey Stanford (joey) wrote :

Hi,

@Matthieu Yes I completely forgot there is one Debian laptop here that is affected. Also the problems exhibiting here appear to mostly be reason code 2 and 3. On most of the reason code 2s killing wpa_supplicant was work-around. If we have the issue again today I'll get a few folks to add the wpa debug to the bug.

@Seth So far it's 100% Intel boards. I haven't seen any realtek boards affected yet. However I need to check because I heard mention just as we were shutting down that it may also be affecting two of the Linaro dev boards from different manufacturers. I'll circle with Matt W. when we start in 3 hours and then update the bug.

Matt Waddel (mwaddel) wrote :

Syslog with --wpa debug from an Samsung Origen platform. Attached whole log, important info at bottom.

Joey Stanford (joey) wrote :

Here is my syslog. My own case is a little different. Mine normally works until I come out of suspend and then it goes through a bunch of retries and eventually I can connect. This happens about 80% of the time. 10% it doesn't connect at all and 10% I have no problems.

Last few minutes of the /var/log/syslog with --debug on for wpa-supplicant. Sony Vaio Z with Intel wifi 5100, US model.

Loïc Minier (lool) wrote :

Joey, your log shows:
May 30 09:48:38 interpol kernel: [43632.669251] cfg80211: Calling CRDA for country: BO
how come you're using Bolivia as your regulatory domain?

Danilo, could you run:
sudo udevadm monitor --environment kernel
in one window, and:
sudo iw reg set HK
in another and see whether this generates some events?

I was getting frequent disconnects this afternoon, realized that I was still using the wrong regulatory domain and ran the above "iw" command; it seems stable since but perhaps I'm lucky.

On Wed, May 30, 2012 at 3:24 PM, Loïc Minier <email address hidden> wrote:
> Joey, your log shows:
> May 30 09:48:38 interpol kernel: [43632.669251] cfg80211: Calling CRDA for country: BO
> how come you're using Bolivia as your regulatory domain?

BO unrestricts the driver and I've done that for testing. My default
was 00 and Pete had set to HK. So I had been using this for testing
to see if it indeed made any difference. And it doesn't.

Matt: The only thing I see that strikes me as odd in your logs is that at some point network manager is being killed by SIGABRT. There's also a lot of futzing around with keys, maybe something to do with temporal keys?

danilo: The only thing I see in your logs is a dhcp failure when trying to roam.

Joey: The probe request failures are something I see a lot in Pete's logs too. My best guess right now is that iwlwifi has some issue that's causing problems in associating, probably exacerbated by the noisy environment. The connection drops when network manager tries to roam but then experiences problems trying to (re?)associate with the new AP. Your errors after suspend may just be another manifestation of the problem.

I'll try to reproduce problems associating using iwlwifi. Of course my environment isn't going to be nearly as noisy as Linaro connect.

James Tunnicliffe (dooferlad) wrote :

Seems like it is my turn today to suffer from wireless problems at Connect. Have attached log. Yesterday I had a couple of problems, but turning off the wireless radio and then back on seemed to help. Today it didn't. Seems to be quite transient - I had no problems earlier this morning in the same location.

Joey Stanford (joey) wrote :

@loic yesterday after testing I reset to HK and it's still the same here.

@seth attached a quick screenshot of the environment where it happens for your reference.

Nicola Scendoni (scendoni) wrote :

Hi all,

the problem seems to me related to this iwlwifi bug:

http://bugzilla.intellinuxwireless.org/show_bug.cgi?id=2338

There are also some patch attached.
What do you think abou that?

 I cannot reproduce the bug "every day", but I had it sometime. When I reproduce it 2-3 times in one day I'll check if the patch solves the problem.

Seth Forshee (sforshee) wrote :

Nicola: Thanks for the link. I see two patches there.

The first one fixes a memory corruption in the wireless device's SRAM that, according to the commit message, would result in the message "Queue 2 stuck for 10000ms" in dmesg. I'm not seeing that in anyone's logs, so I suspect the patch isn't going to help.

The second one sounds more like the association failures seen in some of the logs, but the commit that is identified as introducing the problem (cfg80211: use compare_ether_addr on MAC addresses instead of memcmp) wasn't merged to Linus's tree until 3.5-rc1. No released Ubuntu kernel is using 3.5 yet, so the patch there is not going to help either.

tags: added: precise
summary: - Intel wifi frequent disconnects
+ 8086:0085 Intel wifi frequent disconnects
eschulte (schulte-eric) wrote :

Just want to say that this is also affecting Arch linux users (there have been a couple of reports on IRC).

no longer affects: linux (Arch Linux)
eschulte (schulte-eric) wrote :

For what it's worth, I'm getting "Reason 6" messages for my frequent disconnects.
Here's the output of dmesg, Arch linux 3.4.4-1-ARCH on an x220.

[ 5733.501542] wlan0: deauthenticated from 00:1b:63:2d:38:2e (Reason: 6)
[ 5733.536486] cfg80211: Calling CRDA to update world regulatory domain
[ 5734.878213] wlan0: authenticate with 00:1b:63:2d:38:2e
[ 5734.898206] wlan0: send auth to 00:1b:63:2d:38:2e (try 1/3)
[ 5734.904580] wlan0: authenticated
[ 5734.917586] wlan0: associate with 00:1b:63:2d:38:2e (try 1/3)
[ 5734.920720] wlan0: RX AssocResp from 00:1b:63:2d:38:2e (capab=0x431 status=0 aid=3)
[ 5734.920731] wlan0: associated
[ 7546.043076] wlan0: deauthenticated from 00:1b:63:2d:38:2e (Reason: 6)
[ 7546.155348] cfg80211: Calling CRDA for country: US
[ 7547.487474] wlan0: authenticate with 00:1b:63:2d:38:2e
[ 7547.507562] wlan0: send auth to 00:1b:63:2d:38:2e (try 1/3)
[ 7547.514132] wlan0: authenticated
[ 7547.526886] wlan0: associate with 00:1b:63:2d:38:2e (try 1/3)
[ 7547.529938] wlan0: RX AssocResp from 00:1b:63:2d:38:2e (capab=0x431 status=0 aid=1)
[ 7547.529949] wlan0: associated
[11130.589374] wlan0: deauthenticated from 00:1b:63:2d:38:2e (Reason: 6)

Joseph Salisbury (jsalisbury) wrote :

Can folks affected by this bug please test the latest Quantal kernel and report back if there is still an issue?

tags: removed: kernel-key
Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Seth Forshee (sforshee) wrote :

Just a note on one possible source of these kind of problems.

The upstream 3.4-rc1 kernel included a commit (7e79a39 iwlwifi: use valid TX/RX antenna from hw_params) that caused problems with the tx power in iwlwifi and could lead to these "authentication to xx:xx:xx:xx:xx:xx timed out" types of messages. This is fixed by "a5fdde2 iwlwifi: fix TX power antenna access" in 3.5-rc3 and in the 3.4.4 stable kernel. This wouldn't affect precise but could be a cause of problems when running quantal.

This problem should be fixed for quantal in 3.5.0-1.1, so if you're testing the quantal kernel be sure you're testing that version or later.

Joseph Salisbury (jsalisbury) wrote :

Can folks affected by this bug please test the latest Quantal kernel and report back if there is still an issue?

Changed in linux (Ubuntu):
status: Incomplete → Invalid
Joey Stanford (joey) wrote :

This one is hard to test because you need to be in area with a lot of APs. Something we only find at conferences and not at home. Hopefully it'll get some air time at plummers and linuxcon next week by someone who was affected.

On Wed, Aug 15, 2012 at 06:28:52PM -0000, Joey Stanford wrote:
> This one is hard to test because you need to be in area with a lot of
> APs. Something we only find at conferences and not at home. Hopefully
> it'll get some air time at plummers and linuxcon next week by someone
> who was affected.

I'll be at plumbers, so I'll be sure to bring some equipment to test
with. Some of the upstream devs should be around too on account of
kernel summit, which could prove useful.

Seth

Changed in linux (Ubuntu):
status: Invalid → Incomplete
madbiologist (me-again) wrote :

I noticed that the original reporter has a:

Network controller [0280]: Intel Corporation Centrino Advanced-N 6205 [8086:0085] (rev 34)
 Subsystem: Intel Corporation Centrino Advanced-N 6205 AGN [8086:1311]

Is bug #1039856 relevant here?

Lorant Nemeth (loci) wrote :

madbiologist:

As the new uc have been pushed back to -updates, it's now running on my laptop and I'm still experiencing the problem. In my case iwlist reports ~15 APs with the same ESSID. We're using WPA2 Enterprise and in my case it's not only dropping off and reconnecting, but I have to disable the wireless adapter via NetworkManager and reconnect it again, to be able to reconnect.

Dror Cohen (drorcohen) wrote :

I'm using 12.10 kernel 3.5.0-17-generic x86_64 and seeing 2 ESSID with the same name. Using channel 6. Bug is still very active. would love to test and help in any way I can

Dror Cohen (drorcohen) wrote :

also tried with 3.5.0-18-generic - still can't connect my wifi

Eryn O'Neil (eryn-oneil) wrote :

I'm also having trouble staying connected on my office wireless, which has a handful of APs set up for seamless roaming throughout the building. A lot of these symptoms and logs are familiar to me.

I reported my issue here: https://bugs.launchpad.net/dell-sputnik/+bug/1091372

I'm away from the office until after the new year, but I'm happy to supply logs or do some debugging when I'm back. We've been banging our heads against this issue for months, so I'm really keen to help get it fixed.

Eryn O'Neil (eryn-oneil) wrote :

So you don't have to click through to my other ticket:

I'm using Ubuntu 12.04 precise kernel 3.2.0-35-generic x86_64

01:00.0 Network controller: Intel Corporation Centrino Advanced-N 6235 (rev 24)
 Subsystem: Intel Corporation Centrino Advanced-N 6235 AGN
 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
 Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
 Latency: 0, Cache Line Size: 64 bytes
 Interrupt: pin A routed to IRQ 44
 Region 0: Memory at d0400000 (64-bit, non-prefetchable) [size=8K]
 Capabilities: <access denied>
 Kernel driver in use: iwlwifi
 Kernel modules: iwlwifi

Johannes Strom (jstrom) wrote :

I have a similar issue, with an intel utlimate-N chipset on a lenovo T410 in an enterprise WPA setup with multiple routers. I installed linux-backports-modules-cw-3.6-precise-generic to get the latest drivers. This seems to in general improve the connection dropping I was experiencing, but still is a bit quirky. For example, network manager sometimes gets confused, and no longer show a network name even though the network is connected. Also, the connection still occaisonally hangs, though I can't be certain this is a driver issue -- could be on the router end.

~$ uname -a
Linux april-103 3.2.0-35-generic #55-Ubuntu SMP Wed Dec 5 17:42:16 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

dmesg before the upgrade:
[ 1814.888217] wlan0: direct probe to 0a:0b:0d:0e:68:0c (try 1/3)
[ 1815.084266] wlan0: direct probe to 0a:0b:0d:0e:68:0c (try 2/3)
[ 1815.284159] wlan0: direct probe to 0a:0b:0d:0e:68:0c (try 3/3)
[ 1815.484061] wlan0: direct probe to 0a:0b:0d:0e:68:0c timed out

(Now after the upgrade, dmesg is filled with compat wireless debugging statements, however)

Hope this helps!

Daniel Fernandes (gadgetdevil) wrote :

I am currently having the same problem. I am sitting within close proximity of three access points, and my Intel 6300 Ultimate card is doing a "roaming dance". I am on a Lenovo X230 running the kernel: 3.5.0-21-generic #32-Ubuntu SMP Tue Dec 11 18:51:59 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

Sometimes the dance succeeds, and I don't drop my IP address, and sometimes the dance fails, and Network Manager gives me the "disconnected" message. How can we lower the roaming agressivness threshold? I know this is possible in Windows.

Changed in linux (Ubuntu):
status: Incomplete → Invalid

Just verified that, disabling 11n the connection become stable. My wifi card is an "Intel Wireless-N 1030 BGN":

I've added "options iwlwifi 11n_disable=1" to a file /etc/modprobe.d/personal-opts.conf

as recommended here:

https://bugs.launchpad.net/dell-sputnik/+bug/1091372

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.