ath9k module causing MIC challenge failures

Bug #580753 reported by Jack Johnson
20
This bug affects 4 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Incomplete
High
Manoj Iyer

Bug Description

With an access point configured for WPA (TKIP), boot/associate six ASUS eeePC 1005HA running either UNR 9.10 or UNE 10.04 simultaneously (within a 2-3 second window) and one of them will cause a MIC challenge failure, where, by protocol, everyone associated to the access point will be disconnected for approximately 60 seconds.

While troubleshooting, determined the Xandros 1.8 image that ships on the ASUS eeePC 1005HA uses a different driver altogether. Simultaneously associate six of those while a UNR 9.10 or UNE 10.04 client is associated and the UNR/UNE client will cause a MIC challenge failure, I suspect from having such a tight cycle of group rekeying, based on the logs:

May 13 13:05:44 asd-laptop wpa_supplicant[732]: WPA: Group rekeying completed with 00:1b:63:2b:46:75 [GTK=TKIP]
May 13 13:05:44 asd-laptop NetworkManager: <info> (wlan0): supplicant connection state: completed -> group handshake
May 13 13:05:44 asd-laptop NetworkManager: <info> (wlan0): supplicant connection state: group handshake -> completed
May 13 13:05:46 asd-laptop wpa_supplicant[732]: WPA: Group rekeying completed with 00:1b:63:2b:46:75 [GTK=TKIP]
May 13 13:05:46 asd-laptop NetworkManager: <info> (wlan0): supplicant connection state: completed -> group handshake
May 13 13:05:46 asd-laptop NetworkManager: <info> (wlan0): supplicant connection state: group handshake -> completed
May 13 13:05:46 asd-laptop wpa_supplicant[732]: Michael MIC failure detected

Retested by reassociating the six Xandros clients without the UNR/UNE client and no MIC failure occurs.

ProblemType: Bug
DistroRelease: Ubuntu 10.04
Package: linux-image-2.6.32-21-generic 2.6.32-21.32
Regression: No
Reproducible: Yes
ProcVersionSignature: Ubuntu 2.6.32-21.32-generic 2.6.32.11+drm33.2
Uname: Linux 2.6.32-21-generic i686
AlsaVersion: Advanced Linux Sound Architecture Driver Version 1.0.21.
AplayDevices:
 **** List of PLAYBACK Hardware Devices ****
 card 0: Intel [HDA Intel], device 0: ALC269 Analog [ALC269 Analog]
   Subdevices: 1/1
   Subdevice #0: subdevice #0
Architecture: i386
ArecordDevices:
 **** List of CAPTURE Hardware Devices ****
 card 0: Intel [HDA Intel], device 0: ALC269 Analog [ALC269 Analog]
   Subdevices: 1/1
   Subdevice #0: subdevice #0
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: asd 1577 F.... pulseaudio
CRDA: Error: [Errno 2] No such file or directory
Card0.Amixer.info:
 Card hw:0 'Intel'/'HDA Intel at 0xf7db8000 irq 22'
   Mixer name : 'Realtek ALC269'
   Components : 'HDA:10ec0269,104383ce,00100004'
   Controls : 12
   Simple ctrls : 7
Date: Fri May 14 14:40:36 2010
HibernationDevice: RESUME=UUID=64a6054c-eac5-4a34-bf57-03c85e705e6c
InstallationMedia: Ubuntu-Netbook 10.04 "Lucid Lynx" - Release i386 (20100429.4)
MachineType: ASUSTeK Computer INC. 1005HA
ProcCmdLine: BOOT_IMAGE=/boot/vmlinuz-2.6.32-21-generic root=UUID=725ef15d-dd13-4d7f-b968-1b58bf967cb9 ro quiet splash
ProcEnviron:
 LANG=en_US.utf8
 SHELL=/bin/bash
RelatedPackageVersions: linux-firmware 1.34
RfKill:
 0: phy0: Wireless LAN
  Soft blocked: no
  Hard blocked: no
SourcePackage: linux
dmi.bios.date: 08/26/2009
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 0801
dmi.board.asset.tag: To Be Filled By O.E.M.
dmi.board.name: 1005HA
dmi.board.vendor: ASUSTeK Computer INC.
dmi.board.version: x.xx
dmi.chassis.asset.tag: 0x00000000
dmi.chassis.type: 10
dmi.chassis.vendor: ASUSTeK Computer INC.
dmi.chassis.version: x.x
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr0801:bd08/26/2009:svnASUSTeKComputerINC.:pn1005HA:pvrx.x:rvnASUSTeKComputerINC.:rn1005HA:rvrx.xx:cvnASUSTeKComputerINC.:ct10:cvrx.x:
dmi.product.name: 1005HA
dmi.product.version: x.x
dmi.sys.vendor: ASUSTeK Computer INC.

Revision history for this message
Jack Johnson (knapjack) wrote :
Revision history for this message
Jeremy Foshee (jeremyfoshee) wrote :

Hi Jack,

If you could also please test the latest upstream kernel available that would be great. It will allow additional upstream developers to examine the issue. Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag. This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text. Please let us know your results.

Thanks in advance.

    [This is an automated message. Apologies if it has reached you inappropriately; please just reply to this message indicating so.]

tags: added: kj-triage
Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Jack Johnson (knapjack) wrote :

Confirmed bug exists in kerne 2.6.34-999-generic build date 201005121008.

tags: removed: needs-upstream-testing
Changed in linux (Ubuntu):
status: Incomplete → Triaged
Revision history for this message
Jack Johnson (knapjack) wrote :

I was reluctant to believe it would affect the outcome, but I can't deny the data. With feedback from ASUS, updated the BIOS to the latest version (1401) and initial tests with kernel build 2.6.34-999-generic are looking good. Will do wider tests tomorrow against UNR 9.10 (30-60 machines) and likely next week against UNE 10.04 to confirm.

Revision history for this message
Jack Johnson (knapjack) wrote :

Tested using 30 machines updated to the latest BIOS version, first with kernel 2.6.31-19-generic (UNR 9.10), then with 2.6.34-999-generic (UNE 10.04), still generating MIC failures.

tags: added: kernel-needs-review kernel-net
Revision history for this message
Jack Johnson (knapjack) wrote :

I have not been able to reproduce the bug with our Broadcom-based devices, but I'm open to ideas. Does anyone believe it's possible the bug is with wpa_supplicant?

Revision history for this message
Jack Johnson (knapjack) wrote :
Download full text (7.1 KiB)

Continuing to test 2.6.34-999-generic. Had a client generate a MIC failure after several hours of continuous connectivity.

Most interesting is cross-correlating the logs between the access point (in this case an Apple AirPort) and the client that caused the MIC failure.

Access point:

May 26 13:10:01 Severity:5 Disassociated with station 1c:4b:d6:a3:82:31
May 26 13:10:01 Severity:5 Rotated TKIP group key.
May 26 13:10:02 Severity:5 Associated with station 1c:4b:d6:a3:82:31
May 26 13:10:02 Severity:5 Installed unicast CCMP key for supplicant 1c:4b:d6:a3:82:31
May 26 13:10:02 Severity:1 MIC integrity error (reported from STA) src_addr=1c:4b:d6:a3:82:31

Client:

May 26 13:10:01 asd-laptop avahi-daemon[637]: last message repeated 2 times
May 26 13:10:01 asd-laptop kernel: [13398.500221] No probe response from AP 00:1b:63:2b:46:75 after 500ms, disconnecting.
May 26 13:10:01 asd-laptop wpa_supplicant[671]: CTRL-EVENT-DISCONNECTED - Disconnect event - remove keys
May 26 13:10:01 asd-laptop kernel: [13398.560196] cfg80211: All devices are disconnected, going to restore regulatory settings
May 26 13:10:01 asd-laptop kernel: [13398.560222] cfg80211: Restoring regulatory settings
May 26 13:10:01 asd-laptop kernel: [13398.560242] cfg80211: Calling CRDA to update world regulatory domain
May 26 13:10:01 asd-laptop NetworkManager: <info> (wlan1): supplicant connection state: completed -> disconnected
May 26 13:10:01 asd-laptop kernel: [13398.571339] cfg80211: World regulatory domain updated:
May 26 13:10:01 asd-laptop kernel: [13398.571354] (start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp)
May 26 13:10:01 asd-laptop kernel: [13398.571370] (2402000 KHz - 2472000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
May 26 13:10:01 asd-laptop kernel: [13398.571384] (2457000 KHz - 2482000 KHz @ 20000 KHz), (300 mBi, 2000 mBm)
May 26 13:10:01 asd-laptop kernel: [13398.571397] (2474000 KHz - 2494000 KHz @ 20000 KHz), (300 mBi, 2000 mBm)
May 26 13:10:01 asd-laptop kernel: [13398.571411] (5170000 KHz - 5250000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
May 26 13:10:01 asd-laptop kernel: [13398.571424] (5735000 KHz - 5835000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
May 26 13:10:01 asd-laptop NetworkManager: <info> (wlan1): supplicant connection state: disconnected -> scanning
May 26 13:10:02 asd-laptop wpa_supplicant[671]: Trying to associate with 00:1b:63:2b:46:75 (SSID='ittesting' freq=2452 MHz)
May 26 13:10:02 asd-laptop NetworkManager: <info> (wlan1): supplicant connection state: scanning -> associating
May 26 13:10:02 asd-laptop kernel: [13399.798514] wlan1: authenticate with 00:1b:63:2b:46:75 (try 1)
May 26 13:10:02 asd-laptop kernel: [13399.808652] wlan1: authenticated
May 26 13:10:02 asd-laptop kernel: [13399.808712] wlan1: associate with 00:1b:63:2b:46:75 (try 1)
May 26 13:10:02 asd-laptop kernel: [13399.812547] wlan1: RX AssocResp from 00:1b:63:2b:46:75 (capab=0x431 status=0 aid=1)
May 26 13:10:02 asd-laptop kernel: [13399.812559] wlan1: associated
May 26 13:10:02 asd-laptop wpa_supplicant[671]: Associated with 00:1b:63:2b:46:75
May 26 13:10:02 asd-laptop kernel: [13399.822750] cfg80211: Calling CRDA for country: US
May 26 13:10:0...

Read more...

Revision history for this message
Jack Johnson (knapjack) wrote :

Based on some info from the ath9k-devel mailing list ( http://<email address hidden>/msg02724.html ), confirmed that power management is off on the interface.

Revision history for this message
Jack Johnson (knapjack) wrote :

Completed 16-hour test with four clients and no MIC failures. Will do another 16-hour test with nine clients.

Revision history for this message
Jack Johnson (knapjack) wrote :

Test with nine clients generated a MIC failure in 2 hours 37 minutes. Retesting.

Changed in linux (Ubuntu):
importance: Undecided → High
Revision history for this message
Jack Johnson (knapjack) wrote :

Retest with nine clients generated a MIC failure in 3 hours 55 minutes.

Revision history for this message
Jouni Malinen (jkmaline) wrote :

Regarding comment #7: The AP did not detect Michael MIC failure in that case at May 26 13:10:02; only one of the associated stations did. As such, this by itself should not trigger TKIP countermeasures. In fact, the AP would not be detecting Michael MIC failures in this kind of setup if all the associated stations were using CCMP as pairwise cipher since the AP would not be receiving any TKIP frames.

Since none of the other associated stations reported the failure and the timing of the error report matches with reassociation, I would assume something goes wrong with TKIP group key configuration at the station that reported the failure. As far as wpa_supplicant is concerned, the actual failure detection event comes from the driver and wpa_supplicant is only forwarding it to the AP.

Are all the failures similar? I.e., are they only reported immediately after a disassociation-reassociation sequence? The AP seems to be configured to rekey the TKIP group key whenever a station disconnects and that may be the key trigger for the behavior here.

Would it be possible to get more verbose wpa_supplicant debug log from a station that reports the Michael MIC failure? I would like to see both the previous association and EAPOL handshakes and the sequence of EAPOL frames and events from wpa_supplicant during the disassociation-reassociation-failure report sequence.

Revision history for this message
Jack Johnson (knapjack) wrote :

Thanks, Jouni. I see you're correct regarding the countermeasures not being triggered in that log.

There are actually two different scenarios that may or may not be related: the MIC failures when multiple stations are associated within a tight window, and the periodic MIC failures when stations are connected for long periods of time (which may increase in frequency with an increase in clients).

The AP log from comment #7 is an example of the second scenario, and looking at last night's run, we're coming up on 15.5 hours with three MIC failures and no countermeasures triggered.

Here's a snippet from the AP log from the first scenario (tight window association):

May 26 16:00:15 Severity:5 Associated with station 1c:4b:d6:a3:75:60
May 26 16:00:15 Severity:5 Installed unicast CCMP key for supplicant 1c:4b:d6:a3:75:60
May 26 16:00:16 Severity:5 Associated with station 1c:4b:d6:55:e8:01
May 26 16:00:16 Severity:5 Installed unicast CCMP key for supplicant 1c:4b:d6:55:e8:01
May 26 16:00:16 Severity:1 MIC integrity error (reported from STA) src_addr=1c:4b:d6:55:e8:01
May 26 16:00:16 Severity:5 Disassociated with station 1c:4b:d6:55:e8:01
May 26 16:00:17 Severity:5 Rotated TKIP group key.
May 26 16:00:20 Severity:5 Deauthenticating with station 1c:4b:d6:55:e8:01 (reserved 2).
May 26 16:00:20 Severity:5 Disassociated with station 1c:4b:d6:55:e8:01
May 26 16:00:23 Severity:5 Associated with station 1c:4b:d6:55:ef:bf
May 26 16:00:23 Severity:5 Installed unicast CCMP key for supplicant 1c:4b:d6:55:ef:bf
May 26 16:00:23 Severity:1 MIC integrity error (reported from STA) src_addr=1c:4b:d6:55:ef:bf
May 26 16:00:23 Severity:1 Start TKIP countermeasures mode.
May 26 16:00:23 Severity:5 Deauthenticating with station 1c:4b:d6:55:ef:bf (reserved 14).
May 26 16:00:23 Severity:5 Deauthenticating with station 1c:4b:d6:a3:75:60 (reserved 14).
May 26 16:00:23 Severity:5 Disassociated with station 1c:4b:d6:55:ef:bf
May 26 16:00:23 Severity:5 Disassociated with station 1c:4b:d6:a3:75:60
May 26 16:00:23 Severity:5 Rotated TKIP group key.
May 26 16:01:23 Severity:1 End TKIP countermeasures mode.

I think especially if you look at 1c:4b:d6:55:ef:bf, you'll see the CCMP pairwise key for the station, but the group key is clearly still TKIP, and as you noted the TKIP group key seems to be rotated whenever a station comes and goes. But the station can still screw up the group key rotation and trigger the TKIP countermeasures. We're seeing this behavior on our Cisco APs as well, but the Apple AirPort is handy for the test.

I believe all the failures are similar and that the variance we've seen has to do only with the number of stations. The population suffers from random disassociation/reassociation (either the no probe response or the MIC failure), which then triggers a group rekey, which increases the odds of another random disassociation/reassociation in another member of the population. Someone with better math than me could probably work out how this plays out as the population increases, but we've seen it's pretty painful in groups of 30.

I'll jump on the wpa_supplicant debug verbosity right now.

Revision history for this message
Jouni Malinen (jkmaline) wrote :

Thanks. Based on the information here so far, this looks likely to be a race condition in either ath9k or mac80211 which would end up in triggering incorrect Michael MIC failure event immediately after the group key is configured (i.e., a broadcast frame being received from the AP at more or less the exact same time when the key is being set).

Quick review of the current implementation did not result in me finding any clear problems in this area. As such, I would like to be able to figure out how to reproduce this without having to use a large number of client devices and to wait for hours.. If that turns out to be too difficult, another option would be in adding some more debugging code into the kernel code to figure out what exactly is the place where this event is generated and under what conditions. Would you be willing to run such a test if needed (i.e., to patch the kernel source code and build your own kernel)?

Revision history for this message
Jack Johnson (knapjack) wrote :

Thanks Jouni.

Absolutely. It's been a while since I've done a kernel patch and compile, but I suspect it's like riding a bike.

I have not had a "natural" MIC failure yet this morning, but I've also been thinking about the most graceful way to generate it. I suspect with maybe three clients, two of which just cycle disassociation/reassociation, I'm betting the third will fail with fifteen minutes or so. Headed to lunch, and will then attempt to tickle the bug.

Revision history for this message
Jack Johnson (knapjack) wrote :

Worked like a charm. Cycled the interface on two stations once every 30 seconds before going to lunch, about 15 MIC failures in the last hour and ten or so.

Station 1c:4b:d6:a3:7a:b4 generated three of them. Syslog from station attached, with log from AP to follow. First failure about 13:32 localtime.

Revision history for this message
Jack Johnson (knapjack) wrote :

Log from Apple AirPort.

Revision history for this message
Jouni Malinen (jkmaline) wrote :

Thanks. For some reason, this issue does not seem to show up on my main laptop, but once I started testing with an older (and quite a bit slower) laptop, I was able to reproduce the issue. I'm now able to reproduce it at will within couple of minutes by using just that single laptop and a continuous stream of broadcast data packets from the AP, so that will hopefully make it much easier to figure out what exactly is causing the bogus Michael MIC failure reports.

Revision history for this message
Jouni Malinen (jkmaline) wrote :

It looks like the issue is in ath9k reporting a one of the received broadcast data frames (likely the first one after the group keys have been set or are in the process of being set) as a decrypted frame even when the data is not really correctly decrypted. mac80211 will then process the frame and verify the Michael MIC value. This validation fails because of bogus data in the frame.

ath9k key configuration and RX processing needs some further research to figure out whether there is a clear issue somewhere or whether some kind of workaround is needed to avoid reporting incorrect frames as decrypted when a frame is received at the same time as the key is set.

tags: added: kernel-candidate kernel-reviewed
removed: kernel-needs-review
Andy Whitcroft (apw)
tags: removed: kernel-candidate
Revision history for this message
Jack Johnson (knapjack) wrote :

Problem seems to be resolved in the 2010-06-17 snapshot of compat-wireless.

http://wireless.kernel.org/download/compat-wireless-2.6/compat-wireless-2010-06-17.tar.bz2

Revision history for this message
Manoj Iyer (manjo) wrote :

Can you please try the package linux-backports-modules-wireless-2.6.32-23-generic and report if that fixes you problem ?

Changed in linux (Ubuntu):
assignee: nobody → Manoj Iyer (manjo)
status: Triaged → Incomplete
Revision history for this message
Jack Johnson (knapjack) wrote :

After a long run with the snapshot, finally getting around to testing the backport (linux-backports-modules-wireless-2.6.32-24-generic now). Will probably take a couple of weeks to complete testing.

Revision history for this message
rew (r-e-wolff) wrote :

FYI,
I'm getting VERY similar errors on my raspberry pi with an
Bus 001 Device 004: ID 0bda:8176 Realtek Semiconductor Corp. RTL8188CUS 802.11n WLAN Adapter
To me this sounds as if the "ath" driver is not at fault, but something else.
The fact that the older laptop DOES suffer this problem and the newer one not, may implicate slower processors, and that in turn correlates nicely with that the rpi has this problem.

Here is my syslog:

Sep 24 08:04:41 raspberrypi wpa_supplicant[1042]: wlan0: WPA: Group rekeying completed with 40:4a:03:c0:bb:cf [GTK=TKIP]
Sep 24 08:16:49 raspberrypi wpa_supplicant[1042]: wlan0: WPA: Group rekeying completed with 40:4a:03:c0:bb:cf [GTK=TKIP]
Sep 24 08:20:36 raspberrypi wpa_supplicant[1042]: wlan0: WPA: Group rekeying completed with 40:4a:03:c0:bb:cf [GTK=TKIP]
Sep 24 08:25:46 raspberrypi wpa_supplicant[1042]: wlan0: WPA: Group rekeying completed with 40:4a:03:c0:bb:cf [GTK=TKIP]
Sep 24 08:25:46 raspberrypi wpa_supplicant[1042]: wlan0: Michael MIC failure detected
Sep 24 08:25:46 raspberrypi wpa_supplicant[1042]: wlan0: WPA: Sending EAPOL-Key Request (error=1 pairwise=0 ptk_set=1 len=99)
Sep 24 08:25:46 raspberrypi wpa_supplicant[1042]: wlan0: Michael MIC failure detected
Sep 24 08:25:46 raspberrypi wpa_supplicant[1042]: wlan0: WPA: Sending EAPOL-Key Request (error=1 pairwise=0 ptk_set=1 len=99)
Sep 24 08:25:46 raspberrypi wpa_supplicant[1042]: wlan0: TKIP countermeasures started
Sep 24 08:25:46 raspberrypi wpa_supplicant[1042]: wlan0: CTRL-EVENT-DISCONNECTED bssid=00:00:00:00:00:00 reason=0
Sep 24 08:25:46 raspberrypi wpa_supplicant[1042]: wlan0: CTRL-EVENT-DISCONNECTED bssid=00:00:00:00:00:00 reason=0
Sep 24 08:25:47 raspberrypi ifplugd(wlan0)[1034]: Link beat lost.

Revision history for this message
rew (r-e-wolff) wrote :

Oops. Forgot to mention:
Linux version 3.2.27+ (dc4@dc4-arm-01) (gcc version 4.7.2 20120731 (prerelease) (crosstool-NG linaro-1.13.1+bzr2458 - Linaro GCC 2012.08) ) #114 PREEMPT Tue Sep 4 00:15:33 BST 2012

Revision history for this message
Jack Johnson (knapjack) wrote :

Sorry Manoj, I just saw that I never replied with the final results.

We had linux-backports-modules-wireless-2.6.32-24-generic in production on a few thousand devices with the AR9285 chipset through probably December 2012 with no issues.

Thanks again,

-Jack

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.