Cable Ethernet conn. "die" with Atheros Network Card in Ubuntu 13.04

Bug #1175091 reported by Stefano Pecchenino
310
This bug affects 10 people
Affects Status Importance Assigned to Milestone
Fedora
Invalid
Medium
openSUSE
Fix Released
Medium
network-manager (Ubuntu)
Fix Released
Undecided
daniel patricio

Bug Description

In New Ubuntu 13.04 the network connections crash on 13.04 in cable Ethernet connection.
The connection works for about 2-3 minutes, and after "die".
Unplug/plug the cable sometimes helps to re-work the connection manager.

I suppose the problem in the Atheros drivers
In Ubuntu 12.10 the network works perfectly.
The bug is confermed by many other users.

The problem seems to be more present when using static IP (type of network configuration : manual)

WIFI works perfectly (Card is Atheros too)

My network Ethernet card is ATHEROS AR8152

command : lshw -C network

-network
       description: Wireless interface
       product: AR9285 Wireless Network Adapter (PCI-Express)
       vendor: Atheros Communications Inc.
       physical id: 0
       bus info: pci@0000:0c:00.0
       logical name: wlan0
       version: 01
       serial: 90:a4:de:4a:7a:28
       width: 64 bits
       clock: 33MHz
       capabilities: bus_master cap_list ethernet physical wireless
       configuration: broadcast=yes driver=ath9k driverversion=3.8.0-19-generic firmware=N/A ip=10.173.129.175 latency=0 link=yes multicast=yes wireless=IEEE 802.11bgn
       resources: irq:17 memory:f69f0000-f69fffff
  *-network
       description: Ethernet interface
       product: AR8152 v2.0 Fast Ethernet
       vendor: Qualcomm Atheros
       physical id: 0
       bus info: pci@0000:09:00.0
       logical name: eth0
       version: c1
       serial: 78:2b:cb:f0:56:35
       capacity: 100Mbit/s
       width: 64 bits
       clock: 33MHz
       capabilities: bus_master cap_list ethernet physical tp 10bt 10bt-fd 100bt 100bt-fd autonegotiation
       configuration: autonegotiation=on broadcast=yes driver=atl1c driverversion=1.0.1.1-NAPI latency=0 link=no multicast=yes port=twisted pair
       resources: irq:47 memory:f68c0000-f68fffff ioport:df00(size=128)

Without this 13.04 is unuseful.

Thanx
Stefano Pecchenino
<email address hidden>

ProblemType: Bug
DistroRelease: Ubuntu 13.04
Package: network-manager 0.9.8.0-0ubuntu6
ProcVersionSignature: Ubuntu 3.8.0-19.29-generic 3.8.8
Uname: Linux 3.8.0-19-generic i686
ApportVersion: 2.9.2-0ubuntu8
Architecture: i386
CRDA:
 country IT:
  (2402 - 2482 @ 40), (N/A, 20)
  (5170 - 5250 @ 40), (N/A, 20)
  (5250 - 5330 @ 40), (N/A, 20), DFS
  (5490 - 5710 @ 40), (N/A, 27), DFS
Date: Wed May 1 10:23:15 2013
IfupdownConfig:
 # interfaces(5) file used by ifup(8) and ifdown(8)
 auto lo
 iface lo inet loopback
InstallationDate: Installed on 2013-04-30 (0 days ago)
InstallationMedia: Ubuntu 13.04 "Raring Ringtail" - Release i386 (20130424)
IpRoute:
 default via 10.173.128.1 dev wlan0 proto static
 10.173.128.0/20 dev wlan0 proto kernel scope link src 10.173.129.175 metric 9
 169.254.0.0/16 dev wlan0 scope link metric 1000
MarkForUpload: True
NetworkManager.state:
 [main]
 NetworkingEnabled=true
 WirelessEnabled=true
 WWANEnabled=true
 WimaxEnabled=true
ProcEnviron:
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=it_IT.UTF-8
 SHELL=/bin/bash
SourcePackage: network-manager
UpgradeStatus: No upgrade log present (probably fresh install)
nmcli-con:
 NAME UUID TYPE TIMESTAMP TIMESTAMP-REAL AUTOCONNECT READONLY DBUS-PATH
 FASTWEB-PEKKE e24676af-073a-4071-8e25-a3dc1c022421 802-11-wireless 1367396634 mer 01 mag 2013 10:23:54 CEST yes no /org/freedesktop/NetworkManager/Settings/1
 ADSL Ufficio 8db2aff6-e740-4f53-81de-e260b2c883db 802-3-ethernet 0 never no no /org/freedesktop/NetworkManager/Settings/0
nmcli-dev:
 DEVICE TYPE STATE DBUS-PATH
 eth0 802-3-ethernet unavailable /org/freedesktop/NetworkManager/Devices/1
 wlan0 802-11-wireless connected /org/freedesktop/NetworkManager/Devices/0
nmcli-nm:
 RUNNING VERSION STATE NET-ENABLED WIFI-HARDWARE WIFI WWAN-HARDWARE WWAN
 running 0.9.8.0 connected enabled enabled enabled enabled disabled

Revision history for this message
In , Tschaefer (tschaefer) wrote :

User-Agent: Mozilla/5.0 (X11; Linux i686; rv:19.0) Gecko/20100101 Firefox/19.0

Atheros Communications Inc. AR8152 v2.0 Fast Ethernet (rev c1)
at asus eeepc R11XC loads the module atl1c

It works only a short time. Then the networks stops suddenly without any message.

only ifconfig shows, that something is wrong:

          RX packets:547 errors:0 dropped:123 overruns:123 frame:123
          TX packets:656 errors:0 dropped:0 overruns:0 carrier:2
          collisions:0 Sendewarteschlangenlänge:1000
          RX bytes:314606 (307.2 Kb) TX bytes:104503 (102.0 Kb)

Reproducible: Always

Steps to Reproduce:
Short after start oder plugin of the ethernet-cable

Revision history for this message
In , Tschaefer (tschaefer) wrote :

PCI-ID is

1969:2062

Revision history for this message
In , Tschaefer (tschaefer) wrote :

same behavior/problem

with

3.9.0-rc3-next-20130320-1-vanilla #1 SMP Wed Mar 20 07:04:59 UTC 2013 (3e90b55) i686 i686 i386 GNU/Linux

http://download.opensuse.org/repositories/Kernel:/linux-next/standard/

Revision history for this message
In , Kast (b-m-kast) wrote :

I have the same problem with AR8152 v2.0 Fast Ethernet (rev c1) (same PCI ID) on Lenovo IdeaPad G570 laptop.

Tested on kernels:

* 3.7.10-1.1-default (openSUSE 12.3 x86 Live KDE) - fails
* 3.8.6 [can't remember exact version] (Fedora 18 x86) - fails
* 3.7.x (Fedora 17 x86) - fails
* 3.6.11 (Fedora 17 x86) - WORKS

So it seems that something bad happened between 3.6.x and 3.7.x.

Also, there are Atheroses (is that a correct word?) completely unaffected by this bug, all of them using atl1c module:

* AR8151 v2.0 Gigabit Ethernet (rev c0) [1969:1083], non-branded desktop PC
* AR8132 Fast Ethernet (rev c0) [can't remember PCI ID], ASUS UL20N laptop

Here's a (seemingly similar) bug in Red Hat bugzilla:

https://bugzilla.redhat.com/show_bug.cgi?id=901811

Some people there told about problems with non-Atheros wired cards as well as wireless ones and some symptoms persisted after downgrading the kernel to 3.6.x, but there's a possibility that those problems are caused by different bugs.

Revision history for this message
Stefano Pecchenino (pekke) wrote :
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in network-manager (Ubuntu):
status: New → Confirmed
Revision history for this message
Kast (b-m-kast) wrote :

I have the same problem with exactly the same network card (tested with 13.04 i386 live DVD).

This problem seems to appear during transition from kernel 3.6.x to 3.7. It does not seem to be distribution dependent because the same bug affects at least Fedora (https://bugzilla.redhat.com/show_bug.cgi?id=901811) and openSUSE (https://bugzilla.novell.com/show_bug.cgi?id=812116).

Changed in fedora:
importance: Undecided → Unknown
status: New → Unknown
Changed in opensuse:
importance: Undecided → Unknown
status: New → Unknown
Revision history for this message
Kast (b-m-kast) wrote :

The bug affects amd64 kernels too.

$ uname -a
Linux ubuntu 3.8.0-19-generic #29-Ubuntu SMP Wed Apr 17 18:16:28 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

Changed in opensuse:
importance: Unknown → Medium
status: Unknown → Confirmed
Revision history for this message
Kast (b-m-kast) wrote :

Seems to be fixed in kernel 3.9.2. I'll post more info if this problem shows up again.

Revision history for this message
In , Kast (b-m-kast) wrote :

The bug seems to be fixed in kernel 3.9.2 (here: http://download.opensuse.org/repositories/Kernel:/stable/standard/ )

Revision history for this message
In , Tschaefer (tschaefer) wrote :

The Problem is still there, also in Kernel 3.9.2-1.g04040b9-desktop

Revision history for this message
In , Kast (b-m-kast) wrote :

Yep, false alarm, sorry. The bug is not fixed yet.

It's interesting that after the clean reboot (NOT a hibernate/resume cycle) the network adapter can work for indefinitely long time as long as it is used constantly (I've ran a flood ping on another machine in my home network while browsing the Web, and it worked for 3 hours non stop). It can be related to some kind of power saving / device suspension state.

As far as I know, problems start once a message like this appears in dmesg:

atl1c 0000:07:00.0: irq 46 for MSI/MSI-X

Revision history for this message
In , Kast (b-m-kast) wrote :

> As far as I know, problems start once a message like this appears in dmesg:

> atl1c 0000:07:00.0: irq 46 for MSI/MSI-X

Sorry, not like this. I mean this message appears when I try to 'revive' the device by unplugging the cable temporarily.

Revision history for this message
Stefano Pecchenino (pekke) wrote :

Kernel 3.9.2 Generic (It's the latest kernel for ubuntu) don't solve the problem.
I come back to kernel 3.8.0.22 (latest kernel official for ubuntu 13.04).

Revision history for this message
Stefano Pecchenino (pekke) wrote :

Why this bug is Unassigned ?
Anyone in ubuntu staff does words this ?

information type: Public → Public Security
Revision history for this message
Peter (peter-weiss) wrote :

From https://bugzilla.novell.com/show_bug.cgi?id=812116#add_comment

    [...]
    Also, there are Atheroses (is that a correct word?) completely
    unaffected by this bug, all of them using atl1c module:
    [...]

This is not true. This bug is also seen on a Sony Vaio notebook with

~:1> sudo lspci -vs 05:00.0
05:00.0 Ethernet controller: Qualcomm Atheros AR8151 v2.0 Gigabit Ethernet (rev c0)
        Subsystem: Sony Corporation Device 9081
        Flags: bus master, fast devsel, latency 0, IRQ 51
        Memory at f5400000 (64-bit, non-prefetchable) [size=256K]
        I/O ports at 9000 [size=128]
        Capabilities: [40] Power Management version 3
        Capabilities: [48] MSI: Enable+ Count=1/1 Maskable- 64bit+
        Capabilities: [58] Express Endpoint, MSI 00
        Capabilities: [6c] Vital Product Data
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [180] Device Serial Number ff-d9-ea-04-f0-bf-97-ff
        Kernel driver in use: atl1c

~:1> ifconfig eth0
eth0 Link encap:Ethernet HWaddr f0:bf:97:d9:ea:04
          inet addr:192.168.92.29 Bcast:192.168.92.255 Mask:255.255.255.0
          inet6 addr: fe80::f2bf:97ff:fed9:ea04/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:36766 errors:0 dropped:859 overruns:859 frame:859
          TX packets:30358 errors:0 dropped:0 overruns:0 carrier:12
          collisions:0 txqueuelen:1000
          RX bytes:38925673 (38.9 MB) TX bytes:4821993 (4.8 MB)

~:1>

Peter

Revision history for this message
Kast (b-m-kast) wrote :

But there *are* models driven by atl1c which do not have this problem. AR8132 is an example (PCI ID 1043:14e5). It works for hours with no troubles.

Revision history for this message
cllee 李嘉陵 (lchialing) wrote :

T

Revision history for this message
cllee 李嘉陵 (lchialing) wrote :

Sorry about the incomplete post previously. But, I like to confirm that the same thing happened to me. It is clearly a problem with the NIC driver.

- The problem does NOT appear in 12.04
- When the NIC died, other computers on the network is still connecting to the Internet fine.
- Unplug the wire, re-plug-in, network comes up, you can resume download, and network connection dies soon after.

Wifi would work fine, but of course it is much slower. I use the network cable when I am downloading large files. As a result, nightly apt-get updates now are over Wifi.

Revision history for this message
Arie Skliarouk (skliarie) wrote :
Revision history for this message
Arie Skliarouk (skliarie) wrote :
Revision history for this message
Arie Skliarouk (skliarie) wrote :

Same problem on Lenovo g570:
# uname -a
Linux cmlap21 3.8.0-25-generic #37-Ubuntu SMP Thu Jun 6 20:47:07 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 13.04
Release: 13.04
Codename: raring

The network worked perfectly under ubuntu 12.10.

Revision history for this message
Arie Skliarouk (skliarie) wrote :
Revision history for this message
giorgio.01 (giorgio-ua) wrote :
Revision history for this message
In , Dchang-s (dchang-s) wrote :

As per comment#6, it seems that the network interface fail after s3/s4 resume.
What's output of "ethtool -i $INTERFACE" ? Could you please provide the /var/log/pm-*.log files and "dmesg" log which includes the network failed? Thanks!

Revision history for this message
In , Tschaefer (tschaefer) wrote :

Tested again with

3.7.10-1.16-desktop #1 SMP PREEMPT Fri May 31 20:21:23 UTC 2013 (97c14ba) i686 i686 i386 GNU/Linux

dmesg after pluging in the cable:

[ 138.411153] atl1c 0000:04:00.0: atl1c: eth0 NIC Link is Up<100 Mbps Full Duplex>
[ 138.411269] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready

It works for some minutes. Then it stops and is working again.

in dmesg now

[ 138.411153] atl1c 0000:04:00.0: atl1c: eth0 NIC Link is Up<100 Mbps Full Duplex>
[ 138.411269] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 376.440058] atl1c 0000:04:00.0: irq 47 for MSI/MSI-X
[ 376.440261] atl1c 0000:04:00.0: atl1c: eth0 NIC Link is Up<100 Mbps Full Duplex>

Id did not plug in/out the cable!

The wanted ethtool-output:

driver: atl1c
version: 1.0.1.0-NAPI
firmware-version:
bus-info: 0000:04:00.0
supports-statistics: no
supports-test: no
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no

ifconfig output:

eth0 Link encap:Ethernet Hardware Adresse 10:BF:48:A1:EA:D7
          inet Adresse:xxx.187.148.158 Bcast:xxx.187.148.255 Maske:255.255.255.128
          inet6 Adresse: fe80::12bf:48ff:fea1:ead7/64 Gültigkeitsbereich:Verbindung
          inet6 Adresse: 2xxx:4ca0:4f01:1:12bf:48ff:fea1:ead7/64 Gültigkeitsbereich:Global
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:16936 errors:0 dropped:639 overruns:639 frame:639
          TX packets:6652 errors:0 dropped:0 overruns:0 carrier:1
          collisions:0 Sendewarteschlangenlänge:1000
          RX bytes:24172954 (23.0 Mb) TX bytes:567353 (554.0 Kb)

ip -s link output

2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT qlen 1000
    link/ether 10:bf:48:a1:ea:d7 brd ff:ff:ff:ff:ff:ff
    RX: bytes packets errors dropped overrun mcast
    51485511 38177 0 0 639 186
    TX: bytes packets errors dropped carrier collsns
    1264645 13891 0 0 1 0

Id did not suspend the netbook.

Revision history for this message
In , Tschaefer (tschaefer) wrote :

Created an attachment (id=546876)
one of the wanted pm-files

Revision history for this message
In , Tschaefer (tschaefer) wrote :

Created an attachment (id=546877)
the second pm-file

Revision history for this message
In , Tschaefer (tschaefer) wrote :

still same with

Linux eeepc.site 3.10.0-1.g3dcd746-desktop #1 SMP PREEMPT Mon Jul 1 13:38:11 UTC 2013 (3dcd746) i686 i686 i386 GNU/Linux

Revision history for this message
In , Dchang-s (dchang-s) wrote :

Could you attach the full "dmesg" log, please? Thanks!

Revision history for this message
In , Tschaefer (tschaefer) wrote :

Created an attachment (id=546951)
dmesg boot, before plug in the ethernet cable

Revision history for this message
In , Tschaefer (tschaefer) wrote :

Created an attachment (id=546953)
dmesg boot (complete), after plugin the ethernet cable (plug in only once, no physical disconnect here)

Revision history for this message
In , Dchang-s (dchang-s) wrote :

The pm log looks normal. However there are many dropped and overruns packets in the RX path.

RX packets:547 errors:0 dropped:123 overruns:123 frame:123
RX packets:16936 errors:0 dropped:639 overruns:639 frame:639

Packets were dropped probably because of low memory, and receiver overruns usually occur when packets come in faster than the kernel can service the last interrupt.

Could you post more statistics of the ethernet:
# ip -s -s link show eth0; ethtool -S eth0; ifconfig eth0

and "cat /proc/interrupts", thank you!

Revision history for this message
In , Dchang-s (dchang-s) wrote :

Please also provide the output of "lspci -nvvv -s 4:0.0", it would be good to get the output after the network fail. Thanks!

Revision history for this message
In , Tschaefer (tschaefer) wrote :

Created an attachment (id=547372)
output of ip -s link; ethtool -S ; ifconfig; lspci -nvvv -s and proc/interrupts

Revision history for this message
In , Tschaefer (tschaefer) wrote :

lspci -nvvv -s 4:0.0

causes a new error massage at dmesg:

[ 622.585019] atl1c 0000:04:00.0: vpd r/w failed. This is likely a firmware bug on this device. Contact the card vendor for a firmware update.
[ 659.267384] atl1c 0000:04:00.0: vpd r/w failed. This is likely a firmware bug on this device. Contact the card vendor for a firmware update.
[ 701.525868] atl1c 0000:04:00.0: vpd r/w failed. This is likely a firmware bug on this device. Contact the card vendor for a firmware update.
[ 1225.302013] atl1c 0000:04:00.0: vpd r/w failed. This is likely a firmware bug on this device. Contact the card vendor for a firmware update.
[ 1230.786970] atl1c 0000:04:00.0: vpd r/w failed. This is likely a firmware bug on this device. Contact the card vendor for a firmware update.

Revision history for this message
In , Dchang-s (dchang-s) wrote :

(In reply to comment #19)
> lspci -nvvv -s 4:0.0
>
> causes a new error massage at dmesg:
>
>
> [ 622.585019] atl1c 0000:04:00.0: vpd r/w failed. This is likely a firmware
> bug on this device. Contact the card vendor for a firmware update.
> [ 659.267384] atl1c 0000:04:00.0: vpd r/w failed. This is likely a firmware
> bug on this device. Contact the card vendor for a firmware update.
> [ 701.525868] atl1c 0000:04:00.0: vpd r/w failed. This is likely a firmware
> bug on this device. Contact the card vendor for a firmware update.
> [ 1225.302013] atl1c 0000:04:00.0: vpd r/w failed. This is likely a firmware
> bug on this device. Contact the card vendor for a firmware update.
> [ 1230.786970] atl1c 0000:04:00.0: vpd r/w failed. This is likely a firmware
> bug on this device. Contact the card vendor for a firmware update.

I think we can temporary ignore this message for now.

Can you please give it a try to pass kernel parameter "pcie_aspm=off"?

Revision history for this message
In , Tschaefer (tschaefer) wrote :

cat /proc/cmdline

BOOT_IMAGE=/boot/vmlinuz-3.7.10-1.16-desktop root=UUID=c223292b-e338-4a62-90b7-58181d074d51 resume=/dev/disk/by-id/ata-Samsung_SSD_840_Series_S19HNEAD207957J-part1 splash=silent quiet showopts pcie_aspm=off

The problem is still there.

Revision history for this message
In , Dchang-s (dchang-s) wrote :

Thanks for your feedback.

There is already a upstream issue which is the same with this at: https://bugzilla.kernel.org/show_bug.cgi?id=54021
I will update status once I have another finding, thanks!

penalvch (penalvch)
tags: added: needs-kernel-logs
Revision history for this message
Timmie (timmie) wrote :

Hello, how do I send you a kernel log?

Revision history for this message
In , Dchang-s (dchang-s) wrote :

Hi,

I created a kmp base on v3.6 vanilla kernel. Could you please give it a try? You can download the package from:
http://download.opensuse.org/repositories/home:/david_chang:/bnc812116_atl1c/openSUSE_12.3/i586/atl1c-kmp-desktop-v3.6_1.0.1.0_NAPI_k3.7.10_1.1-1.1.i586.rpm
Thank you!

Revision history for this message
In , Tschaefer (tschaefer) wrote :

I installed the package. And I rebooted my system.

The problem is still there.

modinfo atl1c
filename: /lib/modules/3.7.10-1.16-desktop/weak-updates/updates/atl1c.ko
version: 1.0.1.0-NAPI
license: GPL
description: Qualcom Atheros 100/1000M Ethernet Network Driver
author: Qualcomm Atheros Inc., <email address hidden>
author: Jie Yang
srcversion: D7DE85CCFA0BF493AEB7839
alias: pci:v00001969d00001083sv*sd*bc*sc*i*
alias: pci:v00001969d00001073sv*sd*bc*sc*i*
alias: pci:v00001969d00002062sv*sd*bc*sc*i*
alias: pci:v00001969d00002060sv*sd*bc*sc*i*
alias: pci:v00001969d00001062sv*sd*bc*sc*i*
alias: pci:v00001969d00001063sv*sd*bc*sc*i*
depends:
vermagic: 3.7.10-1.1-desktop SMP preempt mod_unload modversions 686

ll /lib/modules/3.7.10-1.16-desktop/weak-updates/updates/atl1c.ko
lrwxrwxrwx 1 root root 48 25. Jul 13:45 /lib/modules/3.7.10-1.16-desktop/weak-updates/updates/atl1c.ko -> /lib/modules/3.7.10-1.1-desktop/updates/atl1c.ko

ll /lib/modules/3.7.10-1.1-desktop/updates/atl1c.ko
-rw-r--r-- 1 root root 921822 25. Jul 11:40 /lib/modules/3.7.10-1.1-desktop/updates/atl1c.ko

So I think the rpm/installation was right.

Revision history for this message
In , Dchang-s (dchang-s) wrote :

Yes! I think your rpm installation is correct! Thanks!

It looks like the v3.6 driver did not affect the issue. However I think I can not just test driver only, maybe we should test by changing the whole kernel due to it may has another fixes in neworking or other related subsystem.

Actually there already had a bisected result, it may related to the issue (from: https://bugzilla.kernel.org/show_bug.cgi?id=54021#c14)

commit 69b08f62e17439ee3d436faf0b9a7ca6fffb78db
Author: Eric Dumazet <email address hidden>
Date: Wed Sep 26 06:46:57 2012 +0000

    net: use bigger pages in __netdev_alloc_frag

    We currently use percpu order-0 pages in __netdev_alloc_frag
    to deliver fragments used by __netdev_alloc_skb()

    Depending on NIC driver and arch being 32 or 64 bit, it allows a page to
    be split in several fragments (between 1 and 8), assuming PAGE_SIZE=4096

    Switching to bigger pages (32768 bytes for PAGE_SIZE=4096 case) allows :

    - Better filling of space (the ending hole overhead is less an issue)

    - Less calls to page allocator or accesses to page->_count

    - Could allow struct skb_shared_info futures changes without major
      performance impact.

    This patch implements a transparent fallback to smaller
    pages in case of memory pressure.

    It also uses a standard "struct page_frag" instead of a custom one.

Revision history for this message
In , Dchang-s (dchang-s) wrote :

So it may be good with v3.6 kernel, since the commit (69b08f6 net: use bigger pages in __netdev_all) starting from v3.7-rc1. And it looks like the hardware has problem on some memory range?

There are some progress from upstream: http://marc.info/?t=137485734400001&r=1&w=2
And I've backported the patch. Please help to test whether the patch works or not.

http://download.opensuse.org/repositories/home:/david_chang:/bnc812116_atl1c/openSUSE_12.3/i586/atl1c-kmp-desktop-1.0.1.0_NAPI_k3.7.10_1.1-1.1.i586.rpm

Thank you!

Revision history for this message
In , Tschaefer (tschaefer) wrote :

The patch from comment#26 seems to solve the problem.

Of course one negative test is not very reliable.

Revision history for this message
In , Dchang-s (dchang-s) wrote :

Thank you for your report and testing!

I will put the fix into openSUSE 12.3 kernel.

Revision history for this message
In , Bpoirier (bpoirier) wrote :

7b70176 atl1c: Fix misuse of netdev_alloc_skb in refilling rx ring
 [PATCH] atl1c: use custom skb allocator

This is somewhat speculative.
Since "69b08f6 net: use bigger pages in __netdev_alloc_frag
(v3.7-rc1)" skbs allocated via netdev_alloc_skb() with len roughly < PAGE_SIZE
can have a head that crosses page boundaries. The theory is that the hardware
doesn't support this.

However, as pointed out by Eric Dumazet, the mtu can be more than 4k.
(MAX_JUMBO_FRAME_SIZE = 6122) Does it work then?

According to my math, an mtu >= 1643 will lead to rx_frag_size
>= 4096.

Thomas, out of curiosity, would you mind testing with an mtu between 1643 and
6122?

Something like:
ip link set mtu 5000 dev eth0 # on the remote host as well, important!
ping -s4900 -c1000 -f -M do $host
ip -s link show dev eth0

If that shows errors, can you test again with mtu = 1643, just to see if I got
my math right ;) You may do these tests with the stock 12.3 driver or the
updated kmp, I don't think it matters.

Thanks.

Revision history for this message
In , Tschaefer (tschaefer) wrote :

RTNETLINK answers: Invalid argument

MTUs higher than 1500 are not accepted.

May be the reason is, this device supports only 100Mb/s and no jumboframes.

So I was not able to verify your nice calculation.

Revision history for this message
In , Bpoirier (bpoirier) wrote :

Ah, indeed. Only the 1GB controllers l1c, l1d, l1d_2 support jumbos. Sorry I
didn't pay attention to that.

drivers/net/ethernet/atheros/atl1c/atl1c_main.c:534
 /* Fast Ethernet controller doesn't support jumbo packet */

That's good news in fact because it means that it's always possible to
allocate buffers so that they don't straddle a page boundary for this
hardware.

The "theory" remains speculative however. Could you please run one more test?
Install the kmp at http://download.opensuse.org/repositories/home:/benjamin_poirier:/branches:/home:/david_chang:/bnc812116_atl1c/openSUSE_12.3/
modinfo atl1c should report
srcversion: 09F7FD879EA87758649C33B

Then run as root:
 cat /sys/kernel/debug/tracing/trace_pipe > /tmp/output1
The command will continue to run.

Use the adapter as you normally do to reproduce the bug. When that has
happened, please check (in another terminal) that
 cat /sys/kernel/debug/tracing/tracing_on
contains "0".
Then kill the first cat trace_pipe command and attach the resulting
file /tmp/output1 to the bug.

If you can repeat this 2-3 times in different output files it'll be even
better!

Thanks.

Changed in opensuse:
status: Confirmed → Incomplete
Revision history for this message
Stefano Pecchenino (pekke) wrote :

Sorry but the but for Ubuntu ?
This post has born for ubuntu bug...no one is looking for ubuntu network manager bug ???

Only fedora or suse ?

I have Ubuntu 13.04 and the network does not work with the cable connection.

Thank you
Pekke

Revision history for this message
In , Bpoirier (bpoirier) wrote :

Hi Thomas, any chance you can collect the trace with the kmp as described in
comment 31?
If so, please note that I've made a small modification to it and tracing will
not stop at the first overflow (ie. ignore the stuff about checking
"tracing_on", just kill the `cat trace_pipe` once the bug has reproduced). The
new module's srcversion is 0E185B665C23586985D5320.
If you'd rather not do this test, no worries - you've already been quite
helpful - but please let me know, I'll go ahead right away and commit the
patch that David identified.

Revision history for this message
In , Tschaefer (tschaefer) wrote :

Sorry for the delay. I am in holiday with weak internet connectivity. (wireless)

I could/would do the test, but not this week.

Revision history for this message
In , Bpoirier (bpoirier) wrote :

(In reply to comment #33)
> Sorry for the delay. I am in holiday with weak internet connectivity.
> (wireless)
>
> I could/would do the test, but not this week.

Oh, thank you and enjoy your holiday!

I'll commit the patch now anyways since it was accepted upstream. I'll also
leave the bug open so we can try and get a little more info about the cause of
these failures when it's convenient for you.

---

Introduced by "69b08f6 net: use bigger pages in __netdev_alloc_frag
(v3.7-rc1)"

Fixed by "7b70176 atl1c: Fix misuse of netdev_alloc_skb in refilling rx ring
(v3.11-rc4)"

openSUSE-12.3
 patches.fixes/atl1c-Fix-misuse-of-netdev_alloc_skb-in-refilling-rx.patch

Revision history for this message
Timmie (timmie) wrote :

@Ubuntu Kernel Team:
How does is happen that the driver gets broken if it was working in earlier kernel versions without any troubles?
(I know that it's not an Ubuntu fault... But how could pre-release QA be imporved here?)

Revision history for this message
In , Tschaefer (tschaefer) wrote :

Created an attachment (id=552466)
wanted logfile

Revision history for this message
In , Bpoirier (bpoirier) wrote :

Created an attachment (id=552666)
trace analysis script

Thank you for this trace.
It shows that many buffers in fact straddle page boundaries but that does not
systematically lead to a stall.

zcat c35-trace.gz | ./analyze.py
[...]
rfd 5 mapped, page left 1856
rfd 7 recv, page left ?
rfd 6 mapped, page left 64
[...]
rfd 5 recv, page left ?
rfd 4 mapped, page left 2240
overflow after rfd 5 (repeated 530 times)
[...]

The trace exhibits one case of stall and it happens right before reading from
a receive buffer that was mapped to within 64 bytes of a page boundary. This
is the smallest distance within the entire trace. (The second smallest being
192). I'm wondering if the stall also depends on how much data the card
actually writes into the buffer (ie. the stall happens when the cards actually
writes accross a page). I've updated the kmp to also trace this information.
If you can give it another spin, it'll be much appreciated - for science's
sake!

The kmp can be installed from the same location as comment 31. New srcversion
is 057EBD10FCF6584249086C0.

Revision history for this message
In , Tschaefer (tschaefer) wrote :

I am sorry. I am not able to reproduce the error with this test-module.
I don't see any stall nor dropped packets in the interface-statistics.

Revision history for this message
In , Bpoirier (bpoirier) wrote :

That's odd. I does not contain a workaround of any sort. It's mostly the same
tracepoints as the module from comment 31 but with more information. Could you
check that the srcversion for the running module (cat
/sys/module/atl1c/srcversion) matches what's in comment 36? If so and the
problem still does not reproduce, please supply a trace anyhow. It won't show
that the card stalls when writing accross page boundaries, but it will show
that it does not stall when not doing such writes (I expect). Thanks again.

Revision history for this message
In , Tschaefer (tschaefer) wrote :

Created an attachment (id=552961)
the wanted log-file, while the failure occurs

sporadic errors are some times hard to reproduce, this time it was easier

Changed in opensuse:
status: Incomplete → Confirmed
Revision history for this message
Stefano Pecchenino (pekke) wrote :

No one is looking the bug for Ubuntu ??
I cannot install 13.04 for this bug...:(

Revision history for this message
Timmie (timmie) wrote :
Revision history for this message
In , Triffterer (triffterer) wrote :

I just wanted to add that I can also confirm this problem.
I checked the current default and desktop flavor of the openSUSE 12.3 kernel (3.7.10-1.16) on my Asus P53E laptop.

My LAN NIC is an
03:00.0 Ethernet controller [0200]: Atheros Communications Inc. AR8151 v2.0 Gigabit Ethernet [1969:1083] (rev c0)
powered by the atl1c kernel driver.

Interestingly, the problem disappears when I disable my WLAN interface (02:00.0 Network controller: Atheros Communications Inc. AR9285 Wireless Network Adapter (PCI-Express) (rev 01)) by unloading the ath9k driver or unchecking "enable wireless" in the network manager (the discussion in the kernel bugzilla (see comment 25) indicates that there seems to be a problem if another network driver is active in addition to atl1c).

Revision history for this message
alex cuellar (puca-sunqu) wrote :

Yo tambien tengo el mismo, se desconecta a cada instante.

Yo Uso ubuntu 13.04

Revision history for this message
In , Bpoirier (bpoirier) wrote :

Created an attachment (id=555784)
updated analysis script for trace from comment 39

(In reply to comment #39)
> Created an attachment (id=552961) [details]
> the wanted log-file, while the failure occurs
>
> sporadic errors are some times hard to reproduce, this time it was easier

Thank you for this second trace and sorry for my long delay in analyzing it.

cat c39-trace | ./analyze.py
rfd 114 recv, frame len 1510 / 192 (-1318 left before page end)
rfd 113 mapped, page left 2496
rfd 115 recv, frame len 1298 / 192 (-1106 left before page end)
rfd 114 mapped, page left 1984
rfd 116 recv, frame len 1298 / 3008 (1710 left before page end)
rfd 115 mapped, page left 2240
rfd 117 recv, frame len 1298 / 448 (-850 left before page end)
[...]
rfd 470 mapped, page left 1984
rfd 472 recv, frame len 102 / 192 (90 left before page end)
rfd 471 mapped, page left 3264
rfd 473 recv, frame len 94 / 2496 (2402 left before page end)
rfd 472 mapped, page left 64
[...]
rfd 471 recv, frame len 1298 / 3264 (1966 left before page end)
rfd 470 mapped, page left 192
overflow (repeated 198 times) after rfd 471, rrd 472 frame len 102 / 64 (-38 left before page end)

It shows that the driver can in fact receive into some buffers that straddle
page boundaries and can write frames large enough to cross this boundary,
without crashing. However, once again the crash happens when writing into a
buffer that starts 64B before the page end. Like in the trace from comment 35,
that is the smallest distance present in the trace.

These traces show that the adapter crashes in a more constrained situation
(writing accross the page boundary to a buffer that starts 64B before the end
of a page) than the initial hypothesis (writing to any buffer that crosses a
page boundary).

Revision history for this message
In , Bpoirier (bpoirier) wrote :

(In reply to comment #40)
> I just wanted to add that I can also confirm this problem.
> I checked the current default and desktop flavor of the openSUSE 12.3 kernel
> (3.7.10-1.16) on my Asus P53E laptop.
>
> My LAN NIC is an
> 03:00.0 Ethernet controller [0200]: Atheros Communications Inc. AR8151 v2.0
> Gigabit Ethernet [1969:1083] (rev c0)
> powered by the atl1c kernel driver.
>

Tobias,

All the reports of this bug so far affect the AR8152 variant of this nic while
you report a 8151. Does ifconfig report "overrruns" after a problematic
episode?
The patch that fixes the problem on AR8152 has been added to the 12.3 branch
of the openSUSE kernel repository but there hasn't been a kernel update
released since. You can test it by using the kmp from comment 26:
http://download.opensuse.org/repositories/home:/david_chang:/bnc812116_atl1c/openSUSE_12.3/

or using the kotd packages:
http://kernel.opensuse.org/packages/openSUSE-12.3

Revision history for this message
In , Triffterer (triffterer) wrote :

Hi Benjamin,

the symptoms on my NIC are the same as described in the other comments.

I made the following test to isolate the problem:

I pinged my desktop PC from my laptop and ran Wireshark on both computers for monitoring. When the problem occured, I could still see the incoming echo request and the outgoing echo reply on the desktop PC, but the echo reply never showed up in Wireshark on the laptop. Instead, in the ifconfig output, the counters dropped, overruns and frame in the RX line started counting up until I removed the LAN cable for a few seconds and plugged it back in.

I am going to check the mentioned kernel package, but I will be quite busy this week so it will take some time until I have a possibility to do the test.

Revision history for this message
Stefano Pecchenino (pekke) wrote :

RISOLTO !!!!!

wget -c kernel.ubuntu.com/~kernel-ppa/mainline/v3.11-rc4-saucy/linux-headers-3.11.0-031100rc4_3.11.0-031100rc4.201308041735_all.deb

wget -c kernel.ubuntu.com/~kernel-ppa/mainline/v3.11-rc4-saucy/linux-headers-3.11.0-031100rc4-generic_3.11.0-031100rc4.201308041735_i386.deb

wget -c kernel.ubuntu.com/~kernel-ppa/mainline/v3.11-rc4-saucy/linux-image-3.11.0-031100rc4-generic_3.11.0-031100rc4.201308041735_i386.deb

sudo dpkg -i *.deb

Revision history for this message
In , Christopher Goss (9e9o1ko8b2f5xpiibgscj-chris) wrote :

I have identical problem on Toshiba Laptop with AR9285 chipset. OpenSUSE 12.3 all 32bit PAE desktop kernels.

Symptoms : Ethernet connects normally but stops functioning soon after ( 1-2 minutes).

Background : This module was marked as "EXPERIMENTAL" in kernels prior to 3.9.x
(http://cateee.net/lkddb/web-lkddb/ATL1C.html)

Note : Same problem reported on Ubuntu Bugs Lanuchpad, Ubuntu users report upgrading to 3.11 kernel fixes. (https://bugs.launchpad.net/ubuntu/+source/network-manager/+bug/1175091)

Revision history for this message
In , Bpoirier (bpoirier) wrote :

Christopher, AR9285 is a wireless chipset. Can you attach the output of `lspci
-vvnn`, `ip -s -s link` and `dmesg` after you experience problems?

The comments about kernel update in Comment 42 still apply. The best way to
test the patched driver is by using the kmp from comment 26.

Revision history for this message
In , Christopher Goss (9e9o1ko8b2f5xpiibgscj-chris) wrote :

Apologies, the ether net controller is AR8151 per lspci:

03:00.0 Network controller: Atheros Communications Inc. AR9285 Wireless Network Adapter (PCI-Express) (rev 01)

04:00.0 Ethernet controller: Atheros Communications Inc. AR8151 v2.0 Gigabit Ethernet (rev c0)

Thanks to David Chang I can confirm this patch fixes for me.
Dmesg has nothing remarkable post patch. (Sorry I didnt check pre).
I have now switched to the daily kernel builds to test.
No regressions so far.

Sincere thanks to the hard working dev's at Suse/openSUSE.

Revision history for this message
In , Christopher Goss (9e9o1ko8b2f5xpiibgscj-chris) wrote :
Download full text (56.4 KiB)

dmesg output pre patch (sorry)

chris@linux-ojpk:~> dmesg
[ 0.000000] Initializing cgroup subsys cpuset
[ 0.000000] Initializing cgroup subsys cpu
[ 0.000000] Linux version 3.7.10-58.gae46e8d-desktop (geeko@buildhost) (gcc version 4.7.2 20130108 [gcc-4_7-branch revision 195012] (SUSE Linux) ) #1 SMP PREEMPT Fri Sep 13 12:38:23 UTC 2013 (ae46e8d)
[ 0.000000] e820: BIOS-provided physical RAM map:
[ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009d7ff] usable
[ 0.000000] BIOS-e820: [mem 0x000000000009d800-0x000000000009ffff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
[ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000ace3efff] usable
[ 0.000000] BIOS-e820: [mem 0x00000000ace3f000-0x00000000acebefff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000acebf000-0x00000000acfbefff] ACPI NVS
[ 0.000000] BIOS-e820: [mem 0x00000000acfbf000-0x00000000acffefff] ACPI data
[ 0.000000] BIOS-e820: [mem 0x00000000acfff000-0x00000000acffffff] usable
[ 0.000000] BIOS-e820: [mem 0x00000000ad000000-0x00000000af9fffff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000e0000000-0x00000000efffffff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000feb00000-0x00000000feb03fff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000fec00000-0x00000000fec00fff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000fed10000-0x00000000fed19fff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000fed1c000-0x00000000fed1ffff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000fee00000-0x00000000fee00fff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000ffd80000-0x00000000ffffffff] reserved
[ 0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000014fdfffff] usable
[ 0.000000] NX (Execute Disable) protection: active
[ 0.000000] SMBIOS 2.7 present.
[ 0.000000] DMI: TOSHIBA Satellite L730/Base Board Product Name, BIOS 2.50 06/26/2012
[ 0.000000] e820: update [mem 0x00000000-0x0000ffff] usable ==> reserved
[ 0.000000] e820: remove [mem 0x000a0000-0x000fffff] usable
[ 0.000000] e820: last_pfn = 0x14fe00 max_arch_pfn = 0x1000000
[ 0.000000] MTRR default type: uncachable
[ 0.000000] MTRR fixed ranges enabled:
[ 0.000000] 00000-9FFFF write-back
[ 0.000000] A0000-BFFFF uncachable
[ 0.000000] C0000-FFFFF write-protect
[ 0.000000] MTRR variable ranges enabled:
[ 0.000000] 0 base 000000000 mask F80000000 write-back
[ 0.000000] 1 base 080000000 mask FC0000000 write-back
[ 0.000000] 2 base 0AD000000 mask FFF000000 uncachable
[ 0.000000] 3 base 0AE000000 mask FFE000000 uncachable
[ 0.000000] 4 base 0B0000000 mask FF0000000 uncachable
[ 0.000000] 5 base 0FFC00000 mask FFFC00000 write-protect
[ 0.000000] 6 base 100000000 mask FC0000000 write-back
[ 0....

Revision history for this message
In , Triffterer (triffterer) wrote :

Hi all,

sorry for the delay.

I have testes the kernel module from comment 26 for about one week and the problem did not occur again since its installation, so I think that the patch solves the problem.

Revision history for this message
In , Jcheung (jcheung) wrote :

Hi David,

Pls check when can you submit the patch ?

Changed in opensuse:
status: Confirmed → In Progress
Revision history for this message
In , Jcheung (jcheung) wrote :

I checked with David that the commit already in SP3 maintenance kernel.

Revision history for this message
In , Jcheung (jcheung) wrote :

Sorry my typo mistake, it should be openSUSE 12.3 maintenance kernel.

Revision history for this message
In , Swamp-a (swamp-a) wrote :

openSUSE-SU-2013:1971-1: An update that solves 34 vulnerabilities and has 19 fixes is now available.

Category: security (moderate)
Bug References: 799516,801341,802347,804198,807153,807188,807471,808827,809906,810144,810473,811882,812116,813733,813889,814211,814336,814510,815256,815320,816668,816708,817651,818053,818561,821612,821735,822575,822579,823267,823342,823517,823633,823797,824171,824295,826102,826350,826374,827749,827750,828119,828191,828714,829539,831058,831956,832615,833321,833585,834647,837258,838346
CVE References: CVE-2013-0914,CVE-2013-1059,CVE-2013-1819,CVE-2013-1929,CVE-2013-1979,CVE-2013-2141,CVE-2013-2148,CVE-2013-2164,CVE-2013-2206,CVE-2013-2232,CVE-2013-2234,CVE-2013-2237,CVE-2013-2546,CVE-2013-2547,CVE-2013-2548,CVE-2013-2634,CVE-2013-2635,CVE-2013-2851,CVE-2013-2852,CVE-2013-3222,CVE-2013-3223,CVE-2013-3224,CVE-2013-3226,CVE-2013-3227,CVE-2013-3228,CVE-2013-3229,CVE-2013-3230,CVE-2013-3231,CVE-2013-3232,CVE-2013-3233,CVE-2013-3234,CVE-2013-3235,CVE-2013-3301,CVE-2013-4162
Sources used:
openSUSE 12.3 (src): kernel-docs-3.7.10-1.24.1, kernel-source-3.7.10-1.24.1, kernel-syms-3.7.10-1.24.1

Revision history for this message
In , Jcheung (jcheung) wrote :

Patches already pushed to maintenance channel for user to update.

Changed in opensuse:
status: In Progress → Fix Released
Changed in network-manager (Ubuntu):
assignee: nobody → daniel patricio (danipizr55)
Changed in fedora:
importance: Unknown → Medium
status: Unknown → Invalid
Changed in network-manager (Ubuntu):
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public Security information  
Everyone can see this security related information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.