linux-image-2.6.28-11-generic r8169 timeout

Bug #378907 reported by Cody Pisto
24
This bug affects 3 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Won't Fix
Undecided
Unassigned

Bug Description

Binary package hint: linux-image-2.6.28-11-generic

Unfortunately this isn't easy to reproduce, although it seems to happen when the network interface is under heavy load

The network interface will "stall" and in dmesg I get:

[ 9837.988021] ------------[ cut here ]------------
[ 9837.988029] WARNING: at /build/buildd/linux-2.6.28/net/sched/sch_generic.c:226 dev_watchdog+0x219/0x230()
[ 9837.988035] NETDEV WATCHDOG: eth0 (r8169): transmit timed out
[ 9837.988039] Modules linked in: input_polldev video output lp parport snd_hda_intel iTCO_wdt iTCO_vendor_support snd_pcm intel_agp serio_raw pcspkr agpgart usbhid snd_timer snd soundcore snd_page_alloc r8169 mii fbcon tileblit font bitblit softcursor
[ 9837.988081] Pid: 0, comm: swapper Not tainted 2.6.28-11-generic #42-Ubuntu
[ 9837.988086] Call Trace:
[ 9837.988096] [<c0139ab0>] warn_slowpath+0x60/0x80
[ 9837.988105] [<c012c6fc>] ? enqueue_entity+0x13c/0x360
[ 9837.988113] [<c01568f3>] ? getnstimeofday+0x53/0x110
[ 9837.988121] [<c0119a00>] ? setup_APIC_eilvt_mce+0x0/0x30
[ 9837.988128] [<c0159b1a>] ? clockevents_program_event+0x9a/0x150
[ 9837.988135] [<c01568f3>] ? getnstimeofday+0x53/0x110
[ 9837.988143] [<c02cb03d>] ? strlcpy+0x1d/0x60
[ 9837.988151] [<c04312f2>] ? netdev_drivername+0x32/0x40
[ 9837.988157] [<c0445e49>] dev_watchdog+0x219/0x230
[ 9837.988164] [<c01568f3>] ? getnstimeofday+0x53/0x110
[ 9837.988171] [<c0119a73>] ? lapic_next_event+0x13/0x20
[ 9837.988178] [<c0159b1a>] ? clockevents_program_event+0x9a/0x150
[ 9837.988186] [<c0143b00>] run_timer_softirq+0x130/0x200
[ 9837.988192] [<c0445c30>] ? dev_watchdog+0x0/0x230
[ 9837.988198] [<c0445c30>] ? dev_watchdog+0x0/0x230
[ 9837.988206] [<c013f197>] __do_softirq+0x97/0x170
[ 9837.988212] [<c0152ca6>] ? hrtimer_interrupt+0x186/0x1b0
[ 9837.988218] [<c0152af9>] ? ktime_get+0x19/0x40
[ 9837.988225] [<c013f2cd>] do_softirq+0x5d/0x60
[ 9837.988231] [<c013f445>] irq_exit+0x55/0x90
[ 9837.988238] [<c011a07b>] smp_apic_timer_interrupt+0x5b/0x90
[ 9837.988245] [<c0105318>] apic_timer_interrupt+0x28/0x30
[ 9837.988252] [<c010b012>] ? mwait_idle+0x42/0x50
[ 9837.988258] [<c010285d>] cpu_idle+0x6d/0xd0
[ 9837.988265] [<c04fe6fe>] start_secondary+0xbe/0xf0
[ 9837.988270] ---[ end trace 1684b2f04d15bc47 ]---
[ 9838.005062] r8169: eth0: link up

The stalls also happen without the specific timeout:
[44518.005399] r8169: eth0: link up
[137104.006414] r8169: eth0: link up

I have verified this is not a hub/switch/cabling issue and have verified it on two machines with identical r8169 based 10/100/1000 PCIe NIC's

additional info:

[ 3.755127] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
[ 3.755169] r8169 0000:01:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
[ 3.755205] r8169 0000:01:00.0: setting latency timer to 64
[ 3.755453] r8169 0000:01:00.0: irq 2299 for MSI/MSI-X
[ 9.926920] r8169: eth0: link up
[ 9.926940] r8169: eth0: link up

01:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 02)

/proc/interrupts:
2299: 161483691 0 0 0 PCI-MSI-edge eth0

The machines in question are the MSI Wind 100 Nettops.

Revision history for this message
Cody Pisto (cpisto) wrote :

Related note, using the linux driver supplied by realtek for r8168 based NICs on this page:

http://www.realtek.com.tw/downloads/downloadsView.aspx?Langid=1&PNid=13&PFid=5&Level=5&Conn=4&DownTypeID=3&GetDown=false

LINUX driver for kernel 2.6.x and 2.4.x (Support x86 and x64)
8.012.00 2009/5/5

Eliminates the error entirely

Revision history for this message
Cody Pisto (cpisto) wrote :
Revision history for this message
Thomas Baum (thomas-thomba) wrote :

This NIC is also used on Intel's Atom N270 based board D945GSEJT which I use for my home server. I can reproduce that behavoir easily by copy large files from the server to the server.

Revision history for this message
mgoewe (michael-goewe) wrote :

Same behaviour on my NAS box with an D945GCLF2 board. Switching back to 2.6.27-14-server #1 SMP and the box ran flawless again.

Revision history for this message
Monarch (niels-monarch) wrote :

I have the same board (D945GSEJT) and had the same issue. I got the problem finaly solved by taking a newest Kernel from the karmic repository.

http://www.monarch.de/wordpress/?p=101

Revision history for this message
Thomas Baum (thomas-thomba) wrote :

Just a guess: Perhaps this bug is related to the shortly fixed issue mentioned in http://www.securityfocus.com/bid/35281.

Revision history for this message
Daniel Benamy (dbenamy) wrote :

I'm seeing this too.
Possibly related to https://bugs.launchpad.net/ubuntu/hardy/+bug/141343.

Revision history for this message
Thomas Baum (thomas-thomba) wrote :

Having updated to Karmic (kernel 2.6.31-16-generic) I can't reproduce this issue.

Revision history for this message
Monarch (niels-monarch) wrote :

Yes, the problem is rectified starting from 2.6.31-1

Revision history for this message
Daniel Benamy (dbenamy) wrote :

I also haven't seen it since upgrading to Karmic 9.10.

Revision history for this message
Ed Stone (svcallisto37) wrote :

My r8168 NIC quit working with the update prior to 2.6.31.16 (Karmic 9.10) and hasn't worked since. I cannot get r8168.015.00 to compile either. I have tried every suggested thread I could find - no joy. I can post all the data anyone wants to see but it looks like all the others I have seen. Any help would be appreciated.

Revision history for this message
Thomas Baum (thomas-thomba) wrote :

I can't comfirm Ed's report. I've kernel 2.6.31-17-generic-pae here up an running. I haven't got any problems since I've updated to Karmic.

Revision history for this message
Jeremy Foshee (jeremyfoshee) wrote :

Hi Cody,

Please be sure to confirm this issue exists with the latest development release of Ubuntu. ISO CD images are available from http://cdimage.ubuntu.com/releases/ . If the issue remains, please run the following command from a Terminal (Applications->Accessories->Terminal). It will automatically gather and attach updated debug information to this report.

apport-collect -p linux 378907

Also, if you could test the latest upstream kernel available that would be great. It will allow additional upstream developers to examine the issue. Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag. This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text. Please let us know your results.

Thanks in advance.

[This is an automated message. Apologies if it has reached you inappropriately; please just reply to this message indicating so.]

tags: added: needs-kernel-logs
tags: added: needs-upstream-testing
tags: added: kj-triage
Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
arnox (arno-grbac) wrote :

This bug is present in 10.04. I have a 4hd 1.5GB mdadm RAID5. When I try to copy about 380GBs worth of images from the main file server, the computer will eventually restart or lock up (anywhere above 100 GB). It's fun waiting two hours for the RAID to resync, but at least it comes back up.

I tried using apport-collect, but that didn't work, so here are the details:

torq kernel: [ 0.000000] Linux version 2.6.32-24-generic (buildd@yellow) (gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5) ) #39-Ubuntu SMP Wed Jul 28 05:14:15 UTC 2010 (Ubuntu 2.6.32-24.39-generic 2.6.32.15+drm33.5)
torq kernel: [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-2.6.32-24-generic root=/dev/md0 ro nodmraid quiet splash

syslog:

...[a lot of "nothing out of ordinary"], and then:

22:38:22 torq kernel: [14413.554256] r8169: eth0: link up
22:38:25 torq kernel: [14416.603732] r8169: eth0: link up
22:39:32 torq kernel: [14482.908563] r8169: eth0: link up
22:41:42 torq kernel: [14613.724158] r8169: eth0: link up

...about 20 more of above, and then "no more and then"

mdadm -> resync

Revision history for this message
Matti Airas (mairas) wrote :

The bug is still present in both maverick and natty beta 2 on an ASUS G53JW. I don't get any call traces, but the repeated "r8169: eth0: link up" messages. When transferring large files over a gigabit ethernet, on maverick the computer locks up within a few seconds; on natty, it may last up to 10-20 minutes. Transfer rates are poor, about 2 Mbit/s, due to the frequent resets. The realtek-provided r8168 driver works perfectly.

Revision history for this message
Brad Figg (brad-figg) wrote : Unsupported series, setting status to "Won't Fix".

This bug was filed against a series that is no longer supported and so is being marked as Won't Fix. If this issue still exists in a supported series, please file a new bug.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: Incomplete → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.