sky2 driver "tx timeout" with large uploads

Bug #114019 reported by jcfp
0
Affects Status Importance Assigned to Milestone
Linux
Invalid
Medium
linux (Ubuntu)
Invalid
Undecided
Unassigned
linux-source-2.6.20 (Ubuntu)
Won't Fix
Undecided
Unassigned

Bug Description

Binary package hint: linux-source-2.6.20

With kernel 2.6.20-15 on feisty sky2 module (version 1.13) suffers from "tx timeout" errors when doing larger upload. The problem is reliably reproducable; it usually takes several GB of upload (anywhere from 4 to 20) to trigger this. Typical upload speeds here when the error happens are in the 40-50 mbit range, which represents less than halve the available uplink capacity. I don't know if it depends on upload speed (can't really do any low speed tests). Haven't noticed any problem on downloads, regardless of size and speed.

Don't know if this bug existed prior to this kernel or driver version; in dapper and edgy I used the sk98lin driver instead which always worked fine. Since that is no longer an option in feisty (at least for the time being, see bug #114012), I have had no choice but to start using the sky2 driver and quickly ran into this problem.

When the error happens, the interface is automatically disabled and (successfully) re-enabled by the driver, but of course still killing all connections (including vpn/ppp/etc) on the interface. The following is logged to syslog:

kernel: [35559.052000] NETDEV WATCHDOG: eth1: transmit timed out
kernel: [35559.052000] sky2 eth1: tx timeout
kernel: [35559.052000] sky2 eth1: transmit ring 413 .. 390 report=413 done=413
kernel: [35559.052000] sky2 eth1: disabling interface
kernel: [35559.056000] sky2 eth1: enabling interface
kernel: [35559.056000] sky2 eth1: ram buffer 48K
kernel: [35560.740000] sky2 eth1: Link is up at 100 Mbps, full duplex, flow control rx

as well as similar ones, where only the 'transmit ring' line is different, like:
kernel: [37680.744000] sky2 eth1: transmit ring 341 .. 318 report=341 done=341

# lspci -vvnn -s 04:00.0
04:00.0 Ethernet controller [0200]: Marvell Technology Group Ltd. 88E8053 PCI-E Gigabit Ethernet Controller [11ab:4362] (rev 15)
        Subsystem: Micro-Star International Co., Ltd. Marvell 88E8053 Gigabit Ethernet Controller (MSI) [1462:058c]
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR-
        Latency: 0, Cache Line Size: 32 bytes
        Interrupt: pin A routed to IRQ 23
        Region 0: Memory at fdafc000 (64-bit, non-prefetchable) [size=16K]
        Region 2: I/O ports at 7c00 [size=256]
        [virtual] Expansion ROM at fd900000 [disabled] [size=128K]
        Capabilities: [48] Power Management version 2
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
                Status: D0 PME-Enable- DSel=0 DScale=1 PME-
        Capabilities: [50] Vital Product Data
        Capabilities: [5c] Message Signalled Interrupts: Mask- 64bit+ Queue=0/1 Enable-
                Address: 0000000000000000 Data: 0000
        Capabilities: [e0] Express Legacy Endpoint IRQ 0
                Device: Supported: MaxPayload 128 bytes, PhantFunc 0, ExtTag-
                Device: Latency L0s unlimited, L1 unlimited
                Device: AtnBtn- AtnInd- PwrInd-
                Device: Errors: Correctable- Non-Fatal- Fatal- Unsupported-
                Device: RlxdOrd- ExtTag- PhantFunc- AuxPwr+ NoSnoop-
                Device: MaxPayload 128 bytes, MaxReadReq 512 bytes
                Link: Supported Speed 2.5Gb/s, Width x1, ASPM L0s, Port 2
                Link: Latency L0s <256ns, L1 unlimited
                Link: ASPM Disabled RCB 128 bytes CommClk- ExtSynch-
                Link: Speed 2.5Gb/s, Width x1

I have noticed there are several other bugs regarding this driver, such as bug #83009 (includes a "soft lockup" that doesn't happen here), bug #68338 (task against 2.6.20 was marked rejected) and bug #37784 (talks of problems on tiny uploads and requires manual interaction to restore connectivity), but none of them seem identical to this.

Revision history for this message
jcfp (jcfp) wrote :

Still exists with 2.6.20-16

Revision history for this message
jcfp (jcfp) wrote :

It appears that with 2.6.20-16 the connection just hangs without anything being printed to syslog or dmesg, and rmmod/modprobe of the the sky2 module is required to get the device working again (I waited over half an hour after ppp/vpn connections stopped working). In 2.6.20-15 it would always printed the "tx timeout" messages within minutes after the connection failed, and automatically reset itself without human intervention.

Revision history for this message
jcfp (jcfp) wrote :

Half an hour wasn't enough :(
Today driver restarted device after 40 minutes when the problem happened once again.

Revision history for this message
In , gbailey (gbailey-linux-kernel-bugs) wrote :

Most recent kernel where this bug did not occur: Unknown

Distribution: CentOS 4.5

Hardware Environment: Intel server board SE7320VP21

02:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8050 PCI-E ASF Gigabit Ethernet Controller (rev 18)
        Subsystem: Intel Corporation: Unknown device 3466
        Flags: bus master, fast devsel, latency 0, IRQ 16
        Memory at deefc000 (64-bit, non-prefetchable) [size=16K]
        I/O ports at b800 [size=256]
        Expansion ROM at deec0000 [disabled] [size=128K]
        Capabilities: [48] Power Management version 2
        Capabilities: [50] Vital Product Data
        Capabilities: [5c] Message Signalled Interrupts: 64bit+ Queue=0/1 Enable-
        Capabilities: [e0] Express Legacy Endpoint IRQ 0

Software Environment: CentOS 4.5 install + "vanilla" kernel 2.6.23-rc4

Problem Description:

Discovered while attempting to troubleshoot:
https://bugzilla.redhat.com/show_bug.cgi?id=228733

I'm trying to understand the "tx timeout" messages, and how to reproduce them. In my test environment, I have 2 servers, each of which has a sky2 Marvell NIC connected to a switch as "eth0".

On server "B", I type "nc -l -p 3409 > /dev/null"

On server "A", I type "nc server-B 3409 < /dev/zero"

I see lots of traffic from A->B, as would be expected. If I shutdown eth0 on server "B" using "ifdown eth0", wait a few seconds, and then re-enable eth0 on server "B" using "ifup eth0", I see the following in "dmesg" output on server B:

sky2 eth0: disabling interface
sky2 eth0: enabling interface
sky2 eth0: ram buffer 48K
sky2 eth0: Link is up at 1000 Mbps, full duplex, flow control rx
ip_tables: (C) 2000-2006 Netfilter Core Team

As expected... The problem is that server B can occasionally end up in a state where it is unable to ping or access the local subnet anymore. Both "mii-tool" and "ethtool eth0" shows a link present.

If I perform "ifdown eth0; ifup eth0" on server B, it doesn't help anything.
If I unload the sky2 module, then things clear up and I'm back on the network again.

I'm curious about this testcase because the symptom seems to match the earlier "tx timeout" messages; the driver tried to re-enable itself after a timeout, but it's still not able to see any traffic.

Steps to reproduce:

See "Problem Description" above. While traffic is continuously being transmitted from server "A" to server "B", shutdown the network interface on server "B", and then start the interface on server "B". Monitoring RX traffic on server "B" will indicate when it is no longer receiving the bytes sent from server "A".

Revision history for this message
In , stephen (stephen-linux-kernel-bugs) wrote :

CentOS has older version of driver please update to latest version from 2.6.22.6 or 2.6.23-rc4. There are several bugs that caused tx timeouts (hung chip),
and a problem that led to PHY clock issues.

Revision history for this message
In , gbailey (gbailey-linux-kernel-bugs) wrote :

The kernel version I encountered this on is 2.6.23-rc4, as marked in the bug report and is why I chose CentOS 4.5 install + "vanilla" kernel 2.6.23-rc4" under "Software Environment".

Revision history for this message
jcfp (jcfp) wrote :

Still happens with 2.6.20-16.31; driver only resets itself approximately half an hour after failure.

Revision history for this message
In , stephen (stephen-linux-kernel-bugs) wrote :

Please enable the sky2 debugfs kernel configuration option.
Mount debugfs on somewhere (/debug)
Hang system then capture sky2 state. (cat /debug/sky2/eth0 >savefile)
It will show the status of IRQ and receive/transmit.

Revision history for this message
In , gbailey (gbailey-linux-kernel-bugs) wrote :

Rebuilt 2.6.23-rc5 with SKY2_DEBUG. I've reproduced the issue where ifdown/ifup does not reset the interface properly.

# cat /debug/sky2/eth0
IRQ src=0 mask=c000001d control=0
Status ring (empty)
Tx ring pending=24...24 report=24 done=24

Rx ring hw get=169 put=169 last=1023

Revision history for this message
In , tony (tony-linux-kernel-bugs) wrote :

I can confirm that we can reproduce this issue (or one nearly identical to it). We are using the current stable 2.6.22.6 kernel on a system with a Marvell 88E8055 (Panasonic Toughbook CF-74).

To reproduce it, we can open any kind of persistent socket connection (such as an Apache SSL connection using a browser) and then yank the cable. We wait a bit and pop the cable back in and the driver is dead. We can't ping in or out until we down the interface, remove and reinsert the sky2 driver and bring the interface back up.

I will be happy to provide any info or test any patches you provide.

Revision history for this message
In , xt.knight (xt.knight-linux-kernel-bugs) wrote :

I'm having trouble here too. Using an Ubuntu Gutsy kernel:

Linux andy-desktop 2.6.22-10-generic #1 SMP Wed Aug 22 07:42:05 GMT 2007 x86_64 GNU/Linux

I'm not getting tx timeouts AFAIK. I'm not getting any driver crash dumps either. I'm just having connection issues. I'm not transferring anything big. I will be browsing the web, then all of a sudden the interface will get in some type of corrupted state where nothing works. Sometimes ifdown/ifup will do it, sometimes it will not. Sometimes dhclient works, sometimes not. Unloading sky2 and reloading it *always* fixes the problem, indicating some type of issue with the "current state" of the driver. Maybe a variable not getting cleared/etc but I can only guess.

Sometimes ifdown/ifup will work and then it will only work for about a minute. Redoing ifdown/ifup will make sky2 work for another few hours (it's like refilling your gas tank, just on a smaller level ;)).

Sometimes I will get Destination Host Unreachable from pinging my router, sometimes ping says nothing at all.

I tried with the modprobe sky2 debug=16 option but the log output looks not much different from when the adapter is working. And, I haven't caught it just when it stopped working, yet. I have only turned on my monitor to notice that my net wasn't working and then dumped a few logs of it. In any case, I don't think they're helpful but if you need them I will gladly post them.

Most importantly, this is a regression from 2.6.20. I hope this can get fixed and if so I'll notify those at Ubuntu and get this into the kernel and hopefully an exception for it if necessary.

Ubuntu bug link: https://bugs.launchpad.net/ubuntu/+source/linux-source-2.6.22/+bug/138611

Changed in linux:
status: Unknown → Fix Released
Revision history for this message
In , xt.knight (xt.knight-linux-kernel-bugs) wrote :

I fixed my problems by using 2.6.23-rc6.

Revision history for this message
In , nhorman (nhorman-linux-kernel-bugs) wrote :

Interesting, the only thing that went in between rc5 and rc6 was the restore multicast list on resume, which while potentially applicable, doesn't sound like it addresses the whole of the problem. Does rc5 fix the problem for you as well?

Revision history for this message
In , xt.knight (xt.knight-linux-kernel-bugs) wrote :

Sorry for the misunderstanding.

I fixed my problems by upgrading from the Ubuntu Linux 2.6.22-11 kernel to the vanilla 2.6.23-rc6 kernel. I hadn't even tried any other 2.6.23 yet. I'm thinking the Ubuntu kernel has a problem due to mismatched or partially backported patches, at least in my case.

Revision history for this message
In , xt.knight (xt.knight-linux-kernel-bugs) wrote :

Created attachment 13006
debugfs sky2

Revision history for this message
In , xt.knight (xt.knight-linux-kernel-bugs) wrote :

Created attachment 13007
debugfs sky2 (when it did work)

Revision history for this message
In , xt.knight (xt.knight-linux-kernel-bugs) wrote :

I am still having issues with 2.6.23-rc6 and rc8, but it took awhile for them to begin happening again. I attached two debugfs logs of sky2.

Changed in linux:
status: Unknown → Confirmed
Revision history for this message
In , rf (rf-linux-kernel-bugs) wrote :

I'm running SuSe 10.3 and with an updated kernel (2.6.23.1-164-default) the problem remains.
The interface is listed as "sky2 0000:02:00.0: v1.18 addr 0xd5020000 irq 17 Yukon-EC (0xb6) rev 1"
I only run 100mbit to a switch. Using it on a media server and unfortunately after a few hours of reasonably heavy use streaming media, the interface dies, then a 3-4 hours later, the machine crashes.
If I get to the machine before it dies, I can restart the interface, but as others report, it lasts for a shorter time.
When restarting it, "ifstatus" reports it as up in the failed mode, doing an "ifdown" and "ifup" restarts it.
ifup reports: "device: Marvell Technology Group Ltd. 88E8053 PCI-E Gigabit Ethernet Controller (rev 19)"
I see nothing in dmesg when the interface dies

Revision history for this message
In , stephen (stephen-linux-kernel-bugs) wrote :

There is a problem on Yukon-EC that causes the receive fifo to hang.
Workaround code in 2.6.23 that is supposed to detect and fix it.

The problem also only occurs if there is no flow control. The sky2
autonegotiates to enable flow control but some hardware doesn't support
flow control or has it disables.

Revision history for this message
In , rf (rf-linux-kernel-bugs) wrote :

Thanks.
Unfortunately the log reports:
kernel: [ 982.916325] sky2 eth0: Link is up at 100 Mbps, full duplex, flow control both
So I'm not sure it's limited to the case when flow control is on.
I noticed some threads earlier this year where you tried flow control off. Is that worth trying again with latest release, if so how?

Revision history for this message
In , rf (rf-linux-kernel-bugs) wrote :

I can repeat the failure by trying to copy about 20G of files over a Samba connection from a Windows box. I can never get past 5G before it fails. So perhaps I can do some debugging for you?

Revision history for this message
In , stephen (stephen-linux-kernel-bugs) wrote :

Is this the same bug as the original report, or is the bug becoming a tar baby for all the possible "my sky2 has hung" reports?

The original report said problem was reproducible after up/down. Not one
of the "my box hangs under load" problems.

Changed in linux:
status: Confirmed → Incomplete
Revision history for this message
In , rf (rf-linux-kernel-bugs) wrote :

Sorry, no, to avoid raising another bug on sky2 this was the nearest I could find.
Sky2 hangs under load, that's the problem. Very repeatable.
I've now compiled and switched to the Marvell driver sk98lin, and that gives me no problems...

Revision history for this message
In , andree182 (andree182-linux-kernel-bugs) wrote :

Tried to find the bug source, but couldn't ;-( I used ubuntu 2.6.24 sources, placed the 2.6.22 (ubuntu) sky2.[ch] (ver. 1.18) files into the tree and applied the

[NET]: Make NAPI polling independent of struct net_device objects.
+
[NET]: Nuke SET_MODULE_OWNER macro.

patches (from git). Then I build the module, did a rmmod/modprobe, but nothing changed - the sky2 still fails with "sky2 eth0: rx error ..." in the dmesg.

Thus I guess the error could be somewhere else (maybe the napi polling isn't working quite right?), or maybe... I guess I'm gonna try to really find the bug...

Revision history for this message
In , ryan.roth (ryan.roth-linux-kernel-bugs) wrote :

I have consistently had the same issue reported above where the kernel reports the following and the interface does not work. It seems to work fine the first timeyou bring up the interface, but if you do a ifdown/ifup you get the following message, but no connection.

"sky2 eth0: Link is up at 100 Mbps, full duplex, flow control both"

Revision history for this message
Marlon Cisternas Milla (mcisternas-deactivatedaccount) wrote :

Thanks for taking you time in reporting this but. This bug has remained idle for some time. Can you still confirm this bug? does the issue persist with the kernel 2.6.24-19?

Please, don't forget to answer

Revision history for this message
Launchpad Janitor (janitor) wrote : This bug is now reported against the 'linux' package

Beginning with the Hardy Heron 8.04 development cycle, all open Ubuntu kernel bugs need to be reported against the "linux" kernel package. We are automatically migrating this bug to the new "linux" package. However, development has already began for the upcoming Intrepid Ibex 8.10 release. It would be helpful if you could test the upcoming release and verify if this is still an issue - http://www.ubuntu.com/testing . If the issue still exists, please update this report by changing the Status of the "linux" task from "Incomplete" to "New". We appreciate your patience and understanding as we make this transition. Thanks!

Revision history for this message
jcfp (jcfp) wrote :

No longer using Ubuntu on machines with this hardware. Too bad a fully functional driver gets crippled and even removed in favour of one in beta state.

Changed in linux:
status: Incomplete → Invalid
Revision history for this message
In , alan (alan-linux-kernel-bugs) wrote :

Closing out old bugs

Changed in linux:
status: Incomplete → Invalid
Changed in linux:
importance: Unknown → Medium
Revision history for this message
In , ucelsanicin (ucelsanicin-linux-kernel-bugs) wrote :

------8<-------
 1 size_t fwrite(const void * __restrict ptr, size_t size, http://www-look-4.com/category/travel/
 2 size_t nmemb, register FILE * __restrict stream)
 3 {
 4 size_t retval; https://komiya-dental.com/category/technology/
 5 __STDIO_AUTO_THREADLOCK_VAR;
 6 http://www.iu-bloomington.com/category/technology/
 7 > __STDIO_AUTO_THREADLOCK(stream);
 8
 9 retval = fwrite_unlocked(ptr, size, nmemb, stream);
10 https://waytowhatsnext.com/category/technology/
11 __STDIO_AUTO_THREADUNLOCK(stream);
12 http://www.wearelondonmade.com/category/travel/
13 return retval;
14 }
------>8-------
 http://www.jopspeech.com/category/travel/
Here, we are at line 7. Using the "next" command leads no where. However,
setting a breakpoint on line 9 and issuing "continue" works.
http://joerg.li/category/travel/
Looking at the assembly instructions reveals that we're dealing with the
critical section entry code [1] that should never be interrupted, in this
case by the debugger's implicit breakpoints: http://connstr.net/category/travel/

------8<-------
  ... http://embermanchester.uk/category/travel/
1 add_s r0,r13,0x38
2 mov_s r3,1
3 llock r2,[r0] <-.
4 brne.nt r2,0,14 --. | http://www.slipstone.co.uk/category/travel/
5 scond r3,[r0] | |
6 bne -10 --|--'
7 brne_s r2,0,84 <-' http://www.logoarts.co.uk/category/travel/
  ...
------>8-------
 http://www.acpirateradio.co.uk/category/travel/
Lines 3 until 5 (inclusive) are supposed to be executed atomically. Therefore,
GDB should never (implicitly) insert a breakpoint on lines 4 and 5, else the http://www.compilatori.com/category/travel/
program will try to acquire the lock again by jumping back to line 3 and
gets stuck in an infinite loop. https://www.webb-dev.co.uk/category/technology/

The solution is to make GDB aware of these patterns so it inserts breakpoints
after the sequence -- line 6 in this example.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.