sky2 transmit timeout and soft lockup detected on CPU#0!

Bug #83009 reported by Leon van der Ree on 2007-02-03
28
Affects Status Importance Assigned to Milestone
Linux
Fix Released
High
linux (Ubuntu)
Undecided
Unassigned
linux-source-2.6.20 (Ubuntu)
Medium
Unassigned

Bug Description

My system seems to crash after a while, with kernel 2.6.20-6-generic

I think it is related to something with the network, see this log:

Feb 3 07:47:48 leon-desktop dhclient: DHCPACK from 192.168.0.100
Feb 3 07:47:48 leon-desktop dhclient: bound to 192.168.0.13 -- renewal in 251 seconds.
Feb 3 07:51:59 leon-desktop dhclient: DHCPREQUEST on eth1 to 192.168.0.100 port 67
Feb 3 07:52:41 leon-desktop last message repeated 5 times
Feb 3 07:53:38 leon-desktop last message repeated 4 times
Feb 3 07:54:10 leon-desktop last message repeated 2 times
Feb 3 07:54:15 leon-desktop kernel: [20220.995201] NETDEV WATCHDOG: eth1: transmit timed out
Feb 3 07:54:15 leon-desktop kernel: [20220.995206] sky2 eth1: tx timeout
Feb 3 07:54:15 leon-desktop kernel: [20220.995212] sky2 eth1: transmit ring 194 .. 171 report=196 done=196
Feb 3 07:54:15 leon-desktop kernel: [20220.995214] sky2 status report lost?
Feb 3 07:54:24 leon-desktop kernel: [20230.263750] BUG: soft lockup detected on CPU#0!
Feb 3 07:54:24 leon-desktop kernel: [20230.263776] [softlockup_tick+156/240] softlockup_tick+0x9c/0xf0
Feb 3 07:54:24 leon-desktop kernel: [20230.263794] [update_process_times+51/128] update_process_times+0x33/0x80
Feb 3 07:54:24 leon-desktop kernel: [20230.263802] [smp_apic_timer_interrupt+112/128] smp_apic_timer_interrupt+0x70/0x80
Feb 3 07:54:24 leon-desktop kernel: [20230.263809] [apic_timer_interrupt+40/48] apic_timer_interrupt+0x28/0x30
Feb 3 07:54:24 leon-desktop kernel: [20230.263820] [_spin_lock_bh+18/32] _spin_lock_bh+0x12/0x20
Feb 3 07:54:24 leon-desktop kernel: [20230.263827] [<f8aa66f5>] sky2_tx_timeout+0xf5/0x1c0 [sky2]
Feb 3 07:54:24 leon-desktop kernel: [20230.263843] [dev_watchdog+0/208] dev_watchdog+0x0/0xd0
Feb 3 07:54:24 leon-desktop kernel: [20230.263848] [dev_watchdog+193/208] dev_watchdog+0xc1/0xd0
Feb 3 07:54:24 leon-desktop kernel: [20230.263854] [run_timer_softirq+303/416] run_timer_softirq+0x12f/0x1a0
Feb 3 07:54:24 leon-desktop kernel: [20230.263861] [<f8898f02>] usb_hcd_irq+0x22/0x60 [usbcore]
Feb 3 07:54:24 leon-desktop kernel: [20230.263890] [__do_softirq+130/256] __do_softirq+0x82/0x100
Feb 3 07:54:24 leon-desktop kernel: [20230.263899] [do_softirq+85/96] do_softirq+0x55/0x60
Feb 3 07:54:24 leon-desktop kernel: [20230.263905] [smp_apic_timer_interrupt+117/128] smp_apic_timer_interrupt+0x75/0x80
Feb 3 07:54:24 leon-desktop kernel: [20230.263910] [apic_timer_interrupt+40/48] apic_timer_interrupt+0x28/0x30
Feb 3 07:54:24 leon-desktop kernel: [20230.263921] [mwait_idle_with_hints+70/96] mwait_idle_with_hints+0x46/0x60
Feb 3 07:54:24 leon-desktop kernel: [20230.263930] [cpu_idle+73/208] cpu_idle+0x49/0xd0
Feb 3 07:54:24 leon-desktop kernel: [20230.263936] [start_kernel+863/1056] start_kernel+0x35f/0x420
Feb 3 07:54:24 leon-desktop kernel: [20230.263943] [unknown_bootoption+0/608] unknown_bootoption+0x0/0x260
Feb 3 07:54:24 leon-desktop kernel: [20230.263953] =======================
Feb 3 07:57:45 leon-desktop kernel: [20431.473423] Core dump to |/usr/share/apport/apport.29154 pipe failed
Feb 3 08:17:43 leon-desktop -- MARK --
Feb 3 08:37:43 leon-desktop -- MARK --
Feb 3 08:57:44 leon-desktop -- MARK --
Feb 3 09:17:44 leon-desktop -- MARK --
Feb 3 09:37:44 leon-desktop -- MARK --
Feb 3 09:57:44 leon-desktop -- MARK --
Feb 3 10:17:44 leon-desktop -- MARK --
Feb 3 10:37:45 leon-desktop -- MARK --
Feb 3 10:57:45 leon-desktop -- MARK --
Feb 3 11:11:03 leon-desktop syslogd 1.4.1#20ubuntu3: restart.

I don't know what else I have to report so please ask me if you need more info.

Lauri Kotilainen (rytmis) wrote :

Happens to me on 2.6.20-8.

Feb 16 07:25:58 reactor kernel: [ 628.004000] BUG: soft lockup detected on CPU#0!
Feb 16 07:25:58 reactor kernel: [ 628.004000] [softlockup_tick+156/240] softlockup_tick+0x9c/0xf0
Feb 16 07:25:58 reactor kernel: [ 628.004000] [update_process_times+51/128] update_process_times+0x33/0x80
Feb 16 07:25:58 reactor kernel: [ 628.004000] [smp_apic_timer_interrupt+112/128] smp_apic_timer_interrupt+0x70/0x80
Feb 16 07:25:58 reactor kernel: [ 628.004000] [apic_timer_interrupt+40/48] apic_timer_interrupt+0x28/0x30
Feb 16 07:25:58 reactor kernel: [ 628.004000] [try_to_del_timer_sync+7/80] try_to_del_timer_sync+0x7/0x50
Feb 16 07:25:58 reactor kernel: [ 628.004000] [del_timer_sync+14/32] del_timer_sync+0xe/0x20
Feb 16 07:25:58 reactor kernel: [ 628.004000] [<f8a130bf>] MlmeAuthReqAction+0x5f/0x1e0 [rt61]
Feb 16 07:25:58 reactor kernel: [ 628.004000] [<f8a07b58>] MlmeQueueFull+0x28/0x40 [rt61]
Feb 16 07:25:58 reactor kernel: [ 628.004000] [<f8a07ccc>] MlmeEnqueue+0xbc/0xf0 [rt61]
Feb 16 07:25:58 reactor kernel: [ 628.004000] [do_timer+564/2080] do_timer+0x234/0x820
Feb 16 07:25:58 reactor kernel: [ 628.004000] [<f8a129b0>] AuthTimeout+0x0/0x40 [rt61]
Feb 16 07:25:58 reactor kernel: [ 628.004000] [<f8a0ced3>] CntlWaitAuthProc+0x73/0x110 [rt61]
Feb 16 07:25:58 reactor kernel: [ 628.004000] [<f8a129b0>] AuthTimeout+0x0/0x40 [rt61]
Feb 16 07:25:58 reactor kernel: [ 628.004000] [<f8a047de>] StateMachinePerformAction+0x1e/0x30 [rt61]
Feb 16 07:25:58 reactor kernel: [ 628.004000] [<f8a09481>] MlmeHandler+0x151/0x170 [rt61]
Feb 16 07:25:58 reactor kernel: [ 628.004000] [run_timer_softirq+303/416] run_timer_softirq+0x12f/0x1a0
Feb 16 07:25:58 reactor kernel: [ 628.004000] [timer_interrupt+89/176] timer_interrupt+0x59/0xb0
Feb 16 07:25:58 reactor kernel: [ 628.004000] [__do_softirq+130/256] __do_softirq+0x82/0x100
Feb 16 07:25:58 reactor kernel: [ 628.004000] [do_softirq+85/96] do_softirq+0x55/0x60
Feb 16 07:25:58 reactor kernel: [ 628.004000] [smp_apic_timer_interrupt+117/128] smp_apic_timer_interrupt+0x75/0x80
Feb 16 07:25:58 reactor kernel: [ 628.004000] [apic_timer_interrupt+40/48] apic_timer_interrupt+0x28/0x30
Feb 16 07:25:58 reactor kernel: [ 628.004000] [default_idle+0/96] default_idle+0x0/0x60
Feb 16 07:25:58 reactor kernel: [ 628.004000] [native_safe_halt+2/16] native_safe_halt+0x2/0x10
Feb 16 07:25:58 reactor kernel: [ 628.004000] [default_idle+61/96] default_idle+0x3d/0x60
Feb 16 07:25:58 reactor kernel: [ 628.004000] [cpu_idle+73/208] cpu_idle+0x49/0xd0
Feb 16 07:25:58 reactor kernel: [ 628.004000] [start_kernel+863/1056] start_kernel+0x35f/0x420
Feb 16 07:25:58 reactor kernel: [ 628.004000] [unknown_bootoption+0/608] unknown_bootoption+0x0/0x260

Lauri Kotilainen (rytmis) wrote :
Lauri Kotilainen (rytmis) wrote :
Changed in linux-source-2.6.20:
status: Unconfirmed → Confirmed
Kyle McMartin (kyle) wrote :

Could you also attach the output of "lsmod" after you see this problem?

Thanks!
 Kyle

Changed in linux-source-2.6.20:
assignee: nobody → kyle
status: Confirmed → Needs Info
Lauri Kotilainen (rytmis) wrote :

I'll see what I can do about it after work today, but I think that it's going to be a wee bit difficult, since the lockup literally locks my system up.

I don't think that any new modules are loaded when the bug manifests, so I could save the lsmod output and then try to reproduce it?

Yes, lsmod before the hang should be okay.

Also have you observed any pattern to the hangs? Does it always occur at roughly the same time after booting? Are you doing a lot of network I/O just before or whilst the hang takes place? Any information along those lines would be helpful.

Lauri Kotilainen (rytmis) wrote :

OK, so here goes. I've been futzing around with my wireless NIC recently, and looks to me like the rt61(pci) module might have something to do with the hang (I'm just guesstimating but the presence of the driver seems to correlate with the hangs).

The lsmod is taken some fifteen minutes prior to the hangup -- it's hard to cause reliably, very intermittent but seems to happen most often when apt or update-manager are doing something.

Andreas Simon (andreas-w-simon) wrote :

This seems to be a bug in the sky2 network driver.

I have the same on a Gigabyte P695-DS4 rev3.3 motherboard with a Marvel 8056 chip. The higher the network traffic, the higher the probability for the crash.

Andreas Simon (andreas-w-simon) wrote :

I remove the "needs info" status because the last request was answered.

Changed in linux-source-2.6.20:
status: Needs Info → Confirmed
Tim Gardner (timg-tpi) on 2007-03-23
Changed in linux-source-2.6.20:
assignee: kyle → ubuntu-kernel-team
importance: Undecided → Medium

I reported similar behaviour in Bug #87320.

I have a Gigabyte GA-965P-DS3 motherboard with a gigabit network interface - using the sky2 driver.

Changed in linux:
status: Unknown → In Progress
Lauri Kotilainen (rytmis) wrote :

Doesn't seem to be exclusively a sky2 issue since as far as I can tell, I've never run that specific driver (in fact, I suspect the culprit to be the rt61(pci) driver in my case).

Lauri, it could also be that your problem is a different one. At least some lockups seems to be highly related to sky2 and network traffic.

Anyway, since the 3 days I run linux-image-2.6.20-13-generic version 2.6.20-13.21 I haven't encountered any new lockups. Maybe the problem is already fixed in that new kernel? Do other people still have lockups with the latest kernel version?

If yes, then hopefully the next Ubuntu kernel will fix them. In the changelog [1] of Ben Collin's git are several updates for the sky2 driver.

[1] http://git.kernel.org/?p=linux/kernel/git/bcollins/ubuntu-2.6.git;a=shortlog

Tim Gardner (timg-tpi) wrote :

Sky2 upstream updates will appear in the 2.6.20-14 kernel. Check the daily ISO at http://cdimage.ubuntu.com/daily-live/current on April 2.

Darren Albers (dalbers) wrote :

I was copying around 4 gigs of files over SMB and experienced the sky2 hang today. I am running 2.6.20-14

In the meantime I encountered several soft lockups with linux image 2.6.20-14.22 too. It seems this bug is not fixed with tha latest sky2 upstream changes.

Currently I test the Yukon driver (sk98lin) from Marvell, which can be downloaded from http://www.marvell.com/drivers/driverDisplay.do?dId=153&pId=36
It looks rock stable so far, but it's too early to say that the soft-lockup bug is not there.

Tim Gardner (timg-tpi) wrote :

I back ported sky2 from 2.6.21-rc6. Lets see if it helps:

cd /lib/modules/2.6.20-14-generic/kernel/drivers/net
sudo wget -O sky2.ko http://people.ubuntu.com/~rtg/sky2.ko.2.6.20-14
sudo modprobe -r sky2
sudo modprobe sky2

Please attach dmesg output if you continue to have failures.

Tim Gardner (timg-tpi) wrote :

Uh, be sure to save your original sky2.ko before running 'wget'.

Tim Gardner (timg-tpi) on 2007-04-12
Changed in linux-source-2.6.20:
assignee: ubuntu-kernel-team → timg-tpi
Darren Albers (dalbers) wrote :

I just grabbed it, I can generally trigger it with 4-5 gigs back and forth over smb or SSH so I will try that now.

Thank you!

Darren Albers (dalbers) wrote :

Unfortunately it still seemed to crash, after copying about 9 gigs back and forth I lost all network connectivity. Removing the sky2 driver and adding it back brought it back to life.

Is there a chance it is related to the cheap switch I am using? I have an off the shelf D-link gig switch. I can probably swap it out with a Cisco3550 (Though that would only be 100mb) or another noname soho switch to test.

An odd note, I was tailing syslog, kernel.log, and messages and saw no errors. My connection just died...

Tim Gardner (timg-tpi) wrote :

I would be interesting to change switches, though a 100Mbit switch is much less stressful. Also make sure your network adapter is well ventilated.

Darren Albers (dalbers) wrote :

I can take a look at grabbing a new switch gig switch tomorrow, the NIC is on a MAC Mini but the unit itself is well ventilated.

DrCore (launchpad-drsdre) wrote :

It would be weird if it was related to network equipment behind the NIC. Before Feisty (i.e. Edgy) network connectivity would work basically flawless.

I am usually able to get connectivity back after unplugging and replugging the physical connection. The network-manager will detect this and will retrieve a DHCP lease.

Tim, thank you for your work. Sadly, like Darren, I too encountered soft-lockups with that sky2.ko.2.6.20-14 (as with a complete vanilla 2.6.21-rc6 kernel). It's slightly better with this version. Removing the module and reloading it again makes the network work again. With earlier versions of the driver I often had to reboot because nearly everything segfaulted after a soft-lockup (unable to handle kernel paging request at virtual address xxxxxxxx).

Also I tested Marvell's non-free open-source upstream driver for several days and stressed it with downloading several Linux distro images while keeping the CPU high with compile jobs and folding@home. No softlockups, no nothing. That driver seems to be stable. If anyone wants to install it (download link in a comment above): the first line of the install.sh script needs to be changed from
#!/bin/sh
to
#!/bin/bash
to make it work. Further you need the kernel-header package installed for your kernel, i.e. linux-headers-2.6.20-14-generic and set a link from /usr/src/linux to these headers, for example
$ sudo ln -s /usr/src/linux-headers-2.6.20-14-generic /usr/src/linux
After that, the install.sh script should work fine. Choose the option 1 for 'installation', it will afterwards give the options to either remove or disable (it renames it) the existing sky2 driver, both options should work.

Tim Gardner (timg-tpi) on 2007-04-20
Changed in linux-source-2.6.20:
assignee: timg-tpi → ubuntu-kernel-team

Hello, i had the same Problem with a Marvell ethernet controller.
Bevore reading this bug-thread i made a upgrade from edgy to feisty but the problem still occurs every few hours.
I installed the driver from http://www.marvell.com/drivers/driverDisplay.do?dId=153&pId=36 as by Andreas Simon described.
The system uptime is now about 4 days. No error occurs since installation of the driver.
I think this is the solution for marvell-controller until the sky2 bug is fixed.

Controller-Info:

03:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8053 PCI-E Gigabit Ethernet Controller (rev 22)
        Subsystem: Giga-byte Technology Marvell 88E8053 Gigabit Ethernet Controller (Gigabyte)
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR+ <PERR-
        Latency: 0, Cache Line Size: 32 bytes
        Interrupt: pin A routed to IRQ 16
        Region 0: Memory at f8000000 (64-bit, non-prefetchable) [size=16K]
        Region 2: I/O ports at 9000 [size=256]
        [virtual] Expansion ROM at f9300000 [disabled] [size=128K]
        Capabilities: [48] Power Management version 2
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
                Status: D0 PME-Enable- DSel=0 DScale=1 PME-
        Capabilities: [50] Vital Product Data
        Capabilities: [5c] Message Signalled Interrupts: Mask- 64bit+ Queue=0/1 Enable-
                Address: 0000000000000000 Data: 0000
        Capabilities: [e0] Express Legacy Endpoint IRQ 0
                Device: Supported: MaxPayload 128 bytes, PhantFunc 0, ExtTag-
                Device: Latency L0s unlimited, L1 unlimited
                Device: AtnBtn- AtnInd- PwrInd-
                Device: Errors: Correctable- Non-Fatal- Fatal- Unsupported-
                Device: RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
                Device: MaxPayload 128 bytes, MaxReadReq 2048 bytes
                Link: Supported Speed 2.5Gb/s, Width x1, ASPM L0s, Port 0
                Link: Latency L0s <256ns, L1 unlimited
                Link: ASPM Disabled RCB 128 bytes CommClk- ExtSynch-
                Link: Speed 2.5Gb/s, Width x1

root@...:~# ethtool eth0
Settings for eth0:
        Supported ports: [ TP ]
        Supported link modes: 10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Half 1000baseT/Full
        Supports auto-negotiation: Yes
        Advertised link modes: 10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Half 1000baseT/Full
        Advertised auto-negotiation: Yes
        Speed: 100Mb/s
        Duplex: Full
        Port: Twisted Pair
        PHYAD: 0
        Transceiver: internal
        Auto-negotiation: on
        Supports Wake-on: g
        Wake-on: g
        Link detected: yes

docunext (docunext-staff) wrote :

I am experiencing similar problems. I downloaded Marvell's sk98lin driver, compiled and installed without problems besides those already mentioned (headers symbolic link and #!/bin/bash).

Perhaps related is my inability to resume from S3 via wake on lan using other ethernet card - nvidia gige on mainboard. WOL works from S5 with said card, and resume from keyboard works from S3, with no LAN connectivity. When using sky2 driver, resume from S3 from keyboard would result in no interfaces being detected at all, whereas with sk98lin both are detected, but unusable.

2.6.20-16-generic (root@terranova) (gcc version 4.1.2 (Ubuntu 4.1.2-0ubuntu4))

eth0: forcedeth.c: subsystem: 01019:1b51 bound to 0000:00:0a.0

Ethernet controller: Marvell Technology Group Ltd. 88E8053
eth1: Marvell Yukon 88E8053 Gigabit Ethernet Controller

Seems to be working well when I pump lots of iperf through it:

          inet6 addr: fe80::200:5aff:fe00:305/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:2592311 errors:0 dropped:0 overruns:0 frame:0
          TX packets:5390320 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:181493629 (173.0 MiB) TX bytes:3873149222 (3.6 GiB)
          Interrupt:20 Memory:fdbfc000-0

Changed in linux:
status: In Progress → Invalid
Darren Albers (dalbers) wrote :

The upstream bug report is now being shown as Resolved Code_Fix. Is this something we can see backported to 2.6.20 or do we need to wait until Gutsy to get the fix in?

Thanks!

At least the sky2-tx-kick.patch applies fine to the 2.6.20 kernel. But I can't test it, because currently I don't have access to a sky2 network controller.

Changed in linux:
status: Invalid → Fix Released
eze80 (ezequiel-pozzo) wrote :

Hello, I'm experiencing a similar problem with my 88E8053

I read there's a fix released. How do I apply the patch? I have Ubuntu Feisty up to date and using the linux kernel 2.6.20-16-generic.

I never notice this problem until lately that I've been trying to transmit big files over samba or using VNC on my LAN. That's probably because the data transfers rate involved now are bigger than the ones I have when using internet.

Ezequiel

Beginning with the Hardy Heron 8.04 development cycle, all open Ubuntu kernel bugs need to be reported against the "linux" kernel package. We are automatically migrating this bug to the new "linux" package. However, development has already began for the upcoming Intrepid Ibex 8.10 release. It would be helpful if you could test the upcoming release and verify if this is still an issue - http://www.ubuntu.com/testing . If the issue still exists, please update this report by changing the Status of the "linux" task from "Incomplete" to "New". We appreciate your patience and understanding as we make this transition. Thanks!

I found the 'transmit timed out' issue solved totally by moving from firmware 1.9 to 2.2 on my 88E8053; others have reported success with firmware updates to their 8056 NICs also. Contact your motherboard vendor for the update.

You can find the current firmware version using: http://www.marvell.com/drivers/files/yukondg_v6.53.4.3.zip

I posted more info about this at:
http://marc.info/?l=linux-netdev&m=121967539303140&w=2

The Ubuntu Kernel Team is planning to move to the 2.6.27 kernel for the upcoming Intrepid Ibex 8.10 release. As a result, the kernel team would appreciate it if you could please test this newer 2.6.27 Ubuntu kernel. There are one of two ways you should be able to test:

1) If you are comfortable installing packages on your own, the linux-image-2.6.27-* package is currently available for you to install and test.

--or--

2) The upcoming Alpha5 for Intrepid Ibex 8.10 will contain this newer 2.6.27 Ubuntu kernel. Alpha5 is set to be released Thursday Sept 4. Please watch http://www.ubuntu.com/testing for Alpha5 to be announced. You should then be able to test via a LiveCD.

Please let us know immediately if this newer 2.6.27 kernel resolves the bug reported here or if the issue remains. More importantly, please open a new bug report for each new bug/regression introduced by the 2.6.27 kernel and tag the bug report with 'linux-2.6.27'. Also, please specifically note if the issue does or does not appear in the 2.6.26 kernel. Thanks again, we really appreicate your help and feedback.

Per a decision made by the Ubuntu Kernel Team, bugs will longer be assigned to the ubuntu-kernel-team in Launchpad as part of the bug triage process. The ubuntu-kernel-team is being unassigned from this bug report. Refer to https://wiki.ubuntu.com/KernelTeamBugPolicies for more information. Thanks.

Stefan Bader (smb) wrote :

According to the last comment about updating the firmware solving the problem, I am closing this bug.

Changed in linux (Ubuntu):
status: Incomplete → Invalid
Changed in linux:
importance: Unknown → High
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.