Bug #130075 “Random pauses when transferring data at gigabit spe...” : Bugs : linux-source-2.6.20 package : Ubuntu

Revision history for this message

Christian Iversen (chrivers) wrote on 2007-08-03:

#1

Please disregard the comment "more or less randomly". It should be "more or less randomly when the NIC is heavily utilized".

Revision history for this message

Christian Iversen (chrivers) wrote on 2007-08-03:

#2

Additionally, this bug may be related to #107215.

Revision history for this message

Brian Murray (brian-murray) wrote on 2007-08-08:

#3

Thank you for taking the time to report this bug and helping to make Ubuntu better. The issue that you reported is one that should be reproducable with the live environment of the Desktop CD of the development release - Gutsy Gibbon. It would help us greatly if you could test with it so we can work on getting it fixed in the actively developed kernel. You can find out more about the development release at http://www.ubuntu.com/testing/ . Thanks again and we appreciate your help.
Additionally, are both systems using the forcedeth driver?
Could you please add the full output of 'sudo lspci -vvn' as an attachment to your bug report?
Could you please add the full contents of your kernel log?

Changed in linux-source-2.6.22:
assignee:	nobody → brian-murray
status:	New → Incomplete

Revision history for this message

David Burgess (apt-get) wrote on 2007-08-09:

#4

lspci -vvn > lspci.output Edit (18.5 KiB, text/plain)

I'm getting this message in Edubuntu Feisty AMD64 while doing tests with iperf on a gigabit link. Both iperf client and server are using the forcedeth driver on identical nic/motherboards. While the failing driver is running in the OS mentioned above, I've had no trouble on the second machine running Ubuntu server 7.04 AMD64.

I've found some potentially related threads in other forums:

http://lists.debian.org/debian-amd64/2006/08/msg00274.html
http://www.nvnews.net/vbulletin/showthread.php?t=57791&page=11

Here the command that causes it for me:

iperf -c 192.168.0.195 -i 2 -f mbps -t 30 -d -P 16

-P 8 doesn't seem to cause a problem.

Here's the partial output of dmesg when the link goes down:

[ 1235.537521] eth1: too many iterations (6) in nv_nic_irq.
[ 1247.641652] NETDEV WATCHDOG: eth1: transmit timed out
[ 1247.641657] eth1: Got tx_timeout. irq: 00000036
[ 1247.641660] eth1: Ring at 7b5ec000: next 2597442 nic 2597186
[ 1247.641662] eth1: Dumping tx registers
[ 1247.641667] 0: 00000036 000000ff 00000003 017203ca 00000000 00000000 00000000 00000000

I tried bringing down the link, reloading the forcedeth module and bringing the link back up, but then dmesg | tail gives:

[ 1301.525982] eth1: forcedeth.c: subsystem: 01043:816a bound to 0000:00:14.0
[ 1301.553024] eth1: no link during initialization.
[ 1301.553263] ADDRCONF(NETDEV_UP): eth1: link is not ready

although the interface is connected and lit up.

As per somebody's recommendation from an above-linked thread, I tried adding the following option to the forcedeth driver at boot, but I haven't succeeded in doing so:

"options forcedeth max_interrupt_work=20"

I've also attached the output of sudo lspci -vvn.

Revision history for this message

BullCreek (jeff-openenergy) wrote on 2007-09-23:

#5

logs.tar.bz2 Edit (29.7 KiB, application/octet-stream)

I just ran into this bug and it almost cost me data with feisty latest! For some reason when the forcedeth ethernet adapter locked up, it also locked up two out of six of my sata ports (I'm presuming because of some PCI bus issue). Luckily it was raid6, so the array is still functional, but it could have been a disaster if it was raid5. Similar situation to above - I was copying a bunch of large media files over to another machine over gigabit NFS (and being pleasantly surprised at the 70+MB/s I was seeing without jumbo frames or any real tuning before it locked up).

Will the max_interrupt_work=20 get rid of the problem or just mask it and how is the best way to configure this fix on Feisty? I don't want this to happen again as rebuilding the whole 3+TB array is a very time consuming operation.

I've attached my dmesg, syslog and lspci -vvn output. The syslog items of interest that show what happened and in what order start around 10:33:57.
Hope someone comes up with a fix - I can try it on Gutsy although the last daily build I downloaded wouldn't even boot. When is Tribe-6 or another pseudo tested build going to come out?

Revision history for this message

BullCreek (jeff-openenergy) wrote on 2007-09-24: Re: [Bug 130075] Re: Random pauses when transferring data at gigabit speeds with forcedeth driver

#6

FWIW, I downloaded the gutsy GIT kernel, built it, and replaced the forcedeth.ko on my feisty systems with the newer one. I still get a few of the:

eth0: too many iterations (6) in nv_nic_irq messages

but it doesn't appear to hang anymore under heavy load. I'll be stress testing it today by copying about 1.5TB of data over NFS, so wish me luck.

Revision history for this message

BullCreek (jeff-openenergy) wrote on 2007-09-24:

#7

An hour or so into the transfer, using the gutsy forcedeth.ko - the machine locked up and eth0 stopped responding just like it was doing with the default feisty version of the module. This is a fairly serious bug - so I hope someone can escalate it so it gets looked at and minimally fixed for gutsy if not feisty. I notice that the version of the driver in feisty if 59, the version in gutsy is 60, and the latest version available from nvidia is 62.

The nvidia package is available from here:

http://www.nvidia.com/object/linux_nforce_1.23.html

I'm considering dropping that version in, and rebuilding my gutsy packages to see what happens. Thoughts?

Brian Murray (brian-murray) on 2007-09-24

Changed in linux-source-2.6.22:
assignee:	brian-murray → nobody
importance:	Undecided → Medium
status:	Incomplete → Confirmed

Revision history for this message

BullCreek (jeff-openenergy) wrote on 2007-09-25:

#8

I've played around with it a bit more, and even with the options max_interrupt_work=20 set in /etc/modprobe.d/forcedeth, it still happens. Only additional thing to report, is I happened upon an easy way to reproduce the hangup every time:

1. Login to machine1 which has a gigabit ethernet card in it and is attached to a gigabit switch. Run "iperf -s"

2. Login to a machine that has the gigabit forcedeth adapter in it - also hooked up to a gigabit switch. Run "iperf -c machine1 -d".

The command in step two generates a lot of traffic in that it tells it to test speeds going both ways at once. On both my forcedeth systems, it will lock up the interface almost immediately, requiring a full reboot (i.e. /etc/init.d/networking restart has no effect). Note the problem only occurs under heavy load - if you just run iperf -c machine1 without the -d option, it usually won't lock up.

Should we report this bug to the kernel mailing list? I see that back in August, someone reported similar behavior in 2.6.22.1 but said adding the forcedeth.max_interrupt_work=20 option to their bootline fixed it (FWIW, I tried that and just go invalid option with feisty). Here is the thread:

http://lkml.org/lkml/2007/8/5/92

Does anyone know how to tell if the options in modprobe.d are in effect - dmesg doesn't show anything and lsmod doesn't have any flags? I have tried putting the appropriate line in /boot/grub/menu.lst, in /etc/modprobe.d/options, and in /etc/modprobe.d/forcedeth and as best as I can tell none have had any effect.

P.S. I'm now wishing I had spent a little more and bought Intel boards with Intel NICs.

Revision history for this message

David Burgess (apt-get) wrote on 2007-09-26:

#9

iperf -c machine1 -d doesn't lock mine up, but adding "-p 2" (any
number of multiple threads) will do it every time, although not
necessarily right away.

And yes, this makes the forcedeth driver virtually useless on a
gigabit network. My findings were the same, only a reboot will bring
the interface back online.

db

On 9/25/07, BullCreek <email address hidden> wrote:
> I've played around with it a bit more, and even with the options
> max_interrupt_work=20 set in /etc/modprobe.d/forcedeth, it still
> happens. Only additional thing to report, is I happened upon an easy
> way to reproduce the hangup every time:
>
> 1. Login to machine1 which has a gigabit ethernet card in it and is
> attached to a gigabit switch. Run "iperf -s"
>
> 2. Login to a machine that has the gigabit forcedeth adapter in it -
> also hooked up to a gigabit switch. Run "iperf -c machine1 -d".
>
> The command in step two generates a lot of traffic in that it tells it
> to test speeds going both ways at once. On both my forcedeth systems,
> it will lock up the interface almost immediately, requiring a full
> reboot (i.e. /etc/init.d/networking restart has no effect). Note the
> problem only occurs under heavy load - if you just run iperf -c machine1
> without the -d option, it usually won't lock up.
>
> Should we report this bug to the kernel mailing list? I see that back
> in August, someone reported similar behavior in 2.6.22.1 but said adding
> the forcedeth.max_interrupt_work=20 option to their bootline fixed it
> (FWIW, I tried that and just go invalid option with feisty). Here is
> the thread:
>
> http://lkml.org/lkml/2007/8/5/92
>
> Does anyone know how to tell if the options in modprobe.d are in effect
> - dmesg doesn't show anything and lsmod doesn't have any flags? I have
> tried putting the appropriate line in /boot/grub/menu.lst, in
> /etc/modprobe.d/options, and in /etc/modprobe.d/forcedeth and as best as
> I can tell none have had any effect.
>
> P.S. I'm now wishing I had spent a little more and bought Intel boards
> with Intel NICs.
>
> --
> Random pauses when transferring data at gigabit speeds with forcedeth driver
> https://bugs.launchpad.net/bugs/130075
> You received this bug notification because you are a direct subscriber
> of the bug.
>

iperf -c machine1 -d doesn't lock mine up, but adding "-p 2" (any
number of multiple threads) will do it every time, although not
necessarily right away.

And yes, this makes the forcedeth driver virtually useless on a
gigabit network. My findings were the same, only a reboot will bring
the interface back online.

db

On 9/25/07, BullCreek <jeff@openenergy.org> wrote:
> I've played around with it a bit more, and even with the options
> max_interrupt_work=20 set in /etc/modprobe.d/forcedeth, it still
> happens.  Only additional thing to report, is I happened upon an easy
> way to reproduce the hangup every time:
>
> 1.  Login to machine1 which has a gigabit ethernet card in it and is
> attached to a gigabit switch.  Run "iperf -s"
>
> 2.  Login to a machine that has the gigabit forcedeth adapter in it -
> also hooked up to a gigabit switch.  Run "iperf -c machine1 -d".
>
> The command in step two generates a lot of traffic in that it tells it
> to test speeds going both ways at once.  On both my forcedeth systems,
> it will lock up the interface almost immediately, requiring a full
> reboot (i.e. /etc/init.d/networking restart has no effect).  Note the
> problem only occurs under heavy load - if you just run iperf -c machine1
> without the -d option, it usually won't lock up.
>
> Should we report this bug to the kernel mailing list?  I see that back
> in August, someone reported similar behavior in 2.6.22.1 but said adding
> the forcedeth.max_interrupt_work=20 option to their bootline fixed it
> (FWIW, I tried that and just go invalid option with feisty).  Here is
> the thread:
>
> http://lkml.org/lkml/2007/8/5/92
>
> Does anyone know how to tell if the options in modprobe.d are in effect
> - dmesg doesn't show anything and lsmod doesn't have any flags?  I have
> tried putting the appropriate line in /boot/grub/menu.lst, in
> /etc/modprobe.d/options, and in /etc/modprobe.d/forcedeth and as best as
> I can tell none have had any effect.
>
> P.S.  I'm now wishing I had spent a little more and bought Intel boards
> with Intel NICs.
>
> --
> Random pauses when transferring data at gigabit speeds with forcedeth driver
> https://bugs.launchpad.net/bugs/130075
> You received this bug notification because you are a direct subscriber
> of the bug.
>

Revision history for this message

Steen "Miravlix" Poulsen (miravlix) wrote on 2007-09-26:

#10

Your actually fighting two issues here.

The "to many iterations" part is harmless and can be ignored, it has nothing to do with why the system crashes.

Your system is crashing because an IRQ gets disabled, when an IRQ get disabled, that means anything using it stops talking with the kernel, if thats your harddisk controler, the machine isn't going to last long before it dies.

Disable SMP to fix the problem...

Now SMP is not the true problem here, but when you disable SMP, you remove things like MSI/MSI-X and APIC, IOAPIC and other SMP related features thats *known* to be broken on many motherboards. The reason I don't say to disable the sub SMP features, is because SMP on those same motherboards might be just as broken without MSI/MSI-X and IO-APIC enabled, and can still crash.

Bottom line is Many Cheap Motherboard + SMP = crap. (Especially Nvidia based ones, but I don't have information that points the finger at the nvidia chipset being broken, it's according to the limited information I have the motherboard vendors that has implemented the Nvidia chipset badly and broken it)

Revision history for this message

BullCreek (jeff-openenergy) wrote on 2007-09-27:

#11

I try to boot with the nosmp option to test this and 2.6.20-16-generic just hangs shortly into the boot (no output on any of the virtual screens). Any ideas on how to work around? I don't really see nosmp as a viable solution, but am willing to at least validate it as an option for some. If this is indeed a problem with nforce motherboards, I think it probably needs to get more attention and documentation - as I certainly was never aware of it and sort of just considered that the boards were so plentiful they can't be bad.

FWIW, I do notice that Asus uses Broadcom LAN rather than Nvidia's on their AMD server boards that otherwise use the rest of nforce. I wonder if this is why? Miravlix, do you have any links to threads where this has been discussed in public forum before or is it just your personal findings?

Revision history for this message

David Burgess (apt-get) wrote on 2007-09-29:

#12

Do I understand correctly that this will be an issue with my
motherboard as long as I am using more than one core of a multicore
cpu?

How in the world did my motherboard make it onto AMD's list of
recommended motherboards for a dual-core AMD cpu?

And how can Nvidia market this as their stable "Business Platform"?
I'm feeling cheated. If Steen's post is correct, then my motherboard
clearly doesn't do what it claims to do.

And if this is truly a hardware issue then we should see the problem
occurring in other smp-using OSes too, right? Has anybody seen
problems or reports of problems with this in Windows, for example?

This is disappointing. I thought I did my research.

db

On 9/27/07, BullCreek <email address hidden> wrote:
> I try to boot with the nosmp option to test this and 2.6.20-16-generic
> just hangs shortly into the boot (no output on any of the virtual
> screens). Any ideas on how to work around? I don't really see nosmp as
> a viable solution, but am willing to at least validate it as an option
> for some. If this is indeed a problem with nforce motherboards, I think
> it probably needs to get more attention and documentation - as I
> certainly was never aware of it and sort of just considered that the
> boards were so plentiful they can't be bad.
>
> FWIW, I do notice that Asus uses Broadcom LAN rather than Nvidia's on
> their AMD server boards that otherwise use the rest of nforce. I wonder
> if this is why? Miravlix, do you have any links to threads where this
> has been discussed in public forum before or is it just your personal
> findings?
>
> --
> Random pauses when transferring data at gigabit speeds with forcedeth driver
> https://bugs.launchpad.net/bugs/130075
> You received this bug notification because you are a direct subscriber
> of the bug.
>

Revision history for this message

BullCreek (jeff-openenergy) wrote on 2007-09-29:

#13

> And how can Nvidia market this as their stable "Business Platform"?
> I'm feeling cheated. If Steen's post is correct, then my motherboard
> clearly doesn't do what it claims to do.

I'm ordering a couple of these to see if just disabling the nforce NICs on these boards will result in a stable system under heavy NFS loads. I'll let everyone know how it works out.

http://www.newegg.com/Product/Product.asp?item=N82E16833166015

If it doesn't work - I'm considering returning the boards and procs for Intel - although that will be a lot of trouble for all parties involved.

Revision history for this message

David Burgess (apt-get) wrote on 2007-09-29:

#14

I just returned one of these RMA. Now I'm wondering if my problems
with it were actually due to this bug. I suspect not entirely though,
as this card sometimes didn't work right off a fresh boot.

http://www.intel.com/network/connectivity/products/pro1000pt_desktop_adapter.htm

On 9/29/07, BullCreek <email address hidden> wrote:
> > And how can Nvidia market this as their stable "Business Platform"?
> > I'm feeling cheated. If Steen's post is correct, then my motherboard
> > clearly doesn't do what it claims to do.
>
> I'm ordering a couple of these to see if just disabling the nforce NICs
> on these boards will result in a stable system under heavy NFS loads.
> I'll let everyone know how it works out.
>
> http://www.newegg.com/Product/Product.asp?item=N82E16833166015
>
> If it doesn't work - I'm considering returning the boards and procs for
> Intel - although that will be a lot of trouble for all parties involved.
>
> --
> Random pauses when transferring data at gigabit speeds with forcedeth driver
> https://bugs.launchpad.net/bugs/130075
> You received this bug notification because you are a direct subscriber
> of the bug.
>

Revision history for this message

BullCreek (jeff-openenergy) wrote on 2007-09-30:

#15

I've been testing with Gutsy latest on these boards, and some moderately good news is that with 2.6.22-12-generic, it seems to take much longer for the machine to lockup. Using Feisty's 2.6.20 kernel, and the aforementioned iperf test, it would hang almost instantly. With Gutsy, it takes 30-60 minutes - but still hangs.

Revision history for this message

BullCreek (jeff-openenergy) wrote on 2007-10-13:

#16

More testing with these boards and latest Gutsy (2.6.22-14-generic x86) yields the following observations above and beyond those previously reported:

1. It still locks up under Gutsy latest (I can't tell any change between 2.6.22-12 and 2.6.22-14). Someone should probably report this as a problem to the kernel mailing list or whoever maintains the forcedeth driver.

2. It isn't hardware though. I booted a XenSource 4.0.1 install in the system (partly to test a different distro and partly to see if they had support for these nforce boards because SuperMicro makes a dual socket Barcelona board based on the big brother workstation version of this NF570 chipset that looks fairly tasty except it has the same dual nforce LAN setup). The Xen kernel gives the same spurious "too many iterations" but unlike Ubuntu, refused to lock up even after hours of serious abuse. XenSource's kernel is a stripped down version of 2.6.18 based on CentOS 4.4 I believe.

3. FWIW, the cheap PCIe Marvell 88E8053 based NICs from Rosewill mentioned earlier do seem to work reliably in Gutsy (although the requisite sky2 module is a nightmare on other platforms including Xen). I've transfered TB of data both ways with it via NFS in Gutsy with no problem other than high CPU usage (see next item).

4. Both the Marvell and the Nforce hardware can't hold a candle to Intel as far as CPU offloading goes. For troubleshooting this problem, I used an old 1.8Ghz P4 (single core) system running Feisty with an intel 82547EI gigabit adapter on the MB - it never goes above 50% CPU usage serving or pulling at 1Gbps - whereas the Marvell and Nforce solutions routinely use most of what a 2.1GHz dual core Athlon 64 X2 system has to offer, just to run iperf! I know I could make this CPU usage go down by enabling jumbo frames across the board, but that introduces a whole other list of compatibility problems I don't want to face.

Long story short - you get what you pay for I guess. It's a shame, because the 6 channel SATA2 controller on these Nforce 570 boards seems to perform quite nicely and reliably using mdadm and RAID5 or RAID6. If the dual nforce LAN worked, the board would be quite a steal for $80 and another $80 or so for a fast Athlon X2 proc - but unfortunately, it doesn't.

More testing with these boards and latest Gutsy (2.6.22-14-generic x86) yields the following observations above and beyond those previously reported:

1.  It still locks up under Gutsy latest (I can't tell any change between 2.6.22-12 and 2.6.22-14).  Someone should probably report this as a problem to the kernel mailing list or whoever maintains the forcedeth driver.

2.  It isn't hardware though.  I booted a XenSource 4.0.1 install in the system (partly to test a different distro and partly to see if they had support for these nforce boards because SuperMicro makes a dual socket Barcelona board based on the big brother workstation version of this NF570 chipset that looks fairly tasty except it has the same dual nforce LAN setup).  The Xen kernel gives the same spurious "too many iterations" but unlike Ubuntu, refused to lock up even after hours of serious abuse.  XenSource's kernel is a stripped down version of 2.6.18 based on CentOS 4.4 I believe.

3.  FWIW, the cheap PCIe Marvell 88E8053 based NICs from Rosewill mentioned earlier do seem to work reliably in Gutsy (although the requisite sky2 module is a nightmare on other platforms including Xen).  I've transfered TB of data both ways with it via NFS in Gutsy with no problem other than high CPU usage (see next item).

4.  Both the Marvell and the Nforce hardware can't hold a candle to Intel as far as CPU offloading goes.  For troubleshooting this problem, I used an old 1.8Ghz P4 (single core) system running Feisty with an intel 82547EI gigabit adapter on the MB - it never goes above 50% CPU usage serving or pulling at 1Gbps - whereas the Marvell and Nforce solutions routinely use most of what a 2.1GHz dual core Athlon 64 X2 system has to offer, just to run iperf!  I know I could make this CPU usage go down by enabling jumbo frames across the board, but that introduces a whole other list of compatibility problems I don't want to face.

Long story short - you get what you pay for I guess. It's a shame, because the 6 channel SATA2 controller on these Nforce 570 boards seems to perform quite nicely and reliably using mdadm and RAID5 or RAID6.  If the dual nforce LAN worked, the board would be quite a steal for $80 and another $80 or so for a fast Athlon X2 proc - but unfortunately, it doesn't.

Revision history for this message

David Burgess (apt-get) wrote on 2007-10-14:

#17

I have to agree that this doesn't look like a hardware problem. If it were,
then how would one explain different kernels behaving differently?

Parenthetically, I have posted some throughput benchmarks for a handful of
nics on the m0n0wall forum for any interested. And no, I have never seen the
nforce GBE lose its interrupt or cease up for any reason in my m0n0walls,
even under full bench load, but then again, m0n0wall doesn't use SMP either.

http://forum.m0n0.ch/index.php/topic,875.0.html

db

On 10/13/07, BullCreek <email address hidden> wrote:
>
> More testing with these boards and latest Gutsy (2.6.22-14-generic x86)
> yields the following observations above and beyond those previously
> reported:
>
> 1. It still locks up under Gutsy latest (I can't tell any change
> between 2.6.22-12 and 2.6.22-14). Someone should probably report this
> as a problem to the kernel mailing list or whoever maintains the
> forcedeth driver.
>
> 2. It isn't hardware though. I booted a XenSource 4.0.1 install in the
> system (partly to test a different distro and partly to see if they had
> support for these nforce boards because SuperMicro makes a dual socket
> Barcelona board based on the big brother workstation version of this
> NF570 chipset that looks fairly tasty except it has the same dual nforce
> LAN setup). The Xen kernel gives the same spurious "too many
> iterations" but unlike Ubuntu, refused to lock up even after hours of
> serious abuse. XenSource's kernel is a stripped down version of 2.6.18
> based on CentOS 4.4 I believe.
>
> 3. FWIW, the cheap PCIe Marvell 88E8053 based NICs from Rosewill
> mentioned earlier do seem to work reliably in Gutsy (although the
> requisite sky2 module is a nightmare on other platforms including Xen).
> I've transfered TB of data both ways with it via NFS in Gutsy with no
> problem other than high CPU usage (see next item).
>
> 4. Both the Marvell and the Nforce hardware can't hold a candle to
> Intel as far as CPU offloading goes. For troubleshooting this problem,
> I used an old 1.8Ghz P4 (single core) system running Feisty with an
> intel 82547EI gigabit adapter on the MB - it never goes above 50% CPU
> usage serving or pulling at 1Gbps - whereas the Marvell and Nforce
> solutions routinely use most of what a 2.1GHz dual core Athlon 64 X2
> system has to offer, just to run iperf! I know I could make this CPU
> usage go down by enabling jumbo frames across the board, but that
> introduces a whole other list of compatibility problems I don't want to
> face.
>
> Long story short - you get what you pay for I guess. It's a shame,
> because the 6 channel SATA2 controller on these Nforce 570 boards seems
> to perform quite nicely and reliably using mdadm and RAID5 or RAID6. If
> the dual nforce LAN worked, the board would be quite a steal for $80 and
> another $80 or so for a fast Athlon X2 proc - but unfortunately, it
> doesn't.
>
> --
> Random pauses when transferring data at gigabit speeds with forcedeth
> driver
> https://bugs.launchpad.net/bugs/130075
> You received this bug notification because you are a direct subscriber
> of the bug.
>

I have to agree that this doesn't look like a hardware problem. If it were,
then how would one explain different kernels behaving differently?

Parenthetically, I have posted some throughput benchmarks for a handful of
nics on the m0n0wall forum for any interested. And no, I have never seen the
nforce GBE lose its interrupt or cease up for any reason in my m0n0walls,
even under full bench load, but then again, m0n0wall doesn't use SMP either.

http://forum.m0n0.ch/index.php/topic,875.0.html

db

On 10/13/07, BullCreek <jeff@openenergy.org> wrote:
>
> More testing with these boards and latest Gutsy (2.6.22-14-generic x86)
> yields the following observations above and beyond those previously
> reported:
>
> 1.  It still locks up under Gutsy latest (I can't tell any change
> between 2.6.22-12 and 2.6.22-14).  Someone should probably report this
> as a problem to the kernel mailing list or whoever maintains the
> forcedeth driver.
>
> 2.  It isn't hardware though.  I booted a XenSource 4.0.1 install in the
> system (partly to test a different distro and partly to see if they had
> support for these nforce boards because SuperMicro makes a dual socket
> Barcelona board based on the big brother workstation version of this
> NF570 chipset that looks fairly tasty except it has the same dual nforce
> LAN setup).  The Xen kernel gives the same spurious "too many
> iterations" but unlike Ubuntu, refused to lock up even after hours of
> serious abuse.  XenSource's kernel is a stripped down version of 2.6.18
> based on CentOS 4.4 I believe.
>
> 3.  FWIW, the cheap PCIe Marvell 88E8053 based NICs from Rosewill
> mentioned earlier do seem to work reliably in Gutsy (although the
> requisite sky2 module is a nightmare on other platforms including Xen).
> I've transfered TB of data both ways with it via NFS in Gutsy with no
> problem other than high CPU usage (see next item).
>
> 4.  Both the Marvell and the Nforce hardware can't hold a candle to
> Intel as far as CPU offloading goes.  For troubleshooting this problem,
> I used an old 1.8Ghz P4 (single core) system running Feisty with an
> intel 82547EI gigabit adapter on the MB - it never goes above 50% CPU
> usage serving or pulling at 1Gbps - whereas the Marvell and Nforce
> solutions routinely use most of what a 2.1GHz dual core Athlon 64 X2
> system has to offer, just to run iperf!  I know I could make this CPU
> usage go down by enabling jumbo frames across the board, but that
> introduces a whole other list of compatibility problems I don't want to
> face.
>
> Long story short - you get what you pay for I guess. It's a shame,
> because the 6 channel SATA2 controller on these Nforce 570 boards seems
> to perform quite nicely and reliably using mdadm and RAID5 or RAID6.  If
> the dual nforce LAN worked, the board would be quite a steal for $80 and
> another $80 or so for a fast Athlon X2 proc - but unfortunately, it
> doesn't.
>
> --
> Random pauses when transferring data at gigabit speeds with forcedeth
> driver
> https://bugs.launchpad.net/bugs/130075
> You received this bug notification because you are a direct subscriber
> of the bug.
>

Revision history for this message

DrCore (launchpad-drsdre) wrote on 2007-10-31:

#18

dmesg dump Edit (120.7 KiB, text/plain)

The same problem here running latest Gutsy (2.6.22-14-generic fully patched) standard config with fixed IP. While watching a HD x264 stream (continuous stream of 1 GB in 45 minutes), the network crashed without being able to recover (/etc/init.d/networking restart nor ifdown eth0; ifup eth0 didn't help). Only a reboot worked.

In the crashlog attachement you see it dumping tx ring/registers.

Revision history for this message

DrCore (launchpad-drsdre) wrote on 2007-10-31:

#19

lspci -vvn Edit (13.1 KiB, text/plain)

Revision history for this message

nikosapi (nikosapi) wrote on 2007-11-02:

#20

lspci -vvnn Edit (23.8 KiB, text/plain)

I'm running Gutsy with all the latest updates (Ubuntu 2.6.22-14.46-generic) and I experience a similar problem. When I try to transfer large files (~1GB) from a samba share the PC will do one of two things, either completely lock up or the nic will still work, but with a huge amount of latency (1 to 2 seconds on a 100Mbit local network).
I've tried adding 'max_interrupt_work=20 msi=0 msix=0' to the module's options but all this does is allow me to transfer one or two extra gigabytes of data before exhibiting the aforementioned behavior.

Revision history for this message

Alexander Gruber (freakyjoe00) wrote on 2007-11-02:

#21

for me the error is not occuring any more when I load the module with

options forcedeth max_interrupt_work=200

but it seems to be ignored during boot. So I have do the following in /etc/rc.local:

rmmod forcedeth && modprobe forcedeth && /etc/init.d/networking restart

now iperf runs at full gigabit speed without any errors!

Changed in linux-source-2.6.20:
status:	New → Confirmed

Revision history for this message

Rodrigo Azevedo (rodrigoams) wrote on 2007-12-11:

#22

I have the same problem both with NFS or PVFS high usage.
Disable SMP isn't possible! Actually I have 2 solution:

1) options forcedeth max_interrupt_work=200;
2) ro quiet splash noapic apic=off in grub's menu.lst.

The second still show the message "too many iterations (6) in nv_nic_irq" but I reach all bandwidth.

Revision history for this message

Brian Murray (brian-murray) wrote on 2007-12-12:

#23

I am assigning this bug to the 'ubuntu-kernel-team' per their bug policy. For future reference you can learn more about their bug policy at https://wiki.ubuntu.com/KernelTeamBugPolicies .

Changed in linux-source-2.6.20:
assignee:	nobody → ubuntu-kernel-team

Brian Murray (brian-murray) on 2007-12-17

Changed in linux-source-2.6.22:
assignee:	nobody → ubuntu-kernel-team

Revision history for this message

David Burgess (apt-get) wrote on 2007-12-18:

#24

The fact that Rodrigo fixed this by using the "noapic" boot option led me to do some more reading into APIC, and I am convinced that there is a reasonable chance that this problem is due to poor APIC support on my motherboard, which, if true, could also validate Miravlix' theory.

Unfortunately I haven't been able to test the noapic option as my board is away on RMA. I will be testing this when it returns. I've read that other users, having disabled apic on their boards, were able to reenable it after a bios update.

I have also learned that in time, many great things come from the able kernel team. Here's to hoping.

db

Revision history for this message

Leann Ogasawara (leannogasawara) wrote on 2007-12-19:

#25

The Hardy Heron Alpha2 release will be coming out soon (around Dec 20). It will have an updated version of the kernel. It would be great if you could test with this new release if this issue still exists. I'll be sure to update this report when Alpha2 is available. Thanks!

Changed in linux:
status:	New → Incomplete

Revision history for this message

Leann Ogasawara (leannogasawara) wrote on 2007-12-28:

#26

Hardy Heron Alpha2 was recently released. It contains an updated version of the kernel. You can download and try the new Hardy Heron Alpha2 release from http://cdimage.ubuntu.com/releases/hardy/alpha-2/ . You should be able to then test the new kernel via the LiveCD. If you can, please verify if this bug still exists or not and report back your results. General information regarding the release can also be found here: http://www.ubuntu.com/testing/hardy/alpha2 . Thanks!

Revision history for this message

BullCreek (jeff-openenergy) wrote on 2007-12-28:

#27

I haven't had time to verify the i386 Desktop build of alpha2, but I can
confirm that the AMD64 server build of alpha2 no longer exhibits this problem.
I copied 3TB of data over NFS at full speed with no problems.

----- "Leann Ogasawara" <email address hidden> wrote:
> Hardy Heron Alpha2 was recently released. It contains an updated
> version of the kernel. You can download and try the new Hardy Heron
> Alpha2 release from http://cdimage.ubuntu.com/releases/hardy/alpha-2/
> .
> You should be able to then test the new kernel via the LiveCD. If
> you
> can, please verify if this bug still exists or not and report back
> your
> results. General information regarding the release can also be found
> here: http://www.ubuntu.com/testing/hardy/alpha2 . Thanks!
>
> --
> Random pauses when transferring data at gigabit speeds with forcedeth
> driver
> https://bugs.launchpad.net/bugs/130075
> You received this bug notification because you are a direct
> subscriber
> of the bug.

Revision history for this message

Rachel Greenham (rachel-strangenoises) wrote on 2008-01-19:

#28

lspci output Edit (1.8 KiB, text/plain)

I'm getting this problem too, on Athlon64 Gutsy. I get the message:

eth0: too many iterations (6) in nv_nic_irq.

And the network hangs, when a little while after I attempt a large transfer operation across the network.

The *different* thing is, I've had this machine for months, running Feisty, then Gutsy, and indeed Gentoo originally, and I only started getting this problem today. In fact last night I pulled about 800GB off it to another gigabit-equipped machine without any problems. The problems occurred when it came to pushing that data back onto the Ubuntu machine.

Of course, something happened in the interim: there was a hard disk upgrade and a reinstallation to a RAID 5 configuration. The installation was done with the AMD64 alternate install CD, selecting only a commandline install, whereas previously it had been done with the AMD64 desktop CD. I also doubled the memory, to 4GB. I didn't use Expert Options during the install, so anything I could have broken only by selecting that shouldn't be broken. :-) Finally, I added a PCI SATA card.

When the machine was up and running again I started the transfer going to copy the stuff back, that I'd pulled off the night before. After about 22GB on the first attempt and only 11GB on the second (before which I had removed the additional PCI SATA card, thinking that the most likely cause of new IRQ issues), the network interface stopped working, and I got the "Too many iterations" error in dmesg.

Reading through the above comments I've tried adding "noapic" to the default options in grub, and so far it seems to be working: 60GB into the copy as I type this and no apparent issues. If it didn't/doesn't work, I'd try the Hardy Heron Alpha 2 system. However, it's worth noting that the system as of yesterday did *not* have noapic set and never showed this problem; and in both cases the kernel version was the same; 2.6.22-14-generic.

The "other machine" in these copy operations is an iMac Core Duo with Leopard, and file transfer is taking place using Appletalk/Netatalk (the latter built with SSL support). The data is being copied from (and was last night copied to) an external drive connected via Firewire 400.

The Ubuntu machine with this problem has an Asus M2N-VM DH mainboard, which is nForce430-based, and an Athlon64 x2 5200+ (2.6GHz). The RAID 5 is set up on four 500GB drives connected to the onboard 4-port nv-sata interface. There is currently one other hard drive outside the RAID connected to the JMicron (AHCI) sata interface.

(What's the downside of disabling APIC btw? I note that the estimated time to complete the operation, seems to be 9 hours now, whereas it was 7 hours before, until the network crashed. For that matter, does setting the "noapic" option even do anything on a dual-core system. top is still reporting activity on both cores...)