MSI 785GT with Realtek RTL8111/8168B locks up only with heavy gigabit traffic -and- heavy PCI load (10.10 at least)

Bug #746914 reported by Grondr
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Expired
Undecided
Unassigned

Bug Description

I'm submitting this bug report to encourage driver developers to try
this particular use case because a network-only test would not uncover
the bug. I know (now, alas :) that the Realtek NIC is the subject of
many prior bug reports, but someone who thinks those are fixed in any
given release might not try what's going on here.

I just got an MSI 785GT-E63 motherboard (MS-7551), which has an
onboard Realtek RTL8111/8168B NIC, running a 6-core AMD 64-bit CPU.
I've found that heavy network traffic -only when combined with heavy
PCI traffic- will cause a variety of crashes after around 10 minutes,
plus or minus 10. I can supply all kinds of logs (dmesg, lshw, lspci
-vvvv, whatever you want) and my use cases, if someone is interested
in working on the bug. I'm using the r8169 kernel module that comes
with the release and have -not- downloaded the one from Realtek.

This happens under 10.10 (Maverick) in both 32-bit and 64-bit kernels,
on both LiveCDs and real installed. (I tried 2.6.35-28-generic and
also the 2.6.35-22 kernel from the LiveCD.) I have -not- tainted
the kernel (certainly not in the case of the LiveCD tests).

If I use a "tar | nc" pipeline to push files from one machine to the
target machine at gigabit speeds, I see about 70 MB/sec, limited
mostly by I/O on the disks. The target disks are on an AOC-SAT2-MV8
PCI disk controller, so this push loads both the NIC and the PCI bus
simultaneously. (The machine actually has 2 of these and 1 empty PCI
slot at present, but eventually it will have 3 MV8's.)

Under these conditions, the machine simply freezes after a while.
It's not totally frozen---I can toggle caps-lock on the keyboard---but
it's good for little else. It invariably leaves the NIC -transmitting-,
according to the lights on the NIC and on my switch---ancient thicknet
Ethernet called this "jabbering". I often have to do a hard power
cycle to reset the machine state; simply pushing the reset button
seems to leave the USB controller in a messed-up state, at least.

If I turn off AMD Cool & Quiet and spread-spectrum, the machine stays
up long enough to write some logs into kern.log and for me to even
type into terminal windows, but anything that might cause disk access
(the system disk is a -USB- disk, not one of the ones on the PCI bus)
will wedge it up. The system monitor app will also continue running
and the menubar clock will tick, until I wedge it by typing too much,
etc. [Waving my mouse once wedged it; mouse on that test was serial,
keyboard was USB.]

In the cases where I can get logs, I see lots of "BUG: soft lockup"
entries for all 6 cores after that. Even if the machine is left
alone, they recur every 5-60 minutes at random when the machine is
sitting idle. (It's possible the later ones correspond to TCP
keepalives or NTP traffic or something hitting the NIC; I didn't
instrument the switch.) They do NOT happen if the machine never got
stressed in this way; a machine that never locked up before since boot
(even if stressed in other ways) doesn't show the soft lockups.

According to the logs, the NIC is using MSI interrupts, and according
to /proc/interrupts, nothing else is sitting on the IRQ for either of
the MV8 disk controllers or on the NIC---they're all disjoint from
each other and from everything else. (PCI-MSI-edge/IRQ 41 for the
NIC, IO-APIC-fasteoi sata_mv at 20 & 21 for the MV8's.)

Why do I think this is specific to heavy PCI loads? Because if I
write to a disk connected to one of the motherboard's native SATA
ports, it never crashes (I can push for 12 hours and see no signs of
distress, and that push runs about 20 MB/sec faster because the
onboard SATA is faster than PCI to the MV8---same exact disk, btw,
just plugged in elsewhere). I can also do "nc blahblah < /dev/zero"
from the source and "nc > /dev/null" on the target and see almost
120MB/s through the link for hours, with no problem. I can also -copy-
from a disk on a motherboard native SATA port to a disk on the PCI bus
and again, no problem if the network isn't in use---I copied 2TB of
ext4 filesystem that way with no problem. USB traffic doesn't affect
it; I tried booting with my system disk being on a native SATA port
and it changed nothing. (My normal configuration is instead a USB
disk; some of my tests used a USB flash drive for LiveCD images.)

This machine is actually one of a pair of clones---same exact hardware
on both, bought at the same time. Both machines exhibit exactly the
same behavior, which exonorates any one-off hardware issue. I have
-not- flashed a newer BIOS into the machine (my rev is from October
of last year, and there are two newer ones) because I consider the
risks of bricking a mobo larger than the chances that this is actually
a BIOS bug instead. These mobos have the capability of temporarily
booting a different BIOS vis USB without flashing it; I -may- try that
if I trust it enough.

I happened to have a GA311 PCI (not PCIe) gigabit card sitting around
still in its shrinkwrap that someone gave to me. I stuck it in the
free PCI slot. I was able to push 2TB to the test machine (at about
50 MB/sec) that way via my tar|nc pipe, and it wrote to the PCI-based
MV8 disk for 12 hours without a hiccup. So even if the PCI bus is
totally saturated, no crash---as long as I'm not using the onboard
Realtek NIC.

I tried the 11.04 Natty daily build as of 3/30/2011 (desktop AMD64)
and -that- did not crash, BUT the test is incomplete---Natty loaded
the machine -terribly-, spinning up the CPU fan to maximum and soaking
almost all 6 cores! One core was entirely devoted to running kswapd
at 100%; most of two others were running the tar and the nc at more
than 70% each. Under 10.10, the machine sits at lowest speed (800Mhz)
and only a couple cores are doing anything, and they're at 10-20%.
And my transfer rates were abysmal---roughly 45 MB/sec---probably
because of the CPU load, which went away as soon as I aborted the
test when I couldn't stand the fan noise any more. That's not enough
time for the complete test---one of my crashing tests under 10.10 was
to use rsync via ssh instead of tar via nc, and the additional load
from all the crypto slowed things down enough (to about 50 MB/sec
again) that it took 3 hours to crash instead of 10 minutes. So
Natty may still have the same bug, if it was otherwise usable.
(WTF is the deal with Natty? WTF is kswapd soaking an entire core
just because I was writing to disk? Swap was completely unused,
and the load vanished as soon as I stopped the test.)

If someone cares about this, I can attach all the logs you'd ever
want, but I'll leave that for now. But be advised---just testing
the net -alone-, without heavy PCI activity, will -not- find this
bug.

Given the huge number of confused works/doesn't-work/worked in
last release but not this one and vice versa about the RTL8111,
I'm giving up. I have a free PCIe slot and will be buying two
Intel EXPI9301CTBLK's and putting one in each machine, and never
using the RTL8111 again. I hope that works.

Revision history for this message
Grondr (grondr) wrote :

To follow up: using the Intel PCIe NIC instantly solved all my problems. No more lockups under combined network and PCI load. I won't be going back to Realtek NICs ever. (If someone can point to a specific purported bugfix that's in some kernel I can simply download and install without building my own kernel, I can try this on my hardware to test the bugfix, but I'm not going to bother for the case of, "It's a random new kernel but nobody's specifically addressed this bug, but test it anyway.") I'm currently running Natty as of 4/8, which is using 2.6.38-8 under AMD64 desktop on a 6-core CPU.

Revision history for this message
David (g-launchpad-strategyplayer-net) wrote :

I (and some other users) experienced big problems which are probably realted to yours ( https://bugs.launchpad.net/ubuntu/+source/linux/+bug/661294 ).

In my case the exchange of the Realtek OnBoard NIC with an Intel NIC (Intel 82541PI Chip) also solved all problems which seemed mysteriuous before.

Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in dianosing the problem. From a terminal window please run:

apport-collect 746914

and then change the status of the bug back to 'New'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Grondr (grondr) wrote :

I have never been comfortable using apport
because I've never found a way to be 100%
sure I can see exactly what it submits before
it sends it, so it's a privacy problem. If a developer
actually wants to work on this and has difficulties
reproducing it, I would be happy to share configuration
information at that point, but until I see evidence that
this bug won't simply sit here forever anyway, I'll hold
off. Presumably that would include only hardware
configuration, since I tried 4 different software
configurations (32/64 bit, installed/LiveCD) and
the behavior was repeatable for all.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
nick (niek-art) wrote :

I had the same problem after updating my ubuntu server (8.04) to the newest. Then the problems started.
It took me several months to find this posting.
I use samba on my server (software raid) and can confirm that the latest river from realtek works. It unloads the one in the kernel (size 84022) and replaces it with a newer one (size 203096) after compilation. I have not yet rebooted since my raid is being rebuild because of another crash due to this faulty nic driver.

It is about time Ubuntu puts this driver in the distributions...

new driver is out:
http://www.realtek.com/downloads/downloadsView.aspx?Langid=1&PNid=13&PFid=5&Level=5&Conn=4&DownTypeID=3&GetDown=false#2 (2011-8-25)

Revision history for this message
penalvch (penalvch) wrote :

Grondr, thank you for reporting this and helping make Ubuntu better. This bug was reported a while ago and there hasn't been any activity in it recently. We were wondering if this is still an issue? If so, could you please capture the oops following https://wiki.ubuntu.com/KernelTeam/KernelTeamBugPolicies#Capturing_OOPs ? As well, can you try with the latest development release of Ubuntu? ISO CD images are available from http://cdimage.ubuntu.com/releases/ .

If it remains an issue, could you run the following command from a Terminal (Applications->Accessories->Terminal). It will automatically gather and attach updated debug information to this report.

apport-collect -p linux <replace-with-bug-number>

Also, if you could test the latest upstream kernel available that would be great. It will allow additional upstream developers to examine the issue. Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag. This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text. Please let us know your results.

Thanks in advance.

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.