Bug #661294 “10ec:8168 System lock-up when receiving large files...” : Bugs : linux package : Ubuntu

Revision history for this message

Fabio Marconi (fabiomarconi) wrote on 2010-10-15:

#1

Hello David
can you please reproduce the bug then attach here
/var/log/syslog
/var/log/dmesg
/var/log/kern.log
Thanks
Fabio

Changed in ubuntu:
status:	New → Incomplete

Revision history for this message

David (g-launchpad-strategyplayer-net) wrote on 2010-10-16:

#2

LOGS1 Edit (201.0 KiB, text/plain)

Hello Fabio

I've attached the logs:

LOGS1 contains only kern.log an dmesg since I somehow 'lost' syslog, LOGS2 is a second reproducing of the bug which also cotains syslog.

LOGS 1

kern.log shows you the normal reboot I did (starting Oct 16 23:57:07) before reproducing the bug and then the reboot by reset button after the system locked up because I reproduced the bug (Oct 17 00:00:36). There don't seem to be any entries in the moment of the bug occuring (which was around 23:59)

dmesg seems the show the reboot after the lockup (?)

LOGS 2

Here I think kern.log shows only the boot after the lock up (as dmesg) but syslog shows a bit more this time. The moment of the crash must be around 00:28

Do I have to raise the log-level or something in order to receive more detailed logs? (if yes: how?)

Revision history for this message

David (g-launchpad-strategyplayer-net) wrote on 2010-10-16:

#3

LOGS2 Edit (214.8 KiB, text/plain)

Revision history for this message

David (g-launchpad-strategyplayer-net) wrote on 2010-10-31:

#4

Anyone still interested in solving that bug?

David (g-launchpad-strategyplayer-net) on 2010-11-10

Changed in ubuntu:
status:	Incomplete → New

Revision history for this message

David (g-launchpad-strategyplayer-net) wrote on 2010-11-21:

#5

I meanwhile tested the whole thing on a fresh installation of Kubuntu 10.10 and the problem exists exactly the same there.
(While on Kubuntu 8.04 everything works still without any problems.)
So the problem doesn't seem to come from a broken installation (since it even exists when trying it form the LiveCD) and it also doesn't seem to come from broken hardware (since there is no problem when using 8.04).

What can I do to provide you with additional information to solve the problem?

Revision history for this message

Bruce Edge (bruce-edge) wrote on 2010-11-21:

#6

I can collaborate the problem reported here. I have the same problem with a number of servers that were recently migrated to 10.04.
NFS is not stable. frequently, large transfers result in a "task blocked for 120 sec...." message and a dmesg stack trace.

It's utterly incomprehensible to me that an LTS rleease would have this issue and there appears to be zero interest on the part of Canonical to fix, or even acknowledge its existence.

Revision history for this message

David (g-launchpad-strategyplayer-net) wrote on 2010-11-21:

#7

Thank you for your message! I knew that I'm not the only ohne with this problem form message boards but I feared that for the other affected the problem isn't important enough to let the Devs know about it...

By the way: I tried something new. I installed and booted the lucid kernel (2.6.32-21 generic) on my Kubuntu 10.10. Unfortunately this didn't solve the problem. So it doesn't seem to make a difference if I use 2.6.32 or 2.6.35.

Revision history for this message

Clint Byrum (clint-fewbar) wrote on 2010-11-21:

#8

Ok, sounds like this is Confirmed based on user reports. Also this is pretty much the kernel's domain, so I'm setting the package to linux.

Setting Importance to Medium as this seems to be a pretty big problem for users who it affects.

To all reporters, if you can provide explicit details on the *server* platform that would be very helpful. This includes the mount line for the server (just paste the output of the 'mount' command), and also the contents of /etc/exports on the server.

affects:	ubuntu → linux (Ubuntu)
Changed in linux (Ubuntu):
importance:	Undecided → Medium
status:	New → Confirmed

Revision history for this message

Bruce Edge (bruce-edge) wrote on 2010-11-21:

#9

I've tried all of the 10.10 ppa kernel backports as they are made available. This has made no difference. Each one in turn.

I've also played with the mount lines. Here are some of the combinations I've tried:

rw,noatime,nfsvers=3,udp,rsize=32768,wsize=32768,actimeo=3,sloppy,addr=135.149.74.51

udp -> tcp
no nfsvers=3
no actimeo
no noatime

The addr= bit was added because I had a NIC with multiple IP addrs and thought that that might be confusing it as the NFS docs state that the "server IP resolution logic" is not entirely correct.

There are however some servers that have this problem and some that don't.

OK:
Intel(R) Xeon(R) CPU X5680 @ 3.33GHz

Not OK:
Intel(R) Xeon(R) CPU E5405 @ 2.00GHz
AMD Opteron(tm) Processor 280
Intel(R) Xeon(TM) CPU 3.00GHz

They are all using the same mount args as provided by LDAP autofs.

It may be just the use case that has prevented it from happening on that one server.

Revision history for this message

David (g-launchpad-strategyplayer-net) wrote on 2010-11-21:

#10

My Server is:
Kubuntu 10.04 (Linux unimatrixserver 2.6.32-25-generic #44-Ubuntu SMP Fri Sep 17 20:05:27 UTC 2010 x86_64 GNU/Linux)

On the server, the 'mount' command provides the following output:

/dev/sda1 on / type ext4 (rw,errors=remount-ro)
proc on /proc type proc (rw)
none on /sys type sysfs (rw,noexec,nosuid,nodev)
none on /sys/fs/fuse/connections type fusectl (rw)
none on /sys/kernel/debug type debugfs (rw)
none on /sys/kernel/security type securityfs (rw)
none on /dev type devtmpfs (rw,mode=0755)
none on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=0620)
none on /dev/shm type tmpfs (rw,nosuid,nodev)
none on /var/run type tmpfs (rw,nosuid,mode=0755)
none on /var/lock type tmpfs (rw,noexec,nosuid,nodev)
none on /lib/init/rw type tmpfs (rw,nosuid,mode=0755)
none on /var/lib/ureadahead/debugfs type debugfs (rw,relatime)
rpc_pipefs on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw,relatime)
/dev/sdd1 on /media/b2 type ext3 (rw)
/dev/sdc1 on /media/d3 type ext3 (rw)
/dev/sde1 on /media/b1 type ext3 (rw)
/dev/sdb1 on /media/d5 type ext3 (rw)
/dev/sda6 on /media/d4 type ext3 (rw)
nfsd on /proc/fs/nfsd type nfsd (rw)

The output of cat /proc/mounts seems to be a bit more detailed:

rootfs / rootfs rw 0 0
none /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
none /proc proc rw,nosuid,nodev,noexec,relatime 0 0
none /dev devtmpfs rw,relatime,size=1019572k,nr_inodes=254893,mode=755 0 0
none /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0
/dev/disk/by-uuid/cafc3ff0-a72d-4901-9d74-a608281b7a9c / ext4 rw,relatime,errors=remount-ro,barrier=1,data=ordered 0 0
none /var/lib/ureadahead/debugfs debugfs rw,relatime 0 0
none /sys/fs/fuse/connections fusectl rw,relatime 0 0
none /sys/kernel/debug debugfs rw,relatime 0 0
none /sys/kernel/security securityfs rw,relatime 0 0
none /dev/shm tmpfs rw,nosuid,nodev,relatime 0 0
none /var/run tmpfs rw,nosuid,relatime,mode=755 0 0
none /var/lock tmpfs rw,nosuid,nodev,noexec,relatime 0 0
none /lib/init/rw tmpfs rw,nosuid,relatime,mode=755 0 0
rpc_pipefs /var/lib/nfs/rpc_pipefs rpc_pipefs rw,relatime 0 0
/dev/sdd1 /media/b2 ext3 rw,relatime,errors=continue,data=ordered 0 0
/dev/sdc1 /media/d3 ext3 rw,relatime,errors=continue,data=ordered 0 0
/dev/sde1 /media/b1 ext3 rw,relatime,errors=continue,data=ordered 0 0
/dev/sdb1 /media/d5 ext3 rw,relatime,errors=continue,data=ordered 0 0
/dev/sda6 /media/d4 ext3 rw,relatime,errors=continue,data=ordered 0 0
nfsd /proc/fs/nfsd nfsd rw,relatime 0 0

NOTE: Only /media/d3 /media/d4 and /media/d5 are exported via NFS

The content of /etc/exports is the following:

/media/d5 192.168.1.0/255.255.255.0(rw,sync)
/media/d4 192.168.1.0/255.255.255.0(rw,sync)
/media/d3 192.168.1.0/255.255.255.0(rw,sync)

(As far as I can remember I also tried several different settings like 'async' instead of 'sync' but I can't nameexactly all the things I tried some weeks ago.)

If you need any additional information about the server or the client I'd be glad to provide!

My Server is:
 Kubuntu 10.04 (Linux unimatrixserver 2.6.32-25-generic #44-Ubuntu SMP Fri Sep 17 20:05:27 UTC 2010 x86_64 GNU/Linux)

On the server, the 'mount' command provides the following output:

/dev/sda1 on / type ext4 (rw,errors=remount-ro)
proc on /proc type proc (rw)
none on /sys type sysfs (rw,noexec,nosuid,nodev)
none on /sys/fs/fuse/connections type fusectl (rw)
none on /sys/kernel/debug type debugfs (rw)
none on /sys/kernel/security type securityfs (rw)
none on /dev type devtmpfs (rw,mode=0755)
none on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=0620)
none on /dev/shm type tmpfs (rw,nosuid,nodev)
none on /var/run type tmpfs (rw,nosuid,mode=0755)
none on /var/lock type tmpfs (rw,noexec,nosuid,nodev)
none on /lib/init/rw type tmpfs (rw,nosuid,mode=0755)
none on /var/lib/ureadahead/debugfs type debugfs (rw,relatime)
rpc_pipefs on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw,relatime)
/dev/sdd1 on /media/b2 type ext3 (rw)
/dev/sdc1 on /media/d3 type ext3 (rw)
/dev/sde1 on /media/b1 type ext3 (rw)
/dev/sdb1 on /media/d5 type ext3 (rw)
/dev/sda6 on /media/d4 type ext3 (rw)
nfsd on /proc/fs/nfsd type nfsd (rw)

The output of cat /proc/mounts seems to be a bit more detailed:

rootfs / rootfs rw 0 0
none /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
none /proc proc rw,nosuid,nodev,noexec,relatime 0 0
none /dev devtmpfs rw,relatime,size=1019572k,nr_inodes=254893,mode=755 0 0
none /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0
/dev/disk/by-uuid/cafc3ff0-a72d-4901-9d74-a608281b7a9c / ext4 rw,relatime,errors=remount-ro,barrier=1,data=ordered 0 0
none /var/lib/ureadahead/debugfs debugfs rw,relatime 0 0
none /sys/fs/fuse/connections fusectl rw,relatime 0 0
none /sys/kernel/debug debugfs rw,relatime 0 0
none /sys/kernel/security securityfs rw,relatime 0 0
none /dev/shm tmpfs rw,nosuid,nodev,relatime 0 0
none /var/run tmpfs rw,nosuid,relatime,mode=755 0 0
none /var/lock tmpfs rw,nosuid,nodev,noexec,relatime 0 0
none /lib/init/rw tmpfs rw,nosuid,relatime,mode=755 0 0
rpc_pipefs /var/lib/nfs/rpc_pipefs rpc_pipefs rw,relatime 0 0
/dev/sdd1 /media/b2 ext3 rw,relatime,errors=continue,data=ordered 0 0
/dev/sdc1 /media/d3 ext3 rw,relatime,errors=continue,data=ordered 0 0
/dev/sde1 /media/b1 ext3 rw,relatime,errors=continue,data=ordered 0 0
/dev/sdb1 /media/d5 ext3 rw,relatime,errors=continue,data=ordered 0 0
/dev/sda6 /media/d4 ext3 rw,relatime,errors=continue,data=ordered 0 0
nfsd /proc/fs/nfsd nfsd rw,relatime 0 0

NOTE: Only /media/d3 /media/d4 and /media/d5 are exported via NFS

The content of /etc/exports is the following:

/media/d5	192.168.1.0/255.255.255.0(rw,sync)
/media/d4	192.168.1.0/255.255.255.0(rw,sync)
/media/d3	192.168.1.0/255.255.255.0(rw,sync)

(As far as I can remember I also tried several different settings like 'async' instead of 'sync' but I can't nameexactly all the things I tried some weeks ago.)

If you need any additional information about the server or the client I'd be glad to provide!

Revision history for this message

Clint Byrum (clint-fewbar) wrote on 2010-11-21:

#11

Thanks guys for providing some more details.

Whats really needed is the mount arguments from the clients (the machines that have mounted the shared exports and are locking up) and the software versions from the NFS servers (the machines that are exporting files via NFS).

Specifically on the server it would help to see this on the servers:

apt-cache policy nfs-kernel-server
uname -a
ifconfig -a

Anything that will help build a solid repeatable test case.

Revision history for this message

Bruce Edge (bruce-edge) wrote on 2010-11-21:

#12

Download full text (4.9 KiB)

Machine 1:

0 9:39:31 root@topaz ~
0 #> apt-cache policy nfs-kernel-server
nfs-kernel-server:
  Installed: 1:1.2.0-4ubuntu4
  Candidate: 1:1.2.0-4ubuntu4
  Version table:
*** 1:1.2.0-4ubuntu4 0
        500 http://us.archive.ubuntu.com/ubuntu/ lucid/main Packages
        500 http://wlvmirror.lsi.com/ubuntu/ lucid/main Packages
        100 /var/lib/dpkg/status
0 11:07:14 root@topaz ~
0 #> uname -a
Linux topaz 2.6.35-22-server #33~lucid1-Ubuntu SMP Sat Sep 18 13:29:53 UTC 2010 x86_64 GNU/Linux
0 11:07:26 root@topaz ~
0 #> ifconfig -a
bond0 Link encap:Ethernet HWaddr 00:16:35:5b:a3:1e
          inet addr:135.149.75.126 Bcast:135.149.75.255 Mask:255.255.255.0
          inet6 addr: fe80::216:35ff:fe5b:a31e/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
          RX packets:71774991 errors:0 dropped:0 overruns:0 frame:0
          TX packets:70132066 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:23580470583 (23.5 GB) TX bytes:44692733013 (44.6 GB)

eth0 Link encap:Ethernet HWaddr 00:16:35:5b:a3:1e
          UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
          RX packets:20473514 errors:0 dropped:0 overruns:0 frame:0
          TX packets:35066033 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:1655710405 (1.6 GB) TX bytes:22332615100 (22.3 GB)
          Interrupt:25

eth1 Link encap:Ethernet HWaddr 00:16:35:5b:a3:1e
          UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
          RX packets:51301477 errors:0 dropped:0 overruns:0 frame:0
          TX packets:35066033 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:21924760178 (21.9 GB) TX bytes:22360117913 (22.3 GB)
          Interrupt:26

lo Link encap:Local Loopback
          inet addr:127.0.0.1 Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING MTU:16436 Metric:1
          RX packets:31618343 errors:0 dropped:0 overruns:0 frame:0
          TX packets:31618343 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:20323093919 (20.3 GB) TX bytes:20323093919 (20.3 GB)

virbr0 Link encap:Ethernet HWaddr f2:f4:de:5a:e7:79
          inet addr:192.168.122.1 Bcast:192.168.122.255 Mask:255.255.255.0
          inet6 addr: fe80::f0f4:deff:fe5a:e779/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1891 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:0 (0.0 B) TX bytes:194877 (194.8 KB)

Machine 2:

0 #> apt-cache policy nfs-kernel-server
nfs-kernel-server:
  Installed: 1:1.2.0-4ubuntu4
  Candidate: 1:1.2.0-4ubuntu4
  Version table:
*** 1:1.2.0-4ubuntu4 0
        500 http://us.archive.ubuntu.com/ubuntu/ lucid/main Packages
        500 http://wlvmirror.lsi.com/ubuntu/ lucid/main Packages
        100 /var/lib/dpkg/status
0 11:08:10 root@tonic ~
0 #> uname -a
Linux tonic 2.6.35-22-server #34-Ubuntu SMP Thu Oct 7 15:36:13...

Yes, the server side shows no messages and doesn't seem to be affected. (At least it shows no messages where I looked but since I'm not an expert it's possible that I just dont know where to look at the server side)

And yes, the client can re-connect without any problems after the client has rebooted.

I did some tests in tty1 and got at least a little bit of information. Strange is, that not all tests runs showed the same behaviour.

First test: 
Copy a 4 GB File from a NFS share to the second (magnetic, sdb5) harddisk on my computer.

Result: 
For one or two minutes nothing happened. Then the conosle showed some error-messages. Unfortunately I only have the second part of the messages, because the first page was 'out of screen' before I could take a picture.
The error messages are the following:
[5351.183955] end_request: I/O error, dev sdb, sector 131925003
(... this error repeadted a lot of times with different sector numbers ...)
[5351.203631] Aborting journal on device sdb5
[5351.203691] end_request: I/O error, dev sdb, sector 126666835
[5351.305377] EXT3-fs (sdb5): ext3_journal_start_sb: Detected aborted journal
[5351.305508] EXT3-fs (sdb5): ext3_journal_start_sb: remounting filesystem read-only
cp: schreiben von "bigfile.mkv": Read-only file system
("schreiben von" = "writing of")
After this error messages (of which, as I said, I unfortunately don't have the first part) the console was still working and I also could switch back to tty7 (GUI). As it stated in the messages it had remounted sbd5 with 'ro' (read-only).

So I decided to repeat the test to catch also the first part of the error messages, but..:

Second test: 
Copy again a 4 GB File from a NFS share to the second (magnetic, sdb5) harddisk on my computer. 
 
Result:
I waited for about 10 minutes, but nothing happened this time. No error messages but also no finishing of the copy proccess.
Yet I could switch to other text-consoles (tty2.. tty3.. didn't try to go to tty7 this time).
After those 10 minutes I tried to abort the copy proccess by hitting Ctrl+C which then seemed to lock-up also the console (I couldn't switch to other text-consoles anymore). So I rebooted the machine.

Third test:
Copy again a 4 GB File from a NFS share to the second (magnetic, sdb5) harddisk on my computer.

Result:
I waited for about 40 minutes this time, but again: nothing happened. I still could switch to other text-consoles but when I tried to switch to the graphical 'tty7' I just had a black screen and couldn't do anything (also not switching back to tty1) then. So I had to reboot the machine again.

Because of the error messages from the first test which looked likae a harddisk problem, I wanted to do a fourth test which should copy not to the magnetic harddisk but to the SSD harddisk I also have in my computer (note that of course I also tested copying to both disks long time before, so it's not the first time I try to copy to the SSD)

Fourth test:
Copy  a 4 GB File from a NFS share, this time to the first (solid-state, sda1) harddisk on my computer.

Result:
I waited for about 20 minutes this time.. again: nothing happened. Same thing as in tests 2 and 3.

Something that I noticed during the testing:
While it 'tries' to perform the copy proeccess, the HDD-LED does not blink, so it does not indicate any activity on the Harddisk. (At least not after some minutes, I didn't pay attention in the first moments)
BUT during the whole 'pseudo' copy proccess the LED on the Network Switch indicates big activity on the port to which the client is connected. Strangely it does NOT indicate this activity on the port to which the server is connected... (and neither on the port to which the router/modem is connected). So it looks like "ghost traffic".. it comes from or goes to nowhere - at least according to the LEDs of the network switch.
(Another thing I noticed is, that most [or any?] of the affected people are using gigabit ethernet instead of 10/100)

I'm not sure what this test results tell us. On one hand we saw some odd behaviour in the first test... on the other hand, the other three tests (two of them were exactly the same setting as test 1) didn't show any error messages but just not finished the job.

Revision history for this message

Andrew Chambers (andrewrchambers) wrote on 2010-11-30:

#22

I replaced my switch with a 100/10mbit switch last night to see what effect this had - it seemed to remedy the problem (I no longer got client freezes). Of course this adds a new problem which is that I am no longer using gigabit ethernet which is essential.

It is probably worth noting that I also get the same "ghost traffic" that you mentioned David. LEDs on my switch blink like crazy, but only on the client port.

I am going to try new cat6 cables to see whether this could remedy the problem - perhaps my cables are just not good enough for 1000mbit.

I doubt this is the case, but since we are not having much luck debugging this it is worth checking everything.

Revision history for this message

David (g-launchpad-strategyplayer-net) wrote on 2010-11-30:

#23

errors1.jpg Edit (110.1 KiB, image/jpeg)

That's interesting news. I'm not very suprised tough, since it was eye-catching how everyone writing about the problem in the message boards was talking about having gigabit-ethernet equipment. The cables also could be an interesting try but I also doubt that this is the case (since the problem did/does not occur in earlier Kubuntu versions... and we did not have better cables back then, right?)

I also have some probably interesting news. I did a fifth test which suprisingly showed completely new results!

The test setting was: Copy a 1.5 GB File from a NFS share to the first (solid-state, sda1) harddisk on my computer.

Result: After some time (don't know exactly how long) I got loads of new error messages. I'll attach some pictures of it I took form my screen so that I don't have to type for half an hour.

What you'll see on the pictures is the 'output' of some minutes, probably it would have continued like this for a long, long time.

Revision history for this message

David (g-launchpad-strategyplayer-net) wrote on 2010-11-30:

#24

errors2.jpg Edit (129.2 KiB, image/jpeg)

Revision history for this message

David (g-launchpad-strategyplayer-net) wrote on 2010-11-30:

#25

errors3.jpg Edit (144.2 KiB, image/jpeg)

Revision history for this message

Bruce Edge (bruce-edge) wrote on 2010-11-30:

#26

David,
You'll have an easier time gathering data if you use something like 'screen' as a shell wrapper so you can scroll back indefinitely.

Also, regarding the 100Mbit/gigE issue, we did not experience this with 100mBit, but we had all gigE when we upgraded to 9.10, which is when these problems started.

Lastly, regarding the no traffic for 2 years, I stopped adding data because nothing had changed. There seemed to be zero interest from Canonical until very recently.

Revision history for this message

Stefan Bader (smb) wrote on 2010-11-30:

#27

Hm. I need to read into all the new info tomorrow. Just one thing. The two screenshots rather hint some harddrive problem. Just a probably crazy blind shot: is there by chance a via sata controller and wd disks involved?

Revision history for this message

Bruce Edge (bruce-edge) wrote on 2010-11-30: Re: [Bug 661294] Re: System lock-up when receiving large files (big data amount) from NFS server

#28

Agreed, those screen shots look like a different problem.

On Tue, Nov 30, 2010 at 10:24 AM, Stefan Bader
<email address hidden>wrote:

> Hm. I need to read into all the new info tomorrow. Just one thing. The
> two screenshots rather hint some harddrive problem. Just a probably
> crazy blind shot: is there by chance a via sata controller and wd disks
> involved?
>
> --
> System lock-up when receiving large files (big data amount) from NFS server
> https://bugs.launchpad.net/bugs/661294
> You received this bug notification because you are a direct subscriber
> of the bug.
>

Revision history for this message

David (g-launchpad-strategyplayer-net) wrote on 2010-11-30: Re: System lock-up when receiving large files (big data amount) from NFS server

#29

@ Stefan

I have two harddisk in my client machine. The NFS freeze happens with both hdds.
One hdd (sdb) is a Seagate Barracuda 7200.10 family (ST3250410AS), the other (sda) is a solid state disk (it's labeled as 'sagitta' but I guess that means more or less 'no name'). The operating system is installed on the solid state disk.
My Gigabyte Mainboard has (as far as I know) an Intel Chipset, so I guess it also should have an Intel SATA-Controller. But I'm not 100% sure in this point.

When considering that a faulty harddisk is the problem (I also considered this at one point) you should keep in mind the following things:
- The problem happens with two totally different harddisks
- The problem happens exactly since the day I installed 10.10 instead of 8.04
- Those suspicious error messages were only shown in one (or two) of five test runs.

So if it's a harddisk problem it was probably also 'introduced' with 10.10

I'm now gonna try two things and tell you afterward the results.

1. Copy big files between the local disks (to rule out that the problem is possibly totally unrelated to NFS)
2. Attach an external HDD via USB to the client machine and try to copy from a NFS share to this external disk (to see if it's related to the mainboard's SATA-Controller)

Revision history for this message

David (g-launchpad-strategyplayer-net) wrote on 2010-11-30:

#30

The results of the two above mentioned tests:

1. Copying big files between the clients local harddisks causes no problems.

2. Copying big files from a NFS share to an external harddisk connected via USB to the client also locks up the client.
(interestingly this way it even manages to lock up tty1 so completely that I can't even change to tty2-6.)

Revision history for this message

Stefan Bader (smb) wrote on 2010-12-01:

#31

@David. Thanks for testing. In that case it is not the problem I was reminded of (this only happens on certain via controllers and apparently WD disks). The observation about the switch LEDs is interesting. As I said, I did my tests using a Lucid server and Maverick client and used a gigabit switch. I ran the test again this morning and stopped after having copied 15GBs for the 20th time. In my case I definitely saw both LEDs flash in sync. I did not pay too much attention to the hd LEDs but those do not need to go as fast. Writes hit the cache and are then written off in batches.

So obviously I still do something wrong or got lucky to have the "right" hardware. I am still trying to figure out what all of you with the problem may have in common (feels a bit like CSI, just without those nice fancy tools ;)). The fact that David saw errors that point to the disk subsystem but has no problems when only using that could also mean that, whatever happens, causes severe memory corruption. Or maybe missing interrupts (the error message mentions timeout).

At the moment I am not sure which direction to go. First probably it would be good to have more information on the systems affected. If I could get the output of the following commands from at least two affected clients.

sudo lspci -vvnnn >lspci.txt
cat /proc/interrupts >interrupts.txt

Also, just to confirm, server is Lucid based and client on Maverick. At least this was the case in previous comments. And which was the last known good client? One test that comes to my mind: can you scp that big file from the server to the client? That would hint whether it is the network in general or specifically nfs.

Changed in linux (Ubuntu):
assignee:	nobody → Stefan Bader (stefan-bader-canonical)

Revision history for this message

David (g-launchpad-strategyplayer-net) wrote on 2010-12-01:

#32

lspci.txt Edit (22.6 KiB, text/plain)

Yes, the server is lucid, the client is maverick.
The last known good client in my case was hardy (8.04) but I never used 8.10/ 9.04 / 9.10 so I can't say anything about those releases. What I do know is that the problem also exists with the 10.04 Live-CD I tested recently.

I have attached the output of the commands.

Revision history for this message

David (g-launchpad-strategyplayer-net) wrote on 2010-12-01:

#33

interrupts.txt Edit (1.5 KiB, text/plain)

Revision history for this message

Stefan Bader (smb) wrote on 2010-12-01:

#34

Thanks David. So from the pci info we use the exactly same ehternet chip in the clients. Hardy being the last known good case may have a larger impact on how the hardware is driven. One thing that looked a bit weird but maybe has no implication here is the very low count on timer interrupts (though problems there usually come as: system stops doing anything until I hit a keyboard key). Something that probably has changed a lot since Hardy and which I personally saw causing problems sometimes is the MSI support. If you got time, you could try to boot with "pci=nomsi" on the kernel command line? When that is in effect the interrupt assigned to eth0 should not say MSI anymore. Does this change anything?

Revision history for this message

Andrew Chambers (andrewrchambers) wrote on 2010-12-01:

#35

I have SCPed the file from server to client with no issues, so can confirm that it is not the physical network. I also booted a live CD of 9.04 and copied some files to client with no issues.

I don't have a 64bit 9.10 liveCD handy, though I could test with a 32bit one I have - this might be adding an extra variable though?

Revision history for this message

David (g-launchpad-strategyplayer-net) wrote on 2010-12-01:

#36

interrupts_nomsi.txt Edit (1.5 KiB, text/plain)

I booted with pci=nomsi (the Interrupts naturally changed a bit with this option, so I attached a new interrupts.txt to show that the command was in effect).
Unfortunately it did not solve the problem. The system did still lock up.

Revision history for this message

Stefan Bader (smb) wrote on 2010-12-02:

#39

Alright, so it is not related to MSI and Andrew proved that network in general is ok He seems to have a quite similar system as David, though the GA-EP45-DS5 string in lspci is likely a lie as it shows the same here and I got a EX58-UD3R. Both of you get 2 CPUs. From the logs David posted earlier this seems to be a dual-core Intel without hyper-threading enabled. Both got this quite low number on timer interrupts.

@Andrew, yes I guess mixing 32 and 64bit would probably only help if it does not work. In which case the potential breakage was between Jaunty and Karmic. If it works it may be only 64bit broke. Which is a hint but hm.

The last log I saw from David was using 2.6.35-22-generic. Has one of you updated to -23 since then? Due to the silent hang I don't think there is much to be gained by traces on the client. I wonder whether there is much to be gained from gathering a tcpdump on the server. Generally is that "ghost traffic" immediate or starts after a while?

Let me create some kernel packages which have a bunch of debugging code activated (especially lock checking and that stuff). Hopefully that could give any hint on the client side. I'll post here when I got something.

Revision history for this message

Stefan Bader (smb) wrote on 2010-12-02:

#40

Ok, I got a 64bit generic kernel image for Maverick at http://people.canonical.com/~smb/lp/661294 (the version number of 9923 is intentional to make it install in parallel and be easy to identify). For me there is one rcu warning on boot which does not seem to have any impact. If you could try that and run the test again on vt1 (maybe in screen if that helps to get some scrollback).

Revision history for this message

David (g-launchpad-strategyplayer-net) wrote on 2011-01-21:

#62

@ Andrew

I use the following command which I put in /etc/rc.local so I don't have to enter it again after every reboot:

wondershaper eth0 500000 500000

Probably there would be more suitable values but it seems that those worke quite fine for me.
As stated above, it limits my downspeed traffic to around 40 Megabytes per second. Still the upload speed seems to be affected in a 'stronger' way I don't fully understand yet...

(In my case, the home partition luckily is local)

Revision history for this message

Andrew Chambers (andrewrchambers) wrote on 2011-01-22:

#63

Well wondershaper continued to not work even with the same command, so for the time being I have put an old 100/10 mbit NIC in my computer to stop the crashing.

If I replace this with a different gigabit NIC, would the crashing stop?

Clearly having /home mounted over such a connection isn't great, but slow as it is at least my photos are being added to my photo library without complete lock ups.

Revision history for this message

David (g-launchpad-strategyplayer-net) wrote on 2011-01-22:

#64

Strange - in my case it was the other tool 'trickle' which didn't work at all. Maybe you want to give that one a try...

I don't know about the different gigabit NIC, as far as I remember noone in this 'thread' tested this yet. I actually wanted to try it, but I was too lazy buying a gigabit NIC after I finally got the 'partly' workaround with wondershaper.

Revision history for this message

Sean Clarke (sean-clarke) wrote on 2011-02-16:

#65

Can't believe Linux can have such a critical issue - I commend you guys for persevering.

I too have this problem, I run virtual machines (KVM) over NFS and this has rendered the system unusable:

Client:
uname -a
Linux enterprise 2.6.35-27-server #47-Ubuntu SMP Fri Feb 11 23:09:19 UTC 2011 x86_64 GNU/Linux

The server is a Thecus NAS device

Revision history for this message

Sean Clarke (sean-clarke) wrote on 2011-02-16:

#66

Just to confirm - it is 100% repeatable, just by generating a load of traffic - start 2 KVM images together and it just hangs.

I get nfs server not responding in the logs, but i think that is because the system has hung (other clients can use the server and ping it at the same time etc.).

[ 2631.463817] br0: port 4(vnet2) entering forwarding state
[ 2641.716774] vnet2: no IPv6 routers present
[ 2753.179883] INFO: task kvm:3256 blocked for more than 120 seconds.
[ 2753.179897] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2753.179905] kvm D 00000001000391e5 0 3256 1 0x00000000
[ 2753.179909] ffff8802f6941c58 0000000000000086 ffff880200000000 0000000000015b40
[ 2753.179914] ffff8802f6941fd8 0000000000015b40 ffff8802f6941fd8 ffff880325680000
[ 2753.179919] 0000000000015b40 0000000000015b40 ffff8802f6941fd8 0000000000015b40
[ 2753.179921] Call Trace:
[2753.179927] [<ffffffff815a080f>] __mutex_lock_slowpath+0xff/0x190
[2753.179929] [<ffffffff815a0213>] mutex_lock+0x23/0x50
[ 2753.179932] [<ffffffff8117be8d>] vfs_fsync_range+0x6d/0xa0
[ 2753.179933] [<ffffffff8117bf07>] generic_write_sync+0x47/0x50
[2753.179936] [<ffffffff811040ce>] generic_file_aio_write+0xae/0xd0
[ 2753.179944] [<ffffffffa0500911>] nfs_file_write+0xb1/0x200 [nfs]
[ 2753.179946] [<ffffffff811544da>] do_sync_write+0xda/0x120
[ 2753.179949] [<ffffffff810752ef>] ? kill_pid_info+0x3f/0x60
[ 2753.179950] [<ffffffff81075490>] ? kill_something_info+0x40/0x150
[ 2753.179953] [<ffffffff81290d78>] ? apparmor_file_permission+0x18/0x20
[ 2753.179955] [<ffffffff81260316>] ? security_file_permission+0x16/0x20
[ 2753.179957] [<ffffffff811547b8>] vfs_write+0xb8/0x1a0
[ 2753.179959] [<ffffffff81155152>] sys_pwrite64+0x82/0xa0
[ 2753.179962] [<ffffffff8100a0f2>] system_call_fastpath+0x16/0x1b
[ 2753.179963] INFO: task kvm:3327 blocked for more than 120 seconds.

Revision history for this message

Stefan Bader (smb) wrote on 2011-02-17:

#67

Just recently there where two patches found that may be related. One was about starvation when receiving data over virtio-net (bug #579276) and the other was fixing the nfs filesystem that would return the wrong return code for flush/sync (bug #585657). From the stack trace in the previous comment it looks like this may be the latter issue. Both patches have been queued for one of the next updates. But for those interested, I have prepared debian packages that include both changes on top of all the changes that are currently staged for proposed.

The patch to bug #585657 may also be related to this bug, though this bug was about receiving large files which would not write to the nfs file system. Receiving could be impacted though if the client would be running inside of KVM.

Revision history for this message

Stefan Bader (smb) wrote on 2011-02-17:

#68

Oh, test packages are at http://people.canonical.com/~smb/lp661294/ again.

Revision history for this message

David (g-launchpad-strategyplayer-net) wrote on 2011-03-01:

#69

I have annoying news.
Today I built a new computer which is also connected to the network and which should also be able to access the problematic NFS shares. When setting it up, I expected to possible results:
Either it has the same problems as my other computer (which is: crashing when receiving lagre files via over NFS) or it doesn't have the problem.
What I didn't expect was, that the new machine could have an even more annoying problem - but unfortunately exactly this is the case.

To be specific:
My new machine only reaches transfer speeds of between 100kbit/s and 1Mbit/s (we're talking here about Gigabit Ethernet!) which of course makes the whole thing totally useless. Aditionally it seems to cause some overload in NFS so that the file manager not only in the new machine but also in the old machine (because the old one of course is still connected to the NFS shares) gets blockated.
In other words: NFS with client A results in a crash of the client, NFS with client B results in terribly low speed and a blockade of all clients...

I really don't know what to do or say anymore... after wasting a huge amount of time with the old problem and finally elaborating an only halfway satisfying workaround I add a new machine to the network and now I'm facing even bigger problems than ever before.

Revision history for this message

David (g-launchpad-strategyplayer-net) wrote on 2011-03-02:

#70

Ok, sorry, you can ignore my last message. Because of the problems I had with NFS, I assumed that the networking problems of my new machine are als the fault of NFS. But I checked it more detailes and found out, that the Ethernet-Adapter of my new machine was faulty. I replaced it and NFS works now fine on the new machine. Tough on the old machine, the problems still exist.

Revision history for this message

Jeff Taylor (shdwdrgn) wrote on 2011-03-03:

#71

Yesterday I swapped my 100Mb server NICs for new Gbit NICs (RTL-8169). I have been running lucid on 2 machines for about a month, and the last machine was upgraded about 3 months ago. The NIC upgrade was the only thing that changed yesterday. My switches are D-Link DGS-2208 (Gbit).

- When transferring files via NFS3, transfer rates run about 65Mb/s, the *receiving* computer will lock up hard within the first minute. There seems to be no problems with the sending server. This is consistent regardless of which of my three servers are doing the sending or receiving.

- In addition to the NFS problems everyone else is reporting here, I also run a DRBD share between two servers, formatted with OCFS2. After one of the machines is rebooted, a DRBD sync is started between servers, and again the *receiving* machine will lock up hard within the first minute. In drbd.conf, I have 'rate 1000M', and transfer speeds were again around 65Mb/s at the time of lockup. Lockup has occurred on the receiver regardless of which server was being brought up. I have changed rate to 500M and will see on the next reboot if their is still a lockup.

I think this may show that the problem is not limited to NFS transfers. I was able to get DRBD back in sync by disconnecting the ethernet cable until ubuntu finished booting up, then plugging in the network again. This would lead me to believe that the problem is more likely related to a combination of system load and network load?

Server 1:
- Linux Loki 2.6.31-16-generic-pae #53-Ubuntu SMP Tue Dec 8 05:20:21 UTC 2009 i686 GNU/Linux
- 768MB ram
- AMD Athlon(tm) XP 2000+
- DRBD / OCFS2 share

Server 2:
- Linux Eris 2.6.31-16-generic-pae #53-Ubuntu SMP Tue Dec 8 05:20:21 UTC 2009 i686 GNU/Linux
- 768MB ram
- AMD Athlon(tm) XP 2000+
- DRBD / OCFS2 share

Server 3:
- Linux Zeus 2.6.32-29-generic-pae #58-Ubuntu SMP Fri Feb 11 19:15:25 UTC 2011 i686 GNU/Linux
- 1GB RAM
- AMD Athlon(tm) XP 3000+
- RAID10 (software) / EXT3 via NFS3 share

Yesterday I swapped my 100Mb server NICs for new Gbit NICs (RTL-8169).  I have been running lucid on 2 machines for about a month, and the last machine was upgraded about 3 months ago.  The NIC upgrade was the only thing that changed yesterday.  My switches are D-Link DGS-2208 (Gbit).

- When transferring files via NFS3, transfer rates run about 65Mb/s, the *receiving* computer will lock up hard within the first minute.  There seems to be no problems with the sending server.  This is consistent regardless of which of my three servers are doing the sending or receiving.

- In addition to the NFS problems everyone else is reporting here, I also run a DRBD share between two servers, formatted with OCFS2.  After one of the machines is rebooted, a DRBD sync is started between servers, and again the *receiving* machine will lock up hard within the first minute.  In drbd.conf, I have 'rate 1000M', and transfer speeds were again around 65Mb/s at the time of lockup.  Lockup has occurred on the receiver regardless of which server was being brought up.  I have changed rate to 500M and will see on the next reboot if their is still a lockup.

I think this may show that the problem is not limited to NFS transfers.  I was able to get DRBD back in sync by disconnecting the ethernet cable until ubuntu finished booting up, then plugging in the network again.  This would lead me to believe that the problem is more likely related to a combination of system load and network load?

Server 1:
- Linux Loki 2.6.31-16-generic-pae #53-Ubuntu SMP Tue Dec 8 05:20:21 UTC 2009 i686 GNU/Linux
- 768MB ram
- AMD Athlon(tm) XP 2000+
- DRBD / OCFS2 share

Server 2:
- Linux Eris 2.6.31-16-generic-pae #53-Ubuntu SMP Tue Dec 8 05:20:21 UTC 2009 i686 GNU/Linux
- 768MB ram
- AMD Athlon(tm) XP 2000+
- DRBD / OCFS2 share

Server 3:
- Linux Zeus 2.6.32-29-generic-pae #58-Ubuntu SMP Fri Feb 11 19:15:25 UTC 2011 i686 GNU/Linux
- 1GB RAM
- AMD Athlon(tm) XP 3000+
- RAID10 (software) / EXT3 via NFS3 share

Revision history for this message

Justin Dossey (jbd) wrote on 2011-03-04:

#72

I run a set of media encoding servers in KVM VMs running lucid. They run a few FFMPEG processes and write to one of several Linux NFS servers. I get this kind of NFS hang every day or two, so I have been trying different strategies (different virtual NIC types, different kernel versions, etc).

Long story short, I have the kern.log from a server during the hang and the time leading up to it, and I had /proc/sys/sunrpc/rpc_debug set to 32767 all along. The excerpt of the two-minute interval surrounding the hang is 18M and compresses down to 831K. All this on Linux ftrans-03 2.6.38-4-generic-pae #31~lucid1-Ubuntu SMP Thu Feb 17 13:41:45 UTC 2011 i686 GNU/Linux. Single virtual CPU, 1G of memory.

Revision history for this message

Clint Byrum (clint-fewbar) wrote on 2011-03-04: Re: [Bug 661294] Re: System lock-up when receiving large files (big data amount) from NFS server

#73

To everyone reporting being affected, it seems like the most useful
thing we can see is probably your

lspci -v

output, in case there is a common family or type of hardware between
those affected.

On Fri, 2011-03-04 at 01:12 +0000, Justin Dossey wrote:
> I run a set of media encoding servers in KVM VMs running lucid. They
> run a few FFMPEG processes and write to one of several Linux NFS
> servers. I get this kind of NFS hang every day or two, so I have been
> trying different strategies (different virtual NIC types, different
> kernel versions, etc).
>
> Long story short, I have the kern.log from a server during the hang and
> the time leading up to it, and I had /proc/sys/sunrpc/rpc_debug set to
> 32767 all along. The excerpt of the two-minute interval surrounding
> the hang is 18M and compresses down to 831K. All this on Linux
> ftrans-03 2.6.38-4-generic-pae #31~lucid1-Ubuntu SMP Thu Feb 17 13:41:45
> UTC 2011 i686 GNU/Linux. Single virtual CPU, 1G of memory.
>
>
> ** Attachment added: "compressed RPC debug log"
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/661294/+attachment/1884260/+files/diary-of-an-nfs-crash.log.bz2
>

Revision history for this message

Jeff Taylor (shdwdrgn) wrote on 2011-03-04: Re: System lock-up when receiving large files (big data amount) from NFS server

#74

lspic for server 1

Revision history for this message

Jeff Taylor (shdwdrgn) wrote on 2011-03-04:

#75

lspic for server 2

Revision history for this message

Jeff Taylor (shdwdrgn) wrote on 2011-03-04:

#76

lspic for server 3

Revision history for this message

Grondr (grondr) wrote on 2011-04-06:

#77

I just happened to trip over this report, and at least
one of the lcpci entries shows a Realtek NIC, which
makes me suspicious in light of the report I just
filed at bug #746914. You might want to take a
look at that in case it seems relevant and/or try
the nc | tar pipelines I was using (I haven't yet
tried NFS; I will once my system is stable otherwise).
(Short story is that any amount of I/O was fine
-unless- there was also PCI-bus activity, whereupon
raising the I/O rate had an exponential effect on the
time to failure---again with a "jabbering" lockup in which
the receiving machine was continuously transmitting
on its Ethernet but was otherwise hung or nearly hung,
which also sounds somewhat like the "ghost traffic"
someone reported above, maybe.)

I have a replacement (non-Realtek) NIC arriving on
Friday and will update the report once I can test it.

Revision history for this message

David (g-launchpad-strategyplayer-net) wrote on 2011-04-28:

#78

Today I made the mistake of upgrading to 11.04 - instead of solving this annoying old issue, the new version of the distribution made it even worse!

But well, first the good news: In 11.04, usually not the whole machine freezes completely (as it was before) but only 'plasma-desktop', 'dolphin' and the application which tries to read from the NFS share. Also they tend to 'unfreezes' themselves after some minutes, which is another advance.

Now the bad news: The issue is now triggered much more easily - not only the copying of large files freezes the applictaions now, but almost any reading operation form the NFS shares which cause more traffic than a simple directory listing.

In other words: After having upgraded to 11.04, I am now more cut off from my data on the NFS shares than I have been ever before!
Also my workaround that I described above doesn't work under these new conditions anymore.
I'm really stunned how they could make this whole catastrophe even more disastrous instead of finally fixing it after years...

Tomorrow I'll finally try to replace the NIC in my computer. I really hope this helps, because under those new conditions, the point is reached, where this issue really starts to make working with my computer impossible.

Revision history for this message

David (g-launchpad-strategyplayer-net) wrote on 2011-04-29:

#79

Good news (for me)

After replacing my OnBoard NIC with one for the PCI Slot (an Intel Card with Intel 82541PI Chip), the problem doesn't occur anymore.

So it seems likely that the issue has something to do with the Realtek Chip, as Grondr stated.

Revision history for this message

Grondr (grondr) wrote on 2011-04-29:

#80

...and I'd like to point out that I've done a ton of NFS (v3 only, since I'm talking to hosts running Hoary---yes, really) with the Intel NIC and had no problems whatsoever. To everyone who is blaming NFS here, TRY NETCAT. And DO NOT try scp or ssh because they're doing crypto which will slow down your transfer rates significantly---for every 2x slowdown, it might cause the bug to take 2^x longer to manifest (did for me, anyway). Instead, try just shoving data as fast as you can via nc---I was using tar on both ends of the pipeline because I was trying to actually get data moved, but if you're just debugging this, shove /dev/zero through nc at one end and dump it to /dev/null at the other. And then let it sit there for your typical time-to-failure (times as-much-patience-as-you-have). If that doesn't work, try disk activity as well---maybe you have my bug, where things only went south when there was lots of heavy activity on the PCI bus (see my previous comment on this thread) but it was perfectly fine if it was just an nc pipeline that wrote to non-PCI disks or to nowhere. [I haven't rigorously read the entire thread in this bug report here, but it sounds more and more like the problem isn't NFS, but your NIC, and that you're not seeing it in non-NFS applications either because you're using things that don't hit the net as hard, or because you are, but you're not also doing file I/O.]

Robbie Williamson (robbiew) on 2011-05-02

summary:	- System lock-up when receiving large files (big data amount) from NFS - server + System lock-up when receiving large files over a Realtek NIC (big data + amount) from NFS server
Changed in linux (Ubuntu Natty):
status:	New → Confirmed
importance:	Undecided → Medium

Revision history for this message

Vanessa Dannenberg (vanessadannenberg) wrote on 2011-05-08: Re: System lock-up when receiving large files over a Realtek NIC (big data amount) from NFS server

#81

Same problem here - any large receive over NFS tends to crash the receiving machine. When the lockup occurs, X freezes hard, but I can Alt-SysRq {REISUB} to reboot. Realtek gigabit NICs (onboard) at both ends, with a gigabit switch between them. As with others, the server seems unaffected when the client crashes.

The result of `lspci -v` is attached.

Revision history for this message

Adam Bolte (boltronics) wrote on 2011-05-10:

#82

FWIW, I get this on Debian Wheezy (Testing) running 2.6.38-2-amd64. When I download (but never when I upload) large files (eg. >200Mb movies) from my home file server over Gigabit on one of these Realtek NICs I get a crash - 100% reproducible.

I only just installed Debian because I had Gentoo and thought something was screwy with the video card (using the proprietary nvidia driver) - it crashed every time I played a movie directly from my file share (a Samba server) and I could never figure out why. Imagine my surprise when I still got this using nouveau under Debian! Then I thought it was my overclock. Then my memory. Took me ages to notice that the issue was the NIC!

Gigabyte X58A-UD9, BIOS F4 - has 2 NICs, both unfortunately the same chipset:

Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 06)

As is the case with others here, it's only an issue during Gigabit file transfers. Other NICs seem fine too.

Revision history for this message

Maximilian Güntner (mguentner) wrote on 2011-05-10:

#83

I have the same bug with ArchLinux. Both the server (i686) and the client (x86_64) are running ArchLinux with kernel 2.6.38.5 - everything connected with gigabit ethernet.
The problem started after I've created a LUKS partition on the server and started accessing it with NFS.

Writing a file into the container using NFS will result in a total freeze of the client after some megabytes.

The Server has two NVIDIA NICs while the client has an e1000 NIC.

Mount line:
server:/backup /mnt/backup nfs4 bg,hard,intr,nolock,udp 0 0

It seems that these freezes only occur if (Average write speed on the Server) < (Average Disk-to-Network speed on the client). That is the case if LUKS is enabled on the target partition. (50mb/s write speed on the server vs 90 mb/s read speed on the client).

I'm not 100% sure whether this has anything to do with LUKS itself or just with the reduced speed.

In general the 8169 module had and has (?) some problems handling heavy traffic (many connections, high transfer rate). That's why I replaced mine with the e1000. But it's likely that we blame it on the wrong module since the 8169 is pretty mainstream and almost everybody has one in his/her setup.

Revision history for this message

Adam Bolte (boltronics) wrote on 2011-05-10:

#84

I just downloaded and installed the Realtek driver from here (r8168-8.023.00):
http://www.realtek.com/downloads/downloadsView.aspx?Langid=1&PNid=13&PFid=5&Level=5&Conn=4&DownTypeID=3&GetDown=false#2

Transferred a few GBs from my file server to test and had no crash - a record! It may be a bit premature, but I'm calling it fixed. So yes - I blame the r8168 module shipped in the distros.

Revision history for this message

Andrew Chambers (andrewrchambers) wrote on 2011-05-10:

#85

@Adam Bolte

Just did the same thing and can confirm that I no longer get crashes. Can't believe I struggled with this for months and didn't think to look at the realtek website for a new driver! You live and learn!

Thank you very much for the help.

Revision history for this message

khaldan (khaldan) wrote on 2011-05-11:

#86

Can confirm the bug (not present in Ubuntu 10.04, not tested in 10.10 but present in 11.04) with a Realtek chip (see attached output from lspci -v) and the workaround by Adam Bolte. But it seems like my system is removing the realtek driver during restart and loads the buggy driver again. At least the bug is occuring after restart and after reinstalling the realtek driver its gone again. Would be perfect if somebody could fix the ubuntu driver or find a way to permanently install the realtek driver (although the bug wouldnt be fixed by that ;-))

Revision history for this message

Adam Bolte (boltronics) wrote on 2011-05-12:

#87

@Andrew Chambers

Awesome. Actually, the latest driver release is just a few weeks old. These drivers might not have worked prior to that in 10.10, but I haven't investigated.

@khaldan

Before Gentoo I was using Ubuntu 10.04 on this system also and did not see the issue there. I guess I have been switching OSs a lot lately.

Anyway, if it's loading the buggy driver again I can think of two things that might cause it:
1. Loading a new kernel, or running an update.
2. You have the module saved in your initramfs image.

You can probably check for the second scenario with something like:
$ cat /boot/initrd.img-$(uname -r) | zcat | cpio --list 2>/dev/null | grep r8168

If the module is listed, and you have already installed the driver from the Realtek website, maybe the following command will help you (for the next time you boot):
$ sudo update-initramfs -k all -c

Agreed - I should have called it a work-around and not a fix.

Revision history for this message

Robbie Williamson (robbiew) wrote on 2011-05-13:

#88

The updated driver has been proven by a few users to resolve this issue:

http://www.realtek.com/downloads/downloadsView.aspx?Langid=1&PNid=13&PFid=5&Level=5&Conn=4&DownTypeID=3&GetDown=false#2

Revision history for this message

Tuukka Norri (tsnorri) wrote on 2011-08-23:

#89

I had the same problem with Ubuntu 11.04, Linux 2.6.38-11-generic and the r8169 driver that shipped with it. Installing a driver from Realtek resolved the problem.

Revision history for this message

lagerimsi (lagerimsi) wrote on 2011-08-28:

#90

can confirm this problem with Natty (11.04) and Linux 2.6.38-11-generic.

seems it not only affects nfs - also videochat and other programs making heavy use of the nic.

new driver is out:
http://www.realtek.com/downloads/downloadsView.aspx?Langid=1&PNid=13&PFid=5&Level=5&Conn=4&DownTypeID=3&GetDown=false#2 (2011-8-25)

Revision history for this message

nick (niek-art) wrote on 2011-09-03:

#91

I had the same problem after updating my ubuntu server (8.04) to the newest. Then the problems started.
It took me several months to find this posting.
I use samba on my server (software raid) and can confirm that the latest river from realtek works. It unloads the one in the kernel (size 84022) and replaces it with a newer one (size 203096) after compilation. I have not yet rebooted since my raid is being rebuild because of another crash due to this faulty nic driver.

It is about time Ubuntu puts this driver in the distributions...

Revision history for this message

Christoph Gritschenberger (christoph.gritschenberger) wrote on 2011-10-03:

#92

I also had lock-ups with mine.
03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 03)

I tried with Ubuntu 11.04 and Fedora 15.
After 2-5GB had been transfered the system locked up. In Fedora it also did reboot.
This happend when copying via SMB or SFTP (on both sytems).

After installing the realtek-driver from the homepage everything worked fine again.
Just copied 50GB via SFTP without a problem.

Revision history for this message

UmitG (hazamatic) wrote on 2011-10-26:

#93

I also had the same problem while copying files into my raid using SMB.

The freeze would only happen while copying files INTO the raid, but copying OUT of the raid was fine. Meaning the machine was freeze if downloading, but uploading was fine.

After installing the realtek driver from the above, I have no more problems. Transfer speeds have also improved after install the new drivers.

I would be nice if the linux kernel had a working driver...

Revision history for this message

BeJay (bjdag79) wrote on 2012-07-02:

#94

How can this be medium?? It's been going on for 2 years now! I've still got the same issues since upgrading to 12.04 from 10.04. All 10.04 ubuntu clients are the same, but the server is now 12.04. I have gig clients as well as a couple of 54Mb wireless G bridge clients that also lock up and die. This is terrible since I use it to mount over wireless that had no trouble on 10.04 -> 10.04 machines. Is anyone bothering with this MAJOR issue these days?

Revision history for this message

bamyasi (iadzhubey) wrote on 2012-07-03:

#95

I can confirm the same NFS crashes on my Ubuntu Server 12.04 NAS with the hardware RAID5 (3ware 9750-24i4e). I would rate this as a catastrofic bug and surprised there were no activity on it for years. I personally certainly do not enjoy repairing a 40-TB filesystem after regular crashes.

# uname -a
Linux yuka 3.2.0-26-generic #41-Ubuntu SMP Thu Jun 14 17:49:24 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

# dpkg -l | grep nfs
ii libnfsidmap2 0.25-1ubuntu2 NFS idmapping library
ii nfs-common 1:1.2.5-3ubuntu3 NFS support files common to client and server
ii nfs-kernel-server 1:1.2.5-3ubuntu3 support for NFS kernel server

# lspci | grep Ethernet
02:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
02:00.1 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
03:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
03:00.1 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)

penalvch (penalvch) on 2012-07-25

summary:

- System lock-up when receiving large files over a Realtek NIC (big data
- amount) from NFS server
+ 10ec:8168 System lock-up when receiving large files over a Realtek NIC
+ (big data amount) from NFS server

penalvch (penalvch) on 2012-07-25

description:	updated
tags:	added: lucid maverick natty needs-upstream-testing regression-release

Revision history for this message

penalvch (penalvch) wrote on 2012-07-25:

#96

David, this bug was reported a while ago and there hasn't been any activity in it recently. We were wondering if this is still an issue? Can you try with the latest development release of Ubuntu? ISO CD images are available from http://cdimage.ubuntu.com/releases/ .

If it remains an issue, could you run the following command in the development release from a Terminal (Applications->Accessories->Terminal). It will automatically gather and attach updated debug information to this report.

apport-collect -p linux <replace-with-bug-number>

Also, if you could test the latest upstream kernel available that would be great. It will allow additional upstream developers to examine the issue. Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please do not test the kernel in the daily folder, but the one all the way at the bottom. Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag. This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text. As well, please comment on which kernel version specifically you tested.

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

If you are unable to test the mainline kernel, for example it will not boot, please add the tag: 'kernel-unable-to-test-upstream', and comment as to why specifically you were unable to test it.

Please let us know your results. Thanks in advance.

Helpful Bug Reporting Links:
https://help.ubuntu.com/community/ReportingBugs#Bug_Reporting_Etiquette
https://help.ubuntu.com/community/ReportingBugs#A3._Make_sure_the_bug_hasn.27t_already_been_reported
https://help.ubuntu.com/community/ReportingBugs#Adding_Apport_Debug_Information_to_an_Existing_Launchpad_Bug
https://help.ubuntu.com/community/ReportingBugs#Adding_Additional_Attachments_to_an_Existing_Launchpad_Bug

David, this bug was reported a while ago and there hasn't been any activity in it recently. We were wondering if this is still an issue? Can you try with the latest development release of Ubuntu? ISO CD images are available from http://cdimage.ubuntu.com/releases/ .

If it remains an issue, could you run the following command in the development release from a Terminal (Applications->Accessories->Terminal). It will automatically gather and attach updated debug information to this report.

apport-collect -p linux <replace-with-bug-number>

Also, if you could test the latest upstream kernel available that would be great. It will allow additional upstream developers to examine the issue. Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please do not test the kernel in the daily folder, but the one all the way at the bottom. Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag. This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text. As well, please comment on which kernel version specifically you tested.

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

If you are unable to test the mainline kernel, for example it will not boot, please add the tag: 'kernel-unable-to-test-upstream', and comment as to why specifically you were unable to test it.

Please let us know your results. Thanks in advance.

Helpful Bug Reporting Links:
https://help.ubuntu.com/community/ReportingBugs#Bug_Reporting_Etiquette
https://help.ubuntu.com/community/ReportingBugs#A3._Make_sure_the_bug_hasn.27t_already_been_reported
https://help.ubuntu.com/community/ReportingBugs#Adding_Apport_Debug_Information_to_an_Existing_Launchpad_Bug
https://help.ubuntu.com/community/ReportingBugs#Adding_Additional_Attachments_to_an_Existing_Launchpad_Bug

Changed in linux (Ubuntu):
status:	Confirmed → Incomplete

Revision history for this message

David (g-launchpad-strategyplayer-net) wrote on 2012-07-25:

#97

Christopher, as I said above, I have replaced the Realtek OnBoard NIChip with an Intel PCI NICard to bypass the whole issue after it has become even worse and unbearable when I updated to 11.04.

Now of course I could remove the Intel Card and reactivate the Realtek Chip in order to test if the issues presists. But since I gave the concerned machine to my father some months ago (and therefore have no regular access to it) I don't really feel like warming up this annoying old issue.

But maybe someone else can make those tests. When I look at the comments, there still seem to be some people affected, not?

Revision history for this message

Adam Bolte (boltronics) wrote on 2012-07-26:

#98

I might be able to run some tests as I still have the same Gigabyte X58A-UD9 that was experiencing this issue. However, I run Debian Wheezy on it and don't use it all that often. I haven't seen this issue in months, which either means:

1. I downloaded the driver from Realtek ages ago, forgot about it and haven't upgraded kernels to anything incompatible with it.

2. Debian Wheezy doesn't have the problem any more (but I understand you'll want me to test Ubuntu development images anyway).

3. I just haven't copied sufficiently large files to observe this problem (most files I copy across my network are <400Mb).

Anyway, will try to look into this when I have time (hopefully this weekend) if nobody else beats me to it.

Revision history for this message

Stefan Bader (smb) wrote on 2012-07-26:

#99

There are a few reasons this bug report is not well looked after. Admitted not all very good ones but things come together. First of all this seems quite hardware specific (not only tied to the NIC but possibly also to other sources like mother board make, the network cables or switches). I myself have a Realtek 8111/8168B and could never reproduce the issue.
There is also the time. There are always other and more problem which may affect even more people. And as long as there is no one asking in the report about its status, it unfortunately can fall down the cracks. I probably should have unassigned myself but then I had forgotten about it.
There is also the problem that over time there have been additions to this report that are about completely different hardware (Intel NICs instead of Realtek). This unfortunately causes often more confusion than it is helpful. Just as a general rule, for something that looks hardware specific it is better to open its own bug. It is easy to make it a duplicate of some other bug for someone looking at them but it is really hard to work on one report that mixes comments of things that are actually not the same.
Just a word about the driver from Realtek: it is a valid option for someone affected but to have that bundled up in the distro kernel just is too much of a burden for maintenance. Someone needs to make sure that it does not break when the rest of the kernel changes, there would be different bugs than for the in-kernel driver and so on. And Realtek should really make sure the driver in the upstream kernel is good. That would help everyone.

But ok, so much for attempts of explanations from this side. What I would like to propose is that those who are still affected by this on either Precise (12.04) or Quantal (development release as of now) should open their own new bug report ("ubuntu-bug linux" will gather automatically some data which is usually asked for). Optionally reporting the bug number here, so people with the same hw can subscribe to the new report. This just to get things separated. There always was one other problem. Without any oops or panic message and the system only locking up it is near impossible to find anything. Being logged in or using netconsole is of little use when the NIC is the problem. Serial ports are quite rare now (maybe a usb-serial adapter could be used). So I was wondering about using crashdump when I saw newer posts on this report. Unfortunately this is in a bit of a broken state as I found out when looking. I plan to update the debugging wiki (https://wiki.ubuntu.com/Kernel/CrashdumpRecipe) as one of the next things to do. So when that is updated it may be an option to get some useful data for finding the issue.

There are a few reasons this bug report is not well looked after. Admitted not all very good ones but things come together. First of all this seems quite hardware specific (not only tied to the NIC but possibly also to other sources like mother board make, the network cables or switches). I myself have a Realtek 8111/8168B and could never reproduce the issue.
There is also the time. There are always other and more problem which may affect even more people. And as long as there is no one asking in the report about its status, it unfortunately can fall down the cracks. I probably should have unassigned myself but then I had forgotten about it.
There is also the  problem that over time there have been additions to this report that are about completely different hardware (Intel NICs instead of Realtek). This unfortunately causes often more confusion than it is helpful. Just as a general rule, for something that looks hardware specific it is better to open its own bug. It is easy to make it a duplicate of some other bug for someone looking at them but it is really hard to work on one report that mixes comments of things that are actually not the same.
Just a word about the driver from Realtek: it is a valid option for someone affected but to have that bundled up in the distro kernel just is too much of a burden for maintenance. Someone needs to make sure that it does not break when the rest of the kernel changes, there would be different bugs than for the in-kernel driver and so on. And Realtek should really make sure the driver in the upstream kernel is good. That would help everyone.

But ok, so much for attempts of explanations from this side. What I would like to propose is that those who are still affected by this on either Precise (12.04) or Quantal (development release as of now) should open their own new bug report ("ubuntu-bug linux" will gather automatically some data which is usually asked for).  Optionally reporting the bug number here, so people with the same hw can subscribe to the new report. This just to get things separated.  There always was one other problem. Without any oops or panic message and the system only locking up it is near impossible to find anything. Being logged in or using netconsole is of little use when the NIC is the problem. Serial ports are quite rare now (maybe a usb-serial adapter could be used). So I was wondering about using crashdump when I saw newer posts on this report. Unfortunately this is in a bit of a broken state as I found out when looking. I plan to update the debugging wiki (https://wiki.ubuntu.com/Kernel/CrashdumpRecipe) as one of the next things to do. So when that is updated it may be an option to get some useful data for finding the issue.

Changed in linux (Ubuntu Oneiric):
assignee:	Stefan Bader (stefan-bader-canonical) → nobody
Changed in linux (Ubuntu):
assignee:	Stefan Bader (stefan-bader-canonical) → nobody

Revision history for this message

penalvch (penalvch) wrote on 2012-07-26:

#100

David, this bug report is being closed due to your last comment regarding how this no longer affects you and you do not have the hardware. For future reference you can manage the status of your own bugs by clicking on the current status in the yellow line and then choosing a new status in the revealed drop down box. You can learn more about bug statuses at https://wiki.ubuntu.com/Bugs/Status. Thank you again for taking the time to report this bug and helping to make Ubuntu better. Please submit any future bugs you may find.

no longer affects:	linux (Ubuntu Oneiric)
no longer affects:	linux (Ubuntu Natty)
Changed in linux (Ubuntu):
status:	Incomplete → Invalid

Revision history for this message

Adam Bolte (boltronics) wrote on 2012-07-27:

#101

Wow. Just wow. This bug has been open for nearly 2 years, and 100 comments later the bug report is closed because one guy recently claimed to experience the same problem with an Intel card instead of a Realtek - who should have just opened a separate bug report. This bug has *always* been about specific Realtek chipsets, so opening up other bug reports would be pointless.

I myself have previously experienced this same issue on 3 different computers (all different motherboards) all with the same Realtek hardware. I did not mention the other two because one is now critical work machine I cannot readily test since they are now in production with the driver from the Realtek website, and the other I don't have any more. It's unbelievable that Canonical Q/A have been unable to reproduce. You even have some specific motherboard models with the problem mentioned.

> Someone needs to make sure that it does not break when the rest of the kernel changes
It was already broken. It doesn't get much worse than a complete system lock-up.

As previously mentioned, I don't even run Ubuntu any more. I don't get paid from Canonical. Here I was going to spend my time over the weekend to try and help you guys out, but now I don't think I'll bother.

Ubuntu
linux package

10ec:8168 System lock-up when receiving large files over a Realtek NIC (big data amount) from NFS server

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntulinux package

10ec:8168 System lock-up when receiving large files over a Realtek NIC (big data amount) from NFS server

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
linux package