10ec:8168 System lock-up when receiving large files over a Realtek NIC (big data amount) from NFS server

Bug #661294 reported by David
128
This bug affects 20 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Invalid
Medium
Unassigned

Bug Description

1) I'm using Kubuntu 10.10 / 2.6.35-22-generic (at the client machine)

3) I expect being able to copy files from my NFS server (which runs Kubuntu 9.04 / 2.6.32-25) witouth any problems as I could before the upgrade when my client ran Kubuntu 8.04. Both are connected via Gigabit Ethernet.

4) I'm facing complete system lock-ups at the client machine (only reset button helps) when receiving large amounts of data from the NSF server. The server itself doen't seem to be affected by this problem.

I did the following tests:

- Copy files of various sizes (up to 2 GiB) from the client to the server -> WORKS
- Copy small files (like 100 MiB) from the server to the client -> WORKS
- Copy bigger files (like 1 GiB or more) from the server to the client -> CLIENT LOCKS UP

(I'm actually able to copy a 1 GiB file which I copied from client to server right back from server to client afterwards successfuly. But since this doesn't seem to cause any network traffic I'm assuming this is just 'fake' and in reality the file is copied back from some cache at the client. Having a 2 GiB file this situation also causes a lock-up.)

Other situations that caused Lock-Ups:
- Letting the Strigi Indexer oft the client index files on the server's NFS share
- Extracting archives via the client which are on the server's NFS hare
- Watching on the client a movie which is on the server's NFS share (but here it does only seldomly lock-up. Movies with low bitrates didn't cause any problems yet while movies with high bitrates only sometimes and after various amounts of time are causing problems)

It doesn't matter if I copy the files using 'cp' or some graphical file manager by the way.

Having those observation results it seems that the NFS client causes the system to lock up when receiving big amounts of data.

According to /proc/mounts my NFS shares are mounted as follows:

192.168.1.101:/media/data5 /media/data5 nfs rw,sync,nosuid,nodev,noexec,noatime,nodiratime,vers=3,rsize=262144,wsize=262144,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=192.168.1.101,mountvers=3,mountport=53457,mountproto=tcp,addr=192.168.1.101 0 0

The Kubuntu 8.04 Client I had before on which those problems did not occur had quite similar setting, I believe.

I'd gladly provide you with any further informations / logs if you tell me how to enable / where to gather.

WORKAROUND: I put in /etc/rc.local so I don't have to enter it again after every reboot:

wondershaper eth0 500000 500000

Revision history for this message
Fabio Marconi (fabiomarconi) wrote :

Hello David
can you please reproduce the bug then attach here
/var/log/syslog
/var/log/dmesg
/var/log/kern.log
Thanks
Fabio

Changed in ubuntu:
status: New → Incomplete
Revision history for this message
David (g-launchpad-strategyplayer-net) wrote :

Hello Fabio

I've attached the logs:

LOGS1 contains only kern.log an dmesg since I somehow 'lost' syslog, LOGS2 is a second reproducing of the bug which also cotains syslog.

LOGS 1

kern.log shows you the normal reboot I did (starting Oct 16 23:57:07) before reproducing the bug and then the reboot by reset button after the system locked up because I reproduced the bug (Oct 17 00:00:36). There don't seem to be any entries in the moment of the bug occuring (which was around 23:59)

dmesg seems the show the reboot after the lockup (?)

LOGS 2

Here I think kern.log shows only the boot after the lock up (as dmesg) but syslog shows a bit more this time. The moment of the crash must be around 00:28

Do I have to raise the log-level or something in order to receive more detailed logs? (if yes: how?)

Revision history for this message
David (g-launchpad-strategyplayer-net) wrote :
Revision history for this message
David (g-launchpad-strategyplayer-net) wrote :

Anyone still interested in solving that bug?

Changed in ubuntu:
status: Incomplete → New
Revision history for this message
David (g-launchpad-strategyplayer-net) wrote :

I meanwhile tested the whole thing on a fresh installation of Kubuntu 10.10 and the problem exists exactly the same there.
(While on Kubuntu 8.04 everything works still without any problems.)
So the problem doesn't seem to come from a broken installation (since it even exists when trying it form the LiveCD) and it also doesn't seem to come from broken hardware (since there is no problem when using 8.04).

What can I do to provide you with additional information to solve the problem?

Revision history for this message
Bruce Edge (bruce-edge) wrote :

I can collaborate the problem reported here. I have the same problem with a number of servers that were recently migrated to 10.04.
NFS is not stable. frequently, large transfers result in a "task blocked for 120 sec...." message and a dmesg stack trace.

It's utterly incomprehensible to me that an LTS rleease would have this issue and there appears to be zero interest on the part of Canonical to fix, or even acknowledge its existence.

Revision history for this message
David (g-launchpad-strategyplayer-net) wrote :

Thank you for your message! I knew that I'm not the only ohne with this problem form message boards but I feared that for the other affected the problem isn't important enough to let the Devs know about it...

By the way: I tried something new. I installed and booted the lucid kernel (2.6.32-21 generic) on my Kubuntu 10.10. Unfortunately this didn't solve the problem. So it doesn't seem to make a difference if I use 2.6.32 or 2.6.35.

Revision history for this message
Clint Byrum (clint-fewbar) wrote :

Ok, sounds like this is Confirmed based on user reports. Also this is pretty much the kernel's domain, so I'm setting the package to linux.

Setting Importance to Medium as this seems to be a pretty big problem for users who it affects.

To all reporters, if you can provide explicit details on the *server* platform that would be very helpful. This includes the mount line for the server (just paste the output of the 'mount' command), and also the contents of /etc/exports on the server.

affects: ubuntu → linux (Ubuntu)
Changed in linux (Ubuntu):
importance: Undecided → Medium
status: New → Confirmed
Revision history for this message
Bruce Edge (bruce-edge) wrote :

I've tried all of the 10.10 ppa kernel backports as they are made available. This has made no difference. Each one in turn.

I've also played with the mount lines. Here are some of the combinations I've tried:

rw,noatime,nfsvers=3,udp,rsize=32768,wsize=32768,actimeo=3,sloppy,addr=135.149.74.51

udp -> tcp
no nfsvers=3
no actimeo
no noatime

The addr= bit was added because I had a NIC with multiple IP addrs and thought that that might be confusing it as the NFS docs state that the "server IP resolution logic" is not entirely correct.

There are however some servers that have this problem and some that don't.

OK:
Intel(R) Xeon(R) CPU X5680 @ 3.33GHz

Not OK:
Intel(R) Xeon(R) CPU E5405 @ 2.00GHz
AMD Opteron(tm) Processor 280
 Intel(R) Xeon(TM) CPU 3.00GHz

They are all using the same mount args as provided by LDAP autofs.

It may be just the use case that has prevented it from happening on that one server.

Revision history for this message
David (g-launchpad-strategyplayer-net) wrote :

My Server is:
 Kubuntu 10.04 (Linux unimatrixserver 2.6.32-25-generic #44-Ubuntu SMP Fri Sep 17 20:05:27 UTC 2010 x86_64 GNU/Linux)

On the server, the 'mount' command provides the following output:

/dev/sda1 on / type ext4 (rw,errors=remount-ro)
proc on /proc type proc (rw)
none on /sys type sysfs (rw,noexec,nosuid,nodev)
none on /sys/fs/fuse/connections type fusectl (rw)
none on /sys/kernel/debug type debugfs (rw)
none on /sys/kernel/security type securityfs (rw)
none on /dev type devtmpfs (rw,mode=0755)
none on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=0620)
none on /dev/shm type tmpfs (rw,nosuid,nodev)
none on /var/run type tmpfs (rw,nosuid,mode=0755)
none on /var/lock type tmpfs (rw,noexec,nosuid,nodev)
none on /lib/init/rw type tmpfs (rw,nosuid,mode=0755)
none on /var/lib/ureadahead/debugfs type debugfs (rw,relatime)
rpc_pipefs on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw,relatime)
/dev/sdd1 on /media/b2 type ext3 (rw)
/dev/sdc1 on /media/d3 type ext3 (rw)
/dev/sde1 on /media/b1 type ext3 (rw)
/dev/sdb1 on /media/d5 type ext3 (rw)
/dev/sda6 on /media/d4 type ext3 (rw)
nfsd on /proc/fs/nfsd type nfsd (rw)

The output of cat /proc/mounts seems to be a bit more detailed:

rootfs / rootfs rw 0 0
none /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
none /proc proc rw,nosuid,nodev,noexec,relatime 0 0
none /dev devtmpfs rw,relatime,size=1019572k,nr_inodes=254893,mode=755 0 0
none /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0
/dev/disk/by-uuid/cafc3ff0-a72d-4901-9d74-a608281b7a9c / ext4 rw,relatime,errors=remount-ro,barrier=1,data=ordered 0 0
none /var/lib/ureadahead/debugfs debugfs rw,relatime 0 0
none /sys/fs/fuse/connections fusectl rw,relatime 0 0
none /sys/kernel/debug debugfs rw,relatime 0 0
none /sys/kernel/security securityfs rw,relatime 0 0
none /dev/shm tmpfs rw,nosuid,nodev,relatime 0 0
none /var/run tmpfs rw,nosuid,relatime,mode=755 0 0
none /var/lock tmpfs rw,nosuid,nodev,noexec,relatime 0 0
none /lib/init/rw tmpfs rw,nosuid,relatime,mode=755 0 0
rpc_pipefs /var/lib/nfs/rpc_pipefs rpc_pipefs rw,relatime 0 0
/dev/sdd1 /media/b2 ext3 rw,relatime,errors=continue,data=ordered 0 0
/dev/sdc1 /media/d3 ext3 rw,relatime,errors=continue,data=ordered 0 0
/dev/sde1 /media/b1 ext3 rw,relatime,errors=continue,data=ordered 0 0
/dev/sdb1 /media/d5 ext3 rw,relatime,errors=continue,data=ordered 0 0
/dev/sda6 /media/d4 ext3 rw,relatime,errors=continue,data=ordered 0 0
nfsd /proc/fs/nfsd nfsd rw,relatime 0 0

NOTE: Only /media/d3 /media/d4 and /media/d5 are exported via NFS

The content of /etc/exports is the following:

/media/d5 192.168.1.0/255.255.255.0(rw,sync)
/media/d4 192.168.1.0/255.255.255.0(rw,sync)
/media/d3 192.168.1.0/255.255.255.0(rw,sync)

(As far as I can remember I also tried several different settings like 'async' instead of 'sync' but I can't nameexactly all the things I tried some weeks ago.)

If you need any additional information about the server or the client I'd be glad to provide!

Revision history for this message
Clint Byrum (clint-fewbar) wrote :

Thanks guys for providing some more details.

Whats really needed is the mount arguments from the clients (the machines that have mounted the shared exports and are locking up) and the software versions from the NFS servers (the machines that are exporting files via NFS).

Specifically on the server it would help to see this on the servers:

apt-cache policy nfs-kernel-server
uname -a
ifconfig -a

Anything that will help build a solid repeatable test case.

Revision history for this message
Bruce Edge (bruce-edge) wrote :
Download full text (4.9 KiB)

Machine 1:

0 9:39:31 root@topaz ~
0 #> apt-cache policy nfs-kernel-server
nfs-kernel-server:
  Installed: 1:1.2.0-4ubuntu4
  Candidate: 1:1.2.0-4ubuntu4
  Version table:
 *** 1:1.2.0-4ubuntu4 0
        500 http://us.archive.ubuntu.com/ubuntu/ lucid/main Packages
        500 http://wlvmirror.lsi.com/ubuntu/ lucid/main Packages
        100 /var/lib/dpkg/status
0 11:07:14 root@topaz ~
0 #> uname -a
Linux topaz 2.6.35-22-server #33~lucid1-Ubuntu SMP Sat Sep 18 13:29:53 UTC 2010 x86_64 GNU/Linux
0 11:07:26 root@topaz ~
0 #> ifconfig -a
bond0 Link encap:Ethernet HWaddr 00:16:35:5b:a3:1e
          inet addr:135.149.75.126 Bcast:135.149.75.255 Mask:255.255.255.0
          inet6 addr: fe80::216:35ff:fe5b:a31e/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
          RX packets:71774991 errors:0 dropped:0 overruns:0 frame:0
          TX packets:70132066 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:23580470583 (23.5 GB) TX bytes:44692733013 (44.6 GB)

eth0 Link encap:Ethernet HWaddr 00:16:35:5b:a3:1e
          UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
          RX packets:20473514 errors:0 dropped:0 overruns:0 frame:0
          TX packets:35066033 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:1655710405 (1.6 GB) TX bytes:22332615100 (22.3 GB)
          Interrupt:25

eth1 Link encap:Ethernet HWaddr 00:16:35:5b:a3:1e
          UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
          RX packets:51301477 errors:0 dropped:0 overruns:0 frame:0
          TX packets:35066033 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:21924760178 (21.9 GB) TX bytes:22360117913 (22.3 GB)
          Interrupt:26

lo Link encap:Local Loopback
          inet addr:127.0.0.1 Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING MTU:16436 Metric:1
          RX packets:31618343 errors:0 dropped:0 overruns:0 frame:0
          TX packets:31618343 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:20323093919 (20.3 GB) TX bytes:20323093919 (20.3 GB)

virbr0 Link encap:Ethernet HWaddr f2:f4:de:5a:e7:79
          inet addr:192.168.122.1 Bcast:192.168.122.255 Mask:255.255.255.0
          inet6 addr: fe80::f0f4:deff:fe5a:e779/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1891 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:0 (0.0 B) TX bytes:194877 (194.8 KB)

Machine 2:

0 #> apt-cache policy nfs-kernel-server
nfs-kernel-server:
  Installed: 1:1.2.0-4ubuntu4
  Candidate: 1:1.2.0-4ubuntu4
  Version table:
 *** 1:1.2.0-4ubuntu4 0
        500 http://us.archive.ubuntu.com/ubuntu/ lucid/main Packages
        500 http://wlvmirror.lsi.com/ubuntu/ lucid/main Packages
        100 /var/lib/dpkg/status
0 11:08:10 root@tonic ~
0 #> uname -a
Linux tonic 2.6.35-22-server #34-Ubuntu SMP Thu Oct 7 15:36:13...

Read more...

Revision history for this message
Bruce Edge (bruce-edge) wrote :

Note that this was happening long before we switched to bonded interfaces.

Revision history for this message
David (g-launchpad-strategyplayer-net) wrote :
Download full text (5.5 KiB)

The mount arguments on the CLIENT according to /proc/mounts

rootfs / rootfs rw 0 0
none /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
none /proc proc rw,nosuid,nodev,noexec,relatime 0 0
none /dev devtmpfs rw,relatime,size=2021240k,nr_inodes=505310,mode=755 0 0
none /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0
fusectl /sys/fs/fuse/connections fusectl rw,relatime 0 0
/dev/disk/by-uuid/e21a8bd2-83f6-43ed-8c68-08fcdc85631b / ext4 rw,relatime,errors=remount-ro,barrier=1,data=ordered 0 0
none /sys/kernel/debug debugfs rw,relatime 0 0
none /sys/kernel/security securityfs rw,relatime 0 0
none /dev/shm tmpfs rw,nosuid,nodev,relatime 0 0
none /var/run tmpfs rw,nosuid,relatime,mode=755 0 0
none /var/lock tmpfs rw,nosuid,nodev,noexec,relatime 0 0
/dev/sdb5 /media/d1 ext3 rw,relatime,errors=continue,commit=5,barrier=0,data=ordered 0 0
/dev/sdb6 /media/d2 ext3 rw,relatime,errors=continue,commit=5,barrier=0,data=ordered 0 0
binfmt_misc /proc/sys/fs/binfmt_misc binfmt_misc rw,nosuid,nodev,noexec,relatime 0 0
192.168.1.103:/media/d5 /media/d5 nfs rw,sync,nosuid,nodev,noexec,noatime,nodiratime,vers=3,rsize=262144,wsize=262144,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=192.168.1.103,mountvers=3,mountport=36708,mountproto=tcp,addr=192.168.1.103 0 0
192.168.1.103:/media/d4 /media/d4 nfs rw,sync,nosuid,nodev,noexec,noatime,nodiratime,vers=3,rsize=262144,wsize=262144,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=192.168.1.103,mountvers=3,mountport=36708,mountproto=tcp,addr=192.168.1.103 0 0
192.168.1.103:/media/d3 /media/d3 nfs rw,sync,nosuid,nodev,noexec,noatime,nodiratime,vers=3,rsize=262144,wsize=262144,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=192.168.1.103,mountvers=3,mountport=36708,mountproto=tcp,addr=192.168.1.103 0 0

(Again, I tried several different settings here like UDP instead of TCP and such things.)

On the SERVER we have for "apt-cache policy nfs-kernel-server" the following:

nfs-kernel-server:
  Installiert: 1:1.2.0-4ubuntu4
  Kandidat: 1:1.2.0-4ubuntu4
  Versions-Tabelle:
 *** 1:1.2.0-4ubuntu4 0
        500 http://ch.archive.ubuntu.com/ubuntu/ lucid/main Packages
        100 /var/lib/dpkg/status

And for "unmame -a" as stated above:

Linux unimatrixserver 2.6.32-25-generic #44-Ubuntu SMP Fri Sep 17 20:05:27 UTC 2010 x86_64 GNU/Linux

And for "ifconfig -a"

eth0 Link encap:Ethernet Hardware Adresse 90:e6:ba:67:96:73
          inet Adresse:192.168.1.103 Bcast:192.168.1.255 Maske:255.255.255.0
          inet6-Adresse: fe80::92e6:baff:fe67:9673/64 Gültigkeitsbereich:Verbindung
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metrik:1
          RX packets:62655615 errors:0 dropped:0 overruns:0 frame:0
          TX packets:69746827 errors:0 dropped:0 overruns:0 carrier:1
          Kollisionen:0 Sendewarteschlangenlänge:1000
          RX bytes:46398335710 (46.3 GB) TX bytes:84718869326 (84.7 GB)
          Interrupt:27

lo Link encap:Lokale Schleife
          inet Adresse:127.0.0.1 Maske:255.0.0.0
          inet6-Adresse: ::1/128 Gültigkeitsbereich:Maschine
          UP LOOPBACK...

Read more...

Stefan Bader (smb)
tags: added: kernel-server
Revision history for this message
Andrew Chambers (andrewrchambers) wrote :

I can confirm this activity as well after a recent overhaul of my home network. It's very frustrating - NFS usage will cause client lockups. This has happened several times, and can only be resolved by pressing the hard reset button. It happens one of two ways: X locks up completely, or X stops responding but I can move the mouse. In the second scenario, pressing Ctrl+Alt+Fn will lock up X completely.

As a symptom of the problem is a lockup of X, I will mention that I am using the NVIDIA driver. I will test later without the proprietary driver to see if this effects anything, though lockups only occur during some NFS activity so doubt this is responsible.

Locks occur all the time:
When populating my shotwell library (Photos stored on server)
Nautilus getting screenshots of HD videos
Transfer of large files.
Even just updating my .mozilla profile

Crashes are easily reproducable (Basically any time I use NFS).

My desktop is using a gigabit card, as is the server, and they are connected with a netgear gigabit switch. I will post some stats about hardware on the machines later today, as well as info from nfsstat -m and mount.

Server is Ubuntu-server 10.10
Desktop is Ubuntu 10.10

I have another machine on the network which uses a 100mbit card and so far I have yet to see a crash when using NFS, but I haven't performed any serious tests yet. This machine is also using Ubuntu 10.10.

If there is anything specific I can provide please let me know.

Revision history for this message
Stefan Bader (smb) wrote :

I tried to reproduce this as simply as possible:

Server: Lucid (2.6.32-26) , nfs-kernel-server (1.2.0-4ubuntu4), nfs-common (1.2.0-4ubuntu4)
- AMD based
- exported fs ext4, exported like in comment #10 (rw,sync)
Client: Maverick (2.6.35-23), nfs-common (1.2.2-1ubuntu1)
- Intel based
- target fs ext4

Connection is 1000Mbit/full duplex according to ethtool. I did the mount without any special options, just a plain "mount <ip>:/...". One thing I noticed is that

192.168.2.5:/srv/share/nfs /mnt nfs rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=192.168.2.5,mountvers=3,mountport=37472,mountproto=udp,addr=192.168.2.5 0 0

my wsize and rsize values are smaller. And I use ext4 instead of ext3. But that should hopefully be not the issue. Anyway, something seems different because I successfully copy a 4GB file without issues. So we likely need to find out exactly what the failing setup is.

Has anybody running into this tried to run the actual copy from vt1 (that usually gets the console errors and may show something that does not make it into the logs)?

Revision history for this message
Stefan Bader (smb) wrote :

Actually the values are bigger. Just could not read the number of digits...

Revision history for this message
Bruce Edge (bruce-edge) wrote :

As for more test cases and messages, here are some other bugs that AFAICT all refer to the same problem:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/561210
https://bugs.launchpad.net/ubuntu/+source/nfs-utils/+bug/214041
https://bugs.launchpad.net/ubuntu/+source/nfs-utils/+bug/585657
http://web.archiveorange.com/archive/v/PqQ0Rrbfh6Yp6PVD7AO8

I've reported my info in the above links so I won't repeat it here.

Revision history for this message
David (g-launchpad-strategyplayer-net) wrote :

Notice that those bugs as far as I can see are happening when SENDING big files to a NFS target, while the problem here happens when RECEIVING big files from a NFS target.
Of course it's quite possibly that those Problems are tight connected but I just wanted to point out the difference because _sending_ big files to my NFS server does not seem cause any problems in my case.

(By the way. Someone in a message board told me that he had similar problems and solved/bypassed them by using the user-space-nfs-server 'unfs3' instead of the kernel-nfs-server. I haven't tested this yet myself so it's not sure that it hekps in my case - just wanted to mention it.)

Revision history for this message
Stefan Bader (smb) wrote :

David is right here. One of the bugs mentioned by Thag seems even gone since a while (the last comments I see are from 2008). The other bugs seem to relate to a server side NFS lockup and seemed to produce stack traces on the server. Reading there it seems that this issue should be fixed now and that fix was part of 2.6.32-20.29 for Lucid and the initial release of Maverick.

So we should concentrate on receiving files in this bug report. If I understand correctly the server side is unaffected and shows no messages. And the client can re-connect after the reset?
David, if you could try to see whether doing the file copy on the text console (vt1) shows some messages when locking up. And just to mention, yes surely using the user-space nfs server changes the problem (or maybe avoids it). It is completely different code (and presumably slower). But it may as well be only the client side and it would be good to find the problem there. And to understand it better we need to find out how to trigger it reliably (and hopefully get some data out of it).

Revision history for this message
David (g-launchpad-strategyplayer-net) wrote :
Download full text (4.2 KiB)

Yes, the server side shows no messages and doesn't seem to be affected. (At least it shows no messages where I looked but since I'm not an expert it's possible that I just dont know where to look at the server side)

And yes, the client can re-connect without any problems after the client has rebooted.

I did some tests in tty1 and got at least a little bit of information. Strange is, that not all tests runs showed the same behaviour.

First test:
Copy a 4 GB File from a NFS share to the second (magnetic, sdb5) harddisk on my computer.

Result:
For one or two minutes nothing happened. Then the conosle showed some error-messages. Unfortunately I only have the second part of the messages, because the first page was 'out of screen' before I could take a picture.
The error messages are the following:
[5351.183955] end_request: I/O error, dev sdb, sector 131925003
(... this error repeadted a lot of times with different sector numbers ...)
[5351.203631] Aborting journal on device sdb5
[5351.203691] end_request: I/O error, dev sdb, sector 126666835
[5351.305377] EXT3-fs (sdb5): ext3_journal_start_sb: Detected aborted journal
[5351.305508] EXT3-fs (sdb5): ext3_journal_start_sb: remounting filesystem read-only
cp: schreiben von "bigfile.mkv": Read-only file system
("schreiben von" = "writing of")
After this error messages (of which, as I said, I unfortunately don't have the first part) the console was still working and I also could switch back to tty7 (GUI). As it stated in the messages it had remounted sbd5 with 'ro' (read-only).

So I decided to repeat the test to catch also the first part of the error messages, but..:

Second test:
Copy again a 4 GB File from a NFS share to the second (magnetic, sdb5) harddisk on my computer.

Result:
I waited for about 10 minutes, but nothing happened this time. No error messages but also no finishing of the copy proccess.
Yet I could switch to other text-consoles (tty2.. tty3.. didn't try to go to tty7 this time).
After those 10 minutes I tried to abort the copy proccess by hitting Ctrl+C which then seemed to lock-up also the console (I couldn't switch to other text-consoles anymore). So I rebooted the machine.

Third test:
Copy again a 4 GB File from a NFS share to the second (magnetic, sdb5) harddisk on my computer.

Result:
I waited for about 40 minutes this time, but again: nothing happened. I still could switch to other text-consoles but when I tried to switch to the graphical 'tty7' I just had a black screen and couldn't do anything (also not switching back to tty1) then. So I had to reboot the machine again.

Because of the error messages from the first test which looked likae a harddisk problem, I wanted to do a fourth test which should copy not to the magnetic harddisk but to the SSD harddisk I also have in my computer (note that of course I also tested copying to both disks long time before, so it's not the first time I try to copy to the SSD)

Fourth test:
Copy a 4 GB File from a NFS share, this time to the first (solid-state, sda1) harddisk on my computer.

Result:
I waited for about 20 minutes this time.. again: nothing happened. Same thing as...

Read more...

Revision history for this message
Andrew Chambers (andrewrchambers) wrote :

I replaced my switch with a 100/10mbit switch last night to see what effect this had - it seemed to remedy the problem (I no longer got client freezes). Of course this adds a new problem which is that I am no longer using gigabit ethernet which is essential.

It is probably worth noting that I also get the same "ghost traffic" that you mentioned David. LEDs on my switch blink like crazy, but only on the client port.

I am going to try new cat6 cables to see whether this could remedy the problem - perhaps my cables are just not good enough for 1000mbit.

I doubt this is the case, but since we are not having much luck debugging this it is worth checking everything.

Revision history for this message
David (g-launchpad-strategyplayer-net) wrote :

That's interesting news. I'm not very suprised tough, since it was eye-catching how everyone writing about the problem in the message boards was talking about having gigabit-ethernet equipment. The cables also could be an interesting try but I also doubt that this is the case (since the problem did/does not occur in earlier Kubuntu versions... and we did not have better cables back then, right?)

I also have some probably interesting news. I did a fifth test which suprisingly showed completely new results!

The test setting was: Copy a 1.5 GB File from a NFS share to the first (solid-state, sda1) harddisk on my computer.

Result: After some time (don't know exactly how long) I got loads of new error messages. I'll attach some pictures of it I took form my screen so that I don't have to type for half an hour.

What you'll see on the pictures is the 'output' of some minutes, probably it would have continued like this for a long, long time.

Revision history for this message
David (g-launchpad-strategyplayer-net) wrote :
Revision history for this message
David (g-launchpad-strategyplayer-net) wrote :
Revision history for this message
Bruce Edge (bruce-edge) wrote :

David,
You'll have an easier time gathering data if you use something like 'screen' as a shell wrapper so you can scroll back indefinitely.

Also, regarding the 100Mbit/gigE issue, we did not experience this with 100mBit, but we had all gigE when we upgraded to 9.10, which is when these problems started.

Lastly, regarding the no traffic for 2 years, I stopped adding data because nothing had changed. There seemed to be zero interest from Canonical until very recently.

Revision history for this message
Stefan Bader (smb) wrote :

Hm. I need to read into all the new info tomorrow. Just one thing. The two screenshots rather hint some harddrive problem. Just a probably crazy blind shot: is there by chance a via sata controller and wd disks involved?

Revision history for this message
Bruce Edge (bruce-edge) wrote : Re: [Bug 661294] Re: System lock-up when receiving large files (big data amount) from NFS server

Agreed, those screen shots look like a different problem.

On Tue, Nov 30, 2010 at 10:24 AM, Stefan Bader
<email address hidden>wrote:

> Hm. I need to read into all the new info tomorrow. Just one thing. The
> two screenshots rather hint some harddrive problem. Just a probably
> crazy blind shot: is there by chance a via sata controller and wd disks
> involved?
>
> --
> System lock-up when receiving large files (big data amount) from NFS server
> https://bugs.launchpad.net/bugs/661294
> You received this bug notification because you are a direct subscriber
> of the bug.
>

Revision history for this message
David (g-launchpad-strategyplayer-net) wrote : Re: System lock-up when receiving large files (big data amount) from NFS server

@ Stefan

I have two harddisk in my client machine. The NFS freeze happens with both hdds.
One hdd (sdb) is a Seagate Barracuda 7200.10 family (ST3250410AS), the other (sda) is a solid state disk (it's labeled as 'sagitta' but I guess that means more or less 'no name'). The operating system is installed on the solid state disk.
My Gigabyte Mainboard has (as far as I know) an Intel Chipset, so I guess it also should have an Intel SATA-Controller. But I'm not 100% sure in this point.

When considering that a faulty harddisk is the problem (I also considered this at one point) you should keep in mind the following things:
- The problem happens with two totally different harddisks
- The problem happens exactly since the day I installed 10.10 instead of 8.04
- Those suspicious error messages were only shown in one (or two) of five test runs.

So if it's a harddisk problem it was probably also 'introduced' with 10.10

I'm now gonna try two things and tell you afterward the results.

1. Copy big files between the local disks (to rule out that the problem is possibly totally unrelated to NFS)
2. Attach an external HDD via USB to the client machine and try to copy from a NFS share to this external disk (to see if it's related to the mainboard's SATA-Controller)

Revision history for this message
David (g-launchpad-strategyplayer-net) wrote :

The results of the two above mentioned tests:

1. Copying big files between the clients local harddisks causes no problems.

2. Copying big files from a NFS share to an external harddisk connected via USB to the client also locks up the client.
(interestingly this way it even manages to lock up tty1 so completely that I can't even change to tty2-6.)

Revision history for this message
Stefan Bader (smb) wrote :

@David. Thanks for testing. In that case it is not the problem I was reminded of (this only happens on certain via controllers and apparently WD disks). The observation about the switch LEDs is interesting. As I said, I did my tests using a Lucid server and Maverick client and used a gigabit switch. I ran the test again this morning and stopped after having copied 15GBs for the 20th time. In my case I definitely saw both LEDs flash in sync. I did not pay too much attention to the hd LEDs but those do not need to go as fast. Writes hit the cache and are then written off in batches.

So obviously I still do something wrong or got lucky to have the "right" hardware. I am still trying to figure out what all of you with the problem may have in common (feels a bit like CSI, just without those nice fancy tools ;)). The fact that David saw errors that point to the disk subsystem but has no problems when only using that could also mean that, whatever happens, causes severe memory corruption. Or maybe missing interrupts (the error message mentions timeout).

At the moment I am not sure which direction to go. First probably it would be good to have more information on the systems affected. If I could get the output of the following commands from at least two affected clients.

sudo lspci -vvnnn >lspci.txt
cat /proc/interrupts >interrupts.txt

Also, just to confirm, server is Lucid based and client on Maverick. At least this was the case in previous comments. And which was the last known good client? One test that comes to my mind: can you scp that big file from the server to the client? That would hint whether it is the network in general or specifically nfs.

Changed in linux (Ubuntu):
assignee: nobody → Stefan Bader (stefan-bader-canonical)
Revision history for this message
David (g-launchpad-strategyplayer-net) wrote :

Yes, the server is lucid, the client is maverick.
The last known good client in my case was hardy (8.04) but I never used 8.10/ 9.04 / 9.10 so I can't say anything about those releases. What I do know is that the problem also exists with the 10.04 Live-CD I tested recently.

I have attached the output of the commands.

Revision history for this message
David (g-launchpad-strategyplayer-net) wrote :
Revision history for this message
Stefan Bader (smb) wrote :

Thanks David. So from the pci info we use the exactly same ehternet chip in the clients. Hardy being the last known good case may have a larger impact on how the hardware is driven. One thing that looked a bit weird but maybe has no implication here is the very low count on timer interrupts (though problems there usually come as: system stops doing anything until I hit a keyboard key). Something that probably has changed a lot since Hardy and which I personally saw causing problems sometimes is the MSI support. If you got time, you could try to boot with "pci=nomsi" on the kernel command line? When that is in effect the interrupt assigned to eth0 should not say MSI anymore. Does this change anything?

Revision history for this message
Andrew Chambers (andrewrchambers) wrote :

I have SCPed the file from server to client with no issues, so can confirm that it is not the physical network. I also booted a live CD of 9.04 and copied some files to client with no issues.

I don't have a 64bit 9.10 liveCD handy, though I could test with a 32bit one I have - this might be adding an extra variable though?

Revision history for this message
David (g-launchpad-strategyplayer-net) wrote :

I booted with pci=nomsi (the Interrupts naturally changed a bit with this option, so I attached a new interrupts.txt to show that the command was in effect).
Unfortunately it did not solve the problem. The system did still lock up.

Revision history for this message
Stefan Bader (smb) wrote :

Alright, so it is not related to MSI and Andrew proved that network in general is ok He seems to have a quite similar system as David, though the GA-EP45-DS5 string in lspci is likely a lie as it shows the same here and I got a EX58-UD3R. Both of you get 2 CPUs. From the logs David posted earlier this seems to be a dual-core Intel without hyper-threading enabled. Both got this quite low number on timer interrupts.

@Andrew, yes I guess mixing 32 and 64bit would probably only help if it does not work. In which case the potential breakage was between Jaunty and Karmic. If it works it may be only 64bit broke. Which is a hint but hm.

The last log I saw from David was using 2.6.35-22-generic. Has one of you updated to -23 since then? Due to the silent hang I don't think there is much to be gained by traces on the client. I wonder whether there is much to be gained from gathering a tcpdump on the server. Generally is that "ghost traffic" immediate or starts after a while?

Let me create some kernel packages which have a bunch of debugging code activated (especially lock checking and that stuff). Hopefully that could give any hint on the client side. I'll post here when I got something.

Revision history for this message
Stefan Bader (smb) wrote :

Ok, I got a 64bit generic kernel image for Maverick at http://people.canonical.com/~smb/lp/661294 (the version number of 9923 is intentional to make it install in parallel and be easy to identify). For me there is one rcu warning on boot which does not seem to have any impact. If you could try that and run the test again on vt1 (maybe in screen if that helps to get some scrollback).

Revision history for this message
David (g-launchpad-strategyplayer-net) wrote :

Yes, I updated to -23 since then.
I'll check your question regarding the ghost traffic.

The link to the kernel image gives me a 404.

Revision history for this message
Stefan Bader (smb) wrote :

Sorry, http://people.canonical.com/~smb/lp661294/ (slash in wrong position error)

Revision history for this message
David (g-launchpad-strategyplayer-net) wrote :

I tested your kernel images.

Having them installed, the xserver won't start, so it puts me to tty2 ('startx' results in some error message which is basically telling me that it can't load the 'nvidia' driver). Since we need the output from tty1 that's probably no issue, just wanted to mention it.

In tty1 I see the end of some 'boot log' which tells me, that there was an error while mounting my nfs shares:

mountall: Keine Verbindung zu Plymouth ['Keine Verbindung' = 'No connection']
mount.nfs: Failed to resolve server 192.168.1.100: Name or service not known
mountall: mount /media/d3 [1222] brach mit dem Status 32 ab [= canceled with status 32]
(this error repeats for my two other NFS shares)

Regardless, after login my NFS shares seem to be connected an I can browse them.
So I tried to copy a 3 GB file from a share to the local client but it was the same as always: The copy proccess started but never ended. It didn't give me any information on the screen. The HDD LEDs did not indicate any activity and I guess the 'ghost traffic' was also there like always (but it wasn't that easy to tell this time, because the server had some other network activity from the internet at that time).

After rebooting into the normal kernel I found exactly 128,0 MB of the file I tried to copy in my home directory (where I wanted to copy it to). This is not unusual but it's also not every time like this. Sometimes there is no data fragment at all left behind.
Seeing this 'round lot' of 128.0 MB I remember that it was the same amount some times before (but not always.. sometimes it's also a 'random' number of data)

Revision history for this message
Bill M (billmoritz) wrote :

I am having the same issues with nfs mounts. IOZone tests would fail on anything over 10Mb. I recompiled nfs-common and it seams to have fixed the issue for me. Maybe this will help isolate the issue for everyone else. This was tested on my Lucid server. Please don't laugh at my nfs server performance. I know it stinks.

Here's how I compiled nfs-utils:
root@dev-0:~# apt-get install build-essential fakeroot dpkg-dev -y
root@dev-0:~# mkdir build
root@dev-0:~# cd build
root@dev-0:~/build# apt-get source nfs-common
root@dev-0:~/build# apt-get build-dep nfs-common -y
root@dev-0:~/build# dpkg-source -x nfs-utils_1.2.0-4ubuntu4.dsc
root@dev-0:~/build# cd nfs-utils-1.2.0/
root@dev-0:~/build/nfs-utils-1.2.0# dpkg-buildpackage -rfakeroot -b
root@dev-0:~/build/nfs-utils-1.2.0# dpkg -i ../nfs-common_1.2.0-4ubuntu4_amd64.deb

My Kernel:
Linux mup-4 2.6.32-26-server #47-Ubuntu SMP Wed Nov 17 17:05:29 UTC 2010 x86_64 GNU/Linux

My IOZone test:
root@dev-0:~# iozone -i0 -r4k -s1G -e -t1 -F /mnt/slow/vol16/test
 Iozone: Performance Test of File I/O
         Version $Revision: 3.308 $
  Compiled for 64 bit mode.
  Build: linux

 Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
              Al Slater, Scott Rhine, Mike Wisner, Ken Goss
              Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
              Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner,
              Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy,
              Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root.

 Run began: Fri Dec 3 15:20:33 2010

 Record Size 4 KB
 File size set to 1048576 KB
 Include fsync in write timing
 Command line used: iozone -i0 -r4k -s1G -e -t1 -F /mnt/slow/vol16/test
 Output is in Kbytes/sec
 Time Resolution = 0.000001 seconds.
 Processor cache size set to 1024 Kbytes.
 Processor cache line size set to 32 bytes.
 File stride size set to 17 * record size.
 Throughput test with 1 process
 Each process writes a 1048576 Kbyte file in 4 Kbyte records

 Children see throughput for 1 initial writers = 2341.93 KB/sec
 Parent sees throughput for 1 initial writers = 2341.93 KB/sec
 Min throughput per process = 2341.93 KB/sec
 Max throughput per process = 2341.93 KB/sec
 Avg throughput per process = 2341.93 KB/sec
 Min xfer = 1048576.00 KB

 Children see throughput for 1 rewriters = 3055.24 KB/sec
 Parent sees throughput for 1 rewriters = 3055.22 KB/sec
 Min throughput per process = 3055.24 KB/sec
 Max throughput per process = 3055.24 KB/sec
 Avg throughput per process = 3055.24 KB/sec
 Min xfer = 1048576.00 KB

iozone test complete.

Revision history for this message
Bill M (billmoritz) wrote :

Sorry, I ran uname on the wrong machine.. Same kernel though:

Linux dev-0 2.6.32-26-server #47-Ubuntu SMP Wed Nov 17 17:05:29 UTC 2010 x86_64 GNU/Linux

Revision history for this message
Stefan Bader (smb) wrote :

@David, sorry for not replying sooner. Got distracted with other things. Nvidia not working sounds like you only installed the linux-image package and not the header packages (those are needed so dkms can compile a new nvidia module for it).
But sadly there does not seem to be anything that gets printed anyways.

Recompiling the client tools seemed to have solved Bills case (thanks for the info Bill). though then this was on the server side where I run stock packages without problems. Not sure if you want to try that.

Last resort really would be to have a tcpdump or a wireshark trace running on the server or another machine in the same segment and hope to find something strange there...

Revision history for this message
Bill M (billmoritz) wrote :

Server side for me is a nfs appliance running a BSD kernel with a proprietary nfs server daemon. Can't help there.

Revision history for this message
David (g-launchpad-strategyplayer-net) wrote :

No problem, you are quite fast, actually ;-)

I installed all three packages form the link you provided. Got no errors while installing, so I tought that should have worked out.

I could try to recompile nfs-commons with the info that Bill thankfully provided.

I also could try the tcpdump / wireshark thing if you give me additional information how to handle this. I've heard of it but never used it.

Revision history for this message
Andrew Chambers (andrewrchambers) wrote :

Sorry for my lack of input recently.

I can say that I have tried recompiling nfs-common from source as per Bill's instructions, but this didn't help and the client still crashes as it always has.

Revision history for this message
Bill M (billmoritz) wrote :

So in the end the recompile of nfs-common didn't fix my issue after all. It took a couple of days and then it happened again. I think my issue might be slightly different. I have multiple clients connecting to the same mount points. The Ubuntu Lucid client loses connectivity to the mount points on the server even when the other clients can still access them. Processing using those mounts go in to the D state. Nothing but a reboot regains access to the mounts.

Right now I am trying out an Ubuntu Mainline Kernel to see if my issue is with Ubuntu patches to the kernel.

Specifically:
linux-image-2.6.37-999-generic_2.6.37-999.201012081134_amd64.deb
linux-headers-2.6.37-999-generic_2.6.37-999.201012081134_amd64.deb
linux-headers-2.6.37-999_2.6.37-999.201012081134_all.deb

https://wiki.ubuntu.com/Kernel/MainlineBuilds

Wondering if I should be testing with http://kernel.ubuntu.com/~kernel-ppa/mainline/v2.6.34-lucid/ instead of http://kernel.ubuntu.com/~kernel-ppa/mainline/daily/current/

Revision history for this message
Stefan Bader (smb) wrote :

Ok, several high level interruptions later...

@Bill, the current build is always the latest and greatest (if that happens to build). It can be a starting point for testing. If it consistently works with that, then you could go back to the latest release and see there. If it breaks between two releases then you could try the rc versions of the next. That way, if it has been fixed, we could narrow down roughly since when. If it still does hang on the current daily, then it is something still broken and at some point we probably need to ask upstream for help.

I recently saw a bunch of NFS bugfixes been sent to stable and upstream. This would hit the current daily first.

@David, so for debugging further: beside of Wireshark, there may be the option of turning on debugging in rpc and nfs, too. All of the options produce a lot of output (and there is always a chance that things are slowed down enough for the problem to disappear). And probably it is a good idea to check with the latest mainline build to make sure the problem has not magically be fixed there. In that case it would be better to find out the when.

1. rpc/nfs debugging

First make sure the nfs module is loaded on the client. Then run:
    sudo sh -c 'echo 32767 >/proc/sys/sunprc/nfs_debug'
for nfs and/or
    sudo sh -c 'echo 32767 >/proc/sys/sunprc/rpc_debug'
for rpc debugging. Both will result in a lot of output going into /var/log/syslog. In order to have as much backlog as possible I would ssh into the client and run a 'tail -f /var/log/syslog' over the net. Maybe that looses some when hanging but on the other hand there might be some loss even when running locally. At least you retain some history and you can cut and paste from there.

2. Wireshark

Just install the wireshark package on any machine which is in the same segment (connected to the same switch) as the client. Then run "sudo wireshark" (Careful, it will complain about this being a security risk and so on in a little pop-up that happens to hide quite often behind the main window for me). I would add a "ip.addr == <clients ip address>" in the filter section and hit apply. Then select the network interface and start the trace (in that case I would not do the ssh rpc debugging cause that will add a lot of ssh traffic).

Revision history for this message
Bill M (billmoritz) wrote :

I opened up a separate ticket (688437) because I feel like my issue is different than this one. I've outlined my issue there with logs as well as how to recreate the issue. Nobody has responded to that ticket. The latest daily mainline kernel that I wrote about in this ticket didn't work for me unfortunately. I will test with todays daily.

Revision history for this message
Robbie Williamson (robbiew) wrote :

This bug sounds like a tuning issue. Has anyone tried reducing the rsize to 32K or smaller? You could also try using NFSv2...it has 8K buffers, so the copy will take a LONG time, but eliminates a lot of overhead to help determine where to look. If the copy works in either scenario, then we can start tuning the mounts appropriately.

Revision history for this message
Andrew Chambers (andrewrchambers) wrote :

@Stefan.

I tried "rpc/nfs debugging" as you said.

Strangely when /proc/sys/sunprc/nfs_debug and /proc/sys/sunprc/rpc_debu are set as 32767, I no longer get a crash... Perhaps the bug knows I'm looking for it!

I thought this may have some connection with me tailing the syslog with another ssh session so retested some transfers without a session checking the log - still no crash.

Setting these values back to 0 results in crashing once again.

I will try just debugging nfs and rpc seperately to see whether I can narrow the crashes down to one of them in a moment...

As for wireshark - My other Linux box has been rendered useless for the time being by me playing around with my LDAP server so I wont be able to do this test for now.

@Robbie.

I have actually tried with 8k rsize and csize which didn't help, although this was advice on the matter in some forums I came across which helped for other people - perhaps this is related.

Revision history for this message
David (g-launchpad-strategyplayer-net) wrote :

"First make sure the nfs module is loaded on the client." -> How / how's it called. Can't I assume that it must be loaded if I'm using nfs?

"Then run: sudo sh -c 'echo 32767 >/proc/sys/sunprc/nfs_debug'" -> This gives me a "sh: cannot create /proc/sys/sunprc/nfs_debug: Directory nonexistent" error.

I also made a capture with wireshark when reproducing a crash. Now the capture-file has a size of about 160 MB even though I cropped it so that it only contains the data of a few seconds. I guess this is because it cotains the transfered data of the file I tried to copy via NFS before it crashed?
Well, it doesn't seem to be the best idea to upload this big file here to the bugtracker, so maybe you can give me some hints for what to search in the capture-file?

I noticed that wireshark marked a lot of lines in black and red which are telling in the 'Info' stuff like:

"[TCP previous segment lost] [TCP segment of a reassembled PDU]" (very, very much of this type)
or
"[TCP out-of-order] [TCP segment of a reassembled PDU]" (also quite a lot of those)
or
"[TCP Previous segment lost] exp2 > nfs [ACK] Seq=217385 Ack=18456889 Win=17077 Len=0 TSV=77576 TSER=372796188" (maybe about 100 of this type)
or
"[TCP ACKed lost segment] exp2 > nfs [ACK] Seq=354493 Ack=248195985 Win=20672 Len=0 TSV=77917 TSER=372796529" (also quite a lot of those)
or
"[TCP Retransmission] [TCP segment of a reassembled PDU]" (only a handful of those I think)
or
"[TCP previous segment lost] Continuation" (maybe about 100 of this type)
or
"[TCP ACKed lost segment] V3 READ Call, FH:0x53092ff7 Offset:99811328 Len:262144" (only a handful)
or
"[TCP ACKed lost segment] [TCP Previous segment lost] V3 SETATTR Reply (Call In 115496)" (I think there's only one of those... maybe important?)

As I'm not an expert at all, of course this is just a shot in the dark, since I dont' really know what to look for.

Revision history for this message
Andrew Chambers (andrewrchambers) wrote :

@David

There were a few typos in the commands Stefan suggested. (Replace sunprc with sunrpc)

So try:

sudo sh -c 'echo 32767 >/proc/sys/sunrpc/nfs_debug'

And

sudo sh -c 'echo 32767 >/proc/sys/sunrpc/rpc_debug'

I'd be interested to see whether this clears up any crashes at your end as well.

Apologies as I still haven't got round to debugging this further, but should be able to help soon.

Revision history for this message
David (g-launchpad-strategyplayer-net) wrote :
Download full text (3.4 KiB)

Okay, I tried the one with NFS now.

At first I tought that the debugging output slowed down the process enough to avoid the problem, as Andrew experienced.

It copied quite a lot at a transfer spped between 20 an 45, which semmed to be slow enough to avoid the crash. But then the transfer became faster and between 50 an 60 (MB/s) I had the crash.

Now I've got a 90MB syslog file (I didn't the stuff via SSH because I haven't already set up ssh at this machine at all) which ends quite unsuspicious..

The last entries before the logging stopped because of the freeze are like:

NFS: read done (0:16/21807132 4096@1859477504)

There are thousands and thousands of those of course, regularly intermitted by

NFS: nfs_readpage_result: 16772, (status 262144)
NFS: nfs_update_inode(0:16/21807132 ct=1 info=0x7e7f)

Sometimes they are also inermitted by a bunch of those:

NFS: read(path/bigfile.xyz, 32768@1844510720)

or those

 NFS: 0 initiated read call (req 0:16/21807132, 262144 bytes @ offset 1859579904)

or those

NFS: nfs3_forget_cached_acls(0:16/954383)
NFS: clear cookie (0xffff880113c51830/0x(null))

Probably all quite normal?

Here some lines that seemed special to me (mostly because there are only very few of those and they seem kind of 'crippled' compared to the other ones of which there are so much):

this line is a bit special because it only appears very few times in this way:

NFS: read done (0:16/21807132 4096@1806016 read done (0:16/21807132 4096@1806135296)

that one has a weird timestamp:

[2569325693.585308] NFS: read done (0:16/21807132 4096@1805770752)

(it's surrounded by timestamps like '[25693.578505] ')

that one is special too:

[25693.5132 4096@18047132 4096@180132 4096@1804734464)

or that:

NFS: read done (0@1797181440)

or that:

[25693.492344] N1789460480)

this one is also quite uniqe:

NFS: nfs3_forget_cache(null))

here we have some weird braces in front of the timestamp:

Dec 24 22:59:54 unimatrixzero kernel: <>[25691.765538] NFS: read done (0:16/21807132 4096@1699704832)

somewhere in between we have

NFS: dentry_delete(galerie/Allgemein, 0)
NFS: dentry_delete(converted-images/galerie, 0)
...

they don't seem to be related to the actual copy process because they name some totally different path which is part of a backup folder that gets transfered using a rsync script every day via nfs to the server. But the daily transfer was two hours earlier.

a long and special one:

NFS: read(path/bigfile.xyz, 32768@168: read(path/biS: read(path/FS: nfs_readpage_r NFS: nfs_update_inode628369] NFS: read do628371] NFS:628373] NFS628375] NFS: re628376] NFS: r628378] NFS: re628380] NFS:628381] NFS:628383] NFS: read done (0:16/21807132 4096@1694461952)

(this was all one line in the log)

or this one:

NFS: read dd done (0:16/21807132 4096@1694404608)

at one point we have a bunch of lines in which everything is quite crippled:

...
[2566@1666424832)
[2@1666428928)
[2 4096@16664096@1666437120)
132 4096@1666432 4096@1666445312)
1807132 4807132 4096@16666/21807132 4096@/21807132 4096@1666(0:16/2180:16/21807132 4096@1666469888)
...
(th...

Read more...

Revision history for this message
David (g-launchpad-strategyplayer-net) wrote :

I finally tested the advice of someone in the forums who said that he could solve similar client-problems by using 'unfs3' instead of 'nfs-kernel-server' at the server side. Unfortunately it didn't help, so my client crashes no matter if there's 'nfs-kernel-server' or 'unfs3' running on the server.

I'm starting to get a bit desperate because I'm highly dependant on NFS (most of my files are on a fileserver and are acessed trough the network from different machines). Today I even had multiple crashes while trying to watch a movie file with a high bitrate via NFS.

And actually I'm starting to lose hope, that the problem will be solved via this Bugreport.

Can anyone give me any hint what else I could try?
Is there a serious chance that the problem could be solved by replacing the switch or by using a new ethernet-card in the client?
Are there any alternative nfs / kernel versions that I could try?
Is there a way to limit the speed with which data is sent or received trough the network? (Since it seems, that the problem only appears when a high transfer rate is reached. And no: Just using a 100MBit Switch is NOT a solution because that's too slow.)

Revision history for this message
David (g-launchpad-strategyplayer-net) wrote :

Now I'm bypassing the problem by limiting my network interface's bandwidth via 'wondershaper'. It takes care, that the transfer speed from the server to the client is always around 40 Megabytes per second, which seems to be a value which is not yet in the critical range (as we saw eralier, the problem only occurs at high transfer rates).

I'm not yet fully satisfied with the rather intransparent way in which 'wondershaper' limits the speed (for example it seems to strongly reduce the upload speed eventough I tought I configured it in a way in which it only should affect the download speed) but I haven't yet found a better alternative (because 'trickle' which seems to offer a more transparent limitation in reality doesn't limit anything on my machine).

Of course this is not really a solution, but it's some workaround I'll be probably be able to live with.

But if anyone comes up with further possibilities of testing which could lead to a real solution, I'd of course be willing to assist.

Revision history for this message
Bruce Edge (bruce-edge) wrote : Re: [Bug 661294] Re: System lock-up when receiving large files (big data amount) from NFS server

On Sat, Jan 8, 2011 at 3:02 PM, David <email address hidden> wrote:

> Now I'm bypassing the problem by limiting my network interface's
> bandwidth via 'wondershaper'. It takes care, that the transfer speed
> from the server to the client is always around 40 Megabytes per second,
> which seems to be a value which is not yet in the critical range (as we
> saw eralier, the problem only occurs at high transfer rates).
>
> I'm not yet fully satisfied with the rather intransparent way in which
> 'wondershaper' limits the speed (for example it seems to strongly reduce
> the upload speed eventough I tought I configured it in a way in which it
> only should affect the download speed) but I haven't yet found a better
> alternative (because 'trickle' which seems to offer a more transparent
> limitation in reality doesn't limit anything on my machine).
>
> Of course this is not really a solution, but it's some workaround I'll
> be probably be able to live with.
>
> But if anyone comes up with further possibilities of testing which could
> lead to a real solution, I'd of course be willing to assist.
>
> --
> You received this bug notification because you are a direct subscriber
> of the bug.
> https://bugs.launchpad.net/bugs/661294
>
> Title:
> System lock-up when receiving large files (big data amount) from NFS
> server
>

I commend your willingness to keep hacking at this. I gave up a long time
ago.
It appears that Ubuntu variants of 2.6.32+ kernels have an NFS client
problem that eventually takes the kernel down. There are numerous problem
reports on this. I've switched to debian for my clients that really matter
and no longer have the problem.

-Bruce

Revision history for this message
Andrew Chambers (andrewrchambers) wrote : Re: System lock-up when receiving large files (big data amount) from NFS server

The other day I upgraded to Ubuntu 11.04 alpha to see if it might fix the problem - it still occurs as expected.

I have also tried a fresh install of Fedora 14 and get the same issues.

@David - What command did you pass to wondershaper? I have tried with little success - any arguments I seem to give it limit my download speed to about 20kb/s, which has painful consequences given my home partition is mounted over NFS...

All I want to do is import my photos to Shotwell :'(

-Andrew

Revision history for this message
David (g-launchpad-strategyplayer-net) wrote :

@ Andrew

I use the following command which I put in /etc/rc.local so I don't have to enter it again after every reboot:

wondershaper eth0 500000 500000

Probably there would be more suitable values but it seems that those worke quite fine for me.
As stated above, it limits my downspeed traffic to around 40 Megabytes per second. Still the upload speed seems to be affected in a 'stronger' way I don't fully understand yet...

(In my case, the home partition luckily is local)

Revision history for this message
Andrew Chambers (andrewrchambers) wrote :

Well wondershaper continued to not work even with the same command, so for the time being I have put an old 100/10 mbit NIC in my computer to stop the crashing.

If I replace this with a different gigabit NIC, would the crashing stop?

Clearly having /home mounted over such a connection isn't great, but slow as it is at least my photos are being added to my photo library without complete lock ups.

Revision history for this message
David (g-launchpad-strategyplayer-net) wrote :

Strange - in my case it was the other tool 'trickle' which didn't work at all. Maybe you want to give that one a try...

I don't know about the different gigabit NIC, as far as I remember noone in this 'thread' tested this yet. I actually wanted to try it, but I was too lazy buying a gigabit NIC after I finally got the 'partly' workaround with wondershaper.

Revision history for this message
Sean Clarke (sean-clarke) wrote :

Can't believe Linux can have such a critical issue - I commend you guys for persevering.

I too have this problem, I run virtual machines (KVM) over NFS and this has rendered the system unusable:

Client:
 uname -a
Linux enterprise 2.6.35-27-server #47-Ubuntu SMP Fri Feb 11 23:09:19 UTC 2011 x86_64 GNU/Linux

The server is a Thecus NAS device

Revision history for this message
Sean Clarke (sean-clarke) wrote :

Just to confirm - it is 100% repeatable, just by generating a load of traffic - start 2 KVM images together and it just hangs.

I get nfs server not responding in the logs, but i think that is because the system has hung (other clients can use the server and ping it at the same time etc.).

[ 2631.463817] br0: port 4(vnet2) entering forwarding state
[ 2641.716774] vnet2: no IPv6 routers present
[ 2753.179883] INFO: task kvm:3256 blocked for more than 120 seconds.
[ 2753.179897] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2753.179905] kvm D 00000001000391e5 0 3256 1 0x00000000
[ 2753.179909] ffff8802f6941c58 0000000000000086 ffff880200000000 0000000000015b40
[ 2753.179914] ffff8802f6941fd8 0000000000015b40 ffff8802f6941fd8 ffff880325680000
[ 2753.179919] 0000000000015b40 0000000000015b40 ffff8802f6941fd8 0000000000015b40
[ 2753.179921] Call Trace:
[2753.179927] [<ffffffff815a080f>] __mutex_lock_slowpath+0xff/0x190
[2753.179929] [<ffffffff815a0213>] mutex_lock+0x23/0x50
[ 2753.179932] [<ffffffff8117be8d>] vfs_fsync_range+0x6d/0xa0
[ 2753.179933] [<ffffffff8117bf07>] generic_write_sync+0x47/0x50
[2753.179936] [<ffffffff811040ce>] generic_file_aio_write+0xae/0xd0
[ 2753.179944] [<ffffffffa0500911>] nfs_file_write+0xb1/0x200 [nfs]
[ 2753.179946] [<ffffffff811544da>] do_sync_write+0xda/0x120
[ 2753.179949] [<ffffffff810752ef>] ? kill_pid_info+0x3f/0x60
[ 2753.179950] [<ffffffff81075490>] ? kill_something_info+0x40/0x150
[ 2753.179953] [<ffffffff81290d78>] ? apparmor_file_permission+0x18/0x20
[ 2753.179955] [<ffffffff81260316>] ? security_file_permission+0x16/0x20
[ 2753.179957] [<ffffffff811547b8>] vfs_write+0xb8/0x1a0
[ 2753.179959] [<ffffffff81155152>] sys_pwrite64+0x82/0xa0
[ 2753.179962] [<ffffffff8100a0f2>] system_call_fastpath+0x16/0x1b
[ 2753.179963] INFO: task kvm:3327 blocked for more than 120 seconds.

Revision history for this message
Stefan Bader (smb) wrote :

Just recently there where two patches found that may be related. One was about starvation when receiving data over virtio-net (bug #579276) and the other was fixing the nfs filesystem that would return the wrong return code for flush/sync (bug #585657). From the stack trace in the previous comment it looks like this may be the latter issue. Both patches have been queued for one of the next updates. But for those interested, I have prepared debian packages that include both changes on top of all the changes that are currently staged for proposed.

The patch to bug #585657 may also be related to this bug, though this bug was about receiving large files which would not write to the nfs file system. Receiving could be impacted though if the client would be running inside of KVM.

Revision history for this message
Stefan Bader (smb) wrote :

Oh, test packages are at http://people.canonical.com/~smb/lp661294/ again.

Revision history for this message
David (g-launchpad-strategyplayer-net) wrote :

I have annoying news.
Today I built a new computer which is also connected to the network and which should also be able to access the problematic NFS shares. When setting it up, I expected to possible results:
Either it has the same problems as my other computer (which is: crashing when receiving lagre files via over NFS) or it doesn't have the problem.
What I didn't expect was, that the new machine could have an even more annoying problem - but unfortunately exactly this is the case.

To be specific:
My new machine only reaches transfer speeds of between 100kbit/s and 1Mbit/s (we're talking here about Gigabit Ethernet!) which of course makes the whole thing totally useless. Aditionally it seems to cause some overload in NFS so that the file manager not only in the new machine but also in the old machine (because the old one of course is still connected to the NFS shares) gets blockated.
In other words: NFS with client A results in a crash of the client, NFS with client B results in terribly low speed and a blockade of all clients...

I really don't know what to do or say anymore... after wasting a huge amount of time with the old problem and finally elaborating an only halfway satisfying workaround I add a new machine to the network and now I'm facing even bigger problems than ever before.

Revision history for this message
David (g-launchpad-strategyplayer-net) wrote :

Ok, sorry, you can ignore my last message. Because of the problems I had with NFS, I assumed that the networking problems of my new machine are als the fault of NFS. But I checked it more detailes and found out, that the Ethernet-Adapter of my new machine was faulty. I replaced it and NFS works now fine on the new machine. Tough on the old machine, the problems still exist.

Revision history for this message
Jeff Taylor (shdwdrgn) wrote :

Yesterday I swapped my 100Mb server NICs for new Gbit NICs (RTL-8169). I have been running lucid on 2 machines for about a month, and the last machine was upgraded about 3 months ago. The NIC upgrade was the only thing that changed yesterday. My switches are D-Link DGS-2208 (Gbit).

- When transferring files via NFS3, transfer rates run about 65Mb/s, the *receiving* computer will lock up hard within the first minute. There seems to be no problems with the sending server. This is consistent regardless of which of my three servers are doing the sending or receiving.

- In addition to the NFS problems everyone else is reporting here, I also run a DRBD share between two servers, formatted with OCFS2. After one of the machines is rebooted, a DRBD sync is started between servers, and again the *receiving* machine will lock up hard within the first minute. In drbd.conf, I have 'rate 1000M', and transfer speeds were again around 65Mb/s at the time of lockup. Lockup has occurred on the receiver regardless of which server was being brought up. I have changed rate to 500M and will see on the next reboot if their is still a lockup.

I think this may show that the problem is not limited to NFS transfers. I was able to get DRBD back in sync by disconnecting the ethernet cable until ubuntu finished booting up, then plugging in the network again. This would lead me to believe that the problem is more likely related to a combination of system load and network load?

Server 1:
- Linux Loki 2.6.31-16-generic-pae #53-Ubuntu SMP Tue Dec 8 05:20:21 UTC 2009 i686 GNU/Linux
- 768MB ram
- AMD Athlon(tm) XP 2000+
- DRBD / OCFS2 share

Server 2:
- Linux Eris 2.6.31-16-generic-pae #53-Ubuntu SMP Tue Dec 8 05:20:21 UTC 2009 i686 GNU/Linux
- 768MB ram
- AMD Athlon(tm) XP 2000+
- DRBD / OCFS2 share

Server 3:
- Linux Zeus 2.6.32-29-generic-pae #58-Ubuntu SMP Fri Feb 11 19:15:25 UTC 2011 i686 GNU/Linux
- 1GB RAM
- AMD Athlon(tm) XP 3000+
- RAID10 (software) / EXT3 via NFS3 share

Revision history for this message
Justin Dossey (jbd) wrote :

I run a set of media encoding servers in KVM VMs running lucid. They run a few FFMPEG processes and write to one of several Linux NFS servers. I get this kind of NFS hang every day or two, so I have been trying different strategies (different virtual NIC types, different kernel versions, etc).

Long story short, I have the kern.log from a server during the hang and the time leading up to it, and I had /proc/sys/sunrpc/rpc_debug set to 32767 all along. The excerpt of the two-minute interval surrounding the hang is 18M and compresses down to 831K. All this on Linux ftrans-03 2.6.38-4-generic-pae #31~lucid1-Ubuntu SMP Thu Feb 17 13:41:45 UTC 2011 i686 GNU/Linux. Single virtual CPU, 1G of memory.

Revision history for this message
Clint Byrum (clint-fewbar) wrote : Re: [Bug 661294] Re: System lock-up when receiving large files (big data amount) from NFS server

To everyone reporting being affected, it seems like the most useful
thing we can see is probably your

lspci -v

output, in case there is a common family or type of hardware between
those affected.

On Fri, 2011-03-04 at 01:12 +0000, Justin Dossey wrote:
> I run a set of media encoding servers in KVM VMs running lucid. They
> run a few FFMPEG processes and write to one of several Linux NFS
> servers. I get this kind of NFS hang every day or two, so I have been
> trying different strategies (different virtual NIC types, different
> kernel versions, etc).
>
> Long story short, I have the kern.log from a server during the hang and
> the time leading up to it, and I had /proc/sys/sunrpc/rpc_debug set to
> 32767 all along. The excerpt of the two-minute interval surrounding
> the hang is 18M and compresses down to 831K. All this on Linux
> ftrans-03 2.6.38-4-generic-pae #31~lucid1-Ubuntu SMP Thu Feb 17 13:41:45
> UTC 2011 i686 GNU/Linux. Single virtual CPU, 1G of memory.
>
>
> ** Attachment added: "compressed RPC debug log"
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/661294/+attachment/1884260/+files/diary-of-an-nfs-crash.log.bz2
>

Revision history for this message
Jeff Taylor (shdwdrgn) wrote : Re: System lock-up when receiving large files (big data amount) from NFS server

lspic for server 1

Revision history for this message
Jeff Taylor (shdwdrgn) wrote :

lspic for server 2

Revision history for this message
Jeff Taylor (shdwdrgn) wrote :

lspic for server 3

Revision history for this message
Grondr (grondr) wrote :

I just happened to trip over this report, and at least
one of the lcpci entries shows a Realtek NIC, which
makes me suspicious in light of the report I just
filed at bug #746914. You might want to take a
look at that in case it seems relevant and/or try
the nc | tar pipelines I was using (I haven't yet
tried NFS; I will once my system is stable otherwise).
(Short story is that any amount of I/O was fine
-unless- there was also PCI-bus activity, whereupon
raising the I/O rate had an exponential effect on the
time to failure---again with a "jabbering" lockup in which
the receiving machine was continuously transmitting
on its Ethernet but was otherwise hung or nearly hung,
which also sounds somewhat like the "ghost traffic"
someone reported above, maybe.)

I have a replacement (non-Realtek) NIC arriving on
Friday and will update the report once I can test it.

Revision history for this message
David (g-launchpad-strategyplayer-net) wrote :

Today I made the mistake of upgrading to 11.04 - instead of solving this annoying old issue, the new version of the distribution made it even worse!

But well, first the good news: In 11.04, usually not the whole machine freezes completely (as it was before) but only 'plasma-desktop', 'dolphin' and the application which tries to read from the NFS share. Also they tend to 'unfreezes' themselves after some minutes, which is another advance.

Now the bad news: The issue is now triggered much more easily - not only the copying of large files freezes the applictaions now, but almost any reading operation form the NFS shares which cause more traffic than a simple directory listing.

In other words: After having upgraded to 11.04, I am now more cut off from my data on the NFS shares than I have been ever before!
Also my workaround that I described above doesn't work under these new conditions anymore.
I'm really stunned how they could make this whole catastrophe even more disastrous instead of finally fixing it after years...

Tomorrow I'll finally try to replace the NIC in my computer. I really hope this helps, because under those new conditions, the point is reached, where this issue really starts to make working with my computer impossible.

Revision history for this message
David (g-launchpad-strategyplayer-net) wrote :

Good news (for me)

After replacing my OnBoard NIC with one for the PCI Slot (an Intel Card with Intel 82541PI Chip), the problem doesn't occur anymore.

So it seems likely that the issue has something to do with the Realtek Chip, as Grondr stated.

Revision history for this message
Grondr (grondr) wrote :

...and I'd like to point out that I've done a ton of NFS (v3 only, since I'm talking to hosts running Hoary---yes, really) with the Intel NIC and had no problems whatsoever. To everyone who is blaming NFS here, TRY NETCAT. And DO NOT try scp or ssh because they're doing crypto which will slow down your transfer rates significantly---for every 2x slowdown, it might cause the bug to take 2^x longer to manifest (did for me, anyway). Instead, try just shoving data as fast as you can via nc---I was using tar on both ends of the pipeline because I was trying to actually get data moved, but if you're just debugging this, shove /dev/zero through nc at one end and dump it to /dev/null at the other. And then let it sit there for your typical time-to-failure (times as-much-patience-as-you-have). If that doesn't work, try disk activity as well---maybe you have my bug, where things only went south when there was lots of heavy activity on the PCI bus (see my previous comment on this thread) but it was perfectly fine if it was just an nc pipeline that wrote to non-PCI disks or to nowhere. [I haven't rigorously read the entire thread in this bug report here, but it sounds more and more like the problem isn't NFS, but your NIC, and that you're not seeing it in non-NFS applications either because you're using things that don't hit the net as hard, or because you are, but you're not also doing file I/O.]

summary: - System lock-up when receiving large files (big data amount) from NFS
- server
+ System lock-up when receiving large files over a Realtek NIC (big data
+ amount) from NFS server
Changed in linux (Ubuntu Natty):
status: New → Confirmed
importance: Undecided → Medium
Revision history for this message
Vanessa Dannenberg (vanessadannenberg) wrote : Re: System lock-up when receiving large files over a Realtek NIC (big data amount) from NFS server

Same problem here - any large receive over NFS tends to crash the receiving machine. When the lockup occurs, X freezes hard, but I can Alt-SysRq {REISUB} to reboot. Realtek gigabit NICs (onboard) at both ends, with a gigabit switch between them. As with others, the server seems unaffected when the client crashes.

The result of `lspci -v` is attached.

Revision history for this message
Adam Bolte (boltronics) wrote :

FWIW, I get this on Debian Wheezy (Testing) running 2.6.38-2-amd64. When I download (but never when I upload) large files (eg. >200Mb movies) from my home file server over Gigabit on one of these Realtek NICs I get a crash - 100% reproducible.

I only just installed Debian because I had Gentoo and thought something was screwy with the video card (using the proprietary nvidia driver) - it crashed every time I played a movie directly from my file share (a Samba server) and I could never figure out why. Imagine my surprise when I still got this using nouveau under Debian! Then I thought it was my overclock. Then my memory. Took me ages to notice that the issue was the NIC!

Gigabyte X58A-UD9, BIOS F4 - has 2 NICs, both unfortunately the same chipset:

Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 06)

As is the case with others here, it's only an issue during Gigabit file transfers. Other NICs seem fine too.

Revision history for this message
Maximilian Güntner (mguentner) wrote :

I have the same bug with ArchLinux. Both the server (i686) and the client (x86_64) are running ArchLinux with kernel 2.6.38.5 - everything connected with gigabit ethernet.
The problem started after I've created a LUKS partition on the server and started accessing it with NFS.

Writing a file into the container using NFS will result in a total freeze of the client after some megabytes.

The Server has two NVIDIA NICs while the client has an e1000 NIC.

Mount line:
   server:/backup /mnt/backup nfs4 bg,hard,intr,nolock,udp 0 0

It seems that these freezes only occur if (Average write speed on the Server) < (Average Disk-to-Network speed on the client). That is the case if LUKS is enabled on the target partition. (50mb/s write speed on the server vs 90 mb/s read speed on the client).

I'm not 100% sure whether this has anything to do with LUKS itself or just with the reduced speed.

In general the 8169 module had and has (?) some problems handling heavy traffic (many connections, high transfer rate). That's why I replaced mine with the e1000. But it's likely that we blame it on the wrong module since the 8169 is pretty mainstream and almost everybody has one in his/her setup.

Revision history for this message
Adam Bolte (boltronics) wrote :

I just downloaded and installed the Realtek driver from here (r8168-8.023.00):
http://www.realtek.com/downloads/downloadsView.aspx?Langid=1&PNid=13&PFid=5&Level=5&Conn=4&DownTypeID=3&GetDown=false#2

Transferred a few GBs from my file server to test and had no crash - a record! It may be a bit premature, but I'm calling it fixed. So yes - I blame the r8168 module shipped in the distros.

Revision history for this message
Andrew Chambers (andrewrchambers) wrote :

@Adam Bolte

Just did the same thing and can confirm that I no longer get crashes. Can't believe I struggled with this for months and didn't think to look at the realtek website for a new driver! You live and learn!

Thank you very much for the help.

Revision history for this message
khaldan (khaldan) wrote :

Can confirm the bug (not present in Ubuntu 10.04, not tested in 10.10 but present in 11.04) with a Realtek chip (see attached output from lspci -v) and the workaround by Adam Bolte. But it seems like my system is removing the realtek driver during restart and loads the buggy driver again. At least the bug is occuring after restart and after reinstalling the realtek driver its gone again. Would be perfect if somebody could fix the ubuntu driver or find a way to permanently install the realtek driver (although the bug wouldnt be fixed by that ;-))

Revision history for this message
Adam Bolte (boltronics) wrote :

@Andrew Chambers

Awesome. Actually, the latest driver release is just a few weeks old. These drivers might not have worked prior to that in 10.10, but I haven't investigated.

@khaldan

Before Gentoo I was using Ubuntu 10.04 on this system also and did not see the issue there. I guess I have been switching OSs a lot lately.

Anyway, if it's loading the buggy driver again I can think of two things that might cause it:
1. Loading a new kernel, or running an update.
2. You have the module saved in your initramfs image.

You can probably check for the second scenario with something like:
$ cat /boot/initrd.img-$(uname -r) | zcat | cpio --list 2>/dev/null | grep r8168

If the module is listed, and you have already installed the driver from the Realtek website, maybe the following command will help you (for the next time you boot):
$ sudo update-initramfs -k all -c

Agreed - I should have called it a work-around and not a fix.

Revision history for this message
Robbie Williamson (robbiew) wrote :
Revision history for this message
Tuukka Norri (tsnorri) wrote :

I had the same problem with Ubuntu 11.04, Linux 2.6.38-11-generic and the r8169 driver that shipped with it. Installing a driver from Realtek resolved the problem.

Revision history for this message
lagerimsi (lagerimsi) wrote :

can confirm this problem with Natty (11.04) and Linux 2.6.38-11-generic.

seems it not only affects nfs - also videochat and other programs making heavy use of the nic.

new driver is out:
http://www.realtek.com/downloads/downloadsView.aspx?Langid=1&PNid=13&PFid=5&Level=5&Conn=4&DownTypeID=3&GetDown=false#2 (2011-8-25)

Revision history for this message
nick (niek-art) wrote :

I had the same problem after updating my ubuntu server (8.04) to the newest. Then the problems started.
It took me several months to find this posting.
I use samba on my server (software raid) and can confirm that the latest river from realtek works. It unloads the one in the kernel (size 84022) and replaces it with a newer one (size 203096) after compilation. I have not yet rebooted since my raid is being rebuild because of another crash due to this faulty nic driver.

It is about time Ubuntu puts this driver in the distributions...

Revision history for this message
Christoph Gritschenberger (christoph.gritschenberger) wrote :

I also had lock-ups with mine.
03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 03)

I tried with Ubuntu 11.04 and Fedora 15.
After 2-5GB had been transfered the system locked up. In Fedora it also did reboot.
This happend when copying via SMB or SFTP (on both sytems).

After installing the realtek-driver from the homepage everything worked fine again.
Just copied 50GB via SFTP without a problem.

Revision history for this message
UmitG (hazamatic) wrote :

I also had the same problem while copying files into my raid using SMB.

The freeze would only happen while copying files INTO the raid, but copying OUT of the raid was fine. Meaning the machine was freeze if downloading, but uploading was fine.

After installing the realtek driver from the above, I have no more problems. Transfer speeds have also improved after install the new drivers.

I would be nice if the linux kernel had a working driver...

Revision history for this message
BeJay (bjdag79) wrote :

How can this be medium?? It's been going on for 2 years now! I've still got the same issues since upgrading to 12.04 from 10.04. All 10.04 ubuntu clients are the same, but the server is now 12.04. I have gig clients as well as a couple of 54Mb wireless G bridge clients that also lock up and die. This is terrible since I use it to mount over wireless that had no trouble on 10.04 -> 10.04 machines. Is anyone bothering with this MAJOR issue these days?

Revision history for this message
bamyasi (iadzhubey) wrote :

I can confirm the same NFS crashes on my Ubuntu Server 12.04 NAS with the hardware RAID5 (3ware 9750-24i4e). I would rate this as a catastrofic bug and surprised there were no activity on it for years. I personally certainly do not enjoy repairing a 40-TB filesystem after regular crashes.

# uname -a
Linux yuka 3.2.0-26-generic #41-Ubuntu SMP Thu Jun 14 17:49:24 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

# dpkg -l | grep nfs
ii libnfsidmap2 0.25-1ubuntu2 NFS idmapping library
ii nfs-common 1:1.2.5-3ubuntu3 NFS support files common to client and server
ii nfs-kernel-server 1:1.2.5-3ubuntu3 support for NFS kernel server

# lspci | grep Ethernet
02:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
02:00.1 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
03:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
03:00.1 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)

penalvch (penalvch)
summary: - System lock-up when receiving large files over a Realtek NIC (big data
- amount) from NFS server
+ 10ec:8168 System lock-up when receiving large files over a Realtek NIC
+ (big data amount) from NFS server
penalvch (penalvch)
description: updated
tags: added: lucid maverick natty needs-upstream-testing regression-release
Revision history for this message
penalvch (penalvch) wrote :

David, this bug was reported a while ago and there hasn't been any activity in it recently. We were wondering if this is still an issue? Can you try with the latest development release of Ubuntu? ISO CD images are available from http://cdimage.ubuntu.com/releases/ .

If it remains an issue, could you run the following command in the development release from a Terminal (Applications->Accessories->Terminal). It will automatically gather and attach updated debug information to this report.

apport-collect -p linux <replace-with-bug-number>

Also, if you could test the latest upstream kernel available that would be great. It will allow additional upstream developers to examine the issue. Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please do not test the kernel in the daily folder, but the one all the way at the bottom. Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag. This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text. As well, please comment on which kernel version specifically you tested.

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

If you are unable to test the mainline kernel, for example it will not boot, please add the tag: 'kernel-unable-to-test-upstream', and comment as to why specifically you were unable to test it.

Please let us know your results. Thanks in advance.

Helpful Bug Reporting Links:
https://help.ubuntu.com/community/ReportingBugs#Bug_Reporting_Etiquette
https://help.ubuntu.com/community/ReportingBugs#A3._Make_sure_the_bug_hasn.27t_already_been_reported
https://help.ubuntu.com/community/ReportingBugs#Adding_Apport_Debug_Information_to_an_Existing_Launchpad_Bug
https://help.ubuntu.com/community/ReportingBugs#Adding_Additional_Attachments_to_an_Existing_Launchpad_Bug

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
David (g-launchpad-strategyplayer-net) wrote :

Christopher, as I said above, I have replaced the Realtek OnBoard NIChip with an Intel PCI NICard to bypass the whole issue after it has become even worse and unbearable when I updated to 11.04.

Now of course I could remove the Intel Card and reactivate the Realtek Chip in order to test if the issues presists. But since I gave the concerned machine to my father some months ago (and therefore have no regular access to it) I don't really feel like warming up this annoying old issue.

But maybe someone else can make those tests. When I look at the comments, there still seem to be some people affected, not?

Revision history for this message
Adam Bolte (boltronics) wrote :

I might be able to run some tests as I still have the same Gigabyte X58A-UD9 that was experiencing this issue. However, I run Debian Wheezy on it and don't use it all that often. I haven't seen this issue in months, which either means:

1. I downloaded the driver from Realtek ages ago, forgot about it and haven't upgraded kernels to anything incompatible with it.

2. Debian Wheezy doesn't have the problem any more (but I understand you'll want me to test Ubuntu development images anyway).

3. I just haven't copied sufficiently large files to observe this problem (most files I copy across my network are <400Mb).

Anyway, will try to look into this when I have time (hopefully this weekend) if nobody else beats me to it.

Revision history for this message
Stefan Bader (smb) wrote :

There are a few reasons this bug report is not well looked after. Admitted not all very good ones but things come together. First of all this seems quite hardware specific (not only tied to the NIC but possibly also to other sources like mother board make, the network cables or switches). I myself have a Realtek 8111/8168B and could never reproduce the issue.
There is also the time. There are always other and more problem which may affect even more people. And as long as there is no one asking in the report about its status, it unfortunately can fall down the cracks. I probably should have unassigned myself but then I had forgotten about it.
There is also the problem that over time there have been additions to this report that are about completely different hardware (Intel NICs instead of Realtek). This unfortunately causes often more confusion than it is helpful. Just as a general rule, for something that looks hardware specific it is better to open its own bug. It is easy to make it a duplicate of some other bug for someone looking at them but it is really hard to work on one report that mixes comments of things that are actually not the same.
Just a word about the driver from Realtek: it is a valid option for someone affected but to have that bundled up in the distro kernel just is too much of a burden for maintenance. Someone needs to make sure that it does not break when the rest of the kernel changes, there would be different bugs than for the in-kernel driver and so on. And Realtek should really make sure the driver in the upstream kernel is good. That would help everyone.

But ok, so much for attempts of explanations from this side. What I would like to propose is that those who are still affected by this on either Precise (12.04) or Quantal (development release as of now) should open their own new bug report ("ubuntu-bug linux" will gather automatically some data which is usually asked for). Optionally reporting the bug number here, so people with the same hw can subscribe to the new report. This just to get things separated. There always was one other problem. Without any oops or panic message and the system only locking up it is near impossible to find anything. Being logged in or using netconsole is of little use when the NIC is the problem. Serial ports are quite rare now (maybe a usb-serial adapter could be used). So I was wondering about using crashdump when I saw newer posts on this report. Unfortunately this is in a bit of a broken state as I found out when looking. I plan to update the debugging wiki (https://wiki.ubuntu.com/Kernel/CrashdumpRecipe) as one of the next things to do. So when that is updated it may be an option to get some useful data for finding the issue.

Changed in linux (Ubuntu Oneiric):
assignee: Stefan Bader (stefan-bader-canonical) → nobody
Changed in linux (Ubuntu):
assignee: Stefan Bader (stefan-bader-canonical) → nobody
Revision history for this message
penalvch (penalvch) wrote :

David, this bug report is being closed due to your last comment regarding how this no longer affects you and you do not have the hardware. For future reference you can manage the status of your own bugs by clicking on the current status in the yellow line and then choosing a new status in the revealed drop down box. You can learn more about bug statuses at https://wiki.ubuntu.com/Bugs/Status. Thank you again for taking the time to report this bug and helping to make Ubuntu better. Please submit any future bugs you may find.

no longer affects: linux (Ubuntu Oneiric)
no longer affects: linux (Ubuntu Natty)
Changed in linux (Ubuntu):
status: Incomplete → Invalid
Revision history for this message
Adam Bolte (boltronics) wrote :

Wow. Just wow. This bug has been open for nearly 2 years, and 100 comments later the bug report is closed because one guy recently claimed to experience the same problem with an Intel card instead of a Realtek - who should have just opened a separate bug report. This bug has *always* been about specific Realtek chipsets, so opening up other bug reports would be pointless.

I myself have previously experienced this same issue on 3 different computers (all different motherboards) all with the same Realtek hardware. I did not mention the other two because one is now critical work machine I cannot readily test since they are now in production with the driver from the Realtek website, and the other I don't have any more. It's unbelievable that Canonical Q/A have been unable to reproduce. You even have some specific motherboard models with the problem mentioned.

> Someone needs to make sure that it does not break when the rest of the kernel changes
It was already broken. It doesn't get much worse than a complete system lock-up.

As previously mentioned, I don't even run Ubuntu any more. I don't get paid from Canonical. Here I was going to spend my time over the weekend to try and help you guys out, but now I don't think I'll bother.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.