hung system with half-working networking, unresponsive NFS mounts

Bug #1078345 reported by Iordan Iordanov
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Expired
Medium
Unassigned

Bug Description

We have two systems with a lot of simultaneous KDE sessions which mounts user data over NFS. We also have two shared servers that only provide ssh-access (no KDE sessions running), that also mount user data over NFS. We've had three different machines hang the same way as I'm reporting now. Two with graphical sessions, and one without graphical sessions.

The basic symptoms are that NFS mounts are unresponsive. Also, one can mount new (unmounted) NFS shares, but not unmount them.

When collecting the data for this bug-report, the ubuntu-bug process hung, and I had to kill this process to make it continue (probably not related, but relevant, as it could be affecting the data collected):

fuser -v /dev/snd/seq /dev/snd/timer

The process seemed hung indefinitely all three times when we were collectin data. We also have data from one of the other two times a system hung which we can attach to this bug if you need it.

We are also investigating the possibility that this is a routing issue, and would like to know if this could be some sort of kernel limits issue (the systems have a lot of file-handles and network sockets open).

Thanks!

ProblemType: Bug
DistroRelease: Ubuntu 12.04
Package: linux-image (not installed)
ProcVersionSignature: Ubuntu 3.2.0-31.50-generic-pae 3.2.28
Uname: Linux 3.2.0-31-generic-pae i686
AlsaDevices:
 total 0
 crw-rw---T 1 root audio 116, 1 Nov 1 16:29 seq
 crw-rw---T 1 root audio 116, 33 Nov 1 16:29 timer
AplayDevices: aplay: device_list:252: no soundcards found...
ApportVersion: 2.0.1-0ubuntu14
Architecture: i386
ArecordDevices: arecord: device_list:252: no soundcards found...
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code -15:
CRDA: Error: command ['iw', 'reg', 'get'] failed with exit code 1: nl80211 not found.
Date: Tue Nov 13 11:21:44 2012
IwConfig: Error: [Errno 2] No such file or directory
Lsusb:
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
 Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
 Bus 002 Device 002: ID 0b38:0003 Gear Head Keyboard
 Bus 002 Device 003: ID 1061:0101
MachineType: Sun Microsystems Sun Fire X2200 M2 with Quad Core Processor
PciMultimedia:

ProcEnviron:
 TERM=xterm
 SHELL=/bin/bash
ProcFB:

ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz root=/dev/md0p2 ro
PulseList: Error: command ['pacmd', 'list'] failed with exit code 1: No PulseAudio daemon running, or not running as session daemon.
RelatedPackageVersions:
 linux-restricted-modules-3.2.0-31-generic-pae N/A
 linux-backports-modules-3.2.0-31-generic-pae N/A
 linux-firmware N/A
RfKill: Error: [Errno 2] No such file or directory
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 03/17/2010
dmi.bios.vendor: Sun Microsystems
dmi.bios.version: S39_3D16
dmi.board.asset.tag: To Be Filled By O.E.M.
dmi.board.name: S39
dmi.board.vendor: Sun Microsystems
dmi.board.version: Rev 50
dmi.chassis.asset.tag: To Be Filled By O.E.M.
dmi.chassis.type: 23
dmi.chassis.vendor: Sun Microsystems
dmi.chassis.version: Rev 50
dmi.modalias: dmi:bvnSunMicrosystems:bvrS39_3D16:bd03/17/2010:svnSunMicrosystems:pnSunFireX2200M2withQuadCoreProcessor:pvrRev50:rvnSunMicrosystems:rnS39:rvrRev50:cvnSunMicrosystems:ct23:cvrRev50:
dmi.product.name: Sun Fire X2200 M2 with Quad Core Processor
dmi.product.version: Rev 50
dmi.sys.vendor: Sun Microsystems

Revision history for this message
Iordan Iordanov (iiordanov) wrote :
Revision history for this message
Iordan Iordanov (iiordanov) wrote :

The ubuntu-bug information collection itself also caused the following two dmesg lines to appear, which do not appear in CurrentDmesg.txt:

[1030231.494607] ACPI Warning: Incorrect checksum in table [OEMB] - 0xF2, should be 0xEF (20110623/tbutils-314)
[1030231.494860] ACPI Warning: Incorrect checksum in table [OEMB] - 0xF2, should be 0xEF (20110623/tbutils-314)

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Did this issue start happening after an update/upgrade? Was there a kernel version where you were not having this particular problem? This will help determine if the problem you are seeing is the result of the introduction of a regression, and when this regression was introduced.

Changed in linux (Ubuntu):
importance: Undecided → Medium
tags: added: kernel-da-key
Revision history for this message
Brad Figg (brad-figg) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Iordan Iordanov (iiordanov) wrote :

A further development. While we were trying to narrow down the problem, we ruled out routing problems and discovered that NFS over UDP was the issue.

mount -o rw,udp fileserver:/mount/path /mnt

hangs, whereas

mount -o rw,tcp fileserver:/mount/path /mnt

works. Then we tested pure UDP networking with /bin/nc, and it doesn't quite work right. When we transfered a 23MB tar file, most of it would get transfered, but not the entire file. The result of md5sum on the source and target files is different. Redoing the test from a machine which doesn't have problems with networking to the same target machine results in the same md5 checksum on both end-points.

The commands used were the following. On the affected host:

cat /tmp/test.tar | /bin/nc -u fileserver 8080

and on the fileserver side:

/bin/nc -u -l -p 8080 > /tmp/test.udp

When transfering the file over TCP, the transfer was fine from the affected machine to the fileserver (md5 checksum was the same on both sides). The test command on the affected host side:

cat /tmp/test.tar | /bin/nc fileserver 8080

and on the fileserver side:

/bin/nc -l -p 8080 > /tmp/test.tcp

So, from all of this, we have concluded that UDP networking stops working properly on affected hosts, and since we mount NFS over UDP at out institution, NFS is affected.

Revision history for this message
Iordan Iordanov (iiordanov) wrote :

I've also determined that the size of the file is inconsequential. I redid the testing with /bin/nc over UDP and over TCP with a 1.2GB file. The transfer was done from the affected machine to our fileserver. The result is as follows:

fileserver:/$ ls -l /tmp/test2*

-rw-r--r-- 1 root root 1183744000 Nov 13 13:49 test2.tcp
-rw-r--r-- 1 root root 1108691968 Nov 13 13:50 test2.udp

So most of the 1.2GB file gets transfered except for the last 75MB. As far as the 23MB file I started testing with, the results are:

-rw-r--r-- 1 root root 23674880 Nov 13 13:38 test.tcp
-rw-r--r-- 1 root root 22415360 Nov 13 13:34 test.udp

So, only the last 1.2MB are not transfered. In both cases, some part of the end of the file is missing. So then, I tried seeing what the smallest file I can reproduce this with is, and discovered that I can fully transfer 30kb, and 60kb files, but when I tried to transfer a 90k file, I only saw 75k transfered on the other side.

Then I cat /dev/urandom through the netcat udp tunnel, and it seemed to transfer data indefinitely (I stopped it when it had transfered 2GB), so the conclusion here is not that transfers over UDP hang all of a sudden at a certain amount of data, but that the last part of the data is not transmitted for some reason. That is, unless the data transfered is very small (between 60kb and 90kb) in which case all of it gets through.

Brinding the ethernet interface up and down on the target machine has no effect on this problem either, so whatever it is that gets reset during a network device restart is not enough to clear this problem.

Revision history for this message
Iordan Iordanov (iiordanov) wrote :

This issue happened early last summer as well, which was with linux-image-3.2.0-2X-generic-pae. I am not 100% sure how small X was. I believe the first kernel we went into production with was -23, and it happened soon afterward, so it was either with 23 or with 24. We have no real observations from before then.

Thanks!

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v3.7 kernel[0] (Not a kernel in the daily directory) and install both the linux-image and linux-image-extra .deb packages.

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

If you are unable to test the mainline kernel, for example it will not boot, please add the tag: 'kernel-unable-to-test-upstream'.
Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.7-rc5-raring/

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Iordan Iordanov (iiordanov) wrote :

My apologies, but kernel versions 3.5 and above do not work on the Sun X2200 that we use for shared desktop usage. There is a kernel oops at boot time. Also, we have switched to TCP mounts on NFS for production.

We also don't have a direct way to reproduce the problem, as it seems to require significant uptime and NFS traffic over UDP to recur. If we manage to reproduce the bug on different hardware by generating large amounts of NFS traffic over UDP, we will advise on how you can reproduce the bug, and we can try other kernels.

Sincerely,
Iordan Iordanov

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.