Activity log for bug #1006446

Date Who What changed Old value New value Message
2012-05-30 14:49:51 perpetualrabbit bug added bug
2012-05-30 15:00:06 Brad Figg affects linux-meta (Ubuntu) linux (Ubuntu)
2012-05-30 15:30:08 Brad Figg linux (Ubuntu): status New Incomplete
2012-05-30 15:30:10 Brad Figg tags high load nfs4 high load nfs4 precise
2012-05-30 15:42:03 perpetualrabbit linux (Ubuntu): status Incomplete Confirmed
2012-05-30 15:49:59 perpetualrabbit tags high load nfs4 precise apport-collected high load nfs4 precise
2012-05-30 15:50:00 perpetualrabbit description Problem: -------------------------------------------------------------- I just had to remove ubuntu server 12.04 to install redhat enterprise linux 6. The intermittent slowness was completely unacceptable for the users, who have workstations with /home mounted with nfs4 on this server. The mail server, also accesses the /home because the /home/$USER/Maildir directories are there. Using nfs4, the kernel nfs threads caused enormous load. The users had frozen desktops (greyed out windows) and mail slowed or arrived days later as a result. With RHEL6, all nfs4 problems are completely gone. I used the exact same /etc/exports file, and the same settings and mount options on the workstations, the same number of nfs threads. Both the redhat and ubuntu systems are KVM virtual guests on an redhat 6 virtual host (one of 3 actually). The storage backend is a very fast equallogic array, which exports iscsi targets to the virtual hosts. I am sorry, but I have to conclude the current nfs4 implementation of ubuntu server 12.04 is NOT fit for use. A complete university department suffered for weeks while I tried to solve the problems with ubuntu, but in the end it was decided to install redhat instead, re-using the same iscsi targets for system, home and data. A missed chance for ubuntu... Therefore I urge Canonical's people to classify this bug as critical. Also I think quality assurance should have caught this bug before shipping. Analysis ------------------------------------------------------------ LOAD: The nfs threads cause the kernel to use enormous amounts of 'sy' time as measured in top. I will attach a sample of top's output, of a particular _quiet_ time on the network. Load is 7.82. On busier moments, the load went through the roof, beyond 50 and further. It consumes actual CPU cycles. Each thread consumes upto 30% of a cpu core. I enabled 128 threads. rx an tx block sizes are 32768 on the clients. Both server and clients used async, both on redhat and ubuntu. SYSTEM vs IO-WAIT: The replacement redhat system can surely be overloaded, but then it does not consume CPU cycles doing so. Top does report high load, but it spends in in the 'wa' state. This indicates it is simply waiting for its backend iscsi devices to complete writes. I tested this by simultaneously letting all workstations write multi-gigabyte files with dd to /home. On ubuntu, the nfs threads spend their time in 'sy', doing who-knows-what. LOGS: Nothing at all appears in the logs. But when I set bitwise debug options in the /proc/sys/sunrpc/*debug files, lots of log entries appear. Those seem like normal NFS protocol messages to me though. I also tried to discover what was happening with wireshark, but the traffic looks like normal nfs4 traffic to me. SLOWNESS: That is the thing. The ubuntu nfs server is actually NOT slow at all. During my dd tests, it wrote half a gigabyte per second to its iscsi backends. It's _throughput_ is better than that of the redhat server. As far as I can tell, it falls down because it makes client side processes that want to do IO wait on other writes. A simple 'ls' has to wait until a write has been completed. And both server and client used async nfs. People's firefoxes freeze all the time because firefox need to read and write a lot to its cache and other files in the .mozilla directory. The dovecot imap server almost grinds to a halt trying to write all those little files in people's /home/$USER/Maildir's. The problems go on and on. Basically, a complete network of workstations is almost unusable because of this. Upfront tests were done of course, but showed only the excellent throughput but not the appalling `waiting´ behaviour. With redhat 6 there is no such problem. SITUATION: People use their own, and each others linux workstations for science, doing large calculations and writing a lot of big and small files to nfs. The nfs server serves /export/home, and also raw data storage from /export/data with nfs4. The clients mount those under /home and /data/misc respectively. Also there is a read-only software mount for certain scientific packages. CONFIG: client fstab lines: #### nfs entries ### sw.lorentz.leidenuniv.nl:/sw /sw nfs4 hard,intr,ro,tcp,rsize=32768,wsize=32768,bg,acl,async home.lorentz.leidenuniv.nl:/home /home nfs4 hard,intr,rw,tcp,rsize=32768,wsize=32768,bg,acl,async server exports file: /export 132.229.227.0/24(ro,sync,insecure,root_squash,no_subtree_check,nohide,fsid=0)\ 132.229.216.128/26(ro,sync,insecure,no_root_squash,no_subtree_check,nohide,fsid=0)\ 132.229.226.3(ro,sync,insecure,no_root_squash,no_subtree_check,nohide,fsid=0)\ 132.229.226.4(ro,sync,insecure,no_root_squash,no_subtree_check,nohide,fsid=0) /export/home 132.229.227.0/24(rw,async,insecure,root_squash,no_subtree_check,nohide)\ 132.229.216.128/26(rw,async,insecure,no_root_squash,no_subtree_check,nohide)\ 132.229.226.3(rw,async,insecure,no_root_squash,no_subtree_check,nohide)\ 132.229.226.4(rw,async,insecure,no_root_squash,no_subtree_check,nohide)\ 132.229.214.41(rw,async,insecure,no_root_squash,no_subtree_check,nohide) /export/data 132.229.227.0/24(rw,async,insecure,root_squash,no_subtree_check,nohide)\ 132.229.216.128/26(rw,async,insecure,no_root_squash,no_subtree_check,nohide)\ 132.229.226.3(rw,async,insecure,no_root_squash,no_subtree_check,nohide)\ 132.229.226.4(rw,async,insecure,no_root_squash,no_subtree_check,nohide) /export/sw 132.229.227.0/24(rw,async,insecure,root_squash,no_subtree_check,nohide)\ 132.229.216.128/26(rw,async,insecure,no_root_squash,no_subtree_check,nohide)\ 132.229.226.3(rw,async,insecure,no_root_squash,no_subtree_check,nohide)\ 132.229.226.4(rw,async,insecure,no_root_squash,no_subtree_check,nohide) root@gaia:~# lsb_release -rd Description: Ubuntu 12.04 LTS Release: 12.04 root@gaia2:~# dpkg -l | grep -E 'nfs|linux-image' ii libnfsidmap2 0.25-1ubuntu2 NFS idmapping library ii linux-image-3.2.0-18-generic 3.2.0-18.29 Linux kernel image for version 3.2.0 on 64 bit x86 SMP ii linux-image-3.2.0-19-generic 3.2.0-19.31 Linux kernel image for version 3.2.0 on 64 bit x86 SMP ii linux-image-3.2.0-20-generic 3.2.0-20.33 Linux kernel image for version 3.2.0 on 64 bit x86 SMP ii linux-image-3.2.0-21-generic 3.2.0-21.34 Linux kernel image for version 3.2.0 on 64 bit x86 SMP ii linux-image-3.2.0-23-generic 3.2.0-23.36 Linux kernel image for version 3.2.0 on 64 bit x86 SMP ii linux-image-3.2.0-24-generic 3.2.0-24.39 Linux kernel image for version 3.2.0 on 64 bit x86 SMP ii linux-image-server 3.2.0.24.26 Linux kernel image on Server Equipment. ii nfs-common 1:1.2.5-3ubuntu3 NFS support files common to client and server ii nfs-kernel-server 1:1.2.5-3ubuntu3 support for NFS kernel server ii nfswatch 4.99.11-1 Program to monitor NFS traffic for the console WHAT I EXPECTED TO HAPPEN ---------------------------------- A fast and responsive nfs service. WHAT HAPPENED INSTEAD ----------------------------- I got fast, but also intermittently totally unresponsive. Problem: -------------------------------------------------------------- I just had to remove ubuntu server 12.04 to install redhat enterprise linux 6. The intermittent slowness was completely unacceptable for the users, who have workstations with /home mounted with nfs4 on this server. The mail server, also accesses the /home because the /home/$USER/Maildir directories are there. Using nfs4, the kernel nfs threads caused enormous load. The users had frozen desktops (greyed out windows) and mail slowed or arrived days later as a result. With RHEL6, all nfs4 problems are completely gone. I used the exact same /etc/exports file, and the same settings and mount options on the workstations, the same number of nfs threads. Both the redhat and ubuntu systems are KVM virtual guests on an redhat 6 virtual host (one of 3 actually). The storage backend is a very fast equallogic array, which exports iscsi targets to the virtual hosts. I am sorry, but I have to conclude the current nfs4 implementation of ubuntu server 12.04 is NOT fit for use. A complete university department suffered for weeks while I tried to solve the problems with ubuntu, but in the end it was decided to install redhat instead, re-using the same iscsi targets for system, home and data. A missed chance for ubuntu... Therefore I urge Canonical's people to classify this bug as critical. Also I think quality assurance should have caught this bug before shipping. Analysis ------------------------------------------------------------ LOAD: The nfs threads cause the kernel to use enormous amounts of 'sy' time as measured in top. I will attach a sample of top's output, of a particular _quiet_ time on the network. Load is 7.82. On busier moments, the load went through the roof, beyond 50 and further. It consumes actual CPU cycles. Each thread consumes upto 30% of a cpu core. I enabled 128 threads. rx an tx block sizes are 32768 on the clients. Both server and clients used async, both on redhat and ubuntu. SYSTEM vs IO-WAIT: The replacement redhat system can surely be overloaded, but then it does not consume CPU cycles doing so. Top does report high load, but it spends in in the 'wa' state. This indicates it is simply waiting for its backend iscsi devices to complete writes. I tested this by simultaneously letting all workstations write multi-gigabyte files with dd to /home. On ubuntu, the nfs threads spend their time in 'sy', doing who-knows-what. LOGS: Nothing at all appears in the logs. But when I set bitwise debug options in the /proc/sys/sunrpc/*debug files, lots of log entries appear. Those seem like normal NFS protocol messages to me though. I also tried to discover what was happening with wireshark, but the traffic looks like normal nfs4 traffic to me. SLOWNESS: That is the thing. The ubuntu nfs server is actually NOT slow at all. During my dd tests, it wrote half a gigabyte per second to its iscsi backends. It's _throughput_ is better than that of the redhat server. As far as I can tell, it falls down because it makes client side processes that want to do IO wait on other writes. A simple 'ls' has to wait until a write has been completed. And both server and client used async nfs. People's firefoxes freeze all the time because firefox need to read and write a lot to its cache and other files in the .mozilla directory. The dovecot imap server almost grinds to a halt trying to write all those little files in people's /home/$USER/Maildir's. The problems go on and on. Basically, a complete network of workstations is almost unusable because of this. Upfront tests were done of course, but showed only the excellent throughput but not the appalling `waiting´ behaviour. With redhat 6 there is no such problem. SITUATION: People use their own, and each others linux workstations for science, doing large calculations and writing a lot of big and small files to nfs. The nfs server serves /export/home, and also raw data storage from /export/data with nfs4. The clients mount those under /home and /data/misc respectively. Also there is a read-only software mount for certain scientific packages. CONFIG: client fstab lines: #### nfs entries ### sw.lorentz.leidenuniv.nl:/sw /sw nfs4 hard,intr,ro,tcp,rsize=32768,wsize=32768,bg,acl,async home.lorentz.leidenuniv.nl:/home /home nfs4 hard,intr,rw,tcp,rsize=32768,wsize=32768,bg,acl,async server exports file: /export 132.229.227.0/24(ro,sync,insecure,root_squash,no_subtree_check,nohide,fsid=0)\ 132.229.216.128/26(ro,sync,insecure,no_root_squash,no_subtree_check,nohide,fsid=0)\ 132.229.226.3(ro,sync,insecure,no_root_squash,no_subtree_check,nohide,fsid=0)\ 132.229.226.4(ro,sync,insecure,no_root_squash,no_subtree_check,nohide,fsid=0) /export/home 132.229.227.0/24(rw,async,insecure,root_squash,no_subtree_check,nohide)\ 132.229.216.128/26(rw,async,insecure,no_root_squash,no_subtree_check,nohide)\ 132.229.226.3(rw,async,insecure,no_root_squash,no_subtree_check,nohide)\ 132.229.226.4(rw,async,insecure,no_root_squash,no_subtree_check,nohide)\ 132.229.214.41(rw,async,insecure,no_root_squash,no_subtree_check,nohide) /export/data 132.229.227.0/24(rw,async,insecure,root_squash,no_subtree_check,nohide)\ 132.229.216.128/26(rw,async,insecure,no_root_squash,no_subtree_check,nohide)\ 132.229.226.3(rw,async,insecure,no_root_squash,no_subtree_check,nohide)\ 132.229.226.4(rw,async,insecure,no_root_squash,no_subtree_check,nohide) /export/sw 132.229.227.0/24(rw,async,insecure,root_squash,no_subtree_check,nohide)\ 132.229.216.128/26(rw,async,insecure,no_root_squash,no_subtree_check,nohide)\ 132.229.226.3(rw,async,insecure,no_root_squash,no_subtree_check,nohide)\ 132.229.226.4(rw,async,insecure,no_root_squash,no_subtree_check,nohide) root@gaia:~# lsb_release -rd Description: Ubuntu 12.04 LTS Release: 12.04 root@gaia2:~# dpkg -l | grep -E 'nfs|linux-image' ii libnfsidmap2 0.25-1ubuntu2 NFS idmapping library ii linux-image-3.2.0-18-generic 3.2.0-18.29 Linux kernel image for version 3.2.0 on 64 bit x86 SMP ii linux-image-3.2.0-19-generic 3.2.0-19.31 Linux kernel image for version 3.2.0 on 64 bit x86 SMP ii linux-image-3.2.0-20-generic 3.2.0-20.33 Linux kernel image for version 3.2.0 on 64 bit x86 SMP ii linux-image-3.2.0-21-generic 3.2.0-21.34 Linux kernel image for version 3.2.0 on 64 bit x86 SMP ii linux-image-3.2.0-23-generic 3.2.0-23.36 Linux kernel image for version 3.2.0 on 64 bit x86 SMP ii linux-image-3.2.0-24-generic 3.2.0-24.39 Linux kernel image for version 3.2.0 on 64 bit x86 SMP ii linux-image-server 3.2.0.24.26 Linux kernel image on Server Equipment. ii nfs-common 1:1.2.5-3ubuntu3 NFS support files common to client and server ii nfs-kernel-server 1:1.2.5-3ubuntu3 support for NFS kernel server ii nfswatch 4.99.11-1 Program to monitor NFS traffic for the console WHAT I EXPECTED TO HAPPEN ---------------------------------- A fast and responsive nfs service. WHAT HAPPENED INSTEAD ----------------------------- I got fast, but also intermittently totally unresponsive. --- AlsaVersion: Advanced Linux Sound Architecture Driver Version 1.0.24. AplayDevices: Error: [Errno 2] No such file or directory ApportVersion: 2.0.1-0ubuntu7 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/by-path', '/dev/snd/controlC0', '/dev/snd/hwC0D0', '/dev/snd/pcmC0D0c', '/dev/snd/pcmC0D0p', '/dev/snd/timer'] failed with exit code 1: CRDA: Error: [Errno 2] No such file or directory Card0.Amixer.info: Error: [Errno 2] No such file or directory Card0.Amixer.values: Error: [Errno 2] No such file or directory CurrentDmesg: [ 17.280028] eth0: no IPv6 routers present DistroRelease: Ubuntu 12.04 HibernationDevice: RESUME=UUID=aafa9be4-19ac-4d74-a853-a5532ddedf5d InstallationMedia: Ubuntu-Server 12.04 LTS "Precise Pangolin" - Alpha amd64 (20120313) IwConfig: lo no wireless extensions. eth0 no wireless extensions. Lsusb: Bus 001 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub MachineType: Red Hat KVM NonfreeKernelModules: nfsd nfs lockd fscache auth_rpcgss nfs_acl sunrpc ext2 snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_timer snd psmouse serio_raw lp parport i2c_piix4 soundcore snd_page_alloc virtio_balloon mac_hid floppy Package: linux (not installed) ProcEnviron: LANGUAGE=en_US:en TERM=xterm PATH=(custom, no user) LANG=en_US.UTF-8 SHELL=/bin/bash ProcFB: 0 EFI VGA ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-3.2.0-24-generic root=/dev/mapper/gaia-root ro ProcVersionSignature: Ubuntu 3.2.0-24.39-generic 3.2.16 RelatedPackageVersions: linux-restricted-modules-3.2.0-24-generic N/A linux-backports-modules-3.2.0-24-generic N/A linux-firmware 1.79 RfKill: Error: [Errno 2] No such file or directory Tags: precise Uname: Linux 3.2.0-24-generic x86_64 UpgradeStatus: No upgrade log present (probably fresh install) UserGroups: dmi.bios.date: 01/01/2007 dmi.bios.vendor: Seabios dmi.bios.version: 0.5.1 dmi.chassis.type: 1 dmi.chassis.vendor: Red Hat dmi.modalias: dmi:bvnSeabios:bvr0.5.1:bd01/01/2007:svnRedHat:pnKVM:pvrRHEL6.2.0PC:cvnRedHat:ct1:cvr: dmi.product.name: KVM dmi.product.version: RHEL 6.2.0 PC dmi.sys.vendor: Red Hat
2012-05-30 15:50:01 perpetualrabbit attachment added AcpiTables.txt https://bugs.launchpad.net/bugs/1006446/+attachment/3169211/+files/AcpiTables.txt
2012-05-30 15:50:02 perpetualrabbit attachment added AlsaDevices.txt https://bugs.launchpad.net/bugs/1006446/+attachment/3169212/+files/AlsaDevices.txt
2012-05-30 15:50:04 perpetualrabbit attachment added BootDmesg.txt https://bugs.launchpad.net/bugs/1006446/+attachment/3169213/+files/BootDmesg.txt
2012-05-30 15:50:06 perpetualrabbit attachment added Card0.Codecs.codec.0.txt https://bugs.launchpad.net/bugs/1006446/+attachment/3169214/+files/Card0.Codecs.codec.0.txt
2012-05-30 15:50:07 perpetualrabbit attachment added Lspci.txt https://bugs.launchpad.net/bugs/1006446/+attachment/3169215/+files/Lspci.txt
2012-05-30 15:50:09 perpetualrabbit attachment added PciMultimedia.txt https://bugs.launchpad.net/bugs/1006446/+attachment/3169216/+files/PciMultimedia.txt
2012-05-30 15:50:10 perpetualrabbit attachment added ProcCpuinfo.txt https://bugs.launchpad.net/bugs/1006446/+attachment/3169217/+files/ProcCpuinfo.txt
2012-05-30 15:50:11 perpetualrabbit attachment added ProcInterrupts.txt https://bugs.launchpad.net/bugs/1006446/+attachment/3169218/+files/ProcInterrupts.txt
2012-05-30 15:50:14 perpetualrabbit attachment added ProcModules.txt https://bugs.launchpad.net/bugs/1006446/+attachment/3169219/+files/ProcModules.txt
2012-05-30 15:50:16 perpetualrabbit attachment added UdevDb.txt https://bugs.launchpad.net/bugs/1006446/+attachment/3169220/+files/UdevDb.txt
2012-05-30 15:50:17 perpetualrabbit attachment added UdevLog.txt https://bugs.launchpad.net/bugs/1006446/+attachment/3169221/+files/UdevLog.txt
2012-05-30 15:50:19 perpetualrabbit attachment added WifiSyslog.gz https://bugs.launchpad.net/bugs/1006446/+attachment/3169222/+files/WifiSyslog.gz
2012-05-30 16:06:54 Joseph Salisbury linux (Ubuntu): importance Undecided High
2012-05-30 16:08:33 Joseph Salisbury linux (Ubuntu): status Confirmed Incomplete
2012-05-30 16:08:45 Joseph Salisbury tags apport-collected high load nfs4 precise apport-collected high kernel-da-key load nfs4 precise
2012-06-03 11:58:45 perpetualrabbit linux (Ubuntu): status Incomplete Confirmed
2012-06-06 11:44:15 Stephan Held bug added subscriber Stephan Held
2012-06-06 19:53:56 Joseph Salisbury linux (Ubuntu): status Confirmed Triaged
2012-06-06 19:57:51 Joseph Salisbury bug task added nfs-utils (Ubuntu)
2012-06-06 19:57:58 Joseph Salisbury nfs-utils (Ubuntu): importance Undecided High
2012-06-18 06:53:02 Launchpad Janitor nfs-utils (Ubuntu): status New Confirmed
2012-06-18 06:53:59 Jeff Ebert bug added subscriber Jeff Ebert
2012-06-19 02:59:33 Jeff Taylor bug added subscriber Jeff Taylor
2012-06-20 04:56:58 Jeff Ebert marked as duplicate 879334
2012-09-05 16:42:32 juuso puuso bug added subscriber juuso puuso
2012-09-14 10:56:21 Sven Siemsen bug added subscriber Sven Siemsen
2012-09-25 16:01:17 Michael Walser bug added subscriber Michael Walser
2012-09-26 06:58:08 ECOM Development bug added subscriber ECOM Development
2012-10-23 00:40:20 Daniel Frei bug added subscriber Daniel Frei
2012-10-30 16:02:01 Lee Shakespeare bug added subscriber Lee Shakespeare
2012-11-09 17:55:32 Karsten Suehring bug added subscriber Karsten Suehring
2012-11-30 06:41:35 Fredrik Tuomas bug added subscriber Fredrik Tuomas
2013-01-29 04:35:22 amastbaum bug added subscriber amastbaum
2013-02-02 04:23:11 Doug Schaapveld bug added subscriber Doug Schaapveld
2013-06-17 06:39:12 Tim Landscheidt bug added subscriber Tim Landscheidt
2013-10-14 16:52:10 Rick White bug added subscriber Rick White
2014-07-08 09:54:48 masakre bug added subscriber masakre