nbd-proxy hangs the nbd-connection to server

Bug #589034 reported by Juha Erkkilä
70
This bug affects 13 people
Affects Status Importance Assigned to Milestone
LTSP5
Fix Released
Undecided
Unassigned
ltsp (Ubuntu)
Fix Released
Undecided
Unassigned

Bug Description

I am running an ltsp server on Ubuntu (10.04) Lucid Lynx, with a Primergy TX120 S2 as an ltsp server, and HP Probook 4310s as a terminal connecting to server. The server installation has the amd64 architecture, but the terminal image is using i386. This problem could also be reproduced with a kvm virtual machine functioning as a server, with a similar installation, and has been observed with another type of terminal machine as well (XPC shuttle X27D).

The server contains ltsp-server and ltsp-server-standalone packages, in version 5.2.1-0ubuntu9. The terminal image contains matching versions (5.2.1-0ubuntu9) of ltsp-client and ltsp-client-core packages. Kernel version on the server side is 2.6.32-22-server, and on the terminal side it is 2.6.32-22-generic.

I am using dnsmasq as the dhcp-server, and the following settings in /var/lib/tftpboot/ltsp/i386/lts.conf:

[default]
        LDM_DIRECTX = True
        LDM_LANGUAGE = "fi_FI.UTF-8"
        LOCAL_APPS = True
        LOCALDEV = True
        LTSP_FATCLIENT = False
        NBD_SWAP = True
        REMOTE_APPS = True
        SSH_FOLLOW_SYMLINKS = False
        SSH_OVERRIDE_PORT = 222

On the server side Linux reports the following about the network interface that connected to the terminal (some dmesg-snippets here):

[ 1.862987] 0000:30:00.0: eth1: (PCI Express:2.5GB/s:Width x1) 00:15:17:cf:5e:de
[ 1.862989] 0000:30:00.0: eth1: Intel(R) PRO/1000 Network Connection
[ 1.863069] 0000:30:00.0: eth1: MAC: 1, PHY: 4, PBA No: d50858-004
[ 20.324320] ADDRCONF(NETDEV_UP): eth1: link is not ready
[ 22.891005] e1000e: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
[ 22.892038] ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready

On the terminal side Linux reports the following about the network interface that connected to the server (dmesg-snippets):

[ 1.451527] sky2 eth0: addr 00:26:55:c4:06:95
[ 4.535708] sky2 eth0: enabling interface
[ 4.535949] ADDRCONF(NETDEV_UP): eth0: link is not ready
[ 7.029456] sky2 eth0: Link is up at 1000 Mbps, full duplex, flow control both
[ 7.029693] ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready

On this configuration, nbd-connection to server works quite well, without any significant problems (it appears to rarely hang, but only rarely). However, when putting a switch (ZyXEL Desktop Ethernet Switch 10/100Mbps) between these computers, the network interface state changes on the server:

[18989.100157] e1000e: eth1 NIC Link is Down
[18994.101017] e1000e: eth1 NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX/TX
[18994.101023] 0000:30:00.0: eth1: 10/100 speed: disabling TSO

And on the terminal side:

[ 248.484539] sky2 eth0: Link is down.
[ 254.785883] sky2 eth0: Link is up at 100 Mbps, full duplex, flow control both

On this slower network connection between the server and the terminal, nbd-connection frequently hangs. Loading the kernel and initial ramdisk is always reliable, but the nbd connection may stop transferring data at some point, and this point appears to change randomly, yet often before the login screen comes up. Note that the nbd connection does remain open --- at least on the server side a socket connection remains established to the terminal, but nothing is transferred between the machines.

With the previous configuration, the success rate of reaching the ldm login screen is about 30-40% at every boot. Without the switch sitting in-between, but using a direct gigabit link, the success rate is something between 90-100%.

It seems this problem is due to nbd-proxy, because this issue goes away when it is disabled in the initial ramdisk downloaded by the terminal. After using a direct connection from nbd-client to the server, the success rate of reaching the ldm login screen at every boot appears to be pretty close to 100%.

I suspect there a correlation between the terminal CPU speed and the network speed that affects this issue. Perhaps if a terminal machine is comparatively slow and the network is fast, this problem occurs very rarely?

This problem can be worked around by disabling nbd-proxy. This can be done by applying the attached patch to the terminal tree (under /opt/ltsp/i386 for the i386 architecture), and then rebuilding the terminal image with
"sudo ltsp-update-image --arch i386".

Tags: patch
Revision history for this message
Juha Erkkilä (juha-erkkila) wrote :
Revision history for this message
DUSSERT Nicolas (nicolas-dussert) wrote :

I've got the same problem ! No soucis with Ubuntu 9.04 and problems with 10.04.
With Lucid, for 16 workstations, it takes 1h our to start and 20-40% of workstations reaching ldm login. And, i've the same configuration : server on a switch and workstations on others switchs.

I've tested the patch from Juha Erkkilä and it's working fine : all the workstation start in 2 minutes and 100% reaching the ldm screen.

BUT : you have to do more than the patch + "sudo ltsp-update-image --arch i386"
In fact, you have to apply the patch, THEN run in the chroot "update-initramfs -u", THEN quit the chroot and do "ltsp-update-kernels" then "ltsp-update-image".

And it works !!!!

Revision history for this message
Иван Димитров (zlobber) wrote :

Same problem here. I'm now testing it without nbd-proxy and so far 20 reboots without a problem. (before i managed to boot 1 out of 3 times)

Revision history for this message
LedHed (ledhed-jgh) wrote :

Before applying this patch my client machines would fail to boot approximately 30% of the time.

After applying this patch (and following DUSSERT Nicolas' instructions) my clients are now booting 100% of the time.

Server: Ubuntu 10.04 LTS (running as a PV domU on XenServer 5.5)
Clients: i386 Fat Clients (Intel Atom + 1GB RAM, w/ Realtek NIC)

Revision history for this message
robanon (robert-mcdonald) wrote :

Same problem, the patch appears to fix things though.

Revision history for this message
Nikolaus Rath (nikratio) wrote :

Could someone post the kernel messages when the client hangs? I suspect this may be the same bug as bug 604314.

Changed in ltsp:
status: New → Confirmed
Revision history for this message
Matthew Wyneken (mawyn) wrote :

I'm not sure if I have exactly the same problem but there are certain similarities, the main one being that my client machines would boot sometimes but other times not. I was also unable to switch from using LDM to GDM before I applied the patch.

Whatever my exact problem was, the patch that disables nbd-proxy made it possible for my clients to boot and for me to used GDM.

Revision history for this message
Steve Rippl (steverippl) wrote :

We're experiencing this too, here's a report from the tech that builds our LTSP servers which includes the network equipment we're using. Applying the patch (disabling nbd-proxy) doesn't completely solve things for us. We had none of these problems on 9.04.

=======

When 2 or more switches are between client and server a failure to boot
occurs before launching LDM about %50 of the time. The frequency of the
failure to boot is reduced by being connected to the server with only
one intermediate switch but does not go away completely. The following
bugs seem to reference the same problem and indicate that it is
nbd-proxy that is falling down during the boot process.

https://bugs.launchpad.net/ubuntu/+source/ltsp/+bug/604314
https://bugs.launchpad.net/ltsp/+bug/589034

The latter of these references a patch that when applied disables the
activation of the nbd-proxy during the client setup process. My success
rate for booting clients after applying this patch rises to about %70.
Load seems to effect this condition adversely as well as speed and
duplex transitions such as many 10/100 clients connecting to a server
attached to a gigabit port. The clients in question are Atom based and
include gigabit ethernet ports but are connected to 10/100 ports in the
classroom. Switches being used include HP Pro-curve edge products as
well as a variety of unmanaged 10/100 5-8 port switches.

Revision history for this message
Nikolaus Rath (nikratio) wrote :

I had to disable nbd-proxy *and* add the --persist option to nbd-client (see bug 604314). Maybe you want to try that too.

Revision history for this message
Ricardo Pérez López (ricardo) wrote :

I can confirm this bug with Ubuntu 10.04.1 LTS and LTSP fat clients. Sometimes the clients hangs during boot and sometimes (maybe during the next reboot) they boot OK. After applying the patch to disable nbd-proxy, all works perfectly well.

BTW: by now, I don't need to add the --persist option as Nikolaus said. I'll try it and see if there's any difference.

tags: added: patch
Revision history for this message
Ricardo Pérez López (ricardo) wrote :

This guy talks about using a "sleep 5" workaround to fix the issue:

http://www.rickogden.com/2010/06/ltsp-part-2-configuration/

It could be interesting to try it.

Revision history for this message
Tom Jampen (jampen) wrote :

I ran into the same issue on (K)ubuntu Maverick Meerkat.

I just had to disable the nbd-proxy as follows:

In the file
/var/lib/tftpboot/ltsp/i386/pxelinux.cfg/default
I added the boot option
nbd_proxy=false

Thanks to the guys on irc.freenode.net!

Revision history for this message
Alkis Georgopoulos (alkisg) wrote :

As this is frequently asked on the #ltsp IRC channel, some detailed instructions on how to disable nbd_proxy.

Steps only for Lucid:
--------------------------
1) Get a newer version of ltsp_nbd, by clicking "Download file" from this link:
http://bazaar.launchpad.net/~ltsp-upstream/ltsp/ltsp-trunk/annotate/1777/client/initramfs/scripts/ltsp_nbd
Save it in /opt/ltsp/i386/usr/share/initramfs-tools/scripts/ltsp_nbd
(the patch attached to this bug report isn't needed, an option to disable nbd_proxy was added before this bug was reported)

2) Update your initramfs and kernels:
sudo chroot /opt/ltsp/i386 update-initramfs -u
sudo ltsp-update-kernels

Common steps for Lucid and Maverick:
-------------------------------------------------
3) To specify that you don't want to use nbd_proxy:
$ sudo gedit /var/lib/tftpboot/ltsp/i386/pxelinux.cfg
and add "nbd_proxy=false" right next to "quiet splash" there.

4) To prevent the "nbd_proxy=false" option from being overwritten by future calls to ltsp-update-kernels:
$ sudo gedit /etc/ltsp/ltsp-update-image.conf
and add this line:
BOOTPROMPT_OPTIONS="quiet splash nbd_proxy=false"

Revision history for this message
Alkis Georgopoulos (alkisg) wrote :

Step (3) is missing a "/default" at the end of the command, it should be:
sudo gedit /var/lib/tftpboot/ltsp/i386/pxelinux.cfg/default

Revision history for this message
kesou (guilhem-souque) wrote :

Thanks for that fix that's worked for me (ltsp server on utbuntu 10.04)
without the fix the thin client failed to boot approximatively 40% of the time.

Revision history for this message
Nuno Sucena Almeida (slug-debian) wrote :

Hi,
I made a small change to the patch above to allow for nbd_client_options and to force -persistent when disabling the nbd_proxy. See the attachment for the full file. I'm still in the process of confirming that this actually solves the issue though.

Revision history for this message
Stéphane Graber (stgraber) wrote :

Marking as fix released as nbd-proxy has been disabled for a while now in both upstream and more recent Ubuntu releases.
A rewrite of nbd-proxy has been done and should fix most of these issues so we might turn nbd-proxy back on in a later release.

Changed in ltsp:
status: Confirmed → Fix Released
Changed in ltsp (Ubuntu):
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.