nbd-proxy hangs the nbd-connection to server
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
LTSP5 |
Fix Released
|
Undecided
|
Unassigned | ||
ltsp (Ubuntu) |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
I am running an ltsp server on Ubuntu (10.04) Lucid Lynx, with a Primergy TX120 S2 as an ltsp server, and HP Probook 4310s as a terminal connecting to server. The server installation has the amd64 architecture, but the terminal image is using i386. This problem could also be reproduced with a kvm virtual machine functioning as a server, with a similar installation, and has been observed with another type of terminal machine as well (XPC shuttle X27D).
The server contains ltsp-server and ltsp-server-
I am using dnsmasq as the dhcp-server, and the following settings in /var/lib/
[default]
LDM_DIRECTX = True
LOCAL_APPS = True
LOCALDEV = True
NBD_SWAP = True
REMOTE_APPS = True
On the server side Linux reports the following about the network interface that connected to the terminal (some dmesg-snippets here):
[ 1.862987] 0000:30:00.0: eth1: (PCI Express:
[ 1.862989] 0000:30:00.0: eth1: Intel(R) PRO/1000 Network Connection
[ 1.863069] 0000:30:00.0: eth1: MAC: 1, PHY: 4, PBA No: d50858-004
[ 20.324320] ADDRCONF(
[ 22.891005] e1000e: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
[ 22.892038] ADDRCONF(
On the terminal side Linux reports the following about the network interface that connected to the server (dmesg-snippets):
[ 1.451527] sky2 eth0: addr 00:26:55:c4:06:95
[ 4.535708] sky2 eth0: enabling interface
[ 4.535949] ADDRCONF(
[ 7.029456] sky2 eth0: Link is up at 1000 Mbps, full duplex, flow control both
[ 7.029693] ADDRCONF(
On this configuration, nbd-connection to server works quite well, without any significant problems (it appears to rarely hang, but only rarely). However, when putting a switch (ZyXEL Desktop Ethernet Switch 10/100Mbps) between these computers, the network interface state changes on the server:
[18989.100157] e1000e: eth1 NIC Link is Down
[18994.101017] e1000e: eth1 NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX/TX
[18994.101023] 0000:30:00.0: eth1: 10/100 speed: disabling TSO
And on the terminal side:
[ 248.484539] sky2 eth0: Link is down.
[ 254.785883] sky2 eth0: Link is up at 100 Mbps, full duplex, flow control both
On this slower network connection between the server and the terminal, nbd-connection frequently hangs. Loading the kernel and initial ramdisk is always reliable, but the nbd connection may stop transferring data at some point, and this point appears to change randomly, yet often before the login screen comes up. Note that the nbd connection does remain open --- at least on the server side a socket connection remains established to the terminal, but nothing is transferred between the machines.
With the previous configuration, the success rate of reaching the ldm login screen is about 30-40% at every boot. Without the switch sitting in-between, but using a direct gigabit link, the success rate is something between 90-100%.
It seems this problem is due to nbd-proxy, because this issue goes away when it is disabled in the initial ramdisk downloaded by the terminal. After using a direct connection from nbd-client to the server, the success rate of reaching the ldm login screen at every boot appears to be pretty close to 100%.
I suspect there a correlation between the terminal CPU speed and the network speed that affects this issue. Perhaps if a terminal machine is comparatively slow and the network is fast, this problem occurs very rarely?
This problem can be worked around by disabling nbd-proxy. This can be done by applying the attached patch to the terminal tree (under /opt/ltsp/i386 for the i386 architecture), and then rebuilding the terminal image with
"sudo ltsp-update-image --arch i386".
tags: | added: patch |
I've got the same problem ! No soucis with Ubuntu 9.04 and problems with 10.04.
With Lucid, for 16 workstations, it takes 1h our to start and 20-40% of workstations reaching ldm login. And, i've the same configuration : server on a switch and workstations on others switchs.
I've tested the patch from Juha Erkkilä and it's working fine : all the workstation start in 2 minutes and 100% reaching the ldm screen.
BUT : you have to do more than the patch + "sudo ltsp-update-image --arch i386" kernels" then "ltsp-update- image".
In fact, you have to apply the patch, THEN run in the chroot "update-initramfs -u", THEN quit the chroot and do "ltsp-update-
And it works !!!!