nbd+squashfs errors when rebooting ltsp thin clients

Bug #457702 reported by Veli-Matti Lintu on 2009-10-21
This bug affects 4 people
Affects Status Importance Assigned to Milestone
ldm (Ubuntu)
Stéphane Graber
ltsp (Ubuntu)

Bug Description

Testing done on ltsp image built on Karmic from the newest packages available in the repos.

Server: Karmic amd64
Thin client: Karmic i386
Image built with command: ltsp-build-client --arch=i386

After building a fresh image with newest Karmic packages, the ltsp thin clients do not reboot properly, but throw nbd+squashfs related errors on the console. After the errors are shown, the thin client is frozen and it does not reboot automatically. Shutdown works properly with no error messages. The same error happens on every boot and this has happened on multiple thin clients.

Attached is logfile for the entire session when the thin client boots from network and reboot is selected from ldm menu. The logfile is captured from serial console.

The shutdown part gives out these messages:

init: hal main process (856) terminated with status 1
init: cron main process (2247) killed by TERM signal
init: tty1 main process (3341) killed by TERM signal
init: Disconnected from system bus
init: rsyslog-kmsg main process (444) killed by TERM signal
 * Asking all remaining processes to terminate... init: hwclock-save main process (3406) killed by TERM signal
[ 57.931304] nbd0: Receive control failed (result -4)
[ 57.955127] nbd0: Attempted send on closed socket
[ 57.959994] end_request: I/O error, dev nbd0, sector 416250
[ 57.965709] SQUASHFS error: squashfs_read_data failed to read block 0xcb3f6c9
[ 57.972983] SQUASHFS error: Unable to read metadata cache entry [cb3f6c9]
[ 57.979941] SQUASHFS error: Unable to read directory block [cb3f6c9:193c]
[ 57.986973] SQUASHFS error: Unable to read metadata cache entry [cb3f6c9]
[ 57.993887] SQUASHFS error: Unable to read directory block [cb3f6c9:193c]
[ 58.000885] SQUASHFS error: Unable to read metadata cache entry [cb3f6c9]

Veli-Matti Lintu (vmlintu) wrote :

Another log file with boot + shutdown. This works with no problems.

Veli-Matti Lintu (vmlintu) wrote :

Problem seems to related to "reboot -p" call in gtkgreet/greeter.c of ldm. Changing that to "reboot -fp" fixes the problem.

affects: ubuntu → ldm (Ubuntu)
Veli-Matti Lintu (vmlintu) wrote :

This patch fixed the problem. It adds -f also to poweroff just in case.

Andrew Rigney (ubuntultspadmin) wrote :

How do you apply this patch?

torabian (torabian02) wrote :

Yes please clarify, how do you apply this patch?

Jakob Unterwurzacher (jakobunt) wrote :

You can get the updated ldm package here (it includes the patch): https://launchpad.net/~opinsys/+archive/ppa/+sourcepub/836591/+listing-archive-extra

Jakob Unterwurzacher (jakobunt) wrote :

Conversation on #ltsp about this problem: http://www.nubae.com/logs/ltsp20091028_pg1.html (starting at 07:17)

Stéphane Graber (stgraber) wrote :

Fix has been applied upstream.
Lucid will contain the fix next time I upload, I'll also apply it to my PPA.
If anyone feels like preparing a SRU, please feel free to do so.

Changed in ltsp (Ubuntu):
status: New → Invalid
Changed in ldm (Ubuntu):
status: New → Fix Committed
importance: Undecided → Medium
assignee: nobody → Stéphane Graber (stgraber)
Changed in ldm (Ubuntu):
status: Fix Committed → Fix Released
müzso (bit2) wrote :

There's another call to "reboot": after logout from your Gnome session ldm checks whether there's a new version of the NBD image on the server. If there is, then a reboot is issued.
In the source of the "ldm" package this is in the file "rc.d/I01-nbd-checkupdate" at the very end of the script.

The package available at https://launchpad.net/~opinsys/+archive/ppa/+sourcepub/836591/+listing-archive-extra does not yet include the fix for the above problem (it contains the fix only for the regular reboots through the GUI), but the trunk version of the ldm package (available at http://bazaar.launchpad.net/~ltsp-upstream/ltsp/ldm-trunk) already has this fix, so probably the next release version of ldm will contain it too.

Nikolaus Rath (nikratio) wrote :

The above fix calls reboot with the -f option to force a hard reboot without properly shutting down the system.

This is not always a good idea. The problem still exists when some other application (or the user) happens to issue a normal shutdown command.

Moreover, doing a hard reboot is not always a good idea. I have a couple of fat clients that I'd really rather shut down properly (so that e.g. the NFS mounted home directories are correctly flushed and umounted).

The real problem that has to be correct here is that sometime during the shutdown the network interface is deactivated. The NBD then becomes unavailable and the entire system freezes, since it cannot access the root file system anymore.

There is actually a check in /etc/init.d/networking that should prevent the network from being disabled when something is mounted from /dev/ndb, but either it does not work in this case or there is another script that shuts down the network.

Changed in ldm (Ubuntu):
status: Fix Released → Confirmed
Changed in ltsp (Ubuntu):
status: Invalid → Confirmed
Changed in ldm (Ubuntu):
status: Confirmed → Invalid
Nikolaus Rath (nikratio) wrote :

And it turns out that the problem is that /etc/init.d/sendsigs kills nbd-client. Attached is a debdiff that should fix the problem once and for all. I would also suggest to revert the change in LDM - it's no longer necessary.

Please don't be too harsh on the debdiff, it's my very first one.

tags: added: patch
Changed in ltsp (Ubuntu):
assignee: nobody → Ubuntu Sponsors for main (ubuntu-main-sponsors)
Nikolaus Rath (nikratio) wrote :

...and here is a branch for merging in Lucid. This patch also protects dbd-proxy: http://bazaar.launchpad.net/~nikratio/ltsp/ubuntu.bug457702

Colin Watson (cjwatson) wrote :

Subscribed ubuntu-sponsors, unassigned ubuntu-main-sponsors (best not to use assignment for this).

Changed in ltsp (Ubuntu):
assignee: Ubuntu Sponsors for main (ubuntu-main-sponsors) → nobody
Stéphane Graber (stgraber) wrote :

Please note that in Lucid, a upstart script exists for both reboot and shutdown that triggers a forced shutdown/reboot and bypass the regular shutdown sequence.

These two scripts issues a "sync" then reboot or shutdown depending on the runlevel.
/etc/init.d/sendsigs will then never be called at shutdown or reboot making updating the process blacklist useless in this case.

That new LTSP including updated upstart jobs and other upstream fixes will be uploaded later today.

Changed in ltsp (Ubuntu):
status: Confirmed → Fix Released

Hey guys,

we are currently facing the same problem, or at least something relaited.

We use an Xubuntu 10.04.1 Server (amd64) with i386-Clients. LTSP 5.2.1ubuntu9 and separate (Windows) DHCP-Server.
If the LTSP-server gets down, no matter if shutdown, reboot, hard shutdown etc. the clients waits for around 1 minute (mostly less) and starts then giving the above mentioned errors ...

First ...
[ 57.955127] nbd0: Attempted send on closed socket
[ 57.959994] end_request: I/O error, dev nbd0, sector 416250

and then ...

[ 57.965709] SQUASHFS error: squashfs_read_data failed to read block 0xcb3f6c9
[ 57.972983] SQUASHFS error: Unable to read metadata cache entry [cb3f6c9]
[ 57.979941] SQUASHFS error: Unable to read directory block [cb3f6c9:193c]
... to infinity ...

sligthly different IDs etc but the rest is identical. The client continues with that until i shut him down manually (pulling the plug) or try to login with SSH. The authentication/login works, but you never get a prompt, just sometimes an "Input/Output Error". The client hangs then completly and also stops giving above errors. Sometimes it already hangs earlier and SSH is not reachable.

With an hardware-server which takes about 4 minutes to boot (loooots of RAM to count) we faced that first. With an virtual Server (VMware) reboots just takes seconds and the clients reboot automaticlly when the server is back running. If i pause the boot, we have the same behavier of the clients like with the hardware-server.

As above is written the bug is fixed an we use a much newer version, i was really confused why in our .../initramfs/scripts/ltsp_nbd file the fix is not implemented? But even if I add the fix manually the problem stays.

Any solution? We have already an older Server with Ubuntu 8.04 (32Bit) and 15 Clients running without such problems.

Stian Hill (stian-axachi) wrote :

Hi this is strange I have Ubuntu 12.04 Precise and default apt-get install lts-server. then a lts-build-client. this makes a amd64 folder . I use a Windows dhcp. I installed Likewise5 to get AD integration. it all works. only thing is when the client has been on for a while it gets the error listed her...

Now I would think that this fix was implemented in the newer versions???

:K. O. (h-admin-mi-fh-offenburg-de) did you find out anything.?

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers