Kernel deadlock when running programs from aufs over squashfs

Bug #604314 reported by Nikolaus Rath
30
This bug affects 5 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Invalid
Undecided
Unassigned
Nominated for Lucid by Anton S. Ustyuzhanin
ltsp (Ubuntu)
New
Undecided
Unassigned
Nominated for Lucid by Anton S. Ustyuzhanin

Bug Description

I have set up a Lucid diskless fat client using ltsp. The root filesystem is aufs. Underlying the aufs is an rw tmpfs and a ro squashfs, the later mounted from NBD.

The problems is that the fat clients work fine for a little while, but then reproducibly freeze completely within a few hours after booting. The last syslog messages that the server receives are:

Jul 10 14:11:38 beta kernel: [25560.688091] INFO: task cron:2278 blocked for more than 120 seconds.
Jul 10 14:11:38 beta kernel: [25560.688100] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 10 14:11:38 beta kernel: [25560.688107] cron D 00006323 0 2278 1 0x00000000
Jul 10 14:11:38 beta kernel: [25560.688118] d549fa0c 00000086 00000080 00006323 00000000 c0847760 d5fb9c2c c0847760
Jul 10 14:11:38 beta kernel: [25560.688135] b0918eaf 0000170c c0847760 c0847760 d5fb9c2c c0847760 c0847760 d6345400
Jul 10 14:11:38 beta kernel: [25560.688151] b08e6975 0000170c d5fb9980 c1d08760 d5fb9980 d549fa58 d549fa1c c058a5ca
Jul 10 14:11:38 beta kernel: [25560.688168] Call Trace:
Jul 10 14:11:38 beta kernel: [25560.688185] [<c058a5ca>] io_schedule+0x3a/0x60
Jul 10 14:11:38 beta kernel: [25560.688194] [<c022d1f8>] sync_buffer+0x38/0x40
Jul 10 14:11:38 beta kernel: [25560.688201] [<c058ad6d>] __wait_on_bit+0x4d/0x70
Jul 10 14:11:38 beta kernel: [25560.688207] [<c022d1c0>] ? sync_buffer+0x0/0x40
Jul 10 14:11:38 beta kernel: [25560.688214] [<c022d1c0>] ? sync_buffer+0x0/0x40
Jul 10 14:11:38 beta kernel: [25560.688220] [<c058ae3b>] out_of_line_wait_on_bit+0xab/0xc0
Jul 10 14:11:38 beta kernel: [25560.688230] [<c0167850>] ? wake_bit_function+0x0/0x50
Jul 10 14:11:38 beta kernel: [25560.688237] [<c022d1be>] __wait_on_buffer+0x2e/0x30
Jul 10 14:11:38 beta kernel: [25560.688266] [<f80ec30b>] squashfs_read_data+0x30b/0x720 [squashfs]
Jul 10 14:11:38 beta kernel: [25560.688277] [<c0144f39>] ? load_balance_newidle+0x99/0x300
Jul 10 14:11:38 beta kernel: [25560.688290] [<f80ecb06>] squashfs_cache_get+0x1c6/0x2f0 [squashfs]
Jul 10 14:11:38 beta kernel: [25560.688304] [<f80ecd18>] squashfs_read_metadata+0x68/0xe0 [squashfs]
Jul 10 14:11:38 beta kernel: [25560.688317] [<f80ee488>] squashfs_read_inode+0x78/0x5b0 [squashfs]
Jul 10 14:11:38 beta kernel: [25560.688330] [<f80ef0e7>] ? squashfs_alloc_inode+0x17/0x30 [squashfs]
Jul 10 14:11:38 beta kernel: [25560.688340] [<c021cf9e>] ? inode_init_always+0xfe/0x190
Jul 10 14:11:38 beta kernel: [25560.688347] [<c021e015>] ? get_new_inode_fast+0xe5/0x110
Jul 10 14:11:38 beta kernel: [25560.688359] [<f80eea11>] squashfs_iget+0x51/0x80 [squashfs]
Jul 10 14:11:38 beta kernel: [25560.688371] [<f80eee73>] squashfs_lookup+0x293/0x320 [squashfs]
Jul 10 14:11:38 beta kernel: [25560.688384] [<c0212cb5>] __lookup_hash+0xc5/0x110
Jul 10 14:11:38 beta kernel: [25560.688390] [<c0212e0c>] lookup_hash+0x2c/0x30
Jul 10 14:11:38 beta kernel: [25560.688411] [<f82038ac>] vfsub_lookup_hash+0x1c/0x40 [aufs]
Jul 10 14:11:38 beta kernel: [25560.688429] [<f8209a1e>] au_lkup_one+0x9e/0xd0 [aufs]
Jul 10 14:11:38 beta kernel: [25560.688437] [<c058b577>] ? do_nanosleep+0x97/0xc0
Jul 10 14:11:38 beta kernel: [25560.688455] [<f8209ce6>] au_do_lookup+0x96/0x1f0 [aufs]
Jul 10 14:11:38 beta kernel: [25560.688476] [<f820a383>] au_lkup_dentry+0x193/0x270 [aufs]
Jul 10 14:11:38 beta kernel: [25560.688495] [<f82093ad>] ? do_ii_read_lock+0x2d/0x30 [aufs]
Jul 10 14:11:38 beta kernel: [25560.688541] [<f82102c5>] aufs_lookup+0xd5/0x1e0 [aufs]
Jul 10 14:11:38 beta kernel: [25560.688550] [<c058c32d>] ? _spin_lock+0xd/0x10
Jul 10 14:11:38 beta kernel: [25560.688563] [<c021b84b>] ? d_alloc+0x13b/0x190
Jul 10 14:11:38 beta kernel: [25560.688578] [<c0211177>] real_lookup+0xb7/0x110
Jul 10 14:11:38 beta kernel: [25560.688590] [<c0212bc5>] do_lookup+0x95/0xc0
Jul 10 14:11:38 beta kernel: [25560.688602] [<c02134b3>] __link_path_walk+0x603/0xca0
Jul 10 14:11:38 beta kernel: [25560.688616] [<c0101c1d>] ? __switch_to+0xcd/0x180
Jul 10 14:11:38 beta kernel: [25560.688628] [<c0213d64>] path_walk+0x54/0xc0
Jul 10 14:11:38 beta kernel: [25560.688640] [<c0213ee9>] do_path_lookup+0x59/0x90
Jul 10 14:11:38 beta kernel: [25560.688652] [<c0214a31>] user_path_at+0x41/0x80
Jul 10 14:11:38 beta kernel: [25560.688666] [<c016bd46>] ? hrtimer_try_to_cancel+0x36/0xb0
Jul 10 14:11:38 beta kernel: [25560.688679] [<c058b577>] ? do_nanosleep+0x97/0xc0
Jul 10 14:11:38 beta kernel: [25560.688692] [<c016be88>] ? hrtimer_nanosleep+0xa8/0x140
Jul 10 14:11:38 beta kernel: [25560.688705] [<c020c89a>] vfs_fstatat+0x3a/0x70
Jul 10 14:11:38 beta kernel: [25560.688717] [<c020c9f0>] vfs_stat+0x20/0x30
Jul 10 14:11:38 beta kernel: [25560.688729] [<c020ca19>] sys_stat64+0x19/0x30
Jul 10 14:11:38 beta kernel: [25560.688743] [<c016ad50>] ? hrtimer_wakeup+0x0/0x30
Jul 10 14:11:38 beta kernel: [25560.688755] [<c016bd06>] ? hrtimer_start_range_ns+0x26/0x30
Jul 10 14:11:38 beta kernel: [25560.688769] [<c015182e>] ? sys_time+0x1e/0x60
Jul 10 14:11:38 beta kernel: [25560.688781] [<c01033ec>] syscall_call+0x7/0xb

These messages come for different tasks (not just cron) and to me the call traces look identical (but I can also attach a full set of log messages).

The fat client image was generated by the karmic ltsp tools and then upgraded to Lucid in the chroot.

I am quite willing to help debug this further. Let me know what to do.

ProblemType: Bug
DistroRelease: Ubuntu 10.04
Package: ltsp-client (not installed)
ProcVersionSignature: Ubuntu 2.6.32-23.37-generic 2.6.32.15+drm33.5
Uname: Linux 2.6.32-23-generic i686
Architecture: i386
Date: Sun Jul 11 11:12:25 2010
EcryptfsInUse: Yes
ProcEnviron:
 PATH=(custom, user)
 LANG=en_US.utf8
 SHELL=/bin/bash
SourcePackage: ltsp

Revision history for this message
Nikolaus Rath (nikratio) wrote :

Update:

1) If I entirely disable nbd-proxy, and add the --persist option to nbd-client, the problem goes away.

2) The problem also seems to arise only if the client has to go over a couple of switches and routers to reach the server. Clients that are plugged into the same switch as the server do not seem affected.

Revision history for this message
Jeremy Foshee (jeremyfoshee) wrote :

Hi Nikolaus,

If you could also please test the latest upstream kernel available that would be great. It will allow additional upstream developers to examine the issue. Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag. This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text. Please let us know your results.

Thanks in advance.

    [This is an automated message. Apologies if it has reached you inappropriately; please just reply to this message indicating so.]

tags: added: needs-upstream-testing
tags: added: kj-triage
Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Nikolaus Rath (nikratio) wrote :

Problem seems to be in userspace with nbd-proxy and nbd-client.

Changed in linux (Ubuntu):
status: Incomplete → Invalid
Revision history for this message
Anton S. Ustyuzhanin (kit-miras) wrote :

>Update:
>1) If I entirely disable nbd-proxy, and add the --persist option to nbd-client, the problem goes away.
>2) The problem also seems to arise only if the client has to go over a couple of switches and routers to reach the server. Clients that are plugged into the same switch as the server do not seem affected.

I didn't try to disable nbd-proxy, and add the --persist option to nbd-client, but i switched to nfs+aufs on thin client. The situatuion is same. If thin client plugged into server's switch - it works perfectly, but in next switches thin client aren't boot. Nevertheless thin clients in same network topology (couple of switches) are boot perfectly with LTSP on ubuntu 9.10 and early. Problem appears only on LTSP of 10.04 (ku|edu|xu)buntu version.

Revision history for this message
Steve Rippl (steverippl) wrote :

We're experiencing this too, here's a report from the tech that builds our LTSP servers which includes the network equipment we're using. Applying the patch (disabling nbd-proxy) doesn't completely solve things for us. We had none of these problems on 9.04.

=======

When 2 or more switches are between client and server a failure to boot
occurs before launching LDM about %50 of the time. The frequency of the
failure to boot is reduced by being connected to the server with only
one intermediate switch but does not go away completely. The following
bugs seem to reference the same problem and indicate that it is
nbd-proxy that is falling down during the boot process.

https://bugs.launchpad.net/ubuntu/+source/ltsp/+bug/604314
https://bugs.launchpad.net/ltsp/+bug/589034

The latter of these references a patch that when applied disables the
activation of the nbd-proxy during the client setup process. My success
rate for booting clients after applying this patch rises to about %70.
Load seems to effect this condition adversely as well as speed and
duplex transitions such as many 10/100 clients connecting to a server
attached to a gigabit port. The clients in question are Atom based and
include gigabit ethernet ports but are connected to 10/100 ports in the
classroom. Switches being used include HP Pro-curve edge products as
well as a variety of unmanaged 10/100 5-8 port switches.

Revision history for this message
Nuno Sucena Almeida (slug-debian) wrote :

Hi, I'm running into the same problem almost daily on several SMP machines (24 cpus) with high load, requiring reboot, using Ubuntu 10.04 LTS. I was initially running with with stock kernel (2.6.32-28-generic) but switched to 2.6.35 to see if it would go away, with the same result.

cat /proc/version
Linux version 2.6.35-23-generic (buildd@allspice) (gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5) ) #41~lucid1-Ubuntu SMP Thu Dec 2 22:27:43 UTC 2010

cat /proc/mounts
rootfs / rootfs rw 0 0
none /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
none /proc proc rw,nosuid,nodev,noexec,relatime 0 0
none /dev devtmpfs rw,relatime,size=16466640k,nr_inodes=4116660,mode=755 0 0
none /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0
tmpfs /cow tmpfs rw,relatime,mode=755 0 0
/dev/nbd0 /rofs squashfs ro,relatime 0 0
aufs / aufs rw,relatime,si=f440d5a54f83f64d 0 0
tmpfs /cow tmpfs rw,relatime,mode=755 0 0
none /sys/fs/fuse/connections fusectl rw,relatime 0 0
none /sys/kernel/debug debugfs rw,relatime 0 0
none /sys/kernel/security securityfs rw,relatime 0 0
none /dev/shm tmpfs rw,nosuid,nodev,relatime 0 0
none /var/run tmpfs rw,nosuid,relatime,mode=755 0 0
none /var/lock tmpfs rw,nosuid,nodev,noexec,relatime 0 0
none /lib/init/rw tmpfs rw,nosuid,relatime,mode=755 0 0
tmpfs /tmp tmpfs rw,nosuid,nodev,relatime 0 0
server:/opt /opt nfs rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.10.0.10,mountvers=3,mountproto=tcp,addr=10.10.0.10 0 0
server:/storage /storage nfs rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.10.0.10,mountvers=3,mountproto=tcp,addr=10.10.0.10 0 0
binfmt_misc /proc/sys/fs/binfmt_misc binfmt_misc rw,nosuid,nodev,noexec,relatime 0 0

See the attachment with the kern.log for one of the machines.

Revision history for this message
Alkis Georgopoulos (alkisg) wrote :

Marking as a duplicate of https://bugs.launchpad.net/ltsp/+bug/589034 since the reporter reported that disabling nbd-proxy solves the problem for him.
If, for other people, disabling nbd-proxy doesn't fix your problems, please file another bug.

Revision history for this message
brookegravitt (brooke-x) wrote : Thank you so much for your patience and good customer service

Hello Customer

But in case you still do not like the propose of the watch when you take delivery of it, you can swap it or return it to us for the full cash repayment.
Choose to itself that to you interestingly or make a gift!

------------------------------------------------------------------------------------------
I am very happy to say my Gucci watch arrived in the mail today and i am very pleased with it thank you so much for sending so fast and i will recommend your website to my friends.
Thanks!
                     Bernardo Sims
------------------------------------------------------------------------------------------

Click here ---> http://naulu.ru

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.