Kernel deadlock when running programs from aufs over squashfs
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Invalid
|
Undecided
|
Unassigned | ||
ltsp (Ubuntu) |
New
|
Undecided
|
Unassigned | ||
Bug Description
I have set up a Lucid diskless fat client using ltsp. The root filesystem is aufs. Underlying the aufs is an rw tmpfs and a ro squashfs, the later mounted from NBD.
The problems is that the fat clients work fine for a little while, but then reproducibly freeze completely within a few hours after booting. The last syslog messages that the server receives are:
Jul 10 14:11:38 beta kernel: [25560.688091] INFO: task cron:2278 blocked for more than 120 seconds.
Jul 10 14:11:38 beta kernel: [25560.688100] "echo 0 > /proc/sys/
Jul 10 14:11:38 beta kernel: [25560.688107] cron D 00006323 0 2278 1 0x00000000
Jul 10 14:11:38 beta kernel: [25560.688118] d549fa0c 00000086 00000080 00006323 00000000 c0847760 d5fb9c2c c0847760
Jul 10 14:11:38 beta kernel: [25560.688135] b0918eaf 0000170c c0847760 c0847760 d5fb9c2c c0847760 c0847760 d6345400
Jul 10 14:11:38 beta kernel: [25560.688151] b08e6975 0000170c d5fb9980 c1d08760 d5fb9980 d549fa58 d549fa1c c058a5ca
Jul 10 14:11:38 beta kernel: [25560.688168] Call Trace:
Jul 10 14:11:38 beta kernel: [25560.688185] [<c058a5ca>] io_schedule+
Jul 10 14:11:38 beta kernel: [25560.688194] [<c022d1f8>] sync_buffer+
Jul 10 14:11:38 beta kernel: [25560.688201] [<c058ad6d>] __wait_
Jul 10 14:11:38 beta kernel: [25560.688207] [<c022d1c0>] ? sync_buffer+
Jul 10 14:11:38 beta kernel: [25560.688214] [<c022d1c0>] ? sync_buffer+
Jul 10 14:11:38 beta kernel: [25560.688220] [<c058ae3b>] out_of_
Jul 10 14:11:38 beta kernel: [25560.688230] [<c0167850>] ? wake_bit_
Jul 10 14:11:38 beta kernel: [25560.688237] [<c022d1be>] __wait_
Jul 10 14:11:38 beta kernel: [25560.688266] [<f80ec30b>] squashfs_
Jul 10 14:11:38 beta kernel: [25560.688277] [<c0144f39>] ? load_balance_
Jul 10 14:11:38 beta kernel: [25560.688290] [<f80ecb06>] squashfs_
Jul 10 14:11:38 beta kernel: [25560.688304] [<f80ecd18>] squashfs_
Jul 10 14:11:38 beta kernel: [25560.688317] [<f80ee488>] squashfs_
Jul 10 14:11:38 beta kernel: [25560.688330] [<f80ef0e7>] ? squashfs_
Jul 10 14:11:38 beta kernel: [25560.688340] [<c021cf9e>] ? inode_init_
Jul 10 14:11:38 beta kernel: [25560.688347] [<c021e015>] ? get_new_
Jul 10 14:11:38 beta kernel: [25560.688359] [<f80eea11>] squashfs_
Jul 10 14:11:38 beta kernel: [25560.688371] [<f80eee73>] squashfs_
Jul 10 14:11:38 beta kernel: [25560.688384] [<c0212cb5>] __lookup_
Jul 10 14:11:38 beta kernel: [25560.688390] [<c0212e0c>] lookup_
Jul 10 14:11:38 beta kernel: [25560.688411] [<f82038ac>] vfsub_lookup_
Jul 10 14:11:38 beta kernel: [25560.688429] [<f8209a1e>] au_lkup_
Jul 10 14:11:38 beta kernel: [25560.688437] [<c058b577>] ? do_nanosleep+
Jul 10 14:11:38 beta kernel: [25560.688455] [<f8209ce6>] au_do_lookup+
Jul 10 14:11:38 beta kernel: [25560.688476] [<f820a383>] au_lkup_
Jul 10 14:11:38 beta kernel: [25560.688495] [<f82093ad>] ? do_ii_read_
Jul 10 14:11:38 beta kernel: [25560.688541] [<f82102c5>] aufs_lookup+
Jul 10 14:11:38 beta kernel: [25560.688550] [<c058c32d>] ? _spin_lock+0xd/0x10
Jul 10 14:11:38 beta kernel: [25560.688563] [<c021b84b>] ? d_alloc+0x13b/0x190
Jul 10 14:11:38 beta kernel: [25560.688578] [<c0211177>] real_lookup+
Jul 10 14:11:38 beta kernel: [25560.688590] [<c0212bc5>] do_lookup+0x95/0xc0
Jul 10 14:11:38 beta kernel: [25560.688602] [<c02134b3>] __link_
Jul 10 14:11:38 beta kernel: [25560.688616] [<c0101c1d>] ? __switch_
Jul 10 14:11:38 beta kernel: [25560.688628] [<c0213d64>] path_walk+0x54/0xc0
Jul 10 14:11:38 beta kernel: [25560.688640] [<c0213ee9>] do_path_
Jul 10 14:11:38 beta kernel: [25560.688652] [<c0214a31>] user_path_
Jul 10 14:11:38 beta kernel: [25560.688666] [<c016bd46>] ? hrtimer_
Jul 10 14:11:38 beta kernel: [25560.688679] [<c058b577>] ? do_nanosleep+
Jul 10 14:11:38 beta kernel: [25560.688692] [<c016be88>] ? hrtimer_
Jul 10 14:11:38 beta kernel: [25560.688705] [<c020c89a>] vfs_fstatat+
Jul 10 14:11:38 beta kernel: [25560.688717] [<c020c9f0>] vfs_stat+0x20/0x30
Jul 10 14:11:38 beta kernel: [25560.688729] [<c020ca19>] sys_stat64+
Jul 10 14:11:38 beta kernel: [25560.688743] [<c016ad50>] ? hrtimer_
Jul 10 14:11:38 beta kernel: [25560.688755] [<c016bd06>] ? hrtimer_
Jul 10 14:11:38 beta kernel: [25560.688769] [<c015182e>] ? sys_time+0x1e/0x60
Jul 10 14:11:38 beta kernel: [25560.688781] [<c01033ec>] syscall_
These messages come for different tasks (not just cron) and to me the call traces look identical (but I can also attach a full set of log messages).
The fat client image was generated by the karmic ltsp tools and then upgraded to Lucid in the chroot.
I am quite willing to help debug this further. Let me know what to do.
ProblemType: Bug
DistroRelease: Ubuntu 10.04
Package: ltsp-client (not installed)
ProcVersionSign
Uname: Linux 2.6.32-23-generic i686
Architecture: i386
Date: Sun Jul 11 11:12:25 2010
EcryptfsInUse: Yes
ProcEnviron:
PATH=(custom, user)
LANG=en_US.utf8
SHELL=/bin/bash
SourcePackage: ltsp
Update:
1) If I entirely disable nbd-proxy, and add the --persist option to nbd-client, the problem goes away.
2) The problem also seems to arise only if the client has to go over a couple of switches and routers to reach the server. Clients that are plugged into the same switch as the server do not seem affected.