Ubuntu
nbd package

New nbd-client hangs when connecting a second time to a server

Bug #711951 reported by Stéphane Graber on 2011-02-02

This bug report is a duplicate of: Bug #700165: qemu-nbd kthread becomes defunct on disconnect. Edit Remove

This bug affects 5 people

Affects		Status	Importance	Assigned to	Milestone
	linux (Ubuntu)	Incomplete	High	Andy Whitcroft
	nbd (Ubuntu)	Confirmed	High	Unassigned

Bug Description

At boot time LTSP connects using NBD to get its root device.

Then, the login prompt is shown, it checks for a new version of its root device by opening a second connection and checking the first few bytes. That part is now broken in Natty, possibly since the latest nbd update (2.9.16).

nbd-client now hangs in I/O wait (D status) indefinitely and so blocking the login prompt from appearing.

Tags:

Jonathan Carter (jonathan) on 2011-02-02

Changed in nbd (Ubuntu):
importance:	Undecided → High

Revision history for this message

Andy Whitcroft (apw) wrote on 2011-02-02:

@Jonathan -- as you can see that ndb-client hangs I assume you have shell access when its broken. If so can you strace the ndb-client to see what it was trying to do when it gets stuck? What its last system call is. Also is anything emmitted into dmesg related to the hang; please wait long enough for the 120 second 'its stuck' timer to fire and report issues.

As this is hanging with the client in a D there may be a kernel componet here. I have had a look and little has changed since Maverick (where I am assuming it worked). There has been some locking work for BKL removal which may be related, difficult to say.

tags:	added: kernel-key natty
Changed in linux (Ubuntu):
importance:	Undecided → High
status:	New → Incomplete

Revision history for this message

Stéphane Graber (stgraber) wrote on 2011-02-02:

root@ltsp-natty-i386:/home# nbd-client localhost 2000 /dev/nbd0
Negotiation: ..size = 483060KB
bs=1024, sz=483060
root@ltsp-natty-i386:/home# nbd-client localhost 2000 /dev/nbd1
Negotiation: ..size = 483060KB

So first connect works, second hangs.

root@ltsp-natty-i386:/home# ps aux | grep nbd-client
root 10936 0.0 0.0 1884 84 ? S 17:36 0:00 nbd-client localhost 2000 /dev/nbd0
root 10938 0.0 0.0 1884 648 pts/4 D+ 17:36 0:00 nbd-client localhost 2000 /dev/nbd1

stracing a third nbd-client shows that it hangs at:
ioctl(3, NBD_SET_BLKSIZE

Full strace: http://paste.ubuntu.com/561493/

I also get the following in dmesg:
[ 4891.262725] nbd: registered device at major 43
[ 4916.372069] nbd0: unknown partition table
[ 5040.140230] INFO: task nbd-client:10938 blocked for more than 120 seconds.
[ 5040.140244] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5040.140247] nbd-client D c06068a0 0 10938 4549 0x00000004
[ 5040.140262] f6fd7e80 00000086 c100d860 c06068a0 f58068c0 c06068a0 c100daec c09108c0
[ 5040.140268] f35e5e84 00000478 c100dae8 c09108c0 c09108c0 f58068c0 c100d860 c100cbc0
[ 5040.140273] 00000000 00000000 00000000 00000086 f6fd7f04 f6fbe910 00000000 f6fd7e54
[ 5040.140279] Call Trace:
[ 5040.140357] [<c0149bf0>] ? default_wake_function+0x10/0x20
[ 5040.140372] [<c0135fe8>] ? __wake_up_common+0x48/0x70
[ 5040.140411] [<c05f6116>] __mutex_lock_slowpath+0xd6/0x140
[ 5040.140449] [<c0343af1>] ? apparmor_capable+0x21/0x70
[ 5040.140453] [<c05f5c85>] mutex_lock+0x25/0x40
[ 5040.140458] [<f8073080>] nbd_ioctl+0x50/0x190 [nbd]
[ 5040.140462] [<c05f729f>] ? _raw_spin_lock_irqsave+0x2f/0x50
[ 5040.140466] [<f8073030>] ? nbd_ioctl+0x0/0x190 [nbd]
[ 5040.140476] [<c035bf44>] blkdev_ioctl+0x244/0x820
[ 5040.140507] [<c0252fc9>] ? fsnotify+0x199/0x290
[ 5040.140539] [<c03f651d>] ? tty_ldisc_deref+0xd/0x10
[ 5040.140564] [<c024d2ff>] block_ioctl+0x3f/0x50
[ 5040.140567] [<c024d2c0>] ? block_ioctl+0x0/0x50
[ 5040.140578] [<c023037b>] do_vfs_ioctl+0x7b/0x2e0
[ 5040.140582] [<c03efcd0>] ? tty_write+0x0/0x200
[ 5040.140586] [<c0230667>] sys_ioctl+0x87/0x90
[ 5040.140589] [<c05f7524>] syscall_call+0x7/0xb

root@ltsp-natty-i386:/home# nbd-client localhost 2000 /dev/nbd0 
Negotiation: ..size = 483060KB
bs=1024, sz=483060
root@ltsp-natty-i386:/home# nbd-client localhost 2000 /dev/nbd1
Negotiation: ..size = 483060KB

So first connect works, second hangs.

root@ltsp-natty-i386:/home# ps aux | grep nbd-client
root     10936  0.0  0.0   1884    84 ?        S    17:36   0:00 nbd-client localhost 2000 /dev/nbd0
root     10938  0.0  0.0   1884   648 pts/4    D+   17:36   0:00 nbd-client localhost 2000 /dev/nbd1

stracing a third nbd-client shows that it hangs at:
ioctl(3, NBD_SET_BLKSIZE

Full strace: http://paste.ubuntu.com/561493/

I also get the following in dmesg:
[ 4891.262725] nbd: registered device at major 43
[ 4916.372069]  nbd0: unknown partition table
[ 5040.140230] INFO: task nbd-client:10938 blocked for more than 120 seconds.
[ 5040.140244] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5040.140247] nbd-client      D c06068a0     0 10938   4549 0x00000004
[ 5040.140262]  f6fd7e80 00000086 c100d860 c06068a0 f58068c0 c06068a0 c100daec c09108c0
[ 5040.140268]  f35e5e84 00000478 c100dae8 c09108c0 c09108c0 f58068c0 c100d860 c100cbc0
[ 5040.140273]  00000000 00000000 00000000 00000086 f6fd7f04 f6fbe910 00000000 f6fd7e54
[ 5040.140279] Call Trace:
[ 5040.140357]  [<c0149bf0>] ? default_wake_function+0x10/0x20
[ 5040.140372]  [<c0135fe8>] ? __wake_up_common+0x48/0x70
[ 5040.140411]  [<c05f6116>] __mutex_lock_slowpath+0xd6/0x140
[ 5040.140449]  [<c0343af1>] ? apparmor_capable+0x21/0x70
[ 5040.140453]  [<c05f5c85>] mutex_lock+0x25/0x40
[ 5040.140458]  [<f8073080>] nbd_ioctl+0x50/0x190 [nbd]
[ 5040.140462]  [<c05f729f>] ? _raw_spin_lock_irqsave+0x2f/0x50
[ 5040.140466]  [<f8073030>] ? nbd_ioctl+0x0/0x190 [nbd]
[ 5040.140476]  [<c035bf44>] blkdev_ioctl+0x244/0x820
[ 5040.140507]  [<c0252fc9>] ? fsnotify+0x199/0x290
[ 5040.140539]  [<c03f651d>] ? tty_ldisc_deref+0xd/0x10
[ 5040.140564]  [<c024d2ff>] block_ioctl+0x3f/0x50
[ 5040.140567]  [<c024d2c0>] ? block_ioctl+0x0/0x50
[ 5040.140578]  [<c023037b>] do_vfs_ioctl+0x7b/0x2e0
[ 5040.140582]  [<c03efcd0>] ? tty_write+0x0/0x200
[ 5040.140586]  [<c0230667>] sys_ioctl+0x87/0x90
[ 5040.140589]  [<c05f7524>] syscall_call+0x7/0xb

Revision history for this message

Stéphane Graber (stgraber) wrote on 2011-02-02:

As discussed in #ubuntu-release, I tried with maverick's client and had the same issue.
I'll now test with both nbd-client and nbd-server from maverick but will probably get the same result.

Andy Whitcroft (apw) on 2011-02-02

Changed in linux (Ubuntu):
status:	Incomplete → In Progress
assignee:	nobody → Andy Whitcroft (apw)

Revision history for this message

Andy Whitcroft (apw) wrote on 2011-02-02:

That does indeed look pretty much like a kernel locking issue if its fired that 120s timer. I've spun some test kernel with some lock debugging inserted to give us some more info. Could you test the kernels at the url below and report back here:

http://people.canonical.com/~apw/lp711951-natty/