New nbd-client hangs when connecting a second time to a server

Bug #711951 reported by Stéphane Graber
This bug affects 5 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Andy Whitcroft
nbd (Ubuntu)

Bug Description

At boot time LTSP connects using NBD to get its root device.

Then, the login prompt is shown, it checks for a new version of its root device by opening a second connection and checking the first few bytes. That part is now broken in Natty, possibly since the latest nbd update (2.9.16).

nbd-client now hangs in I/O wait (D status) indefinitely and so blocking the login prompt from appearing.

Changed in nbd (Ubuntu):
importance: Undecided → High
Revision history for this message
Andy Whitcroft (apw) wrote :

@Jonathan -- as you can see that ndb-client hangs I assume you have shell access when its broken. If so can you strace the ndb-client to see what it was trying to do when it gets stuck? What its last system call is. Also is anything emmitted into dmesg related to the hang; please wait long enough for the 120 second 'its stuck' timer to fire and report issues.

As this is hanging with the client in a D there may be a kernel componet here. I have had a look and little has changed since Maverick (where I am assuming it worked). There has been some locking work for BKL removal which may be related, difficult to say.

tags: added: kernel-key natty
Changed in linux (Ubuntu):
importance: Undecided → High
status: New → Incomplete
Revision history for this message
Stéphane Graber (stgraber) wrote :

root@ltsp-natty-i386:/home# nbd-client localhost 2000 /dev/nbd0
Negotiation: ..size = 483060KB
bs=1024, sz=483060
root@ltsp-natty-i386:/home# nbd-client localhost 2000 /dev/nbd1
Negotiation: ..size = 483060KB

So first connect works, second hangs.

root@ltsp-natty-i386:/home# ps aux | grep nbd-client
root 10936 0.0 0.0 1884 84 ? S 17:36 0:00 nbd-client localhost 2000 /dev/nbd0
root 10938 0.0 0.0 1884 648 pts/4 D+ 17:36 0:00 nbd-client localhost 2000 /dev/nbd1

stracing a third nbd-client shows that it hangs at:

Full strace:

I also get the following in dmesg:
[ 4891.262725] nbd: registered device at major 43
[ 4916.372069] nbd0: unknown partition table
[ 5040.140230] INFO: task nbd-client:10938 blocked for more than 120 seconds.
[ 5040.140244] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5040.140247] nbd-client D c06068a0 0 10938 4549 0x00000004
[ 5040.140262] f6fd7e80 00000086 c100d860 c06068a0 f58068c0 c06068a0 c100daec c09108c0
[ 5040.140268] f35e5e84 00000478 c100dae8 c09108c0 c09108c0 f58068c0 c100d860 c100cbc0
[ 5040.140273] 00000000 00000000 00000000 00000086 f6fd7f04 f6fbe910 00000000 f6fd7e54
[ 5040.140279] Call Trace:
[ 5040.140357] [<c0149bf0>] ? default_wake_function+0x10/0x20
[ 5040.140372] [<c0135fe8>] ? __wake_up_common+0x48/0x70
[ 5040.140411] [<c05f6116>] __mutex_lock_slowpath+0xd6/0x140
[ 5040.140449] [<c0343af1>] ? apparmor_capable+0x21/0x70
[ 5040.140453] [<c05f5c85>] mutex_lock+0x25/0x40
[ 5040.140458] [<f8073080>] nbd_ioctl+0x50/0x190 [nbd]
[ 5040.140462] [<c05f729f>] ? _raw_spin_lock_irqsave+0x2f/0x50
[ 5040.140466] [<f8073030>] ? nbd_ioctl+0x0/0x190 [nbd]
[ 5040.140476] [<c035bf44>] blkdev_ioctl+0x244/0x820
[ 5040.140507] [<c0252fc9>] ? fsnotify+0x199/0x290
[ 5040.140539] [<c03f651d>] ? tty_ldisc_deref+0xd/0x10
[ 5040.140564] [<c024d2ff>] block_ioctl+0x3f/0x50
[ 5040.140567] [<c024d2c0>] ? block_ioctl+0x0/0x50
[ 5040.140578] [<c023037b>] do_vfs_ioctl+0x7b/0x2e0
[ 5040.140582] [<c03efcd0>] ? tty_write+0x0/0x200
[ 5040.140586] [<c0230667>] sys_ioctl+0x87/0x90
[ 5040.140589] [<c05f7524>] syscall_call+0x7/0xb

Revision history for this message
Stéphane Graber (stgraber) wrote :

As discussed in #ubuntu-release, I tried with maverick's client and had the same issue.
I'll now test with both nbd-client and nbd-server from maverick but will probably get the same result.

Andy Whitcroft (apw)
Changed in linux (Ubuntu):
status: Incomplete → In Progress
assignee: nobody → Andy Whitcroft (apw)
Revision history for this message
Andy Whitcroft (apw) wrote :

That does indeed look pretty much like a kernel locking issue if its fired that 120s timer. I've spun some test kernel with some lock debugging inserted to give us some more info. Could you test the kernels at the url below and report back here:

Changed in linux (Ubuntu):
status: In Progress → Incomplete
Changed in nbd (Ubuntu):
status: New → Confirmed
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers