qemu-nbd causes the cluster to hang
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
sheepdog |
New
|
Undecided
|
Unassigned |
Bug Description
Description
writing data into a nbd0 device cause sheepdog to hang.
'dog vdi list' doesn't work on any of the nodes.
'dog node list' hangs only on the node using nbd.
If no guests are running, nothing is shown in sheep.log.
The node running nbd doesn't get disconnected from the cluster.
If guests are running (on any node), they are unable to issue I/O requests and you get errors in sheep.log [1]
and /var/log/messages [2].
The only way to fix the situation is to reboot the node running nbd.
If the destination vdi is using erasure code, it will hang writing a file larger than 1G.
Note: the same procedure doesn't cause probelm is using qemu-nbd on a regular file instead of a sheepdog vdi.
QEMU emulator version 1.7.50, Copyright (c) 2003-2008 Fabrice Bellard
Sheepdog daemon version 0.8.0_31_gf2a00e1
Linux pbds 3.2.0-4-amd64 #1 SMP Debian 3.2.51-1 x86_64 GNU/Linux (wheezy)
Ho to reproduce:
dog vdi list | grep test
test 0 25 GB 512 MB 0.0 MB 2014-02-12 14:22 7c2b25 2:1
qemu-nbd -c /dev/nbd0 sheepdog:test
dd if=bigfile of=/dev/nbd0 bs=1M
[1]
sheep.log
/var/sheep/
[266904.388934] nbd: registered device at major 43
[266947.188736] block nbd0: NBD_DISCONNECT
[266947.189041] block nbd0: Unexpected reply (ffff8800084a3c58)
[266947.189309] block nbd0: queue cleared
[267212.302371] block nbd0: NBD_DISCONNECT
[267212.302655] block nbd0: Unexpected reply (ffff880004df9c58)
[267212.368741] block nbd0: queue cleared
[267741.369185] block nbd0: NBD_DISCONNECT
[267741.369451] block nbd0: Unexpected reply (ffff880004d4fc58)
[267741.618210] block nbd0: queue cleared
[268030.252831] block nbd0: NBD_DISCONNECT
[268030.253102] block nbd0: Unexpected reply (ffff880006b01c58)
[268030.506146] block nbd0: queue cleared
[605094.364246] block nbd0: NBD_DISCONNECT
[605094.364655] block nbd0: Unexpected reply (ffff880005707c58)
[605094.364884] block nbd0: queue cleared
[605151.748821] block nbd0: NBD_DISCONNECT
[605151.749114] block nbd0: Unexpected reply (ffff880074081c58)
[605151.750838] block nbd0: queue cleared
[605366.036589] block nbd0: NBD_DISCONNECT
[605366.036882] block nbd0: Unexpected reply (ffff880074171c58)
[605366.038608] block nbd0: queue cleared
[608189.537841] block nbd0: NBD_DISCONNECT
[608189.946757] block nbd0: Unexpected reply (ffff880006bddc58)
[608190.198368] block nbd0: queue cleared
[608246.316551] block nbd0: NBD_DISCONNECT
[608246.316849] block nbd0: Unexpected reply (ffff880020db3c58)
[608246.577224] block nbd0: queue cleared
[608318.444758] block nbd0: NBD_DISCONNECT
[608318.445048] block nbd0: Unexpected reply (ffff88004d5b9c58)
[608318.701296] block nbd0: queue cleared
[608395.089149] block nbd0: NBD_DISCONNECT
[608395.089434] block nbd0: Unexpected reply (ffff8800134efc58)
[608395.346413] block nbd0: queue cleared
[608427.549061] block nbd0: NBD_DISCONNECT
[608427.549333] block nbd0: Unexpected reply (ffff880005091c58)
[608427.808279] block nbd0: queue cleared
[608514.375923] block nbd0: NBD_DISCONNECT
[608514.376223] block nbd0: Unexpected reply (ffff8800440afc58)
[608514.378241] block nbd0: queue cleared
[608792.236539] block nbd0: NBD_DISCONNECT
[608792.236832] block nbd0: Unexpected reply (ffff8800440afc58)
[608792.238694] block nbd0: queue cleared
[608870.784907] block nbd0: NBD_DISCONNECT
[608870.785199] block nbd0: Unexpected reply (ffff880050cf7c58)
[608870.786924] block nbd0: queue cleared
[608966.753806] block nbd0: NBD_DISCONNECT
[608968.083637] block nbd0: Unexpected reply (ffff880045dbfc58)
[608968.337105] block nbd0: queue cleared
[611601.621622] block nbd0: NBD_DISCONNECT
[611601.621901] block nbd0: Unexpected reply (ffff8800295f1c58)
[611601.879094] block nbd0: queue cleared
[612807.021413] block nbd0: NBD_DISCONNECT
[612807.021707] block nbd0: Unexpected reply (ffff8800247b9c58)
[612807.280124] block nbd0: queue cleared
[612859.680467] block nbd0: NBD_DISCONNECT
[612859.680757] block nbd0: Unexpected reply (ffff880045c31c58)
[612859.936098] block nbd0: queue cleared
[612911.461513] block nbd0: NBD_DISCONNECT
[612911.461816] block nbd0: Unexpected reply (ffff88000f5adc58)
[612911.719748] block nbd0: queue cleared
[614041.648259] INFO: task sheep:25482 blocked for more than 120 seconds.
[614041.648399] "echo 0 > /proc/sys/
[614041.648551] sheep D ffff880077c13780 0 25482 1 0x00000000
[614041.648563] ffff880072b1a080 0000000000000082 ffffffff810b938f ffffffff8160d020
[614041.648576] 0000000000013780 ffff880044151fd8 ffff880044151fd8 ffff880072b1a080
[614041.648586] ffff880044151778 ffff880044151778 ffff88003740b8a0 ffff880077c13fd0
[614041.648595] Call Trace:
[614041.648611] [<ffffffff810b9
[614041.648623] [<ffffffff810b4
[614041.648632] [<ffffffff8134d
[614041.648641] [<ffffffff810b4
[614041.648649] [<ffffffff8134d
[614041.648658] [<ffffffff810b4
[614041.648668] [<ffffffff8105f
[614041.648678] [<ffffffff810c2
[614041.648689] [<ffffffff810c9
[614041.648698] [<ffffffff810ca
[614041.648706] [<ffffffff810c0
[614041.648715] [<ffffffff810c3
[614041.648724] [<ffffffff810c3
[614041.648734] [<ffffffff810c3
[614041.648743] [<ffffffff810c4
[614041.648752] [<ffffffff81036
[614041.648761] [<ffffffff810bb
[614041.648770] [<ffffffff8134d
[614041.648779] [<ffffffff8103e
[614041.648789] [<ffffffff810e5
[614041.648798] [<ffffffff8103e
[614041.648807] [<ffffffff810f1
[614041.648817] [<ffffffff81351
[614041.648827] [<ffffffff810d6
[614041.648837] [<ffffffff8134f
[614041.648854] INFO: task flush-43:0:2797 blocked for more than 120 seconds.
Feb 03 17:09:12 WARN [gway 10579] wait_forward_
Feb 03 17:09:12 WARN [gway 11263] wait_forward_
Feb 03 17:09:12 WARN [gway 11262] wait_forward_
Feb 03 17:09:12 WARN [gway 11261] wait_forward_
...
Feb 03 17:15:17 ERROR [gway 11686] do_write(281) failed to write to socket: Resource temporarily unavailable
Feb 03 17:15:17 ERROR [gway 11686] send_req(319) failed to send request a5, 262144: Resource temporarily unavailable
...
Feb 03 17:15:17 WARN [gway 11261] wait_forward_
Feb 03 17:15:17 WARN [gway 11684] wait_forward_
Feb 03 17:15:17 WARN [gway 11262] wait_forward_
...
Feb 03 17:29:43 ERROR [gway 11684] do_write(281) failed to write to socket: Resource temporarily unavailable
Feb 03 17:29:43 ERROR [gway 11684] send_req(319) failed to send request a5, 32768: Resource temporarily unavailable
Feb 03 17:29:46 ERROR [gway 11686] do_write(281) failed to write to socket: Resource temporarily unavailable
Feb 03 17:29:46 ERROR [gway 11686] send_req(319) failed to send request a5, 32768: Resource temporarily unavailable
Feb 03 17:30:02 ERROR [gway 11683] do_write(281) failed to write to socket: Resource temporarily unavailable
Feb 03 17:30:02 ERROR [gway 11683] send_req(319) failed to send request a5, 262144: Resource temporarily unavailable
[2]
Feb 12 14:56:55 test004 kernel: [610801.624515] sheep D ffff880077c13780 0 3151 1 0x00000000
Feb 12 14:56:55 test004 kernel: [610801.624528] ffff88003773a280 0000000000000082 ffffffff00000000 ffffffff8160d020
Feb 12 14:56:55 test004 kernel: [610801.624539] 0000000000013780 ffff880073d9bfd8 ffff880073d9bfd8 ffff88003773a280
Feb 12 14:56:55 test004 kernel: [610801.624549] ffff880073d9b778 0000000173d9b778 ffff880074489980 ffff880077c13fd0
Feb 12 14:56:55 test004 kernel: [610801.624559] Call Trace:
Feb 12 14:56:55 test004 kernel: [610801.624575] [<ffffffff810b4
Feb 12 14:56:55 test004 kernel: [610801.624586] [<ffffffff8134d
Feb 12 14:56:55 test004 kernel: [610801.624596] [<ffffffff810b4
Feb 12 14:56:55 test004 kernel: [610801.624603] [<ffffffff8134d
Feb 12 14:56:55 test004 kernel: [610801.624612] [<ffffffff810b4
Feb 12 14:56:55 test004 kernel: [610801.624623] [<ffffffff8105f
Feb 12 14:56:55 test004 kernel: [610801.624633] [<ffffffff810c2
Feb 12 14:56:55 test004 kernel: [610801.624644] [<ffffffff810c9
Feb 12 14:56:55 test004 kernel: [610801.624653] [<ffffffff810ca
Feb 12 14:56:55 test004 kernel: [610801.624661] [<ffffffff810c0
Feb 12 14:56:55 test004 kernel: [610801.624670] [<ffffffff810c3
Feb 12 14:56:55 test004 kernel: [610801.624680] [<ffffffff810bd
Feb 12 14:56:55 test004 kernel: [610801.624689] [<ffffffff810c3
Feb 12 14:56:55 test004 kernel: [610801.624699] [<ffffffff810c3
Feb 12 14:56:55 test004 kernel: [610801.624708] [<ffffffff810c4
Feb 12 14:56:55 test004 kernel: [610801.624716] [<ffffffff81036
Feb 12 14:56:55 test004 kernel: [610801.624726] [<ffffffff810bb
Feb 12 14:56:55 test004 kernel: [610801.624735] [<ffffffff811af
Feb 12 14:56:55 test004 kernel: [610801.624745] [<ffffffff8103e
Feb 12 14:56:55 test004 kernel: [610801.624754] [<ffffffff810e5
Feb 12 14:56:55 test004 kernel: [610801.624764] [<ffffffff8103e
Feb 12 14:56:55 test004 kernel: [610801.624773] [<ffffffff810f1
Feb 12 14:56:55 test004 kernel: [610801.624783] [<ffffffff81351
Feb 12 14:56:55 test004 kernel: [610801.624793] [<ffffffff810d6
Feb 12 14:56:55 test004 kernel: [610801.624803] [<ffffffff8134f
I've been repeating the test on a new vdi with --copies 2 and it never hanged.
So this issue is related to erasure coded vdis.
dog vdi list
testc2 0 25 GB 25 GB 0.0 MB 2014-02-12 16:05 cdab40 2