Vdi corruption with erasure code if n nodes < x

Bug #1367612 reported by sirio81
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
sheepdog
Critical
Hitoshi Mitake

Bug Description

Shortly, I imported a vdi on a 3 node cluster with --copies 2:1.
I remove on node, so there was no parity left.
I remove another node, so the vdi is missing half its data.
Notice that no guest is running, so there are no I/O operation on that vdi.
I joined back the cluster with the second node and the vdi is corrupted.

Sheepdog daemon version 0.8.0_331_gecc533e

dog node list
  Id Host:Port V-Nodes Zone
   0 192.168.10.4:7000 127 67807424
   1 192.168.10.5:7000 129 84584640
   2 192.168.10.6:7000 129 101361856

dog cluster format -c 2:1
using backend plain store

dog cluster info
Cluster status: running, auto-recovery enabled
Cluster created at Fri Sep 5 11:34:20 2014
Epoch Time Version
2014-09-05 11:34:20 1 [192.168.10.4:7000, 192.168.10.5:7000, 192.168.10.6:7000]

qemu-img convert -f qcow2 test.qcow2 sheepdog:test

dog vdi list
  Name Id Size Used Shared Creation time VDI id Copies Tag
  test 0 5.0 GB 864 MB 0.0 MB 2014-09-05 11:34 7c2b25 2:1

dog vdi check test
finish check&repair test

dog vdi read test | md5sum
8886bddd205a7698a8194594c76e61b5 -

dog node kill 2

In sheep.log
Sep 05 11:38:29 ERROR [rw 6110] read_erasure_object(228) can not read 7c2b250000023d idx 0
Sep 05 11:38:29 INFO [main] recover_object_main(906) object recovery progress 66%
Sep 05 11:38:29 ERROR [rw 6109] read_erasure_object(228) can not read 7c2b2500000060 idx 0
Sep 05 11:38:29 INFO [main] recover_object_main(906) object recovery progress 68%
Sep 05 11:38:30 ERROR [rw 6111] read_erasure_object(228) can not read 7c2b2500000040 idx 1
Sep 05 11:38:30 ERROR [rw 6110] read_erasure_object(228) can not read 7c2b2500000268 idx 1
Sep 05 11:38:30 INFO [main] recover_object_main(906) object recovery progress 69%
Sep 05 11:38:30 ERROR [rw 6110] read_erasure_object(228) can not read 7c2b2500000086 idx 0
Sep 05 11:38:30 ERROR [rw 6109] read_erasure_object(228) can not read 7c2b250000002b idx 0
Sep 05 11:38:30 INFO [main] recover_object_main(906) object recovery progress 71%
Sep 05 11:38:30 INFO [main] recover_object_main(906) object recovery progress 72%
Sep 05 11:38:30 ERROR [rw 6110] read_erasure_object(228) can not read 7c2b2500000080 idx 0
Sep 05 11:38:30 ERROR [rw 6012] read_erasure_object(228) can not read 7c2b2500000245 idx 0
Sep 05 11:38:30 INFO [main] recover_object_main(906) object recovery progress 73%
Sep 05 11:38:30 INFO [main] recover_object_main(906) object recovery progress 75%
Sep 05 11:38:30 ERROR [rw 6109] read_erasure_object(228) can not read 7c2b2500000232 idx 0
Sep 05 11:38:30 ERROR [rw 6111] read_erasure_object(228) can not read 7c2b2500000020 idx 0
Sep 05 11:38:30 ERROR [rw 6012] read_erasure_object(228) can not read 7c2b2500000200 idx 0

dog vdi check test
ABORT: Not enough active zones for consistency-checking erasure coded VDI

dog vdi read test | md5sum
8886bddd205a7698a8194594c76e61b5 -

dog node kill 1

re-run sheep on node id 1

dog node list
  Id Host:Port V-Nodes Zone
   0 192.168.10.4:7000 127 67807424
   1 192.168.10.5:7000 129 84584640

dog vdi list
  Name Id Size Used Shared Creation time VDI id Copies Tag
  test 0 5.0 GB 864 MB 0.0 MB 2014-09-05 11:34 7c2b25 2:1

dog vdi read test | md5sum
Failed to read object 7c2b2500000021 No object found
Failed to read VDI
1511e25cedc8ed4e540649744e8809ec -

re-run sheep in node id 2
(I had to empty /var/lib/sheepdog/ otherwise it couldn't join the cluster).

dog node list
  Id Host:Port V-Nodes Zone
   0 192.168.10.4:7000 127 67807424
   1 192.168.10.5:7000 129 84584640
   2 192.168.10.6:7000 129 101361856

dog vdi check test > /tmp/check.log 2>&1
less /tmp/chec.log

...
failed to rebuild object 7c2b250000004e. 2 copies get lost, more than 1
failed to rebuild object 7c2b2500000052. 2 copies get lost, more than 1
failed to rebuild object 7c2b2500000054. 2 copies get lost, more than 1
failed to rebuild object 7c2b2500000055. 2 copies get lost, more than 1
failed to rebuild object 7c2b2500000056. 2 copies get lost, more than 1
failed to rebuild object 7c2b2500000058. 2 copies get lost, more than 1
failed to rebuild object 7c2b250000005c. 2 copies get lost, more than 1
failed to rebuild object 7c2b250000005e. 2 copies get lost, more than 1
failed to rebuild object 7c2b250000005f. 2 copies get lost, more than 1
failed to rebuild object 7c2b2500000066. 2 copies get lost, more than 1
failed to rebuild object 7c2b250000006c. 2 copies get lost, more than 1
failed to rebuild object 7c2b250000006e. 2 copies get lost, more than 1
failed to rebuild object 7c2b2500000074. 2 copies get lost, more than 1
failed to rebuild object 7c2b2500000075. 2 copies get lost, more than 1
...

Changed in sheepdog-project:
importance: Undecided → Critical
Changed in sheepdog-project:
assignee: nobody → Hitoshi Mitake (mitake-hitoshi)
Changed in sheepdog-project:
status: New → Fix Released
status: Fix Released → Fix Committed
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers