sheepdog

Recovery with EC fails

Bug #1474766 reported by sirio81 on 2015-07-15

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	sheepdog	New	Critical	Unassigned

Bug Description

0 192.168.2.44:7000 3 738371776
   1 192.168.2.45:7000 3 755148992
   2 192.168.2.46:7000 3 771926208
   3 192.168.2.47:7000 3 788703424

1) I create a test disk

dog vdi create -P test 10G

2) I constantly write on it, also during recovery.
Note: because my cpu is slow and I'm using /dev/urandom, it doesn't write faster than 3M/s.

dd if=/dev/urandom bs=1M count=2048 | dog vdi write test

3) I kill the 4th node by 'dog node kill 3'. I get many warning messages and the recovery completes successfully:

Jul 15 10:16:26 WARN [rw 6866] read_erasure_object(233) can not read 7c2b250000082b idx 0
Jul 15 10:16:26 WARN [rw 6867] read_erasure_object(233) can not read 7c2b250000082d idx 0
Jul 15 10:16:26 WARN [rw 6865] read_erasure_object(233) can not read 7c2b250000082e idx 1

4) I kill the 3th node by 'dog node kill 2'. I get the warning messages and the recovery completes successfully:

Jul 15 10:37:37 WARN [rw 6832] read_erasure_object(233) can not read 7c2b2500000179 idx 0
Jul 15 10:37:37 WARN [rw 6867] read_erasure_object(233) can not read 7c2b250000017b idx 1
Jul 15 10:37:37 WARN [rw 6865] read_erasure_object(233) can not read 7c2b2500000180 idx 1

5) I insert back node id 2 (note that I "clean" the node first by 'rm -r /var/lib/sheepdog; rm -r /mnt/sheep/0').
During recovery, I get the 'object not found' messages.

Jul 15 10:39:05 WARN [rw 6867] read_erasure_object(233) can not read 7c2b2500000008 idx 2
Jul 15 10:39:05 WARN [rw 6832] sheep_exec_req(1188) failed No object found, remote address: 192.168.2.44:7000, op name: READ_PEER
Jul 15 10:39:05 WARN [rw 6867] sheep_exec_req(1188) failed No object found, remote address: 192.168.2.44:7000, op name: READ_PEER
Jul 15 10:39:05 WARN [rw 6865] sheep_exec_req(1188) failed No object found, remote address: 192.168.2.44:7000, op name: READ_PEER
Jul 15 10:39:05 WARN [rw 6865] sheep_exec_req(1188) failed No object found, remote address: 192.168.2.45:7000, op name: READ_PEER
Jul 15 10:39:05 WARN [rw 6865] sheep_exec_req(1188) failed No object found, remote address: 192.168.2.45:7000, op name: READ_PEER
Jul 15 10:39:05 WARN [rw 6865] sheep_exec_req(1188) failed No object found, remote address: 192.168.2.45:7000, op name: READ_PEER

I insert back the node id 3

Jul 15 10:38:28 NOTICE [main] cluster_recovery_completion(781) all nodes are recovered, epoch 3
Jul 15 10:39:05 INFO [main] local_vdi_state_checkpoint_ctl(1467) freeing vdi state checkpoint at epoch 2
Jul 15 10:39:05 WARN [rw 6832] sheep_exec_req(1188) failed No object found, remote address: 192.168.2.44:7000, op name: READ_PEER
Jul 15 10:39:05 WARN [rw 6832] sheep_exec_req(1188) failed No object found, remote address: 192.168.2.45:7000, op name: READ_PEER
Jul 15 10:39:05 WARN [rw 6832] sheep_exec_req(1188) failed No object found, remote address: 192.168.2.45:7000, op name: READ_PEER
Jul 15 10:39:05 WARN [rw 6832] sheep_exec_req(1188) failed No object found, remote address: 192.168.2.45:7000, op name: READ_PEER
Jul 15 10:39:05 WARN [rw 6832] read_erasure_object(233) can not read 7c2b2500000006 idx 2
Jul 15 10:39:05 WARN [rw 6867] sheep_exec_req(1188) failed No object found, remote address: 192.168.2.44:7000, op name: READ_PEER
Jul 15 10:39:05 WARN [rw 6867] sheep_exec_req(1188) failed No object found, remote address: 192.168.2.45:7000, op name: READ_PEER
Jul 15 10:39:05 WARN [rw 6867] sheep_exec_req(1188) failed No object found, remote address: 192.168.2.45:7000, op name: READ_PEER
Jul 15 10:39:05 WARN [rw 6867] sheep_exec_req(1188) failed No object found, remote address: 192.168.2.45:7000, op name: READ_PEER
...
Jul 15 10:40:26 WARN [rw 6832] sheep_exec_req(1188) failed No object found, remote address: 192.168.2.44:7000, op name: READ_PEER
Jul 15 10:40:57 ERROR [main] check_request_epoch(172) old node version 5, 4 (READ_PEER)
Jul 15 10:40:57 ERROR [main] check_request_epoch(172) old node version 5, 4 (READ_PEER)
Jul 15 10:40:57 ERROR [main] check_request_epoch(172) old node version 5, 4 (READ_PEER)
Jul 15 10:40:57 INFO [main] local_vdi_state_checkpoint_ctl(1467) freeing vdi state checkpoint at epoch 3
Jul 15 10:40:58 INFO [main] recover_object_main(948) object recovery progress 1%
Jul 15 10:40:59 INFO [main] recover_object_main(948) object recovery progress 2%
...
Jul 15 10:41:25 INFO [main] recover_object_main(948) object recovery progress 54%
Jul 15 10:41:26 INFO [main] recover_object_main(948) object recovery progress 55%
Jul 15 10:41:26 ERROR [rw 6832] err_to_sderr(74) diskfull, oid=7c2b250000057d
Jul 15 10:41:26 ERROR [rw 6832] recover_object_work(584) failed to recover object 7c2b250000057d
Jul 15 10:41:26 ERROR [rw 6986] err_to_sderr(74) diskfull, oid=7c2b250000057e
Jul 15 10:41:26 ERROR [rw 6986] recover_object_work(584) failed to recover object 7c2b250000057e
Jul 15 10:41:26 ERROR [rw 6865] err_to_sderr(74) diskfull, oid=7c2b250000057f
Jul 15 10:41:26 ERROR [rw 6865] recover_object_work(584) failed to recover object 7c2b250000057f

Hitoshi Mitake (mitake-hitoshi) on 2015-07-17

Changed in sheepdog-project:
importance:	Undecided → Critical

Revision history for this message

Vasiliy Tolstov (vase) wrote on 2015-07-22:

May be this can fix bug?
https://github.com/sheepdog-ng/sheepdog-ng/commit/4c58191cb6360d3f3191a5efd021cb2f17dd021e

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.