Recovery with EC fails

Bug #1474766 reported by sirio81
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
sheepdog
New
Critical
Unassigned

Bug Description

 0 192.168.2.44:7000 3 738371776
   1 192.168.2.45:7000 3 755148992
   2 192.168.2.46:7000 3 771926208
   3 192.168.2.47:7000 3 788703424

1) I create a test disk

dog vdi create -P test 10G

2) I constantly write on it, also during recovery.
    Note: because my cpu is slow and I'm using /dev/urandom, it doesn't write faster than 3M/s.

dd if=/dev/urandom bs=1M count=2048 | dog vdi write test

3) I kill the 4th node by 'dog node kill 3'. I get many warning messages and the recovery completes successfully:

Jul 15 10:16:26 WARN [rw 6866] read_erasure_object(233) can not read 7c2b250000082b idx 0
Jul 15 10:16:26 WARN [rw 6867] read_erasure_object(233) can not read 7c2b250000082d idx 0
Jul 15 10:16:26 WARN [rw 6865] read_erasure_object(233) can not read 7c2b250000082e idx 1

4) I kill the 3th node by 'dog node kill 2'. I get the warning messages and the recovery completes successfully:

Jul 15 10:37:37 WARN [rw 6832] read_erasure_object(233) can not read 7c2b2500000179 idx 0
Jul 15 10:37:37 WARN [rw 6867] read_erasure_object(233) can not read 7c2b250000017b idx 1
Jul 15 10:37:37 WARN [rw 6865] read_erasure_object(233) can not read 7c2b2500000180 idx 1

5) I insert back node id 2 (note that I "clean" the node first by 'rm -r /var/lib/sheepdog; rm -r /mnt/sheep/0').
During recovery, I get the 'object not found' messages.

Jul 15 10:39:05 WARN [rw 6867] read_erasure_object(233) can not read 7c2b2500000008 idx 2
Jul 15 10:39:05 WARN [rw 6832] sheep_exec_req(1188) failed No object found, remote address: 192.168.2.44:7000, op name: READ_PEER
Jul 15 10:39:05 WARN [rw 6867] sheep_exec_req(1188) failed No object found, remote address: 192.168.2.44:7000, op name: READ_PEER
Jul 15 10:39:05 WARN [rw 6865] sheep_exec_req(1188) failed No object found, remote address: 192.168.2.44:7000, op name: READ_PEER
Jul 15 10:39:05 WARN [rw 6865] sheep_exec_req(1188) failed No object found, remote address: 192.168.2.45:7000, op name: READ_PEER
Jul 15 10:39:05 WARN [rw 6865] sheep_exec_req(1188) failed No object found, remote address: 192.168.2.45:7000, op name: READ_PEER
Jul 15 10:39:05 WARN [rw 6865] sheep_exec_req(1188) failed No object found, remote address: 192.168.2.45:7000, op name: READ_PEER

I insert back the node id 3

Jul 15 10:38:28 NOTICE [main] cluster_recovery_completion(781) all nodes are recovered, epoch 3
Jul 15 10:39:05 INFO [main] local_vdi_state_checkpoint_ctl(1467) freeing vdi state checkpoint at epoch 2
Jul 15 10:39:05 WARN [rw 6832] sheep_exec_req(1188) failed No object found, remote address: 192.168.2.44:7000, op name: READ_PEER
Jul 15 10:39:05 WARN [rw 6832] sheep_exec_req(1188) failed No object found, remote address: 192.168.2.45:7000, op name: READ_PEER
Jul 15 10:39:05 WARN [rw 6832] sheep_exec_req(1188) failed No object found, remote address: 192.168.2.45:7000, op name: READ_PEER
Jul 15 10:39:05 WARN [rw 6832] sheep_exec_req(1188) failed No object found, remote address: 192.168.2.45:7000, op name: READ_PEER
Jul 15 10:39:05 WARN [rw 6832] read_erasure_object(233) can not read 7c2b2500000006 idx 2
Jul 15 10:39:05 WARN [rw 6867] sheep_exec_req(1188) failed No object found, remote address: 192.168.2.44:7000, op name: READ_PEER
Jul 15 10:39:05 WARN [rw 6867] sheep_exec_req(1188) failed No object found, remote address: 192.168.2.45:7000, op name: READ_PEER
Jul 15 10:39:05 WARN [rw 6867] sheep_exec_req(1188) failed No object found, remote address: 192.168.2.45:7000, op name: READ_PEER
Jul 15 10:39:05 WARN [rw 6867] sheep_exec_req(1188) failed No object found, remote address: 192.168.2.45:7000, op name: READ_PEER
...
Jul 15 10:40:26 WARN [rw 6832] sheep_exec_req(1188) failed No object found, remote address: 192.168.2.44:7000, op name: READ_PEER
Jul 15 10:40:57 ERROR [main] check_request_epoch(172) old node version 5, 4 (READ_PEER)
Jul 15 10:40:57 ERROR [main] check_request_epoch(172) old node version 5, 4 (READ_PEER)
Jul 15 10:40:57 ERROR [main] check_request_epoch(172) old node version 5, 4 (READ_PEER)
Jul 15 10:40:57 INFO [main] local_vdi_state_checkpoint_ctl(1467) freeing vdi state checkpoint at epoch 3
Jul 15 10:40:58 INFO [main] recover_object_main(948) object recovery progress 1%
Jul 15 10:40:59 INFO [main] recover_object_main(948) object recovery progress 2%
...
Jul 15 10:41:25 INFO [main] recover_object_main(948) object recovery progress 54%
Jul 15 10:41:26 INFO [main] recover_object_main(948) object recovery progress 55%
Jul 15 10:41:26 ERROR [rw 6832] err_to_sderr(74) diskfull, oid=7c2b250000057d
Jul 15 10:41:26 ERROR [rw 6832] recover_object_work(584) failed to recover object 7c2b250000057d
Jul 15 10:41:26 ERROR [rw 6986] err_to_sderr(74) diskfull, oid=7c2b250000057e
Jul 15 10:41:26 ERROR [rw 6986] recover_object_work(584) failed to recover object 7c2b250000057e
Jul 15 10:41:26 ERROR [rw 6865] err_to_sderr(74) diskfull, oid=7c2b250000057f
Jul 15 10:41:26 ERROR [rw 6865] recover_object_work(584) failed to recover object 7c2b250000057f

Changed in sheepdog-project:
importance: Undecided → Critical
Revision history for this message
Vasiliy Tolstov (vase) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.