Recovery with EC fails

Bug #1474766 reported by sirio81 on 2015-07-15
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
sheepdog
Critical
Unassigned

Bug Description

 0 192.168.2.44:7000 3 738371776
   1 192.168.2.45:7000 3 755148992
   2 192.168.2.46:7000 3 771926208
   3 192.168.2.47:7000 3 788703424

1) I create a test disk

dog vdi create -P test 10G

2) I constantly write on it, also during recovery.
    Note: because my cpu is slow and I'm using /dev/urandom, it doesn't write faster than 3M/s.

dd if=/dev/urandom bs=1M count=2048 | dog vdi write test

3) I kill the 4th node by 'dog node kill 3'. I get many warning messages and the recovery completes successfully:

Jul 15 10:16:26 WARN [rw 6866] read_erasure_object(233) can not read 7c2b250000082b idx 0
Jul 15 10:16:26 WARN [rw 6867] read_erasure_object(233) can not read 7c2b250000082d idx 0
Jul 15 10:16:26 WARN [rw 6865] read_erasure_object(233) can not read 7c2b250000082e idx 1

4) I kill the 3th node by 'dog node kill 2'. I get the warning messages and the recovery completes successfully:

Jul 15 10:37:37 WARN [rw 6832] read_erasure_object(233) can not read 7c2b2500000179 idx 0
Jul 15 10:37:37 WARN [rw 6867] read_erasure_object(233) can not read 7c2b250000017b idx 1
Jul 15 10:37:37 WARN [rw 6865] read_erasure_object(233) can not read 7c2b2500000180 idx 1

5) I insert back node id 2 (note that I "clean" the node first by 'rm -r /var/lib/sheepdog; rm -r /mnt/sheep/0').
During recovery, I get the 'object not found' messages.

Jul 15 10:39:05 WARN [rw 6867] read_erasure_object(233) can not read 7c2b2500000008 idx 2
Jul 15 10:39:05 WARN [rw 6832] sheep_exec_req(1188) failed No object found, remote address: 192.168.2.44:7000, op name: READ_PEER
Jul 15 10:39:05 WARN [rw 6867] sheep_exec_req(1188) failed No object found, remote address: 192.168.2.44:7000, op name: READ_PEER
Jul 15 10:39:05 WARN [rw 6865] sheep_exec_req(1188) failed No object found, remote address: 192.168.2.44:7000, op name: READ_PEER
Jul 15 10:39:05 WARN [rw 6865] sheep_exec_req(1188) failed No object found, remote address: 192.168.2.45:7000, op name: READ_PEER
Jul 15 10:39:05 WARN [rw 6865] sheep_exec_req(1188) failed No object found, remote address: 192.168.2.45:7000, op name: READ_PEER
Jul 15 10:39:05 WARN [rw 6865] sheep_exec_req(1188) failed No object found, remote address: 192.168.2.45:7000, op name: READ_PEER

I insert back the node id 3

Jul 15 10:38:28 NOTICE [main] cluster_recovery_completion(781) all nodes are recovered, epoch 3
Jul 15 10:39:05 INFO [main] local_vdi_state_checkpoint_ctl(1467) freeing vdi state checkpoint at epoch 2
Jul 15 10:39:05 WARN [rw 6832] sheep_exec_req(1188) failed No object found, remote address: 192.168.2.44:7000, op name: READ_PEER
Jul 15 10:39:05 WARN [rw 6832] sheep_exec_req(1188) failed No object found, remote address: 192.168.2.45:7000, op name: READ_PEER
Jul 15 10:39:05 WARN [rw 6832] sheep_exec_req(1188) failed No object found, remote address: 192.168.2.45:7000, op name: READ_PEER
Jul 15 10:39:05 WARN [rw 6832] sheep_exec_req(1188) failed No object found, remote address: 192.168.2.45:7000, op name: READ_PEER
Jul 15 10:39:05 WARN [rw 6832] read_erasure_object(233) can not read 7c2b2500000006 idx 2
Jul 15 10:39:05 WARN [rw 6867] sheep_exec_req(1188) failed No object found, remote address: 192.168.2.44:7000, op name: READ_PEER
Jul 15 10:39:05 WARN [rw 6867] sheep_exec_req(1188) failed No object found, remote address: 192.168.2.45:7000, op name: READ_PEER
Jul 15 10:39:05 WARN [rw 6867] sheep_exec_req(1188) failed No object found, remote address: 192.168.2.45:7000, op name: READ_PEER
Jul 15 10:39:05 WARN [rw 6867] sheep_exec_req(1188) failed No object found, remote address: 192.168.2.45:7000, op name: READ_PEER
...
Jul 15 10:40:26 WARN [rw 6832] sheep_exec_req(1188) failed No object found, remote address: 192.168.2.44:7000, op name: READ_PEER
Jul 15 10:40:57 ERROR [main] check_request_epoch(172) old node version 5, 4 (READ_PEER)
Jul 15 10:40:57 ERROR [main] check_request_epoch(172) old node version 5, 4 (READ_PEER)
Jul 15 10:40:57 ERROR [main] check_request_epoch(172) old node version 5, 4 (READ_PEER)
Jul 15 10:40:57 INFO [main] local_vdi_state_checkpoint_ctl(1467) freeing vdi state checkpoint at epoch 3
Jul 15 10:40:58 INFO [main] recover_object_main(948) object recovery progress 1%
Jul 15 10:40:59 INFO [main] recover_object_main(948) object recovery progress 2%
...
Jul 15 10:41:25 INFO [main] recover_object_main(948) object recovery progress 54%
Jul 15 10:41:26 INFO [main] recover_object_main(948) object recovery progress 55%
Jul 15 10:41:26 ERROR [rw 6832] err_to_sderr(74) diskfull, oid=7c2b250000057d
Jul 15 10:41:26 ERROR [rw 6832] recover_object_work(584) failed to recover object 7c2b250000057d
Jul 15 10:41:26 ERROR [rw 6986] err_to_sderr(74) diskfull, oid=7c2b250000057e
Jul 15 10:41:26 ERROR [rw 6986] recover_object_work(584) failed to recover object 7c2b250000057e
Jul 15 10:41:26 ERROR [rw 6865] err_to_sderr(74) diskfull, oid=7c2b250000057f
Jul 15 10:41:26 ERROR [rw 6865] recover_object_work(584) failed to recover object 7c2b250000057f

Changed in sheepdog-project:
importance: Undecided → Critical
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers