Using Journal corrupts VDI in case of node lose.

Bug #1377865 reported by MKersloot
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
sheepdog
New
Undecided
Unassigned

Bug Description

Cluster with 3 nodes (A,B,C), 1 VDI with copies 2:1. Guest running on node A
Killed en started node B and got the following errors in the sheep.log on that node:

Oct 03 16:18:00 NOTICE [main] cluster_recovery_completion(703) all nodes are recovered, epoch 16
Oct 06 10:54:02 INFO [main] md_add_disk(338) /mnt/sheep/22, vdisk nr 467, total disk 1
Oct 06 10:54:02 ALERT [main] get_vdi_copy_policy(117) copy policy for 2e7549 not found, set 0
Oct 06 10:54:02 ALERT [main] get_vdi_copy_policy(117) copy policy for f19e4f not found, set 0
Oct 06 10:54:02 ALERT [main] get_vdi_copy_policy(117) copy policy for 2e7549 not found, set 0
....
Oct 06 10:54:02 ALERT [main] get_vdi_copy_policy(117) copy policy for 2e7549 not found, set 0
Oct 06 10:54:02 ALERT [main] get_vdi_copy_policy(117) copy policy for 2e7549 not found, set 0
Oct 06 10:54:02 ALERT [main] get_vdi_copy_policy(117) copy policy for 2e7549 not found, set 0
Oct 06 10:54:02 INFO [main] send_join_request(787) IPv4 ip:10.10.0.22 port:7000 going to join the cluster
Oct 06 10:54:02 INFO [main] replay_journal_entry(159) /mnt/sheep/22/00f19e4f00000893, size 1251328, off 845824, 0
Oct 06 10:54:02 ERROR [main] replay_journal_entry(166) open No such file or directory
Oct 06 10:54:02 EMERG [main] check_recover_journal_file(262) PANIC: recoverying from journal file (new) failed
Oct 06 10:54:02 EMERG [main] crash_handler(267) sheep exits unexpectedly (Aborted).
Oct 06 10:54:03 EMERG [main] sd_backtrace(833) :
Oct 06 10:54:03 EMERG [main] sd_backtrace(833) :
Oct 06 10:54:03 EMERG [main] sd_backtrace(833) :
Oct 06 10:54:03 EMERG [main] sd_backtrace(833) :

On node A, the guest stopped responding and sheep.log gives the following errors:
Oct 03 16:18:00 NOTICE [main] cluster_recovery_completion(703) all nodes are recovered, epoch 16
Oct 06 10:53:53 ERROR [gway 6564] wait_forward_request(415) remote node might have gone away
Oct 06 10:53:53 ERROR [gway 6567] connect_to(193) failed to connect to 10.10.0.22:7000: Connection refused
Oct 06 10:53:53 ERROR [gway 6568] connect_to(193) failed to connect to 10.10.0.22:7000: Connection refused
Oct 06 10:53:53 ERROR [gway 6567] connect_to(193) failed to connect to 10.10.0.22:7000: Connection refused
Oct 06 10:53:53 ERROR [gway 6568] connect_to(193) failed to connect to 10.10.0.22:7000: Connection refused
.......
Oct 06 10:54:00 ERROR [gway 6569] connect_to(193) failed to connect to 10.10.0.22:7000: Connection refused
Oct 06 10:54:00 ERROR [gway 6651] wait_forward_request(437) fail f19e4f0000083d, Request has an old epoch
Oct 06 10:54:00 ERROR [rw] read_erasure_object(206) can not read f19e4f0000003d idx 0
Oct 06 10:54:00 ERROR [rw] read_erasure_object(206) can not read f19e4f0000085d idx 1
Oct 06 10:54:00 ERROR [gway 6676] connect_to(193) failed to connect to 10.10.0.22:7000: Connection refused
Oct 06 10:54:00 ERROR [rw] read_erasure_object(206) can not read 2e75490000009e idx 1
.....
Oct 06 10:54:02 ERROR [rw] read_erasure_object(206) can not read f19e4f00000155 idx 1
Oct 06 10:54:02 ERROR [rw] read_erasure_object(206) can not read f19e4f00000895 idx 1
Oct 06 10:54:03 ERROR [block] connect_to(193) failed to connect to 10.10.0.22:7000: Connection refused
Oct 06 10:54:03 ERROR [block] connect_to(193) failed to connect to 10.10.0.22:7000: Connection refused
Oct 06 10:54:03 ALERT [block] do_get_vdis(499) failed to get vdi bitmap from IPv4 ip:10.10.0.22 port:7000
Oct 06 10:54:03 ERROR [main] check_request_epoch(151) old node version 18, 17 (READ_PEER)
Oct 06 10:54:03 ERROR [main] check_request_epoch(151) old node version 18, 17 (READ_PEER)
Oct 06 10:54:03 ERROR [rw] sheep_exec_req(1131) failed Request has an old epoch, remote address: 10.10.0.30:7000, op name: READ_PEER
Oct 06 10:54:03 ERROR [rw] recover_object_work(550) failed to recover object f19e4f00000895
Oct 06 10:54:03 ERROR [rw] connect_to(193) failed to connect to 10.10.0.22:7000: Connection refused
Oct 06 10:54:03 ERROR [rw] connect_to(193) failed to connect to 10.10.0.22:7000: Connection refused
Oct 06 10:54:03 ALERT [rw] fetch_object_list(931) cannot get object list from 10.10.0.22:7000
Oct 06 10:54:03 ALERT [rw] fetch_object_list(933) some objects may be not recovered at epoch 18
Oct 06 10:54:03 ERROR [rw] sheep_exec_req(1131) failed No object found, remote address: 10.10.0.21:7000, op name: READ_PEER
Oct 06 10:54:03 ERROR [rw] sheep_exec_req(1131) failed No object found, remote address: 10.10.0.30:7000, op name: READ_PEER
......
Oct 06 10:54:05 ERROR [rw] sheep_exec_req(1131) failed No object found, remote address: 10.10.0.21:7000, op name: READ_PEER
Oct 06 10:54:05 ERROR [rw] sheep_exec_req(1131) failed No object found, remote address: 10.10.0.30:7000, op name: READ_PEER
Oct 06 10:54:05 INFO [main] recover_object_main(863) object recovery progress 1%
Oct 06 10:54:05 ERROR [rw] sheep_exec_req(1131) failed No object found, remote address: 10.10.0.21:7000, op name: READ_PEER
Oct 06 10:54:05 ERROR [rw] sheep_exec_req(1131) failed No object found, remote address: 10.10.0.30:7000, op name: READ_PEER
......
Oct 06 10:55:07 ERROR [gway 6694] connect_to(193) failed to connect to 10.10.0.22:7000: Connection refused
Oct 06 10:55:07 ERROR [gway 6766] connect_to(193) failed to connect to 10.10.0.22:7000: Connection refused
Oct 06 10:55:07 INFO [main] recover_object_main(863) object recovery progress 99%
Oct 06 10:55:07 ERROR [gway 6766] connect_to(193) failed to connect to 10.10.0.22:7000: Connection refused
Oct 06 10:55:07 ERROR [gway 6653] connect_to(193) failed to connect to 10.10.0.22:7000: Connection refused
.....
Oct 06 10:55:08 ERROR [gway 6764] connect_to(193) failed to connect to 10.10.0.22:7000: Connection refused
Oct 06 10:55:08 ERROR [gway 6653] connect_to(193) failed to connect to 10.10.0.22:7000: Connection refused
Oct 06 10:55:08 ERROR [rw] sheep_exec_req(1131) failed No object found, remote address: 10.10.0.21:7000, op name: READ_PEER
Oct 06 10:55:08 ERROR [rw] sheep_exec_req(1131) failed No object found, remote address: 10.10.0.30:7000, op name: READ_PEER
Oct 06 10:55:08 ERROR [gway 6694] connect_to(193) failed to connect to 10.10.0.22:7000: Connection refused
Oct 06 10:55:08 ERROR [gway 6698] connect_to(193) failed to connect to 10.10.0.22:7000: Connection refused

And after that continues with:
Oct 06 11:21:38 ERROR [gway 6766] connect_to(193) failed to connect to 10.10.0.22:7000: Connection refused
Oct 06 11:21:38 ERROR [gway 6766] connect_to(193) failed to connect to 10.10.0.22:7000: Connection refused

and the guest stays not responding. It looks like the object recovery never passes the 99%

And this process corrupted the VDI and can not be repaired with:
dog vdi check micha_test2
object f19e4f0000085d is inconsistent
97.1 % [============================================================================================> ] 15 GB / 15 GB object f19e4f0000085f is inconsistent
97.2 % [============================================================================================> ] 15 GB / 15 GB object f19e4f00000864 is inconsistent
97.5 % [============================================================================================> ] 15 GB / 15 GB object f19e4f0000086d is inconsistent
98.9 % [=============================================================================================> ] 15 GB / 15 GB object f19e4f00000a46 is inconsistent
99.4 % [==============================================================================================>] 15 GB / 15 GB object f19e4f00000a6f is inconsistent
100.0 % [===============================================================================================] 15 GB / 15 GB
finish check&repair micha_test2

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.