The node does not left when one of the node is stoped node

Bug #1368503 reported by masahiro tsuji
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
sheepdog
Fix Committed
Undecided
Unassigned

Bug Description

The node does not left when one of the node is stoped node

I found problem that the node does not left when one of the node is stoped node under recovery is running

And I reproduced the problem and get a sheepdg debug log.
It was repoduced by continuous write from 2 nodes.

run it on node.
[root@13EV0104 ~]# dd if=/dev/zero |collie vdi write test1-100G

run it on the other node1
[root@13EV0105 ~]# dd if=/dev/zero |collie vdi write test2-100G

Version
Corosync 2.3
sheepdog 0.7.6

When stop one node, all of other node got corosync callback.
There is log 'cdrv_cpg_confchg'

Sep 11 21:20:43 DEBUG [main] client_handler(788) 1, rx 2, tx 3
Sep 11 21:20:43 DEBUG [main] client_handler(788) 1, rx 2, tx 3
Sep 11 21:20:43 DEBUG [main] finish_rx(590) 31, 10.0.0.14:41962
Sep 11 21:20:43 DEBUG [main] queue_request(347) WRITE_PEER, 1
Sep 11 21:20:43 DEBUG [io 1660] do_process_work(1377) a5, 3071b900000211, 12
Sep 11 21:20:43 DEBUG [io 1660] md_get_object_path(343) 0, /home/sheepdog/obj2
Sep 11 21:20:43 DEBUG [main] client_handler(788) 4, rx 0, tx 3
Sep 11 21:20:43 DEBUG [main] finish_tx(677) connection from: 31, 10.0.0.14:41962
Sep 11 21:20:43 DEBUG [main] cdrv_cpg_confchg(553) mem:8, joined:0, left:1
Sep 11 21:20:43 DEBUG [main] __corosync_dispatch(371) wait for a next dispatch event
Sep 11 21:20:44 DEBUG [main] client_handler(788) 19, rx 0, tx 3
Sep 11 21:20:44 DEBUG [main] clear_client_info(716) connection seems to be dead
Sep 11 21:20:44 DEBUG [main] clear_client_info(736) refcnt:0, fd:23, 10.0.0.10:44854
Sep 11 21:20:44 DEBUG [main] destroy_client(707) connection from: 10.0.0.10:44854
Sep 11 21:20:47 DEBUG [main] listen_handler(847) accepted a new connection: 23
Sep 11 21:20:47 DEBUG [main] client_handler(788) 1, rx 0, tx 0
Sep 11 21:20:47 DEBUG [main] finish_rx(590) 23, 127.0.0.1:53491
Sep 11 21:20:47 DEBUG [main] queue_request(347) GET_NODE_LIST, 1

but epoch did not updated,

Then I tried restart stopped node.
sd_leave_handler was called followed by cdrv_cpg_confchg joined callback.
It seems thet there is COROSYNC_EVENT_TYPE_LEAVE in queue. but it is not dispatched until next event is happened.

Sep 11 21:27:40 DEBUG [main] destroy_client(707) connection from: 127.0.0.1:53502
Sep 11 21:28:15 DEBUG [main] cdrv_cpg_confchg(553) mem:9, joined:1, left:0
Sep 11 21:28:15 DEBUG [main] sd_leave_handler(907) leave IPv4 ip:10.0.0.10 port:7000
Sep 11 21:28:15 DEBUG [main] sd_leave_handler(909) [0] IPv4 ip:10.0.0.7 port:7000
Sep 11 21:28:15 DEBUG [main] sd_leave_handler(909) [1] IPv4 ip:10.0.0.8 port:7000
Sep 11 21:28:15 DEBUG [main] sd_leave_handler(909) [2] IPv4 ip:10.0.0.9 port:7000
Sep 11 21:28:15 DEBUG [main] sd_leave_handler(909) [3] IPv4 ip:10.0.0.11 port:7000
Sep 11 21:28:15 DEBUG [main] sd_leave_handler(909) [4] IPv4 ip:10.0.0.12 port:7000
Sep 11 21:28:15 DEBUG [main] sd_leave_handler(909) [5] IPv4 ip:10.0.0.13 port:7000
Sep 11 21:28:15 DEBUG [main] sd_leave_handler(909) [6] IPv4 ip:10.0.0.14 port:7000
Sep 11 21:28:15 DEBUG [main] sd_leave_handler(909) [7] IPv4 ip:10.0.0.15 port:7000
Sep 11 21:28:15 DEBUG [main] recalculate_vnodes(625) node 7000 has 64 vnodes, free space 467043536896
Sep 11 21:28:15 DEBUG [main] recalculate_vnodes(625) node 7000 has 64 vnodes, free space 467043536896
Sep 11 21:28:15 DEBUG [main] recalculate_vnodes(625) node 7000 has 64 vnodes, free space 467043536896
Sep 11 21:28:15 DEBUG [main] recalculate_vnodes(625) node 7000 has 64 vnodes, free space 465969774592
Sep 11 21:28:15 DEBUG [main] recalculate_vnodes(625) node 7000 has 64 vnodes, free space 465969774592
Sep 11 21:28:15 DEBUG [main] recalculate_vnodes(625) node 7000 has 64 vnodes, free space 465969774592
Sep 11 21:28:15 DEBUG [main] recalculate_vnodes(625) node 7000 has 64 vnodes, free space 464896028672
Sep 11 21:28:15 DEBUG [main] recalculate_vnodes(625) node 7000 has 64 vnodes, free space 465969774592
Sep 11 21:28:15 DEBUG [main] update_epoch_log(26) update epoch: 13, 8
Sep 11 21:28:15 DEBUG [rw] prepare_object_list(759) 13

[root@13EV0097 ~]# collie cluster info
Cluster status: running, auto-recovery enabled

Cluster created at Fri Sep 5 18:13:13 2014

Epoch Time Version
2014-09-11 21:28:15 14 [10.0.0.7:7000, 10.0.0.8:7000, 10.0.0.9:7000, 10.0.0.10:7000, 10.0.0.11:7000, 10.0.0.12:7000, 10.0.0.13:7000, 10.0.0.14:7000, 10.0.0.15:7000]
2014-09-11 21:28:15 13 [10.0.0.7:7000, 10.0.0.8:7000, 10.0.0.9:7000, 10.0.0.11:7000, 10.0.0.12:7000, 10.0.0.13:7000, 10.0.0.14:7000, 10.0.0.15:7000]
2014-09-10 13:34:52 12 [10.0.0.7:7000, 10.0.0.8:7000, 10.0.0.9:7000, 10.0.0.10:7000, 10.0.0.11:7000, 10.0.0.12:7000, 10.0.0.13:7000, 10.0.0.14:7000, 10.0.0.15:7000]

Changed in sheepdog-project:
status: New → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.