node list incorrect in sheep start immediately after shutdown

Bug #1433452 reported by Saeki Masaki
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
sheepdog
Undecided
Unassigned

Bug Description

In the case of using the zookeeper,
When sheep start immediately after cluster shutdown, each sheep is started.
But, result of 'dog node list' is not consistent within a cluster.

Cause is related that the remains ephemeral node.
It is possible to avoid the problem by starting sheep after ephemeral node deletion.

This problem has occurred since been added znode of QUEUE_POS.

How to reproduce

***** Case to sheep started before ephemeral node deletion

# sheep -v
Sheepdog daemon version 0.9.0_130_g7e39299_dirty

# zkServer.sh start

# mkdir -p /tmp/sheepdog/{0,1,2,3}
# for i in {0..3}; do sheep -c zookeeper:172.16.3.217:2181 -z ${i} -p 700${i} -l level=debug /tmp/sheepdog/${i}; sleep 1; done
# dog node list
  Id Host:Port V-Nodes Zone
   0 172.16.3.217:7000 128 0
   1 172.16.3.217:7001 128 1
   2 172.16.3.217:7002 128 2
   3 172.16.3.217:7003 128 3

# echo -e 'ls /sheepdog/master\nls /sheepdog/member\nls /sheepdog/queue_pos' | zkCli.sh
[zk: localhost:2181(CONNECTING) 0] ls /sheepdog/master
[0000000000, 0000000001, 0000000002, 0000000003]
[zk: localhost:2181(CONNECTED) 1] ls /sheepdog/member
[IPv4 ip:172.16.3.217 port:7002, IPv4 ip:172.16.3.217 port:7003, IPv4 ip:172.16.3.217 port:7000, IPv4 ip:172.16.3.217 port:7001]
[zk: localhost:2181(CONNECTED) 2] ls /sheepdog/queue_pos
[IPv4 ip:172.16.3.217 port:7002, IPv4 ip:172.16.3.217 port:7003, IPv4 ip:172.16.3.217 port:7000, IPv4 ip:172.16.3.217 port:7001]

# dog cluster shutdown
# echo -e 'ls /sheepdog/master\nls /sheepdog/member\nls /sheepdog/queue_pos' | zkCli.sh
[zk: localhost:2181(CONNECTED) 0] ls /sheepdog/master
[0000000002, 0000000003]
[zk: localhost:2181(CONNECTED) 1] ls /sheepdog/member
[]
[zk: localhost:2181(CONNECTED) 2] ls /sheepdog/queue_pos
[IPv4 ip:172.16.3.217 port:7002, IPv4 ip:172.16.3.217 port:7003, IPv4 ip:172.16.3.217 port:7000, IPv4 ip:172.16.3.217 port:7001]

# ps -fe | grep sheep | grep -v grep

# for i in {0..3}; do sheep -c zookeeper:172.16.3.217:2181 -z ${i} -p 700${i} -l level=debug /tmp/sheepdog/${i}; sleep 1; done
# ps -fe | grep sheep | grep -v grep
root 16196 1 9 13:55 ? 00:00:00 sheep -c zookeeper:172.16.3.217:2181 -z 0 -p 7000 -l level=debug /tmp/sheepdog/0
root 16198 16196 0 13:55 ? 00:00:00 sheep -c zookeeper:172.16.3.217:2181 -z 0 -p 7000 -l level=debug /tmp/sheepdog/0
root 16205 1 7 13:55 ? 00:00:00 sheep -c zookeeper:172.16.3.217:2181 -z 1 -p 7001 -l level=debug /tmp/sheepdog/1
root 16207 16205 0 13:55 ? 00:00:00 sheep -c zookeeper:172.16.3.217:2181 -z 1 -p 7001 -l level=debug /tmp/sheepdog/1
root 16214 1 1 13:55 ? 00:00:00 sheep -c zookeeper:172.16.3.217:2181 -z 2 -p 7002 -l level=debug /tmp/sheepdog/2
root 16216 16214 0 13:55 ? 00:00:00 sheep -c zookeeper:172.16.3.217:2181 -z 2 -p 7002 -l level=debug /tmp/sheepdog/2
root 16233 1 1 13:55 ? 00:00:00 sheep -c zookeeper:172.16.3.217:2181 -z 3 -p 7003 -l level=debug /tmp/sheepdog/3
root 16235 16233 0 13:55 ? 00:00:00 sheep -c zookeeper:172.16.3.217:2181 -z 3 -p 7003 -l level=debug /tmp/sheepdog/3

# for i in {0..3}; do dog node list -p 700${i};done
  Id Host:Port V-Nodes Zone
   0 172.16.3.217:7001 128 1
   1 172.16.3.217:7003 128 3
There are no active sheep daemons
  Id Host:Port V-Nodes Zone
   0 172.16.3.217:7000 128 0
   1 172.16.3.217:7001 128 1
   2 172.16.3.217:7003 128 3
  Id Host:Port V-Nodes Zone
   0 172.16.3.217:7000 128 0
   1 172.16.3.217:7001 128 1

# echo -e 'ls /sheepdog/master\nls /sheepdog/member\nls /sheepdog/queue_pos' | zkCli.sh
[zk: localhost:2181(CONNECTED) 0] ls /sheepdog/master
[0000000004, 0000000005, 0000000006, 0000000007]
[zk: localhost:2181(CONNECTED) 1] ls /sheepdog/member
[IPv4 ip:172.16.3.217 port:7002, IPv4 ip:172.16.3.217 port:7003, IPv4 ip:172.16.3.217 port:7000, IPv4 ip:172.16.3.217 port:7001]
[zk: localhost:2181(CONNECTED) 2] ls /sheepdog/queue_pos
[]

# grep ERROR /tmp/sheepdog/*/sheep.log
/tmp/sheepdog/0/sheep.log:Mar 18 13:55:37 ERROR [main] zk_handle_accept(1140) failed, , node exists
/tmp/sheepdog/1/sheep.log:Mar 18 13:55:37 ERROR [main] zk_handle_accept(1140) failed, , node exists
/tmp/sheepdog/2/sheep.log:Mar 18 13:55:36 ERROR [main] zk_handle_accept(1140) failed, , node exists
/tmp/sheepdog/3/sheep.log:Mar 18 13:55:37 ERROR [main] zk_handle_accept(1140) failed, , node exists

# grep -15 ERROR /tmp/sheepdog/2/sheep.log
Mar 18 13:55:36 DEBUG [main] zk_handle_join(1065) sender: IPv4 ip:172.16.3.217 port:7002
Mar 18 13:55:36 DEBUG [main] sd_join_handler(1017) check IPv4 ip:172.16.3.217 port:7002, 2
Mar 18 13:55:36 DEBUG [main] sd_join_handler(1030) 172.16.3.217:7002: cluster_status = 0x2
Mar 18 13:55:36 DEBUG [main] zk_watcher(732) path:/queue/0000000025, type:3, state:3
Mar 18 13:55:36 DEBUG [main] push_join_response(577) update path:/queue/0000000025, queue_pos:0000000025, len:983256
Mar 18 13:55:36 DEBUG [main] zk_handle_join(1075) I'm the master now
Mar 18 13:55:36 DEBUG [main] zk_event_handler(1282) 1, 25
Mar 18 13:55:36 DEBUG [main] zk_queue_pop_advance(592) /queue/0000000025, type:2, len:983256, pos:25
Mar 18 13:55:36 DEBUG [main] zk_handle_accept(1114) ACCEPT
Mar 18 13:55:36 DEBUG [main] init_node_list(1096) 0
Mar 18 13:55:36 DEBUG [main] zk_handle_accept(1119) IPv4 ip:172.16.3.217 port:7002
Mar 18 13:55:36 DEBUG [main] zk_handle_accept(1127) create path:/member/IPv4 ip:172.16.3.217 port:7002
Mar 18 13:55:36 DEBUG [main] zk_watcher(732) path:/member/IPv4 ip:172.16.3.217 port:7002, type:1, state:3
Mar 18 13:55:36 DEBUG [main] zk_handle_accept(1134) create path:/queue_pos/IPv4 ip:172.16.3.217 port:7002
Mar 18 13:55:36 DEBUG [main] zk_watcher(732) path:/member, type:4, state:3
Mar 18 13:55:36 ERROR [main] zk_handle_accept(1140) failed, , node exists
Mar 18 13:55:36 DEBUG [main] zk_event_handler(1282) 1, 26
Mar 18 13:55:37 DEBUG [main] zk_watcher(732) path:/master, type:4, state:3
Mar 18 13:55:37 DEBUG [main] zk_watcher(732) path:/queue/0000000026, type:1, state:3
Mar 18 13:55:37 DEBUG [main] zk_event_handler(1282) 1, 26
Mar 18 13:55:37 DEBUG [main] zk_queue_pop_advance(592) /queue/0000000026, type:1, len:983256, pos:26
Mar 18 13:55:37 DEBUG [main] zk_handle_join(1065) sender: IPv4 ip:172.16.3.217 port:7003
Mar 18 13:55:37 DEBUG [main] sd_join_handler(1017) check IPv4 ip:172.16.3.217 port:7003, 2
Mar 18 13:55:37 DEBUG [main] sd_join_handler(1030) 172.16.3.217:7003: cluster_status = 0x2
Mar 18 13:55:37 DEBUG [main] zk_watcher(732) path:/queue/0000000026, type:3, state:3
Mar 18 13:55:37 DEBUG [main] push_join_response(577) update path:/queue/0000000026, queue_pos:0000000026, len:983256
Mar 18 13:55:37 DEBUG [main] zk_handle_join(1075) I'm the master now
Mar 18 13:55:37 DEBUG [main] zk_event_handler(1282) 1, 26
Mar 18 13:55:37 DEBUG [main] zk_queue_pop_advance(592) /queue/0000000026, type:2, len:983256, pos:26
Mar 18 13:55:37 DEBUG [main] zk_handle_accept(1114) ACCEPT
Mar 18 13:55:37 DEBUG [main] zk_handle_accept(1119) IPv4 ip:172.16.3.217 port:7003

=========================================
***** Case to sheep start waiting for the ephemeral node deletion

 (same as above)
# dog cluster shutdown

# echo -e 'ls /sheepdog/master\nls /sheepdog/member\nls /sheepdog/queue_pos' | zkCli.sh
[zk: localhost:2181(CONNECTED) 0] ls /sheepdog/master
[0000000002, 0000000003]
[zk: localhost:2181(CONNECTED) 1] ls /sheepdog/member
[]
[zk: localhost:2181(CONNECTED) 2] ls /sheepdog/queue_pos
[IPv4 ip:172.16.3.217 port:7002, IPv4 ip:172.16.3.217 port:7003, IPv4 ip:172.16.3.217 port:7000, IPv4 ip:172.16.3.217 port:7001]

(Wait a few seconds....)
# echo -e 'ls /sheepdog/master\nls /sheepdog/member\nls /sheepdog/queue_pos' | zkCli.sh
[zk: localhost:2181(CONNECTING) 0] ls /sheepdog/master
[]
[zk: localhost:2181(CONNECTED) 1] ls /sheepdog/member
[]
[zk: localhost:2181(CONNECTED) 2] ls /sheepdog/queue_pos
[]
[zk: localhost:2181(CONNECTED) 3]

# for i in {0..3}; do sheep -c zookeeper:172.16.3.217:2181 -z ${i} -p 700${i} -l level=debug /tmp/sheepdog/${i}; sleep 1; done

# for i in {0..3}; do dog node list -p 700${i};done
  Id Host:Port V-Nodes Zone
   0 172.16.3.217:7000 128 0
   1 172.16.3.217:7001 128 1
   2 172.16.3.217:7002 128 2
   3 172.16.3.217:7003 128 3
  Id Host:Port V-Nodes Zone
   0 172.16.3.217:7000 128 0
   1 172.16.3.217:7001 128 1
   2 172.16.3.217:7002 128 2
   3 172.16.3.217:7003 128 3
  Id Host:Port V-Nodes Zone
   0 172.16.3.217:7000 128 0
   1 172.16.3.217:7001 128 1
   2 172.16.3.217:7002 128 2
   3 172.16.3.217:7003 128 3
  Id Host:Port V-Nodes Zone
   0 172.16.3.217:7000 128 0
   1 172.16.3.217:7001 128 1
   2 172.16.3.217:7002 128 2
   3 172.16.3.217:7003 128 3

# echo -e 'ls /sheepdog/master\nls /sheepdog/member\nls /sheepdog/queue_pos' | zkCli.sh
[zk: localhost:2181(CONNECTING) 0] ls /sheepdog/master
[0000000004, 0000000005, 0000000006, 0000000007]
[zk: localhost:2181(CONNECTED) 1] ls /sheepdog/member
[IPv4 ip:172.16.3.217 port:7002, IPv4 ip:172.16.3.217 port:7003, IPv4 ip:172.16.3.217 port:7000, IPv4 ip:172.16.3.217 port:7001]
[zk: localhost:2181(CONNECTED) 2] ls /sheepdog/queue_pos
[IPv4 ip:172.16.3.217 port:7002, IPv4 ip:172.16.3.217 port:7003, IPv4 ip:172.16.3.217 port:7000, IPv4 ip:172.16.3.217 port:7001]

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers