some nodes not join to sheepdog cluster.

Bug #1322427 reported by Saeki Masaki
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
sheepdog
New
Undecided
Unassigned

Bug Description

When trying to build a Sheepdog cluster of 12 nodes using Corosync, we found a strange behavior.
Launched a Sheep process on each server, some nodes were not join to the cluster.
Which "node" or not , was unspecified. ( for more infomation is below )

I'm not familiar with corosync, but corosync logged "enabling flow control" .
because send message buffer is full .

In our environment, unlikely to occur number of nodes was small.
and we can not possible to reproduce in sheepdog v0.7.8.
and not possible to reproduce in corosync v2.3.3.

I have some question
1. In sheepdog v0.8.1 message size, when join cluster, increased from v0.7.8?
2. Which corosync version you are used mainly.

---
The environment occurred is,
  CentOS6.5 ( 2.6.32-431.el6.x86_64)
  sheepdog v0.8.1
  corosync-1.4.1-17.el6_5.1

---
Preparation
 Delete cluster file and data of Sheepdog completely.

---
Steps to Reproduce
[root@sds01 ~]# ssh sds01 "sheep -p 7000 -b 192.168.2.11 -i host=192.168.2.11,port=7001 -l dir=${LOG_DIR},level=${LOG_LEVEL} ${DS_BASE}"
[root@sds01 ~]# ssh sds02 "sheep -p 7000 -b 192.168.2.12 -i host=192.168.2.12,port=7001 -l dir=${LOG_DIR},level=${LOG_LEVEL} ${DS_BASE}"
(Snip 9 nodes)
[root@sds01 ~]# ssh sds12 "sheep -p 7000 -b 192.168.2.22 -i host=192.168.2.22,port=7001 -l dir=${LOG_DIR},level=${LOG_LEVEL} ${DS_BASE}"

---
Confirmation of the results
 Some nodes had different status, and log

[root@sds01 ~]# dog node list -a 192.168.2.11
  Id Host:Port V-Nodes Zone
   0 192.168.2.11:7000 33 184723648
   1 192.168.2.12:7000 80 201500864
   2 192.168.2.13:7000 140 218278080
   3 192.168.2.14:7000 145 235055296
   4 192.168.2.15:7000 147 251832512
   5 192.168.2.16:7000 145 268609728
   6 192.168.2.17:7000 147 285386944
   7 192.168.2.18:7000 145 302164160
   8 192.168.2.19:7000 146 318941376
   9 192.168.2.20:7000 144 335718592
  10 192.168.2.21:7000 146 352495808
  11 192.168.2.22:7000 119 369273024
[root@sds01 ~]# dog node list -a 192.168.2.16
  Id Host:Port V-Nodes Zone
   0 192.168.2.11:7000 33 184723648
   1 192.168.2.12:7000 82 201500864
   2 192.168.2.13:7000 143 218278080
   3 192.168.2.14:7000 148 235055296
   4 192.168.2.15:7000 150 251832512
   5 192.168.2.16:7000 148 268609728
   6 192.168.2.17:7000 150 285386944
   7 192.168.2.18:7000 149 302164160
   8 192.168.2.19:7000 149 318941376

---
sds01 sheepdog log
May 12 15:33:25 DEBUG [main] tx_main(832) 37, 192.168.2.21:58765
May 12 15:33:25 DEBUG [block] sockfd_cache_put_long(372) 192.168.2.21:7001 idx 0
May 12 15:33:28 DEBUG [main] cdrv_cpg_confchg(555) mem:12, joined:1, left:0
May 12 15:33:28 DEBUG [main] cdrv_cpg_deliver(450) 0
May 12 15:33:28 DEBUG [main] sd_join_handler(765) check IPv4 ip:192.168.2.22 port:7000, 2
May 12 15:33:28 DEBUG [main] sd_join_handler(778) 192.168.2.22:7000: cluster_status = 0x2
May 12 15:33:28 DEBUG [main] cdrv_cpg_deliver(450) 1
May 12 15:33:28 DEBUG [main] sd_accept_handler(907) join IPv4 ip:192.168.2.22 port:7000
May 12 15:33:28 DEBUG [main] sd_accept_handler(909) IPv4 ip:192.168.2.11 port:7000
May 12 15:33:28 DEBUG [main] sd_accept_handler(909) IPv4 ip:192.168.2.12 port:7000
May 12 15:33:28 DEBUG [main] sd_accept_handler(909) IPv4 ip:192.168.2.13 port:7000
May 12 15:33:28 DEBUG [main] sd_accept_handler(909) IPv4 ip:192.168.2.14 port:7000
May 12 15:33:28 DEBUG [main] sd_accept_handler(909) IPv4 ip:192.168.2.15 port:7000
May 12 15:33:28 DEBUG [main] sd_accept_handler(909) IPv4 ip:192.168.2.16 port:7000
May 12 15:33:28 DEBUG [main] sd_accept_handler(909) IPv4 ip:192.168.2.17 port:7000
May 12 15:33:28 DEBUG [main] sd_accept_handler(909) IPv4 ip:192.168.2.18 port:7000
May 12 15:33:28 DEBUG [main] sd_accept_handler(909) IPv4 ip:192.168.2.19 port:7000
May 12 15:33:28 DEBUG [main] sd_accept_handler(909) IPv4 ip:192.168.2.20 port:7000
May 12 15:33:28 DEBUG [main] sd_accept_handler(909) IPv4 ip:192.168.2.21 port:7000
May 12 15:33:28 DEBUG [main] sd_accept_handler(909) IPv4 ip:192.168.2.22 port:7000
May 12 15:33:28 DEBUG [main] update_cluster_info(646) status = 2, epoch = 0
May 12 15:33:28 DEBUG [main] sockfd_cache_add(239) 192.168.2.22:7000, count 12
May 12 15:33:28 DEBUG [main] recalculate_vnodes(126) node IPv4 ip:192.168.2.11 port:7000 has 33 vnodes, free space 41482960896
May 12 15:33:28 DEBUG [main] recalculate_vnodes(126) node IPv4 ip:192.168.2.12 port:7000 has 80 vnodes, free space 101949628416
(Snip)
May 12 15:33:28 DEBUG [main] recalculate_vnodes(126) node IPv4 ip:192.168.2.22 port:7000 has 119 vnodes, free space 150564032512
May 12 15:33:28 DEBUG [block] do_get_vdis(495) try to get vdi bitmap from IPv4 ip:192.168.2.22 port:7000
May 12 15:33:28 DEBUG [block] sockfd_cache_get_long(344) create cache connection 192.168.2.22:7001 idx 0
May 12 15:33:28 DEBUG [main] cdrv_cpg_deliver(450) 1
May 12 15:33:28 DEBUG [main] __corosync_dispatch(373) wait for a next dispatch event
May 12 15:33:28 DEBUG [main] cdrv_cpg_deliver(450) 1
May 12 15:33:28 DEBUG [block] connect_to(209) 38, 192.168.2.22:7001
May 12 15:33:28 DEBUG [main] cdrv_cpg_deliver(450) 1
May 12 15:33:28 DEBUG [main] cdrv_cpg_deliver(450) 1
May 12 15:33:28 DEBUG [main] cdrv_cpg_deliver(450) 1
May 12 15:33:28 DEBUG [main] cdrv_cpg_deliver(450) 1
May 12 15:33:28 DEBUG [main] cdrv_cpg_deliver(450) 1
May 12 15:33:28 DEBUG [main] cdrv_cpg_deliver(450) 1
May 12 15:33:28 DEBUG [block] sockfd_cache_put_long(372) 192.168.2.22:7001 idx 0
May 12 15:33:28 DEBUG [main] listen_handler(996) accepted a new connection: 39
May 12 15:33:28 DEBUG [main] client_handler(916) 1, 0
May 12 15:33:28 DEBUG [main] rx_main(780) 39, 192.168.2.22:43156
May 12 15:33:28 DEBUG [main] queue_request(454) GET_VDI_COPIES, 2
May 12 15:33:28 DEBUG [io 12749] do_process_work(1428) ab, 0, 0
May 12 15:33:28 DEBUG [main] client_handler(916) 4, 0
May 12 15:33:28 DEBUG [main] tx_main(832) 39, 192.168.2.22:43156

sds06 sheepdog log
May 12 15:33:25 DEBUG [main] tx_main(832) 35, 192.168.2.21:60702
May 12 15:33:28 DEBUG [main] listen_handler(996) accepted a new connection: 36
May 12 15:33:28 DEBUG [main] client_handler(916) 1, 0
May 12 15:33:28 DEBUG [main] rx_main(780) 36, 192.168.2.22:38787
May 12 15:33:28 DEBUG [main] queue_request(454) GET_VDI_COPIES, 2
May 12 15:33:28 DEBUG [io 15732] do_process_work(1428) ab, 0, 0
May 12 15:33:28 DEBUG [main] client_handler(916) 4, 0
May 12 15:33:28 DEBUG [main] tx_main(832) 36, 192.168.2.22:38787

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.