3-node native rabbitmq cluster race
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Landscape Server |
Fix Released
|
High
|
Andreas Hasenack | ||
15.07 |
Fix Released
|
High
|
Andreas Hasenack | ||
Cisco-odl |
Fix Released
|
High
|
Andreas Hasenack | ||
rabbitmq-server (Juju Charms Collection) |
Fix Released
|
Critical
|
David Ames |
Bug Description
With a 3-node native cluster in Vivid-Kilo, Trusty-Juno, and Precise-Icehouse, in greater than 50% of all attempts, one of the rabbitmq-server units fails to cluster. When this happens, we end up with a 2-node cluster, a 1-node cluster, while juju status indicates happiness. In Trusty-Icehouse, the race is much less frequent.
The min-cluster-size and max-cluster-tries code does not appear to be hit. The above is observed with juju 1.24.5 with LE.
When I try with juju 1.22.1 (fallback cluster approach), I get no clustered units (ie. 3 separate single-node clusters).
Test scenario: a basic 3-node rabbitmq-server native cluster, with nrpe as a subordinate to exercise nrpe-external-
DNS does not appear to play a role here, as all machines can resolve all other machines, forward and reverse, when this cluster failure is observed.
FYI, when the cluster does succeed on V-K, a separate, seemingly-unrelated bug is consistently hit (bug 1485722).
# VK amulet results
2015-08-18 17:49:03,637 test_300_rmq_config INFO: OK
2015-08-18 17:49:03,637 test_400_
2015-08-18 17:49:08,219 get_unit_hostnames DEBUG: Unit host names: {'rabbitmq-
2015-08-18 17:49:09,932 run_cmd_unit DEBUG: rabbitmq-server/0 `rabbitmqctl cluster_status` command returned 0 (OK)
2015-08-18 17:49:09,932 get_rmq_
Cluster status of node 'rabbit@
[{nodes,
{running_
{cluster_
{partitions,[]}]
2015-08-18 17:49:11,578 run_cmd_unit DEBUG: rabbitmq-server/1 `rabbitmqctl cluster_status` command returned 0 (OK)
2015-08-18 17:49:11,578 get_rmq_
Cluster status of node 'rabbit@
[{nodes,
{running_
{cluster_
{partitions,[]}]
2015-08-18 17:49:13,224 run_cmd_unit DEBUG: rabbitmq-server/2 `rabbitmqctl cluster_status` command returned 0 (OK)
2015-08-18 17:49:13,226 get_rmq_
Cluster status of node 'rabbit@
[{nodes,
{running_
{cluster_
{partitions,[]}]
Cluster member check failed on rabbitmq-server/0: rabbit@
Cluster member check failed on rabbitmq-server/0: rabbit@
Cluster member check failed on rabbitmq-server/1: rabbit@
Cluster member check failed on rabbitmq-server/2: rabbit@
# VK rabbitmq-server/2 unit failed to cluster:
2015-08-18 17:44:27 INFO juju-log cluster:1: Clustering with remote rabbit host (juju-beis0-
2015-08-18 17:44:27 INFO cluster-
2015-08-18 17:44:28 INFO cluster-
2015-08-18 17:44:28 INFO cluster-
2015-08-18 17:44:28 INFO cluster-
2015-08-18 17:44:28 INFO cluster-
2015-08-18 17:44:28 INFO cluster-
2015-08-18 17:44:28 INFO cluster-
2015-08-18 17:44:28 INFO cluster-
2015-08-18 17:44:28 INFO cluster-
2015-08-18 17:44:28 INFO cluster-
2015-08-18 17:44:28 INFO cluster-
2015-08-18 17:44:28 INFO cluster-
2015-08-18 17:44:28 INFO cluster-
2015-08-18 17:44:28 INFO cluster-
2015-08-18 17:44:28 INFO cluster-
2015-08-18 17:44:28 INFO cluster-
2015-08-18 17:44:28 INFO cluster-
2015-08-18 17:44:28 INFO cluster-
2015-08-18 17:44:28 INFO cluster-
2015-08-18 17:44:28 INFO cluster-
2015-08-18 17:44:28 INFO cluster-
2015-08-18 17:44:28 INFO juju-log cluster:1: Failed to cluster with juju-beis0-
# rabbitmq-server/2 (juju-beis0-
Name resolution is fine. Attempted to cluster with rabbitmq-server/0 (juju-beis0-
Full unit log: http://
root@juju-
juju-beis0-
root@juju-
inet 172.18.99.100/24 brd 172.18.99.255 scope global eth0
root@juju-
juju-beis0-
root@juju-
juju-beis0-
root@juju-
juju-beis0-
root@juju-
98.99.18.
root@juju-
99.99.18.
root@juju-
100.99.
# rabbitmq-server/0 (juju-beis0-
Name resolution is fine. cluster-releation-* hooks never fired.
Full unit log: http://
root@juju-
juju-beis0-
root@juju-
inet 172.18.99.98/24 brd 172.18.99.255 scope global eth0
root@juju-
juju-beis0-
root@juju-
juju-beis0-
root@juju-
juju-beis0-
root@juju-
98.99.18.
root@juju-
99.99.18.
root@juju-
100.99.
# rabbitmq-server/1 (juju-beis0-
Name resolution is fine. Clustered ok with rabbitmq-server/2 (juju-beis0-
Full unit log: http://
root@juju-
juju-beis0-
root@juju-
inet 172.18.99.99/24 brd 172.18.99.255 scope global eth0
root@juju-
juju-beis0-
root@juju-
juju-beis0-
root@juju-
juju-beis0-
root@juju-
98.99.18.
root@juju-
99.99.18.
root@juju-
100.99.
# VK juju stat
http://
# rmq verions
ubuntu@
- MachineId: "2"
Stdout: |
rabbitmq-
Installed: 3.4.3-2
Candidate: 3.4.3-2
Version table:
*** 3.4.3-2 0
500 http://
100 /var/lib/
UnitId: rabbitmq-server/0
- MachineId: "3"
Stdout: |
rabbitmq-
Installed: 3.4.3-2
Candidate: 3.4.3-2
Version table:
*** 3.4.3-2 0
500 http://
100 /var/lib/
UnitId: rabbitmq-server/1
- MachineId: "4"
Stdout: |
rabbitmq-
Installed: 3.4.3-2
Candidate: 3.4.3-2
Version table:
*** 3.4.3-2 0
500 http://
100 /var/lib/
UnitId: rabbitmq-server/2
Related branches
- Liam Young (community): Approve
- Billy Olsen: Needs Resubmitting
- Ryan Beisner (community): Needs Resubmitting
-
Diff: 107 lines (+36/-11)2 files modifiedhooks/rabbit_utils.py (+6/-1)
hooks/rabbitmq_server_relations.py (+30/-10)
- Liam Young (community): Approve
-
Diff: 7025 lines (+2723/-3919)46 files modifiedMakefile (+5/-4)
charm-helpers-tests.yaml (+2/-3)
hooks/rabbit_utils.py (+6/-1)
hooks/rabbitmq_server_relations.py (+30/-10)
metadata.yaml (+4/-1)
tests/00-setup (+16/-0)
tests/014-basic-precise-icehouse (+11/-0)
tests/015-basic-trusty-icehouse (+9/-0)
tests/016-basic-trusty-juno (+11/-0)
tests/017-basic-trusty-kilo (+11/-0)
tests/019-basic-vivid-kilo (+9/-0)
tests/020-basic-trusty-liberty (+11/-0)
tests/021-basic-wily-liberty (+9/-0)
tests/basic_deployment.py (+492/-0)
tests/charmhelpers/__init__.py (+0/-38)
tests/charmhelpers/cli/__init__.py (+0/-191)
tests/charmhelpers/cli/benchmark.py (+0/-36)
tests/charmhelpers/cli/commands.py (+0/-32)
tests/charmhelpers/cli/hookenv.py (+0/-23)
tests/charmhelpers/cli/host.py (+0/-31)
tests/charmhelpers/cli/unitdata.py (+0/-39)
tests/charmhelpers/contrib/__init__.py (+0/-15)
tests/charmhelpers/contrib/amulet/__init__.py (+15/-0)
tests/charmhelpers/contrib/amulet/deployment.py (+93/-0)
tests/charmhelpers/contrib/amulet/utils.py (+778/-0)
tests/charmhelpers/contrib/openstack/__init__.py (+15/-0)
tests/charmhelpers/contrib/openstack/amulet/__init__.py (+15/-0)
tests/charmhelpers/contrib/openstack/amulet/deployment.py (+198/-0)
tests/charmhelpers/contrib/openstack/amulet/utils.py (+963/-0)
tests/charmhelpers/contrib/ssl/__init__.py (+0/-94)
tests/charmhelpers/contrib/ssl/service.py (+0/-279)
tests/charmhelpers/core/__init__.py (+0/-15)
tests/charmhelpers/core/decorators.py (+0/-57)
tests/charmhelpers/core/files.py (+0/-45)
tests/charmhelpers/core/fstab.py (+0/-134)
tests/charmhelpers/core/hookenv.py (+0/-898)
tests/charmhelpers/core/host.py (+0/-570)
tests/charmhelpers/core/hugepage.py (+0/-62)
tests/charmhelpers/core/services/__init__.py (+0/-18)
tests/charmhelpers/core/services/base.py (+0/-353)
tests/charmhelpers/core/services/helpers.py (+0/-283)
tests/charmhelpers/core/strutils.py (+0/-42)
tests/charmhelpers/core/sysctl.py (+0/-56)
tests/charmhelpers/core/templating.py (+0/-68)
tests/charmhelpers/core/unitdata.py (+0/-521)
tests/tests.yaml (+20/-0)
- Liam Young (community): Approve
-
Diff: 107 lines (+36/-11)2 files modifiedhooks/rabbit_utils.py (+6/-1)
hooks/rabbitmq_server_relations.py (+30/-10)
summary: |
- cluster-relation-changed Error: unable to connect to nodes - ['rabbit@juju-beis0-machine-2']: nodedown + vivid-kilo 3-node native cluster race: cluster-relation-changed Error: + unable to connect to nodes ['rabbit@juju-X-machine-N']: nodedown |
summary: |
- vivid-kilo 3-node native cluster race: cluster-relation-changed Error: - unable to connect to nodes ['rabbit@juju-X-machine-N']: nodedown + 3-node native cluster doesn't always cluster race: cluster-relation- + changed Error: unable to connect to nodes ['rabbit@juju-X-machine-N']: + nodedown |
description: | updated |
description: | updated |
description: | updated |
Changed in rabbitmq-server (Juju Charms Collection): | |
status: | New → Confirmed |
importance: | Undecided → High |
Changed in rabbitmq-server (Juju Charms Collection): | |
importance: | High → Critical |
description: | updated |
summary: |
- 3-node native cluster doesn't always cluster race: cluster-relation- - changed Error: unable to connect to nodes ['rabbit@juju-X-machine-N']: - nodedown + 3-node native rabbitmq cluster race |
Changed in rabbitmq-server (Juju Charms Collection): | |
assignee: | nobody → David Ames (thedac) |
tags: | added: backport-potential |
Changed in rabbitmq-server (Juju Charms Collection): | |
status: | Confirmed → Fix Committed |
milestone: | none → 15.10 |
tags: | added: sts |
tags: | added: cisco landscape |
tags: | added: landscape-release-29 |
Changed in rabbitmq-server (Juju Charms Collection): | |
status: | Confirmed → Fix Committed |
Changed in landscape: | |
importance: | Undecided → High |
Changed in landscape: | |
status: | New → In Progress |
assignee: | nobody → Andreas Hasenack (ahasenack) |
Changed in landscape: | |
status: | In Progress → Fix Committed |
milestone: | none → 15.08 |
tags: | removed: landscape-release-29 |
Changed in landscape: | |
status: | Fix Committed → Fix Released |
milestone: | 15.08 → 15.07 |
Changed in rabbitmq-server (Juju Charms Collection): | |
status: | Fix Committed → Fix Released |
Hello Ryan,
FWIW, I can reproduce cluster setup failure with Juju 1.22 more reliably than 1.24, LP: #1483949.