Malformed 3 unit cluster (rabbitmq)

Bug #1657245 reported by Andreas Hasenack
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Landscape Server
Invalid
Undecided
Unassigned
OpenStack RabbitMQ Server Charm
Fix Released
Critical
David Ames
rabbitmq-server (Juju Charms Collection)
Invalid
Critical
David Ames

Bug Description

juju 2.1b4
cs:xenial/rabbitmq-server-57
maas 2.1.3

I have an HA openstack deployment done by the autopilot where the 3 rabbit units didn't cluster together. In fact, it looks like units 0 and 1 clustered, but unit 2 went ahead on its own (split brain then I suppose). Also of note is that unit 2 is the leader according to juju.

This was noticed when neutron services couldn't connect to rabbit, getting a 403 error back:
2017-01-17 12:30:22.929 32573 ERROR oslo_service.service AccessRefused: (0, 0): (403) ACCESS_REFUSED - Login was refused using authentication mechanism AMQPLAIN. For details see the broker logfile.

This attempt can be confirmed in the rabbit/1 unit logs:
=ERROR REPORT==== 17-Jan-2017::12:30:22 ===
closing AMQP connection <0.16879.0> (10.96.22.27:60030 -> 10.96.22.56:5672):
{handshake_error,starting,0,
                 {amqp_error,access_refused,
                             "AMQPLAIN login refused: user 'neutron' - invalid credentials",
                             'connection.start_ok'}}

In fact, rabbit/0 and /1 show all sorts of refused logins because of invalid credentials.

Meanwhile, logs for rabbit/2 show that it is happily creating those users, like neutron:
=INFO REPORT==== 17-Jan-2017::12:12:10 ===
Creating user 'neutron'

Note that there is suspicion that leader election in juju 2.1b4 broke or changed, see details in https://bugs.launchpad.net/charms/+source/rabbitmq-server/+bug/1654116/comments/11 which was also about rabbit.

Attached are the logs for all 3 rabbit units, as well as the neutron "victim". I have logs of all nodes participating in this deployment if something else is needed.

Tags: landscape
Revision history for this message
Andreas Hasenack (ahasenack) wrote :
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

/etc/* and /var/log/* from all 3 rabbit units

description: updated
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

/etc/* and /var/log/* from neutron-gateway/0, where the rabbit logins are attempted and refused because of incorrect credentials.

David Ames (thedac)
Changed in rabbitmq-server (Juju Charms Collection):
status: New → In Progress
importance: Undecided → Critical
assignee: nobody → David Ames (thedac)
milestone: none → 17.01
Revision history for this message
David Ames (thedac) wrote :

The leader election issues in bug https://bugs.launchpad.net/charms/+source/rabbitmq-server/+bug/1654116/ is a prime suspect for this bug. We'll follow the juju-core response there.

In the meantime, while testing the leadership bug it revealed this bug as a charm bug. The rabbitmq-server charm is not waiting until the cluster is completely formed before running amqp-relation-* hooks. This at least in theory could lead to the split brain problems described here.

Working on a change that
1) Will use the min-cluster-size to try and guarantee waiting for the full cluster to form
2) When min-cluster is not set try and determine the number of cluster relations and wait on them
3) Once the cluster is formed only the leader node will run amqp-relation-*
   The extant charm does this to some degree already and thus the suspicion of Bug#1654116

Changed in landscape:
milestone: none → 17.01
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-rabbitmq-server (master)

Fix proposed to branch: master
Review: https://review.openstack.org/422318

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-rabbitmq-server (master)

Reviewed: https://review.openstack.org/422318
Committed: https://git.openstack.org/cgit/openstack/charm-rabbitmq-server/commit/?id=2472e1ca9fbd68cdd50640eadd071177ddc5968e
Submitter: Jenkins
Branch: master

commit 2472e1ca9fbd68cdd50640eadd071177ddc5968e
Author: David Ames <email address hidden>
Date: Fri Jan 13 20:36:12 2017 -0800

    Wait until clustered before running client hooks

    RabbitMQ takes some time to fully cluster. The charm was previously
    running amqp-relation-changed hooks whenever they were queued even
    if the cluster was not yet complete. This led to split brain
    scenarios. Client authentication to one or more nodes could fail.

    This change confirms the entire cluster is ready before running
    client amqp-relation-changed hooks.

    min-cluster-size can now be used to attempt to guarantee the cluster
    is ready with the expected number of nodes. If min-cluster-size is
    not set the charm will still determine based on the information
    available if all the cluster nodes are ready. Single node
    deployments are still possible.

    Partial-Bug: #1657245
    Closes-Bug: #1657176
    Change-Id: I870df71869c979e65a3a8764efdf35a746278507

David Ames (thedac)
Changed in rabbitmq-server (Juju Charms Collection):
status: In Progress → Fix Committed
Chad Smith (chad.smith)
Changed in landscape:
milestone: 17.01 → 17.02
James Page (james-page)
Changed in charm-rabbitmq-server:
assignee: nobody → David Ames (thedac)
importance: Undecided → Critical
status: New → Fix Committed
Changed in rabbitmq-server (Juju Charms Collection):
status: Fix Committed → Invalid
James Page (james-page)
Changed in charm-rabbitmq-server:
milestone: none → 17.02
James Page (james-page)
Changed in charm-rabbitmq-server:
status: Fix Committed → Fix Released
Revision history for this message
Chad Smith (chad.smith) wrote : Re: Malformed 3 unit cluster

Updated worker multiplier to 1.0 and haven't seen this issue since.

Changed in landscape:
status: New → Invalid
summary: - Malformed 3 unit cluster
+ Malformed 3 unit cluster (rabbitmq)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.