OpenStack RabbitMQ Server Charm

[18.08,Xenial-Queens] RMQ malfunction although looks good and expected to self-recover from crash

Bug #1802315 reported by Alvaro Uria on 2018-11-08

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack RabbitMQ Server Charm	In Progress	High	Liam Young	OpenStack RabbitMQ Server Charm 19.04

Bug Description

Some info:
* rabbitmq-server 3.6.10-1~cloud0
* cs:rabbitmq-server-79
* source=xenial-queens

Initial status:
* rmq/1, rmq/5, rmq/6
* containers on top of 3 different compute nodes
* the compute node hosting rmq/5 goes down (kernel panic) and needs to be rebooted

Next status (1):
* rmq is partitioned: https://pastebin.canonical.com/p/R8pMSRcTQx/
* rmq/1 is stopped
* rmq/5 is stopped
* rmq/5 is started
* rmq/1 is started

Next status (2):
* rmq looks good: https://pastebin.canonical.com/p/Crxz3XPtNn/
* However, clients errors continue to be very similar: https://pastebin.canonical.com/p/BHgQSX5W5K/

In the end, we had to:
* stop all 3 units: /1, then /5, then /6
* start them again: /6, then /5, then /1
* the above made all clients able to register in RMQ and work as expected

Application configuration is:
"""
  rabbitmq-server:
    bindings:
      ? ''
      : oam-space
      amqp: internal-space
      cluster: internal-space
    charm: cs:rabbitmq-server-79
    num_units: 3
    options:
      min-cluster-size: 3
      queue_thresholds: '[[''\*'', ''\*'', 500, 600]]'
      source: cloud:xenial-queens
    to:
    - lxd:20
    - lxd:19
    - lxd:16
"""

cluster_partition_handling is set to the default value (ignore). Should it be changed to autoheal? Messages would be lost, but the same state is achieved if cluster needs to be fully stopped and started to make it work as expected.

Tags:

Revision history for this message

Ryan Beisner (1chb1n) wrote on 2018-11-11:

Flagging as high, pending validation @ master or 18.11 charm revisions. If we can confirm this, keep the prio high. If it is resolved, advise charm upgrade.

Changed in charm-rabbitmq-server:
importance:	Undecided → High

Revision history for this message

Peter Sabaini (peter-sabaini) wrote on 2018-11-12:

Fwiw in my tests the "pause minority" partition handling strategy improved split brain recovery dramatically vs. "ignore"

David Ames (thedac) on 2018-11-27

Changed in charm-rabbitmq-server:
status:	New → Triaged

Revision history for this message

David Ames (thedac) wrote on 2018-11-27:

I have seen this in the wild and it seems to be common problem with rabbitmq itself (google "rabbitmq queue not found 404").

Peter's suggestion is correct. This is already implemented in the charm [0]. There are also other options like auotheal [1].

[0] https://github.com/openstack/charm-rabbitmq-server/blob/master/config.yaml#L39
[1] https://www.rabbitmq.com/partitions.html

Changed in charm-rabbitmq-server:
status:	Triaged → Invalid

Revision history for this message

Shane Peters (shaner) wrote on 2018-11-29:

Any thoughts on setting the default cluster-partition-handling option to 'autoheal'? Reading through the above docs it seems most logical for most use-cases.

Revision history for this message

Peter Sabaini (peter-sabaini) wrote on 2018-12-03:

Afaict from the docs [0] pause_minority seems like the safer option wrt to data integrity (autoheal prioritizes continuity of service over data integrity, pause_minority partition tolerance).

Fwiw autoheal works by restarting losing nodes if a split brain is detected which implies wiping those as well

[0] https://www.rabbitmq.com/partitions.html#cp-mode

Revision history for this message

Frode Nordahl (fnordahl) wrote on 2019-01-09:

While I agree to the benefits of the ``pause_minority`` mode, it does impose requirements on the end users infrastructure reliability and deeper understanding of the inner workings of RabbitMQ.

The ``autoheal`` mode has a "it just works" appeal to it which makes it a safer default until the end user is ready to make a conscious choice.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-01-09: Fix merged to charm-rabbitmq-server (master)

Reviewed: https://review.openstack.org/620949
Committed: https://git.openstack.org/cgit/openstack/charm-rabbitmq-server/commit/?id=b74a50d30f5d257a1061263426052ca830528b55
Submitter: Zuul
Branch: master

commit b74a50d30f5d257a1061263426052ca830528b55
Author: Shane Peters <email address hidden>
Date: Thu Nov 29 10:47:30 2018 -0500

Default to autoheal for cluster-partition-handling

    By setting the default to 'autoheal', we can better ensure
    service continuity in most use-cases. With autoheal, the
    'winning' partition will be the one with the most clients
    connected to it and nodes in the losing partition(s) will be
    restarted.

Change-Id: I0988e1d22e7c97819552b3bf325801632b099a32
Closes-Bug: 1802315

Changed in charm-rabbitmq-server:
status:	Invalid → Fix Committed

Edward Hope-Morley (hopem) on 2019-02-22

Changed in charm-rabbitmq-server:
milestone:	none → 19.04
assignee:	nobody → Shane Peters (shaner)

David Ames (thedac) on 2019-04-17

Changed in charm-rabbitmq-server:
status:	Fix Committed → Fix Released

Revision history for this message

Liam Young (gnuoy) wrote on 2021-11-24:

This change was reverted https://review.opendev.org/c/openstack/charm-rabbitmq-server/+/699862
I believe it was due to deploy time failures particularly when the target machines were under heavy load.

I have proposed a change which sets the cluster-partition-handling to 'ignore' during charm installation and once clustering is complete it sets it whatever has been requested via charm config ( https://review.opendev.org/c/openstack/charm-rabbitmq-server/+/819086 ). If this change is approved I will follow it up with another change to switch the default to autoheal

Changed in charm-rabbitmq-server:
status:	Fix Released → In Progress
assignee:	Shane Peters (shaner) → Liam Young (gnuoy)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-01-04:

Reviewed: https://review.opendev.org/c/openstack/charm-rabbitmq-server/+/819086
Committed: https://opendev.org/openstack/charm-rabbitmq-server/commit/ab813a982d2d091e727a0d05d0305c7cfc5681c9
Submitter: "Zuul (22348)"
Branch: master

commit ab813a982d2d091e727a0d05d0305c7cfc5681c9
Author: Liam Young <email address hidden>
Date: Wed Nov 24 10:08:03 2021 +0000

Use cluster strategy 'ignore' for install

    Use cluster-partition-handling strategy 'ignore' during charm
    installation regardless of the charm config setting. Once the
    leader has checked it is clustered with peers then it sets the
    cluster-partition-handling strategy to be whatever the user set
    in charm config.

Partial-Bug: 1802315
Change-Id: Ic03bbe55ea8aab8b285977a5c0f9410b5bbf35c8

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.