[18.08,Xenial-Queens] RMQ malfunction although looks good and expected to self-recover from crash

Bug #1802315 reported by Alvaro Uria
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack RabbitMQ Server Charm
In Progress
High
Liam Young

Bug Description

Some info:
* rabbitmq-server 3.6.10-1~cloud0
* cs:rabbitmq-server-79
* source=xenial-queens

Initial status:
* rmq/1, rmq/5, rmq/6
* containers on top of 3 different compute nodes
* the compute node hosting rmq/5 goes down (kernel panic) and needs to be rebooted

Next status (1):
* rmq is partitioned: https://pastebin.canonical.com/p/R8pMSRcTQx/
* rmq/1 is stopped
* rmq/5 is stopped
* rmq/5 is started
* rmq/1 is started

Next status (2):
* rmq looks good: https://pastebin.canonical.com/p/Crxz3XPtNn/
* However, clients errors continue to be very similar: https://pastebin.canonical.com/p/BHgQSX5W5K/

In the end, we had to:
* stop all 3 units: /1, then /5, then /6
* start them again: /6, then /5, then /1
* the above made all clients able to register in RMQ and work as expected

Application configuration is:
"""
  rabbitmq-server:
    bindings:
      ? ''
      : oam-space
      amqp: internal-space
      cluster: internal-space
    charm: cs:rabbitmq-server-79
    num_units: 3
    options:
      min-cluster-size: 3
      queue_thresholds: '[[''\*'', ''\*'', 500, 600]]'
      source: cloud:xenial-queens
    to:
    - lxd:20
    - lxd:19
    - lxd:16
"""

cluster_partition_handling is set to the default value (ignore). Should it be changed to autoheal? Messages would be lost, but the same state is achieved if cluster needs to be fully stopped and started to make it work as expected.

Revision history for this message
Ryan Beisner (1chb1n) wrote :

Flagging as high, pending validation @ master or 18.11 charm revisions. If we can confirm this, keep the prio high. If it is resolved, advise charm upgrade.

Changed in charm-rabbitmq-server:
importance: Undecided → High
Revision history for this message
Peter Sabaini (peter-sabaini) wrote :

Fwiw in my tests the "pause minority" partition handling strategy improved split brain recovery dramatically vs. "ignore"

David Ames (thedac)
Changed in charm-rabbitmq-server:
status: New → Triaged
Revision history for this message
David Ames (thedac) wrote :

I have seen this in the wild and it seems to be common problem with rabbitmq itself (google "rabbitmq queue not found 404").

Peter's suggestion is correct. This is already implemented in the charm [0]. There are also other options like auotheal [1].

[0] https://github.com/openstack/charm-rabbitmq-server/blob/master/config.yaml#L39
[1] https://www.rabbitmq.com/partitions.html

Changed in charm-rabbitmq-server:
status: Triaged → Invalid
Revision history for this message
Shane Peters (shaner) wrote :

Any thoughts on setting the default cluster-partition-handling option to 'autoheal'? Reading through the above docs it seems most logical for most use-cases.

Revision history for this message
Peter Sabaini (peter-sabaini) wrote :

Afaict from the docs [0] pause_minority seems like the safer option wrt to data integrity (autoheal prioritizes continuity of service over data integrity, pause_minority partition tolerance).

Fwiw autoheal works by restarting losing nodes if a split brain is detected which implies wiping those as well

[0] https://www.rabbitmq.com/partitions.html#cp-mode

Revision history for this message
Frode Nordahl (fnordahl) wrote :

While I agree to the benefits of the ``pause_minority`` mode, it does impose requirements on the end users infrastructure reliability and deeper understanding of the inner workings of RabbitMQ.

The ``autoheal`` mode has a "it just works" appeal to it which makes it a safer default until the end user is ready to make a conscious choice.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-rabbitmq-server (master)

Reviewed: https://review.openstack.org/620949
Committed: https://git.openstack.org/cgit/openstack/charm-rabbitmq-server/commit/?id=b74a50d30f5d257a1061263426052ca830528b55
Submitter: Zuul
Branch: master

commit b74a50d30f5d257a1061263426052ca830528b55
Author: Shane Peters <email address hidden>
Date: Thu Nov 29 10:47:30 2018 -0500

    Default to autoheal for cluster-partition-handling

    By setting the default to 'autoheal', we can better ensure
    service continuity in most use-cases. With autoheal, the
    'winning' partition will be the one with the most clients
    connected to it and nodes in the losing partition(s) will be
    restarted.

    Change-Id: I0988e1d22e7c97819552b3bf325801632b099a32
    Closes-Bug: 1802315

Changed in charm-rabbitmq-server:
status: Invalid → Fix Committed
Changed in charm-rabbitmq-server:
milestone: none → 19.04
assignee: nobody → Shane Peters (shaner)
David Ames (thedac)
Changed in charm-rabbitmq-server:
status: Fix Committed → Fix Released
Revision history for this message
Liam Young (gnuoy) wrote :

This change was reverted https://review.opendev.org/c/openstack/charm-rabbitmq-server/+/699862
I believe it was due to deploy time failures particularly when the target machines were under heavy load.

I have proposed a change which sets the cluster-partition-handling to 'ignore' during charm installation and once clustering is complete it sets it whatever has been requested via charm config ( https://review.opendev.org/c/openstack/charm-rabbitmq-server/+/819086 ). If this change is approved I will follow it up with another change to switch the default to autoheal

Changed in charm-rabbitmq-server:
status: Fix Released → In Progress
assignee: Shane Peters (shaner) → Liam Young (gnuoy)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/c/openstack/charm-rabbitmq-server/+/819086
Committed: https://opendev.org/openstack/charm-rabbitmq-server/commit/ab813a982d2d091e727a0d05d0305c7cfc5681c9
Submitter: "Zuul (22348)"
Branch: master

commit ab813a982d2d091e727a0d05d0305c7cfc5681c9
Author: Liam Young <email address hidden>
Date: Wed Nov 24 10:08:03 2021 +0000

    Use cluster strategy 'ignore' for install

    Use cluster-partition-handling strategy 'ignore' during charm
    installation regardless of the charm config setting. Once the
    leader has checked it is clustered with peers then it sets the
    cluster-partition-handling strategy to be whatever the user set
    in charm config.

    Partial-Bug: 1802315
    Change-Id: Ic03bbe55ea8aab8b285977a5c0f9410b5bbf35c8

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.