Hundreds of RabbitMQ processes started after SM audit detects rabbit service is disabled

Bug #1896697 reported by Frank Miller
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Don Penney

Bug Description

Brief Description
-----------------
On an AIO-SX running as a subcloud, SM reported that the rabbitmq service was disabled (reason unknown) and then SM attempts to recover rabbit. This is successful and the rabbit logs show rabbit starting up again. However, SM does not think the rabbit service is running and continues to attempt to recover it. From the rabbit startup_log, it appears that each time SM attempts to start rabbit, it fails, because the address (probably port 5672) is already in use. This eventually leads to over 800 instances of rabbitmq running at which point the OOM killer kicks in and eventually causes a host reboot.

Severity
--------
Major

Steps to Reproduce
------------------
Unknown. System was just running.

Expected Behavior
------------------
On a rabbit failure, SM should delete any running instance of rabbit and then start up 1 new instance of rabbitmq.

Actual Behavior
----------------
SM started 800+ instances of rabbitmq.

Reproducibility
---------------
Seen once

System Configuration
--------------------
AIO-SX subcloud

Branch/Pull Time/Commit
-----------------------
stx4.0

Last Pass
---------
n/a

Timestamp/Logs
--------------

Test Activity
-------------
Normal use

Workaround
----------
lock/unlock controller

Frank Miller (sensfan22)
Changed in starlingx:
assignee: nobody → Don Penney (dpenney)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config-files (master)

Fix proposed to branch: master
Review: https://review.opendev.org/753500

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config-files (master)

Reviewed: https://review.opendev.org/753500
Committed: https://git.openstack.org/cgit/starlingx/config-files/commit/?id=0d30fad51343267df4441b89c7289fd73e8f247e
Submitter: Zuul
Branch: master

commit 0d30fad51343267df4441b89c7289fd73e8f247e
Author: Don Penney <email address hidden>
Date: Wed Sep 23 00:39:03 2020 -0400

    Update rabbitmq OCF script to protect against failed status

    In rare cases, the rabbitmq status check may return a status of '2',
    which is an indication to the OCF script that the rabbitmq-server is
    not running, but it may be at least partially up. In such a case, SM
    will see this as a failure of the service when it calls the OCF status
    check, but will subsequently fail to relaunch the service due to the
    partial status of the already-running service.

    To help avoid this, this commit updates the OCF script to:
    1. Update the "stop" function to attempt "rabbitmqctl stop" regardless
    of the "status" result. If the service is partially running, this
    should tear it down.
    2. Update the "start" script to call the "stop" function prior to
    attempting to launch the service, in case it is partially running.

    Change-Id: I19842d382dd1ab60b1caade6608f8dbb9257ebbe
    Closes-Bug: 1896697
    Signed-off-by: Don Penney <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.5.0 / medium priority - this issue is very rare; only seen once in many years of using rabbitmq. The fix focuses on recovery and limiting the system impact if this is encountered again.

No need to put in stx.4.0 given how rare this is.

Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.5.0 stx.config
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.