Hundreds of RabbitMQ processes started after SM audit detects rabbit service is disabled
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Medium
|
Don Penney |
Bug Description
Brief Description
-----------------
On an AIO-SX running as a subcloud, SM reported that the rabbitmq service was disabled (reason unknown) and then SM attempts to recover rabbit. This is successful and the rabbit logs show rabbit starting up again. However, SM does not think the rabbit service is running and continues to attempt to recover it. From the rabbit startup_log, it appears that each time SM attempts to start rabbit, it fails, because the address (probably port 5672) is already in use. This eventually leads to over 800 instances of rabbitmq running at which point the OOM killer kicks in and eventually causes a host reboot.
Severity
--------
Major
Steps to Reproduce
------------------
Unknown. System was just running.
Expected Behavior
------------------
On a rabbit failure, SM should delete any running instance of rabbit and then start up 1 new instance of rabbitmq.
Actual Behavior
----------------
SM started 800+ instances of rabbitmq.
Reproducibility
---------------
Seen once
System Configuration
-------
AIO-SX subcloud
Branch/Pull Time/Commit
-------
stx4.0
Last Pass
---------
n/a
Timestamp/Logs
--------------
Test Activity
-------------
Normal use
Workaround
----------
lock/unlock controller
Changed in starlingx: | |
assignee: | nobody → Don Penney (dpenney) |
Fix proposed to branch: master /review. opendev. org/753500
Review: https:/