Rabbitmq cluster member may fail to start after a cloud reboot
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack RabbitMQ Server Charm |
Triaged
|
Medium
|
Unassigned |
Bug Description
Running Bionic/Ussuri OpenStack cloud with hardware offloading enabled. During reboot test, after all nodes were rebooted, one of the rabbitmq cluster members were failing to start reporting the following error:
BOOT FAILED
===========
Error description:
{could_
Log files (may contain more information):
/<email address hidden>
/<email address hidden>
Error: {could_
The cluster itself was operational. Looking deeper into RabbitMQ entities it turned out that the binding nagios-
I tried to rebuild the failing member mnesia db and readd the member back to the cluster, but it didn't help, most likely the problem was in the db itself. What helped was re-running Nagios NRPE check for the broken unit, from the good unit - it re-recreated the queue and the binding and the rabbimq-server-0 member started succesfully.
tags: | added: cold-start |
Changed in charm-rabbitmq-server: | |
status: | New → Triaged |
importance: | Undecided → Medium |