[pcs][rabbitmq] It is possible to remove rabbitmq pid file and not kill rabbitmq processes, in the result RabbitMQ will fail to start

Bug #1529808 reported by Timur Nurlygayanov
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Confirmed
High
Dmitry Mescheryakov
8.0.x
Confirmed
High
Dmitry Mescheryakov
Future
Confirmed
High
Dmitry Mescheryakov
Mitaka
Confirmed
High
Dmitry Mescheryakov

Bug Description

Short Story:
During the rabbitmq shutdown with pacemaker it is possible to remove pid file of rabbitmq services but don't stop rabbitmq services, in the result pacemaker will fail to start / restart RabbitMQ nodes because we have beam processes and don't have pid files.

The root of the issue in TERM signal which we use to kill RabbitMQ processes. We need to use -9 instead here:
https://github.com/openstack/fuel-library/blob/master/files/fuel-ha-utils/ocf/rabbitmq#L631

Steps To Reproduce:
1. Deploy OpenStack cluster with 3 controllers and 2 compute nodes
2. Create snapshot of environment
3. Revert snapshot
4. Stop all RabbitMQ services with pacemaker:
pcs resource disable p_rabbitmq-server
5. Wait while pacemaker will show that all rabbitmq services are down
6. Run on all controller nodes:
ps axu | grep beam

Observed Result:
We will see several beam processes in running status. Pacemaker will fail to start RabbitMQ cluster because we will have no pid files for already working processes.

Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

Priority is High because it block development of new automated destructive tests for MOS QA team.
We applied some dirty workaround for the issue, but issue is easy to reproduce on production environments as well, and customers will spend a lot of time trying to recover RabbitMQ which will fail many times during start because of the issue. It took 1-1.5 hours to restore RabbitMQ cluster with the issue and will took about 5 minutes if we will prevent such situations.

Bug is actual for MOS 8.0 and MOS 9.0 as well.

Changed in fuel:
assignee: nobody → Dmitry Mescheryakov (dmitrymex)
milestone: none → 8.0
status: New → Confirmed
tags: added: pacemaker rabbitmq
tags: added: blocker-for-qa
Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

Logs from pacemaker/rabbitmq on one controller ^^^

no longer affects: fuel/future
Changed in fuel:
milestone: 8.0 → 9.0
status: Confirmed → New
Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote :

Dima, Timur, Any updates on this one?

Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

The problem here appears because OCF script removes pid file without making sure that rabbit process died. That change from Bogdan takes care of that - https://review.openstack.org/#/c/262519/1 . Here if rabbit process persists, the script kills it with sigkill.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.