Fuel for OpenStack

[pcs][rabbitmq] It is possible to remove rabbitmq pid file and not kill rabbitmq processes, in the result RabbitMQ will fail to start

Bug #1529808 reported by Timur Nurlygayanov on 2015-12-29

This bug report is a duplicate of: Bug #1529897: Rabbit OCF stop action shall not fail. Edit Remove

This bug affects 1 person

	Status	Importance	Assigned to	Milestone
Fuel for OpenStack	Confirmed	High	Dmitry Mescheryakov	Fuel for OpenStack 9.0
8.0.x	Confirmed	High	Dmitry Mescheryakov	Fuel for OpenStack 8.0
Future	Confirmed	High	Dmitry Mescheryakov	Fuel for OpenStack next
Mitaka	Confirmed	High	Dmitry Mescheryakov	Fuel for OpenStack 9.0

Bug Description

Short Story:
During the rabbitmq shutdown with pacemaker it is possible to remove pid file of rabbitmq services but don't stop rabbitmq services, in the result pacemaker will fail to start / restart RabbitMQ nodes because we have beam processes and don't have pid files.

The root of the issue in TERM signal which we use to kill RabbitMQ processes. We need to use -9 instead here:
https://github.com/openstack/fuel-library/blob/master/files/fuel-ha-utils/ocf/rabbitmq#L631

Steps To Reproduce:
1. Deploy OpenStack cluster with 3 controllers and 2 compute nodes
2. Create snapshot of environment
3. Revert snapshot
4. Stop all RabbitMQ services with pacemaker:
pcs resource disable p_rabbitmq-server
5. Wait while pacemaker will show that all rabbitmq services are down
6. Run on all controller nodes:
ps axu | grep beam

Observed Result:
We will see several beam processes in running status. Pacemaker will fail to start RabbitMQ cluster because we will have no pid files for already working processes.

Tags:

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2015-12-29:

Priority is High because it block development of new automated destructive tests for MOS QA team.
We applied some dirty workaround for the issue, but issue is easy to reproduce on production environments as well, and customers will spend a lot of time trying to recover RabbitMQ which will fail many times during start because of the issue. It took 1-1.5 hours to restore RabbitMQ cluster with the issue and will took about 5 minutes if we will prevent such situations.

Bug is actual for MOS 8.0 and MOS 9.0 as well.

Changed in fuel:
assignee:	nobody → Dmitry Mescheryakov (dmitrymex)
milestone:	none → 8.0
status:	New → Confirmed
tags:	added: pacemaker rabbitmq
tags:	added: blocker-for-qa

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2015-12-29:

lrmd.log Edit (935.6 KiB, text/plain)

Logs from pacemaker/rabbitmq on one controller ^^^

Fuel Devops McRobotson (fuel-devops-robot) on 2015-12-30

no longer affects:	fuel/future
Changed in fuel:
milestone:	8.0 → 9.0
status:	Confirmed → New

Revision history for this message

Davanum Srinivas (DIMS) (dims-v) wrote on 2016-01-05:

Dima, Timur, Any updates on this one?

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2016-01-07:

The problem here appears because OCF script removes pid file without making sure that rabbit process died. That change from Bogdan takes care of that - https://review.openstack.org/#/c/262519/1 . Here if rabbit process persists, the script kills it with sigkill.