[pcs][rabbitmq] It is possible to remove rabbitmq pid file and not kill rabbitmq processes, in the result RabbitMQ will fail to start
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Fuel for OpenStack |
Confirmed
|
High
|
Dmitry Mescheryakov | ||
8.0.x |
Confirmed
|
High
|
Dmitry Mescheryakov | ||
Future |
Confirmed
|
High
|
Dmitry Mescheryakov | ||
Mitaka |
Confirmed
|
High
|
Dmitry Mescheryakov |
Bug Description
Short Story:
During the rabbitmq shutdown with pacemaker it is possible to remove pid file of rabbitmq services but don't stop rabbitmq services, in the result pacemaker will fail to start / restart RabbitMQ nodes because we have beam processes and don't have pid files.
The root of the issue in TERM signal which we use to kill RabbitMQ processes. We need to use -9 instead here:
https:/
Steps To Reproduce:
1. Deploy OpenStack cluster with 3 controllers and 2 compute nodes
2. Create snapshot of environment
3. Revert snapshot
4. Stop all RabbitMQ services with pacemaker:
pcs resource disable p_rabbitmq-server
5. Wait while pacemaker will show that all rabbitmq services are down
6. Run on all controller nodes:
ps axu | grep beam
Observed Result:
We will see several beam processes in running status. Pacemaker will fail to start RabbitMQ cluster because we will have no pid files for already working processes.
no longer affects: | fuel/future |
Changed in fuel: | |
milestone: | 8.0 → 9.0 |
status: | Confirmed → New |
Priority is High because it block development of new automated destructive tests for MOS QA team.
We applied some dirty workaround for the issue, but issue is easy to reproduce on production environments as well, and customers will spend a lot of time trying to recover RabbitMQ which will fail many times during start because of the issue. It took 1-1.5 hours to restore RabbitMQ cluster with the issue and will took about 5 minutes if we will prevent such situations.
Bug is actual for MOS 8.0 and MOS 9.0 as well.