rabbitmq node can't be recovered automatically, if epmd service has been hanged
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Fuel for OpenStack |
Fix Released
|
High
|
Olga Gusarenko |
Bug Description
performed rally light from 28 22:25:50 to 29 00:22:35
and rally full from 29 00:22:35 to 29 09:13:16
Then rabbitmq cluster was broken.
Rabbitmq can't be started on node-1 after failover ( It's the issue for this bug ). From pacemaker:
ERROR: node with name "rabbit" already running on "node-1
The same message could be seen in /var/log/
Have found that old empd process doesn't leave processes list:
root@node-1:~# ps -ef | grep epmd
root 14583 9429 0 13:58 pts/26 00:00:00 grep --color=auto epmd
root 23660 1 0 Jul28 ? 00:00:05 /usr/lib/
To recover rabbitmq cluster performed the following steps:
1. stop murano rabbitmq
2. kill epmd
3. start murano rabbitmq
after that pacemaker successfully starts OpenStack’s RabbitMQ
Cluster configuration:
Baremetal,
Controllers:3 Computes+Ceph:21
api: '1.0'
astute_sha: 34e0493afa22999
auth_required: true
build_id: 2015-07-27_09-24-22
build_number: '98'
feature_groups:
- mirantis
fuel-agent_sha: 2a65f11c10b0aeb
fuel-library_sha: 39c3162ee2e2ff6
fuel-ostf_sha: 94a483c8aba639b
fuelmain_sha: 921918a3bd3d278
nailgun_sha: d5c19f6afc66b5e
openstack_version: 2015.1.0-7.0
production: docker
python-
release: '7.0'
Diagnostic Snapshot: http://
Changed in fuel: | |
milestone: | none → 7.0 |
assignee: | nobody → Fuel Library Team (fuel-library) |
importance: | Undecided → High |
tags: | added: docs |
Changed in fuel: | |
assignee: | Fuel Library Team (fuel-library) → Fuel Documentation Team (fuel-docs) |
Changed in fuel: | |
status: | Triaged → Confirmed |
tags: | added: rabbitmq |
summary: |
- rabbitmq can't be started after failing due old epmd service which has - been hanged + rabbitmq node can't be recovered if epmd service has been hanged |
summary: |
- rabbitmq node can't be recovered if epmd service has been hanged + rabbitmq node can't be recovered automatically, if epmd service has been + hanged |
tags: |
added: release-notes-done rn7-0 removed: release-notes |
description: | updated |
Changed in fuel: | |
assignee: | Fuel Documentation Team (fuel-docs) → Olga Gusarenko (ogusarenko) |
status: | Confirmed → Fix Released |
From Dmitry Mescheryakov after investigation the issue on the env:
to sum up: we found that the root cause of RabbitMQ failure on two nodes was epmd process stuck in somewhat incorrect state. Killing it helped, as next time Pacemaker attempted to start Rabbit, it succeeded. Probably adding that procedure to OCF script before starting RabbitMQ will help, but I am hesitant to do so as OCF scripts are already rather clumsy. So, I suggest not to do so, unless the issue reoccurs.