rabbitmq node can't be recovered automatically, if epmd service has been hanged

Bug #1479422 reported by Leontii Istomin
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
Olga Gusarenko

Bug Description

performed rally light from 28 22:25:50 to 29 00:22:35
and rally full from 29 00:22:35 to 29 09:13:16

Then rabbitmq cluster was broken.

Rabbitmq can't be started on node-1 after failover ( It's the issue for this bug ). From pacemaker:
ERROR: node with name "rabbit" already running on "node-1
The same message could be seen in /var/log/rabbitmq/startup_log on the node where RabbitMQ fails to start.

Have found that old empd process doesn't leave processes list:
root@node-1:~# ps -ef | grep epmd
root 14583 9429 0 13:58 pts/26 00:00:00 grep --color=auto epmd
root 23660 1 0 Jul28 ? 00:00:05 /usr/lib/erlang/erts-5.10.4/bin/epmd -daemon

To recover rabbitmq cluster performed the following steps:
1. stop murano rabbitmq
2. kill epmd
3. start murano rabbitmq

after that pacemaker successfully starts OpenStack’s RabbitMQ

Cluster configuration:
Baremetal,Ubuntu,IBP,HA,Neutron-vxlan,DVR,Сeph-all,Nova-debug,Nova-quotas,Sahara,Murano,7.0-98
Controllers:3 Computes+Ceph:21

api: '1.0'
astute_sha: 34e0493afa22999c4a07d3198ceb945116ab7932
auth_required: true
build_id: 2015-07-27_09-24-22
build_number: '98'
feature_groups:
- mirantis
fuel-agent_sha: 2a65f11c10b0aeb5184247635a19740fc3edde21
fuel-library_sha: 39c3162ee2e2ff6e3af82f703998f95ff4cc2b7a
fuel-ostf_sha: 94a483c8aba639be3b96616c1396ef290dcc00cd
fuelmain_sha: 921918a3bd3d278431f35ad917989e46b0c24100
nailgun_sha: d5c19f6afc66b5efe3c61ecb49025c1002ccbdc6
openstack_version: 2015.1.0-7.0
production: docker
python-fuelclient_sha: 58c411d87a7eaf0fd6892eae2b5cb1eff4190c98
release: '7.0'

Diagnostic Snapshot: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-07-29_10-08-18.tar.xz

Revision history for this message
Leontii Istomin (listomin) wrote :

From Dmitry Mescheryakov after investigation the issue on the env:
to sum up: we found that the root cause of RabbitMQ failure on two nodes was epmd process stuck in somewhat incorrect state. Killing it helped, as next time Pacemaker attempted to start Rabbit, it succeeded. Probably adding that procedure to OCF script before starting RabbitMQ will help, but I am hesitant to do so as OCF scripts are already rather clumsy. So, I suggest not to do so, unless the issue reoccurs.

Changed in fuel:
status: New → Incomplete
Revision history for this message
Georgy Okrokvertskhov (gokrokvertskhov) wrote :

Lets wait for another occurrence. If this issue is reproducible then lets move it forward to fixing.

Revision history for this message
Leontii Istomin (listomin) wrote :

Has been reproduced with the same configuration:
Baremetal,Ubuntu,IBP,HA,Neutron-vxlan,DVR,Сeph-all,Nova-debug,Nova-quotas,Sahara,Murano,7.0-98
Controllers:3 Computes+Ceph:21

root@node-57:~# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-57' ...
Error: unable to connect to node 'rabbit@node-57': nodedown

DIAGNOSTICS
===========

attempted to contact: ['rabbit@node-57']

rabbit@node-57:
  * connected to epmd (port 4369) on node-57
  * epmd reports node 'rabbit' running on port 41055
  * can't establish TCP connection, reason: econnrefused (connection refused)
  * suggestion: blocked by firewall?

current node details:
- node name: 'rabbitmqctl6093@node-57'
- home dir: /var/lib/rabbitmq
- cookie hash: soeIWU2jk2YNseTyDSlsEA==

You have new mail in /var/mail/root
root@node-57:~# ps aux | grep epmd
root 6294 0.0 0.0 10460 936 pts/25 S+ 10:05 0:00 grep --color=auto epmd
root 17663 0.0 0.0 9540 2500 ? S Jul30 0:07 /usr/lib/erlang/erts-5.10.4/bin/epmd -daemon

Interesting that without DVR feature it hasn't been reproduced.

Diagnostic snapshot: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-07-31_10-06-31.tar.xz

Changed in fuel:
status: Incomplete → Confirmed
Changed in fuel:
milestone: none → 7.0
assignee: nobody → Fuel Library Team (fuel-library)
importance: Undecided → High
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

We must not "adding that procedure to OCF script before starting RabbitMQ" as murano or any other instance of rabbit are out of pacemaker resource control plane. If shared epmd hangs, it should be healed manually. OCF script cannot and should not fix this.

Changed in fuel:
status: Confirmed → Triaged
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Triaged, as there is a workaround described. Should be added to docs guide

tags: added: release-notes
tags: added: docs
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Fuel Documentation Team (fuel-docs)
Changed in fuel:
status: Triaged → Confirmed
tags: added: rabbitmq
summary: - rabbitmq can't be started after failing due old epmd service which has
- been hanged
+ rabbitmq node can't be recovered if epmd service has been hanged
summary: - rabbitmq node can't be recovered if epmd service has been hanged
+ rabbitmq node can't be recovered automatically, if epmd service has been
+ hanged
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-docs (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/223728

tags: added: release-notes-done rn7-0
removed: release-notes
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-docs (master)

Reviewed: https://review.openstack.org/223728
Committed: https://git.openstack.org/cgit/stackforge/fuel-docs/commit/?id=ae3691833e9e4e6fbce660275804117701dd74c6
Submitter: Jenkins
Branch: master

commit ae3691833e9e4e6fbce660275804117701dd74c6
Author: OlgaGusarenko <email address hidden>
Date: Tue Sep 15 20:55:52 2015 +0300

    [RN 7.0] rabbitmq node can't be recovered automatically

    Adds the workaround for LP1479422

    Change-Id: I6f85c2c13595ce42dc8ffa27e003609686383da7
    Related-Bug: #1479422

Changed in fuel:
assignee: Fuel Documentation Team (fuel-docs) → Olga Gusarenko (ogusarenko)
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.