Fuel for OpenStack

rabbitmq node can't be recovered automatically, if epmd service has been hanged

Bug #1479422 reported by Leontii Istomin on 2015-07-29

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Fix Released	High	Olga Gusarenko	Fuel for OpenStack 7.0

Bug Description

performed rally light from 28 22:25:50 to 29 00:22:35
and rally full from 29 00:22:35 to 29 09:13:16

Then rabbitmq cluster was broken.

Rabbitmq can't be started on node-1 after failover ( It's the issue for this bug ). From pacemaker:
ERROR: node with name "rabbit" already running on "node-1
The same message could be seen in /var/log/rabbitmq/startup_log on the node where RabbitMQ fails to start.

Have found that old empd process doesn't leave processes list:
root@node-1:~# ps -ef | grep epmd
root 14583 9429 0 13:58 pts/26 00:00:00 grep --color=auto epmd
root 23660 1 0 Jul28 ? 00:00:05 /usr/lib/erlang/erts-5.10.4/bin/epmd -daemon

To recover rabbitmq cluster performed the following steps:
1. stop murano rabbitmq
2. kill epmd
3. start murano rabbitmq

after that pacemaker successfully starts OpenStack’s RabbitMQ

Cluster configuration:
Baremetal,Ubuntu,IBP,HA,Neutron-vxlan,DVR,Сeph-all,Nova-debug,Nova-quotas,Sahara,Murano,7.0-98
Controllers:3 Computes+Ceph:21

api: '1.0'
astute_sha: 34e0493afa22999c4a07d3198ceb945116ab7932
auth_required: true
build_id: 2015-07-27_09-24-22
build_number: '98'
feature_groups:
- mirantis
fuel-agent_sha: 2a65f11c10b0aeb5184247635a19740fc3edde21
fuel-library_sha: 39c3162ee2e2ff6e3af82f703998f95ff4cc2b7a
fuel-ostf_sha: 94a483c8aba639be3b96616c1396ef290dcc00cd
fuelmain_sha: 921918a3bd3d278431f35ad917989e46b0c24100
nailgun_sha: d5c19f6afc66b5efe3c61ecb49025c1002ccbdc6
openstack_version: 2015.1.0-7.0
production: docker
python-fuelclient_sha: 58c411d87a7eaf0fd6892eae2b5cb1eff4190c98
release: '7.0'

Diagnostic Snapshot: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-07-29_10-08-18.tar.xz

See original description

Tags:

Revision history for this message

Leontii Istomin (listomin) wrote on 2015-07-29:

From Dmitry Mescheryakov after investigation the issue on the env:
to sum up: we found that the root cause of RabbitMQ failure on two nodes was epmd process stuck in somewhat incorrect state. Killing it helped, as next time Pacemaker attempted to start Rabbit, it succeeded. Probably adding that procedure to OCF script before starting RabbitMQ will help, but I am hesitant to do so as OCF scripts are already rather clumsy. So, I suggest not to do so, unless the issue reoccurs.

Changed in fuel:
status:	New → Incomplete

Revision history for this message

Georgy Okrokvertskhov (gokrokvertskhov) wrote on 2015-07-30:

Lets wait for another occurrence. If this issue is reproducible then lets move it forward to fixing.

Revision history for this message

Leontii Istomin (listomin) wrote on 2015-07-31:

Has been reproduced with the same configuration:
Baremetal,Ubuntu,IBP,HA,Neutron-vxlan,DVR,Сeph-all,Nova-debug,Nova-quotas,Sahara,Murano,7.0-98
Controllers:3 Computes+Ceph:21

root@node-57:~# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-57' ...
Error: unable to connect to node 'rabbit@node-57': nodedown

DIAGNOSTICS
===========

attempted to contact: ['rabbit@node-57']

rabbit@node-57:
  * connected to epmd (port 4369) on node-57
  * epmd reports node 'rabbit' running on port 41055
  * can't establish TCP connection, reason: econnrefused (connection refused)
  * suggestion: blocked by firewall?

current node details:
- node name: 'rabbitmqctl6093@node-57'
- home dir: /var/lib/rabbitmq
- cookie hash: soeIWU2jk2YNseTyDSlsEA==

You have new mail in /var/mail/root
root@node-57:~# ps aux | grep epmd
root 6294 0.0 0.0 10460 936 pts/25 S+ 10:05 0:00 grep --color=auto epmd
root 17663 0.0 0.0 9540 2500 ? S Jul30 0:07 /usr/lib/erlang/erts-5.10.4/bin/epmd -daemon

Interesting that without DVR feature it hasn't been reproduced.

Diagnostic snapshot: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-07-31_10-06-31.tar.xz

Changed in fuel:
status:	Incomplete → Confirmed

Oleksiy Molchanov (omolchanov) on 2015-08-03

Changed in fuel:
milestone:	none → 7.0
assignee:	nobody → Fuel Library Team (fuel-library)
importance:	Undecided → High

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-08-03:

We must not "adding that procedure to OCF script before starting RabbitMQ" as murano or any other instance of rabbit are out of pacemaker resource control plane. If shared epmd hangs, it should be healed manually. OCF script cannot and should not fix this.

Changed in fuel:
status:	Confirmed → Triaged

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-08-03:

Triaged, as there is a workaround described. Should be added to docs guide

tags:

added: release-notes

Vladimir Kuklin (vkuklin) on 2015-08-03

tags:	added: docs
Changed in fuel:
assignee:	Fuel Library Team (fuel-library) → Fuel Documentation Team (fuel-docs)

Alexander Adamov (aadamov) on 2015-08-12

Changed in fuel:
status:	Triaged → Confirmed

Bogdan Dobrelya (bogdando) on 2015-08-20

tags:	added: rabbitmq
summary:	- rabbitmq can't be started after failing due old epmd service which has - been hanged + rabbitmq node can't be recovered if epmd service has been hanged
summary:	- rabbitmq node can't be recovered if epmd service has been hanged + rabbitmq node can't be recovered automatically, if epmd service has been + hanged

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-09-15: Related fix proposed to fuel-docs (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/223728

Olga Gusarenko (ogusarenko) on 2015-09-15

tags:

added: release-notes-done rn7-0
removed: release-notes

Dmitry Mescheryakov (dmitrymex) on 2015-09-16

description:

updated

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-09-16: Related fix merged to fuel-docs (master)

Reviewed: https://review.openstack.org/223728
Committed: https://git.openstack.org/cgit/stackforge/fuel-docs/commit/?id=ae3691833e9e4e6fbce660275804117701dd74c6
Submitter: Jenkins
Branch: master

commit ae3691833e9e4e6fbce660275804117701dd74c6
Author: OlgaGusarenko <email address hidden>
Date: Tue Sep 15 20:55:52 2015 +0300

[RN 7.0] rabbitmq node can't be recovered automatically

Adds the workaround for LP1479422

Change-Id: I6f85c2c13595ce42dc8ffa27e003609686383da7
Related-Bug: #1479422

Evgeny Konstantinov (evkonstantinov) on 2015-09-16

Changed in fuel:
assignee:	Fuel Documentation Team (fuel-docs) → Olga Gusarenko (ogusarenko)
status:	Confirmed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.