Rabbit service does not start after reboot the controller(HA mode)

Bug #1318936 reported by Egor Kotko
34
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
Sergey Vasilenko
5.0.x
Won't Fix
High
Dmitry Borodaenko

Bug Description

{"build_id": "2014-05-12_11-37-35", "mirantis": "yes", "build_number": "194", "ostf_sha": "cdb075090b752246a9c43db3e918c42f645b5873", "nailgun_sha": "4477ba3a6efc4379a6509386e7a9e2e6ae832041", "production": "docker", "api": "1.0", "fuelmain_sha": "97d7f6d5461db3afc27f58160cf9f6985230d255", "astute_sha": "5813d9b537ba6ac95f668321c682f339aac57e05", "release": "5.0", "fuellib_sha": "ff4e0182a94f9b17e5a02bcc65faaf4452a0ad35"}

Steps to reproduce:

1. Create env: CentOS, Multi-node with HA, Neutron with Vlan, 3 Controllers
2. Shutdown then start primary controller

Actual result:
Rabbit server can not start. Node status is offline.

Revision history for this message
Egor Kotko (ykotko) wrote :
Changed in fuel:
importance: Undecided → High
assignee: nobody → Fuel Library Team (fuel-library)
milestone: 5.1 → 5.0
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

looks like a dup of https://bugs.launchpad.net/fuel/+bug/1288831 which was considered fixed...

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Bogdan Dobrelya (bogdando)
Revision history for this message
Sergey Vasilenko (xenolog) wrote :

It is a big rabbitmq clustering issue.
we can't fix it without moving rabbitmq server under Pacemaker control.

Changed in fuel:
milestone: 5.0 → 5.1
Changed in fuel:
assignee: Bogdan Dobrelya (bogdando) → Fuel Library Team (fuel-library)
status: New → Confirmed
importance: High → Critical
Revision history for this message
Sergii Golovatiuk (sgolovatiuk) wrote :

This problem belongs to rabbitmq clustering mechanism

Here is explanation from rabbitmq documentation

When the entire cluster is brought down, the last node to go down must be the first node to be brought online. If this doesn't happen, the nodes will wait 30 seconds for the last disc node to come back online, and fail afterwards. If the last node to go offline cannot be brought back up, it can be removed from the cluster using the forget_cluster_node command - consult the rabbitmqctl manpage for more information.

As primary controller is the first node it takes some responsibility to be "primary" node. Additionally, when you have cluster with three controllers and bring down primary controller, sometimes, under some circumstances, two nodes can't elect "primary". In the end we have classical split-brain.

[root@node-2 ~]# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-2' ...
[{nodes,[{disc,['rabbit@node-1','rabbit@node-2','rabbit@node-3']}]},
 {running_nodes,['rabbit@node-2']},
 {partitions,[]}]
...done.

Cluster status of node 'rabbit@node-3' ...
[{nodes,[{disc,['rabbit@node-1','rabbit@node-2','rabbit@node-3']}]},
 {running_nodes,['rabbit@node-3']},
 {partitions,[]}]
...done.

This situation requires manual steps to bring down one rabbit server and re-join it to cluster. To avoid such situations there should be additional layer of logic which control cluster. Pacemaker is the best choice for such situations. Alternatively there should be another MQ which can be controlled either by client or by clustering software (Pacemaker).

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
tags: added: ha
Changed in fuel:
milestone: 5.1 → 5.0
assignee: Fuel Library Team (fuel-library) → Sergey Vasilenko (xenolog)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/93956

Changed in fuel:
status: Confirmed → In Progress
Changed in fuel:
assignee: Sergey Vasilenko (xenolog) → Dmitry Ilyin (idv1985)
Changed in fuel:
assignee: Dmitry Ilyin (idv1985) → Sergey Vasilenko (xenolog)
Revision history for this message
Mike Scherbakov (mihgen) wrote :

Fix for this is large one, so we are deferring this bug to 5.1. However for 5.0 we would like to get https://review.openstack.org/#/c/94820/ merged.

Changed in fuel:
milestone: 5.0 → 5.1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/94820
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=8ec879ddf30b0d47fd1117a7ee5d6a19fa5a62c7
Submitter: Jenkins
Branch: master

commit 8ec879ddf30b0d47fd1117a7ee5d6a19fa5a62c7
Author: Dmitry Ilyin <email address hidden>
Date: Thu May 22 14:54:10 2014 +0400

    Revert: Rewrite RabbitMQ init scripts

    Return to the previous version of init scripts that
    can recreate Mnesia if cluster gets broken.

    Change-Id: Id4801acd53f04743759f83544ce2db373db964ab
    Partial-Bug: 1318936

Changed in fuel:
importance: Critical → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/4.1)

Fix proposed to branch: stable/4.1
Review: https://review.openstack.org/96861

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/4.1)

Reviewed: https://review.openstack.org/96861
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=e3153d1a4cbf6cec552be9d0bc2b4cd542e322f0
Submitter: Jenkins
Branch: stable/4.1

commit e3153d1a4cbf6cec552be9d0bc2b4cd542e322f0
Author: Dmitry Ilyin <email address hidden>
Date: Thu May 22 14:54:10 2014 +0400

    Revert: Rewrite RabbitMQ init scripts

    Return to the previous version of init scripts that
    can recreate Mnesia if cluster gets broken.

    Change-Id: Id4801acd53f04743759f83544ce2db373db964ab
    Partial-Bug: 1318936

Revision history for this message
Meg McRoberts (dreidellhasa) wrote :

Not documented in 4.1.1 Release Notes -- it looks like this is just a regression to something we previously wrote. The main issue is already documented in http://docs.mirantis.com/fuel/fuel-4.1/frequently-asked-questions.html#rabbitmq-cluster-restart-issues-following-a-systemwide-power-failure

Changed in fuel:
assignee: Sergey Vasilenko (xenolog) → Bogdan Dobrelya (bogdando)
Changed in fuel:
assignee: Bogdan Dobrelya (bogdando) → Sergey Vasilenko (xenolog)
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/93956
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=aeac878fae643dba18c278df2d336633eff26f39
Submitter: Jenkins
Branch: master

commit aeac878fae643dba18c278df2d336633eff26f39
Author: Sergey Vasilenko <email address hidden>
Date: Fri Feb 28 22:04:15 2014 +0400

    Rabbitmq ocf master/slave (WORK IN PROGRESS)

    Blueprint: rabbitmq-cluster-controlled-by-pacemaker
    Closes-bug: #1318936
    Change-Id: Ieab7156fee2b70b32dbf5a2852627495cf1b650e

Changed in fuel:
status: In Progress → Fix Committed
tags: added: to-be-covered-by-tests
Mike Scherbakov (mihgen)
tags: added: release-notes
Dmitry Pyzhov (dpyzhov)
no longer affects: fuel/5.1.x
Revision history for this message
Tatyana Dubyk (tdubyk) wrote :
Download full text (6.1 KiB)

I've reproduced this bug on my vcenter's machine with configuration, which is described below:

api: '1.0'
astute_sha: bc60b7d027ab244039f48c505ac52ab8eb0a990c
auth_required: true
build_id: 2014-09-01_00-01-17
build_number: '491'
feature_groups:
- mirantis
fuellib_sha: 2cfa83119ae90b13a5bac6a844bdadfaf5aeb13f
fuelmain_sha: 109812be3425408dd7be192b5debf109cb1edd4c
nailgun_sha: d25ed02948a8be773e2bd87cfe583ef7be866bb2
ostf_sha: 4dcd99cc4bfa19f52d4b87ed321eb84ff03844da
production: docker
release: '5.1'

on vcenter machine 172.18.170.88:
1. Create new environment (CentOS, simple mode)
     Network: Nova Network Flat DHCP

     setting for vcenter creation:
                    VCENTER_IP='172.16.0.254'
                    <email address hidden>'
                    VCENTER_PASSWORD='Qwer!1234'
                    VCENTER_CLUSTERS='Cluster1'

2. Add 2 nodes with roles: 1 controller and 1 cinder (as storage type - VMDK)
3. Make deploy of environment
4. Verify network connectivity
5. Run ostf

Error and traceback:
<179>Sep 2 12:46:01 node-2 nova-api 2014-09-02 12:46:01.130 19317 ERROR oslo.messaging._drivers.impl_rabbit [req-684f1371-1d98-47a4-a437-4a860f7c60d2 ] Failed to publish message to topic 'notifications.info': [Errno 32] Broken pipe
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit Traceback (most recent call last):
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 648, in ensure
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit return method(*args, **kwargs)
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 753, in _publish
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit publisher = cls(self.conf, self.channel, topic, **kwargs)
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 420, in __init__
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit super(NotifyPublisher, self).__init__(conf, channel, topic, **kwargs)
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 396, in __init__
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit **options)
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 339, in __init__
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit self.reconnect(channel)
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 423, in reconnect
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit super(NotifyPublisher, self).reconnect(channel)
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._dri...

Read more...

Revision history for this message
Tatyana Dubyk (tdubyk) wrote :
Tatyana Dubyk (tdubyk)
Changed in fuel:
status: Fix Committed → Confirmed
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

Tatyana, I do not see information whether you are deploying 5.0.2 or 5.1 environment. Also, there is no info whether RabbitMQ cluster is healthy or not. Please, provide `rabbitmqctl cluster_status` output of each controller. Also, this bug is related HA mode and you are posting info about simple mode. Please, create a separate bug with all the corresponding information.

Changed in fuel:
status: Confirmed → Fix Committed
Revision history for this message
Tatyana Dubyk (tdubyk) wrote :

release: '5.1'
ok, I'll create a new bug. Thanks for your remark.

Changed in fuel:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.