Bug #1318936 “Rabbit service does not start after reboot the con...” : Bugs : Fuel for OpenStack

Revision history for this message

Egor Kotko (ykotko) wrote on 2014-05-13:

#1

fuel-snapshot-2014-05-13_07-48-26.tgz Edit (45.3 MiB, application/x-tar)

Tatyanka (tatyana-leontovich) on 2014-05-13

Changed in fuel:
importance:	Undecided → High
assignee:	nobody → Fuel Library Team (fuel-library)
milestone:	5.1 → 5.0

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-05-13:

#2

looks like a dup of https://bugs.launchpad.net/fuel/+bug/1288831 which was considered fixed...

Changed in fuel:
assignee:	Fuel Library Team (fuel-library) → Bogdan Dobrelya (bogdando)

Revision history for this message

Sergey Vasilenko (xenolog) wrote on 2014-05-13:

#3

It is a big rabbitmq clustering issue.
we can't fix it without moving rabbitmq server under Pacemaker control.

Changed in fuel:
milestone:	5.0 → 5.1

Bogdan Dobrelya (bogdando) on 2014-05-14

Changed in fuel:
assignee:	Bogdan Dobrelya (bogdando) → Fuel Library Team (fuel-library)
status:	New → Confirmed
importance:	High → Critical

Revision history for this message

Sergii Golovatiuk (sgolovatiuk) wrote on 2014-05-14:

#4

This problem belongs to rabbitmq clustering mechanism

Here is explanation from rabbitmq documentation

When the entire cluster is brought down, the last node to go down must be the first node to be brought online. If this doesn't happen, the nodes will wait 30 seconds for the last disc node to come back online, and fail afterwards. If the last node to go offline cannot be brought back up, it can be removed from the cluster using the forget_cluster_node command - consult the rabbitmqctl manpage for more information.

As primary controller is the first node it takes some responsibility to be "primary" node. Additionally, when you have cluster with three controllers and bring down primary controller, sometimes, under some circumstances, two nodes can't elect "primary". In the end we have classical split-brain.

[root@node-2 ~]# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-2' ...
[{nodes,[{disc,['rabbit@node-1','rabbit@node-2','rabbit@node-3']}]},
{running_nodes,['rabbit@node-2']},
{partitions,[]}]
...done.

Cluster status of node 'rabbit@node-3' ...
[{nodes,[{disc,['rabbit@node-1','rabbit@node-2','rabbit@node-3']}]},
{running_nodes,['rabbit@node-3']},
{partitions,[]}]
...done.

This situation requires manual steps to bring down one rabbit server and re-join it to cluster. To avoid such situations there should be additional layer of logic which control cluster. Pacemaker is the best choice for such situations. Alternatively there should be another MQ which can be controlled either by client or by clustering software (Pacemaker).

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-05-14:

#5

related issue https://bugs.launchpad.net/fuel/+bug/1289200

tags:

added: ha

Sergey Vasilenko (xenolog) on 2014-05-16

Changed in fuel:
milestone:	5.1 → 5.0
assignee:	Fuel Library Team (fuel-library) → Sergey Vasilenko (xenolog)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-05-16: Fix proposed to fuel-library (master)

#6

Fix proposed to branch: master
Review: https://review.openstack.org/93956

Changed in fuel:
status:	Confirmed → In Progress

OpenStack Infra (hudson-openstack) on 2014-05-22

Changed in fuel:
assignee:	Sergey Vasilenko (xenolog) → Dmitry Ilyin (idv1985)

OpenStack Infra (hudson-openstack) on 2014-05-22

Changed in fuel:
assignee:	Dmitry Ilyin (idv1985) → Sergey Vasilenko (xenolog)

Revision history for this message

Mike Scherbakov (mihgen) wrote on 2014-05-22:

#7

Fix for this is large one, so we are deferring this bug to 5.1. However for 5.0 we would like to get https://review.openstack.org/#/c/94820/ merged.

Changed in fuel:
milestone:	5.0 → 5.1

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-05-22: Fix merged to fuel-library (master)

#8

Reviewed: https://review.openstack.org/94820
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=8ec879ddf30b0d47fd1117a7ee5d6a19fa5a62c7
Submitter: Jenkins
Branch: master

commit 8ec879ddf30b0d47fd1117a7ee5d6a19fa5a62c7
Author: Dmitry Ilyin <email address hidden>
Date: Thu May 22 14:54:10 2014 +0400

Revert: Rewrite RabbitMQ init scripts

Return to the previous version of init scripts that
can recreate Mnesia if cluster gets broken.

Change-Id: Id4801acd53f04743759f83544ce2db373db964ab
Partial-Bug: 1318936

Vladimir Kuklin (vkuklin) on 2014-05-22

Changed in fuel:
importance:	Critical → High

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-05-30: Fix proposed to fuel-library (stable/4.1)

#9

Fix proposed to branch: stable/4.1
Review: https://review.openstack.org/96861

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-06-02: Fix merged to fuel-library (stable/4.1)

#10

Reviewed: https://review.openstack.org/96861
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=e3153d1a4cbf6cec552be9d0bc2b4cd542e322f0
Submitter: Jenkins
Branch: stable/4.1

commit e3153d1a4cbf6cec552be9d0bc2b4cd542e322f0
Author: Dmitry Ilyin <email address hidden>
Date: Thu May 22 14:54:10 2014 +0400

Revert: Rewrite RabbitMQ init scripts

Return to the previous version of init scripts that
can recreate Mnesia if cluster gets broken.

Change-Id: Id4801acd53f04743759f83544ce2db373db964ab
Partial-Bug: 1318936

Revision history for this message

Meg McRoberts (dreidellhasa) wrote on 2014-06-04:

#11

Not documented in 4.1.1 Release Notes -- it looks like this is just a regression to something we previously wrote. The main issue is already documented in http://docs.mirantis.com/fuel/fuel-4.1/frequently-asked-questions.html#rabbitmq-cluster-restart-issues-following-a-systemwide-power-failure

OpenStack Infra (hudson-openstack) on 2014-06-11

Changed in fuel:
assignee:	Sergey Vasilenko (xenolog) → Bogdan Dobrelya (bogdando)

OpenStack Infra (hudson-openstack) on 2014-06-16

Changed in fuel:
assignee:	Bogdan Dobrelya (bogdando) → Sergey Vasilenko (xenolog)

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-06-27:

#12

superceeded by BP https://blueprints.launchpad.net/fuel/+spec/rabbitmq-cluster-controlled-by-pacemaker

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-07-02: Fix merged to fuel-library (master)

#13

Reviewed: https://review.openstack.org/93956
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=aeac878fae643dba18c278df2d336633eff26f39
Submitter: Jenkins
Branch: master

commit aeac878fae643dba18c278df2d336633eff26f39
Author: Sergey Vasilenko <email address hidden>
Date: Fri Feb 28 22:04:15 2014 +0400

Rabbitmq ocf master/slave (WORK IN PROGRESS)

    Blueprint: rabbitmq-cluster-controlled-by-pacemaker
    Closes-bug: #1318936
    Change-Id: Ieab7156fee2b70b32dbf5a2852627495cf1b650e

Changed in fuel:
status:	In Progress → Fix Committed

Bogdan Dobrelya (bogdando) on 2014-07-04

tags:

added: to-be-covered-by-tests

Mike Scherbakov (mihgen) on 2014-07-17

tags:

added: release-notes

Dmitry Pyzhov (dpyzhov) on 2014-08-13

no longer affects:

fuel/5.1.x

Revision history for this message

Tatyana Dubyk (tdubyk) wrote on 2014-09-02:

#14

Download full text (6.1 KiB)

I've reproduced this bug on my vcenter's machine with configuration, which is described below:

api: '1.0'
astute_sha: bc60b7d027ab244039f48c505ac52ab8eb0a990c
auth_required: true
build_id: 2014-09-01_00-01-17
build_number: '491'
feature_groups:
- mirantis
fuellib_sha: 2cfa83119ae90b13a5bac6a844bdadfaf5aeb13f
fuelmain_sha: 109812be3425408dd7be192b5debf109cb1edd4c
nailgun_sha: d25ed02948a8be773e2bd87cfe583ef7be866bb2
ostf_sha: 4dcd99cc4bfa19f52d4b87ed321eb84ff03844da
production: docker
release: '5.1'

on vcenter machine 172.18.170.88:
1. Create new environment (CentOS, simple mode)
Network: Nova Network Flat DHCP

     setting for vcenter creation:
                    VCENTER_IP='172.16.0.254'
                    <email address hidden>'
                    VCENTER_PASSWORD='Qwer!1234'
                    VCENTER_CLUSTERS='Cluster1'

2. Add 2 nodes with roles: 1 controller and 1 cinder (as storage type - VMDK)
3. Make deploy of environment
4. Verify network connectivity
5. Run ostf

Error and traceback:
<179>Sep 2 12:46:01 node-2 nova-api 2014-09-02 12:46:01.130 19317 ERROR oslo.messaging._drivers.impl_rabbit [req-684f1371-1d98-47a4-a437-4a860f7c60d2 ] Failed to publish message to topic 'notifications.info': [Errno 32] Broken pipe
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit Traceback (most recent call last):
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 648, in ensure
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit return method(*args, **kwargs)
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 753, in _publish
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit publisher = cls(self.conf, self.channel, topic, **kwargs)
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 420, in __init__
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit super(NotifyPublisher, self).__init__(conf, channel, topic, **kwargs)
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 396, in __init__
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit **options)
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 339, in __init__
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit self.reconnect(channel)
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 423, in reconnect
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit super(NotifyPublisher, self).reconnect(channel)
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._dri...

I've reproduced this bug  on my vcenter's machine with configuration, which is described below:

api: '1.0'
astute_sha: bc60b7d027ab244039f48c505ac52ab8eb0a990c
auth_required: true
build_id: 2014-09-01_00-01-17
build_number: '491'
feature_groups:
- mirantis
fuellib_sha: 2cfa83119ae90b13a5bac6a844bdadfaf5aeb13f
fuelmain_sha: 109812be3425408dd7be192b5debf109cb1edd4c
nailgun_sha: d25ed02948a8be773e2bd87cfe583ef7be866bb2
ostf_sha: 4dcd99cc4bfa19f52d4b87ed321eb84ff03844da
production: docker
release: '5.1'

on vcenter machine 172.18.170.88:
1. Create new environment (CentOS, simple mode)
     Network:  Nova Network Flat DHCP

setting for vcenter creation:
                    VCENTER_IP='172.16.0.254'
                    VCENTER_USERNAME='administrator@vsphere.local'
                    VCENTER_PASSWORD='Qwer!1234'
                    VCENTER_CLUSTERS='Cluster1'

2. Add 2 nodes with roles: 1 controller and 1 cinder (as storage type - VMDK)
3.  Make deploy of environment
4.  Verify network connectivity
5.  Run ostf

Error and traceback:
<179>Sep  2 12:46:01 node-2 nova-api 2014-09-02 12:46:01.130 19317 ERROR oslo.messaging._drivers.impl_rabbit [req-684f1371-1d98-47a4-a437-4a860f7c60d2 ] Failed to publish message to topic 'notifications.info': [Errno 32] Broken pipe
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit Traceback (most recent call last):
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 648, in ensure
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit     return method(*args, **kwargs)
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 753, in _publish
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit     publisher = cls(self.conf, self.channel, topic, **kwargs)
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 420, in __init__
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit     super(NotifyPublisher, self).__init__(conf, channel, topic, **kwargs)
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 396, in __init__
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit     **options)
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 339, in __init__
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit     self.reconnect(channel)
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 423, in reconnect
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit     super(NotifyPublisher, self).reconnect(channel)
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 347, in reconnect
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit     routing_key=self.routing_key)
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.6/site-packages/kombu/messaging.py", line 84, in __init__
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit     self.revive(self._channel)
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.6/site-packages/kombu/messaging.py", line 218, in revive
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit     self.declare()
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.6/site-packages/kombu/messaging.py", line 104, in declare
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit     self.exchange.declare()
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.6/site-packages/kombu/entity.py", line 166, in declare
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit     nowait=nowait, passive=passive,
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.6/site-packages/amqp/channel.py", line 613, in exchange_declare
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit     self._send_method((40, 10), args)
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.6/site-packages/amqp/abstract_channel.py", line 56, in _send_method
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit     self.channel_id, method_sig, args, content,
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.6/site-packages/amqp/method_framing.py", line 221, in write_method
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit     write_frame(1, channel, payload)
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.6/site-packages/amqp/transport.py", line 177, in write_frame
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit     frame_type, channel, size, payload, 0xce,
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.6/site-packages/eventlet/greenio.py", line 309, in sendall
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit     tail = self.send(data, flags)
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.6/site-packages/eventlet/greenio.py", line 295, in send
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit     total_sent += fd.send(data[total_sent:], flags)
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit error: [Errno 32] Broken pipe
2014-09-02 12:46:01.130 19317 TRACE oslo.messaging._drivers.impl_rabbit

Diagnostic snapshot below:

Revision history for this message

Tatyana Dubyk (tdubyk) wrote on 2014-09-02:

#15

fuel-snapshot-2014-09-02_14-27-29.tgz Edit (4.7 MiB, application/x-tar)

Tatyana Dubyk (tdubyk) on 2014-09-02

Changed in fuel:
status:	Fix Committed → Confirmed

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2014-09-02:

#16

Tatyana, I do not see information whether you are deploying 5.0.2 or 5.1 environment. Also, there is no info whether RabbitMQ cluster is healthy or not. Please, provide `rabbitmqctl cluster_status` output of each controller. Also, this bug is related HA mode and you are posting info about simple mode. Please, create a separate bug with all the corresponding information.

Changed in fuel:
status:	Confirmed → Fix Committed

Revision history for this message

Tatyana Dubyk (tdubyk) wrote on 2014-09-03:

#17

release: '5.1'
ok, I'll create a new bug. Thanks for your remark.

Sergey Vasilenko (xenolog) on 2015-11-27

Changed in fuel:
status:	Fix Committed → Fix Released

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Fix Released	High	Sergey Vasilenko	Fuel for OpenStack 5.1
	5.0.x	Won't Fix	High	Dmitry Borodaenko	Fuel for OpenStack 5.0.1

Fuel for OpenStack

Rabbit service does not start after reboot the controller(HA mode)

Bug Description

Duplicates of this bug

Other bug subscribers

Related blueprints

Bug attachments

Remote bug watches