Bug #1285449 “Pacemaker migration of management vip causes Rabbi...” : Bugs : Fuel for OpenStack

Mike Scherbakov (mihgen) on 2014-02-27

Changed in fuel:
milestone:	5.0 → 4.1
importance:	High → Critical

Dmitry Borodaenko (angdraug) on 2014-02-27

Changed in fuel:
assignee:	Fuel Library Team (fuel-library) → Ryan Moe (rmoe)
status:	New → Confirmed

Revision history for this message

Dmitry Borodaenko (angdraug) wrote on 2014-02-28: Re: Moving management vip breaks rabbitmq sessions

#1

You don't even need to shut down a node to reproduce this problem, all you have to do is move the management vip to a different node with the following command (replace node-1 with hostname of the controller node that doesn't currently have the vip):

crm_resource -r vip__management_old --move --node node-1

After that, most OpenStack services become unable to either put messages on RabbitMQ queues, take messages off the queues, or acknowledge the messages.

summary:

- Unable to create cinder volume after controller failover
+ Moving management vip breaks rabbitmq sessions

Revision history for this message

Dmitry Borodaenko (angdraug) wrote on 2014-02-28:

#2

One of the typical scenarios is nova-compute keeping its RabbitMQ connection and sending service_update messages to nova-conductor, but nova-conductor not receiving them and marking the nova-compute services as down. When that happens, restarting nova-conductor fixes it. Similar case with cinder-volume and cinder-scheduler, which is likely what causes the symptom in the original bug description.

Reconfiguring services to bypass HAProxy and talk directly to the management IP of one of the controller nodes makes that service unaffected by this problem (except that other services that still are tied to HAProxy e.g. keystone or neutron remain affected and something doesn't work).

Revision history for this message

Dmitry Borodaenko (angdraug) wrote on 2014-02-28:

#3

One of the way to see what's going on with RabbitMQ sessions is to use "ls /proc/<pid>/fd" and "lsof|grep <socket id>" to identify the file descriptors of the RabbitMQ connections, and then "strace -s 2048 -p <pid> |egrep '\((<fd1>|<fd2>)' " to see the send and recv calls to the socket.

Revision history for this message

Dmitry Borodaenko (angdraug) wrote on 2014-02-28:

#4

The theories that we have eliminated so far:

1) It's not MySQL connections management: you can see affected processes successfully running MySQL queries over HAProxy managed connections after moving the VIP.

2) It's not RabbitMQ itself: RabbitMQ connections that bypass HAProxy and go directly to port 5673 on a controller node are not affected by VIP move.

Revision history for this message

Matthew Mosesohn (raytrac3r) wrote on 2014-02-28:

#5

In order to speed up diagnosis, could you get a snapshot posted with an env in debug mode attached to the bug?

Revision history for this message

Dmitry Borodaenko (angdraug) wrote on 2014-02-28:

#6

You don't really need a diagnostic bundle, this is 100% reproducible: deploy any HA cluster and move the management VIP. That's it.

Still, I'm attaching a snapshot from my environment that has nodes 1-4 in the configuration described in the bug, and nodes 5-8 with CentOS and nova-network with the same problem. Neither had debug enabled during deploy, but the Ubuntu environment is the one where we've done most of our investigations so it will have debug logs for most services.

Revision history for this message

Dmitry Borodaenko (angdraug) wrote on 2014-02-28:

#7

fuel-snapshot-2014-02-28_05-30-43.tgz Edit (22.5 MiB, application/x-tar)

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-02-28:

#8

> Reconfiguring services to bypass HAProxy and talk directly to the management IP of one of the controller nodes makes that
> service unaffected by this problem

From fuel-dev list https://lists.launchpad.net/fuel-dev/msg00558.html
It will not help if you shut down the controller. The problem is that you
have hanged AMQP sessions which kombu driver does not look to handle
correctly.

So, does it mean we should 1) Submit an OS bug about Kombu sessions expiration, 2) Restart all OpenStack services (resided at the controllers only) in case of VIP moving?
Would that be enough to w/a RabbitMQ sessions issue unless it will be fixed in upstream (I mean Kombu part)?

Revision history for this message

Matthew Mosesohn (raytrac3r) wrote on 2014-02-28:

#9

reproducing snapshot with debug mode Edit (2.2 MiB, application/x-tar)

Revision history for this message

Sergey Vasilenko (xenolog) wrote on 2014-02-28:

#10

I think we can do following:
* apply dedicated VIP for AMQP
* make master/slave Pacemaker ocf script for AMQP service
* create co-location for AMQP-VIP and AMQP-master resource

remove AMQP handling from Ha-proxy, or redirect AMQP requests from ha-proxy to the AMQP-VIP

Revision history for this message

Matthew Mosesohn (raytrac3r) wrote on 2014-02-28:

#11

Output of rabbitmqctl report from the new master after switching mgmt vip: http://paste.openstack.org/show/LAjPWCzlHKbG4XYFo6Ii/

Revision history for this message

Matthew Mosesohn (raytrac3r) wrote on 2014-02-28:

#12

I believe un-haproxying RabbitMQ will simplify things. If RabbitMQ already does this task, we can avoid dead end scenarios.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-02-28: Fix proposed to fuel-library (master)

#13

Fix proposed to branch: master
Review: https://review.openstack.org/77265

Changed in fuel:
assignee:	Ryan Moe (rmoe) → Sergey Vasilenko (xenolog)
status:	Confirmed → In Progress

Revision history for this message

Dmitry Borodaenko (angdraug) wrote on 2014-03-01: Re: Moving management vip breaks rabbitmq sessions

#14

Configuring all OpenStack services to connect directly to rabbitmq on all controller hosts as recommended here:

http://docs.openstack.org/high-availability-guide/content/_configure_openstack_services_to_use_rabbitmq.html

does solve the broken RabbitMQ connections problem, but still leaves the services on the node that has lost the management VIP with hung MySQL connections. Enabling flush_routes options for management and public VIPs as described here:

http://doc.opensuse.org/products/draft/SLE-HA/SLE-ha-guide_sd_draft/ref.ocfagent.IPaddr2.html

makes the hung MySQL connection less likely to occur, but does not eliminate it altogether.

Changed in fuel:
assignee:	Sergey Vasilenko (xenolog) → Dmitry Borodaenko (dborodaenko)

Revision history for this message

Dmitry Borodaenko (angdraug) wrote on 2014-03-01:

#15

This patch should help reduce the 10-minute kernel default TCP connection read timeout to something meaningful:
http://sourceforge.net/p/mysql-python/feature-requests/19/

Revision history for this message

Dmitry Borodaenko (angdraug) wrote on 2014-03-01:

#16

Recently released python-mysqldb 1.2.5 has finally introduced tcp read timeout support.

OpenStack Infra (hudson-openstack) on 2014-03-01

Changed in fuel:
assignee:	Dmitry Borodaenko (dborodaenko) → Sergey Vasilenko (xenolog)

Revision history for this message

Dmitry Borodaenko (angdraug) wrote on 2014-03-02:

#17

My current plan of the solution outlined here:
https://lists.launchpad.net/fuel-dev/msg00566.html

Revision history for this message

Dmitry Borodaenko (angdraug) wrote on 2014-03-02:

#18

fuel-library-rabbitmq.jpg Edit (485.1 KiB, image/jpeg)

I'm attaching a high-level overview of the puppet classes that are involved in configuring the rabbit_hosts configuration parameter for OpenStack components in fuel-library, as a helper for the shotgun surgery I have to perform to implement the RabbitMQ configuration change.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-03-02: Fix proposed to fuel-library (master)

#19

Fix proposed to branch: master
Review: https://review.openstack.org/77409

Changed in fuel:
assignee:	Sergey Vasilenko (xenolog) → Dmitry Borodaenko (dborodaenko)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-03-02:

#20

Fix proposed to branch: master
Review: https://review.openstack.org/77439

Changed in fuel:
assignee:	Dmitry Borodaenko (dborodaenko) → Sergey Vasilenko (xenolog)

OpenStack Infra (hudson-openstack) on 2014-03-02

Changed in fuel:
assignee:	Sergey Vasilenko (xenolog) → Dmitry Borodaenko (dborodaenko)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-03-03:

#21

Fix proposed to branch: master
Review: https://review.openstack.org/77521

Dmitry Borodaenko (angdraug) on 2014-03-03

Changed in fuel:
assignee:	Dmitry Borodaenko (dborodaenko) → Sergey Vasilenko (xenolog)

OpenStack Infra (hudson-openstack) on 2014-03-03

Changed in fuel:
assignee:	Sergey Vasilenko (xenolog) → Matthew Mosesohn (raytrac3r)

Revision history for this message

Matthew Mosesohn (raytrac3r) wrote on 2014-03-03: Re: Moving management vip breaks rabbitmq sessions

#22

I've confirmed the patch successfully works around a failure of any rabbitmq server or changing virtual IP for both controllers and compute nodes. Waiting on OSCI for the python-mysql fixes. Glance and keystone refuse to reconnect even after 15 minutes, so this patch will help on that end.

OpenStack Infra (hudson-openstack) on 2014-03-03

Changed in fuel:
assignee:	Matthew Mosesohn (raytrac3r) → Sergey Vasilenko (xenolog)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-03-03: Related fix proposed to fuel-main (master)

#23

Related fix proposed to branch: master
Review: https://review.openstack.org/77633

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-03-03:

#24

Related fix proposed to branch: master
Review: https://review.openstack.org/77635

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-03-03: Related fix proposed to fuel-main (stable/4.1)

#25

Related fix proposed to branch: stable/4.1
Review: https://review.openstack.org/77636

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-03-03: Related fix merged to fuel-main (stable/4.1)

#26

Reviewed: https://review.openstack.org/77582
Committed: https://git.openstack.org/cgit/stackforge/fuel-main/commit/?id=46df79e683ec695dce89e4b833b31443a5817f05
Submitter: Jenkins
Branch: stable/4.1

commit 46df79e683ec695dce89e4b833b31443a5817f05
Author: Dmitry Burmistrov <email address hidden>
Date: Mon Mar 3 17:05:00 2014 +0400

Remove version of MySQL-python from requirements-rpm.txt

Related-Bug: #1285449

Change-Id: I26d2a2089229240fed93cbdadc7023ac980bd958

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-03-03: Related fix merged to fuel-main (master)

#27

Reviewed: https://review.openstack.org/77635
Committed: https://git.openstack.org/cgit/stackforge/fuel-main/commit/?id=d4b3f9d7b4608a58651a1f5744643f21d868346c
Submitter: Jenkins
Branch: master

commit d4b3f9d7b4608a58651a1f5744643f21d868346c
Author: Roman Vyalov <email address hidden>
Date: Mon Mar 3 20:27:43 2014 +0400

Remove version of python-sqlalchemy from requirements-rpm.txt

Related-Bug: #1285449

Change-Id: Ic6145e3d47326ba5feab3717cc5ca6becbc62a17

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-03-03: Related fix merged to fuel-main (stable/4.1)

#28

Reviewed: https://review.openstack.org/77636
Committed: https://git.openstack.org/cgit/stackforge/fuel-main/commit/?id=1385b7e212d30f60d97db84ae3cc3f3c587fc811
Submitter: Jenkins
Branch: stable/4.1

commit 1385b7e212d30f60d97db84ae3cc3f3c587fc811
Author: Roman Vyalov <email address hidden>
Date: Mon Mar 3 20:27:43 2014 +0400

Remove version of python-sqlalchemy from requirements-rpm.txt

Related-Bug: #1285449

Change-Id: Ic6145e3d47326ba5feab3717cc5ca6becbc62a17

Changed in fuel:
assignee:	Sergey Vasilenko (xenolog) → Matthew Mosesohn (raytrac3r)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-03-03: Fix proposed to fuel-library (master)

#29

Fix proposed to branch: master
Review: https://review.openstack.org/77643

OpenStack Infra (hudson-openstack) on 2014-03-03

Changed in fuel:
assignee:	Matthew Mosesohn (raytrac3r) → Sergey Vasilenko (xenolog)

OpenStack Infra (hudson-openstack) on 2014-03-03

Changed in fuel:
assignee:	Sergey Vasilenko (xenolog) → Matthew Mosesohn (raytrac3r)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-03-03: Fix proposed to fuel-library (stable/4.1)

#30

Fix proposed to branch: stable/4.1
Review: https://review.openstack.org/77657

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-03-03:

#31

Fix proposed to branch: stable/4.1
Review: https://review.openstack.org/77658

OpenStack Infra (hudson-openstack) on 2014-03-03

Changed in fuel:
assignee:	Matthew Mosesohn (raytrac3r) → Dmitry Borodaenko (dborodaenko)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-03-04: Related fix proposed to fuel-main (master)

#32

Related fix proposed to branch: master
Review: https://review.openstack.org/77825

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-03-04: Related fix merged to fuel-main (master)

#33

Reviewed: https://review.openstack.org/77825
Committed: https://git.openstack.org/cgit/stackforge/fuel-main/commit/?id=f55f7edd468378578b597992546113b58748065d
Submitter: Jenkins
Branch: master

commit f55f7edd468378578b597992546113b58748065d
Author: Dmitry Burmistrov <email address hidden>
Date: Mon Mar 3 17:05:00 2014 +0400

Remove version of MySQL-python from requirements-rpm.txt

Related-Bug: #1285449

Change-Id: I26d2a2089229240fed93cbdadc7023ac980bd958

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-03-04: Fix merged to fuel-library (stable/4.1)

#34

Reviewed: https://review.openstack.org/77658
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=136d342a9d8df888ae44e981e9dff22827206cd5
Submitter: Jenkins
Branch: stable/4.1

commit 136d342a9d8df888ae44e981e9dff22827206cd5
Author: Dmitry Borodaenko <email address hidden>
Date: Sun Mar 2 00:00:49 2014 -0800

Set rabbit_hosts to a list of controllers

    Connecting to RabbitMQ via HAProxy results in hung AMQP sessions after
    controller failover or VIP move. To work around this problem, OpenStack
    services have to be configured to connect directly to RabbitMQ services
    on controllers nodes via rabbit_hosts variable, which makes impl_kombu
    cycle through the listed hosts when attempting to reconnect. On
    controller nodes, preference is given to the local RabbitMQ service on
    the same controller node.

    AMQP client configuration for all OpenStack components is centralized in
    osnailyfacter and propagated through Puppet classes in a consistent
    manner. The exception is Neutron that uses its own
    sanitize_neutron_config() function to parse configuration from
    osnailyfacter. That function was extended to generate AMQP hosts list
    consistently with the above, and fixed to correctly sanitize instances
    of Array subclasses.

No changes were made to Murano manifests since it has its own RPC
implementation that is inconsistent with the rest of the OpenStack.

    In addition to AMQP client configuration, crm configuration is modified
    to set flush_routes flag and resource stickiness for all virtual IP
    addresses. The flush_routes flag reduces the probability of connections
    becoming hung on nodes that a VIP moves away from, and resource
    stickiness prevent unnecessary movement of VIP resources between
    controllers.

Change-Id: Ib839032b2f1aa820b4afc64b0a9badf13414d488
Partial-bug: #1285449

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-03-04:

#35

Reviewed: https://review.openstack.org/77657
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=b8f8476345c74798515d72cd3e05bb6d1df5d8d5
Submitter: Jenkins
Branch: stable/4.1

commit b8f8476345c74798515d72cd3e05bb6d1df5d8d5
Author: Matthew Mosesohn <email address hidden>
Date: Mon Mar 3 20:45:55 2014 +0400

Add read_timeout and infinite retries to MySQL conns

    read_timeout=60 is an explicit parameter added
    for mysqldb to bail connections if no data is
    received for 60s. Depends on MySQLdb 1.2.5
    max_retries=-1 for all connections so that
    APIs don't give up and die

Change-Id: Ib4a2cdcc287cbc53c18f7500d96f82d8099e0f35
Partial-Bug: #1285449

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-03-04: Fix merged to fuel-library (master)

#36

Reviewed: https://review.openstack.org/77409
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=2301fc0e637c8739d12dbb582db306bfa44de817
Submitter: Jenkins
Branch: master

commit 2301fc0e637c8739d12dbb582db306bfa44de817
Author: Dmitry Borodaenko <email address hidden>
Date: Sun Mar 2 00:00:49 2014 -0800

Set rabbit_hosts to a list of controllers

    Connecting to RabbitMQ via HAProxy results in hung AMQP sessions after
    controller failover or VIP move. To work around this problem, OpenStack
    services have to be configured to connect directly to RabbitMQ services
    on controllers nodes via rabbit_hosts variable, which makes impl_kombu
    cycle through the listed hosts when attempting to reconnect. On
    controller nodes, preference is given to the local RabbitMQ service on
    the same controller node.

    AMQP client configuration for all OpenStack components is centralized in
    osnailyfacter and propagated through Puppet classes in a consistent
    manner. The exception is Neutron that uses its own
    sanitize_neutron_config() function to parse configuration from
    osnailyfacter. That function was extended to generate AMQP hosts list
    consistently with the above, and fixed to correctly sanitize instances
    of Array subclasses.

No changes were made to Murano manifests since it has its own RPC
implementation that is inconsistent with the rest of the OpenStack.

    In addition to AMQP client configuration, crm configuration is modified
    to set flush_routes flag and resource stickiness for all virtual IP
    addresses. The flush_routes flag reduces the probability of connections
    becoming hung on nodes that a VIP moves away from, and resource
    stickiness prevent unnecessary movement of VIP resources between
    controllers.

Change-Id: Ib839032b2f1aa820b4afc64b0a9badf13414d488
Partial-bug: #1285449

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-03-04:

#37

Reviewed: https://review.openstack.org/77643
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=0a24882dbdc32c8605c7f712d1b6f2780615a15f
Submitter: Jenkins
Branch: master

commit 0a24882dbdc32c8605c7f712d1b6f2780615a15f
Author: Matthew Mosesohn <email address hidden>
Date: Mon Mar 3 20:45:55 2014 +0400

Add read_timeout and infinite retries to MySQL conns

    read_timeout=60 is an explicit parameter added
    for mysqldb to bail connections if no data is
    received for 60s. Depends on MySQLdb 1.2.5
    max_retries=-1 for all connections so that
    APIs don't give up and die

Change-Id: Ib4a2cdcc287cbc53c18f7500d96f82d8099e0f35
Partial-Bug: #1285449

Vladimir Kuklin (vkuklin) on 2014-03-05

Changed in fuel:
status:	In Progress → Fix Committed

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-03-06: Re: Moving management vip breaks rabbitmq sessions

#38

note: the patch makes RabbitMQ listen on the public network connected interface as well. And in the case of nova-network, we have a security issue (Neutron name-spaces provide isolation, thus the patch should be OK for such case?)

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-03-06:

#39

TODO make sure unused /etc/haproxy/conf.d/100-rabbitmq.cfg will be removed later.

Anastasia Palkina (apalkina) on 2014-03-14

tags:

added: in progress

Dmitry Borodaenko (angdraug) on 2014-03-14

tags:

removed: in progress

Anastasia Palkina (apalkina) on 2014-03-17

tags:

added: in progress

Anastasia Palkina (apalkina) on 2014-03-17

tags:

removed: in progress

Revision history for this message

Anastasia Palkina (apalkina) wrote on 2014-03-17:

#40

Download full text (4.1 KiB)

I reproduced with bug on ISO #235
"build_id": "2014-03-05_07-31-01",
"mirantis": "yes",
"build_number": "235",
"nailgun_sha": "f58aad317829112913f364347b14f1f0518ad371",
"ostf_sha": "dc54d99ddff2f497b131ad1a42362515f2a61afa",
"fuelmain_sha": "16637e2ea0ae6fe9a773aceb9d76c6e3a75f6c3b",
"astute_sha": "f15f5615249c59c826ea05d26707f062c88db32a",
"release": "4.1",
"fuellib_sha": "73313007c0914e602246ea41fa5e8ca2dfead9f8"

1. Create new environment (Ubuntu, HA mode)
2. Choose GRE segmentation
3. Choose installing both Ceph
4. Add 3 controllers+ceph, 1 compute
5. Start deployment. It was successful
6. Power off primary controller.
7. Try to create cinder volume. It is stuck in "creating" state
8. Power on primary controller

root@node-3:~# rabbitmqctl list_queues pid
Listing queues ...
<'rabbit@node-2'.1.251.0>
<'rabbit@node-2'.1.647.0>
<'rabbit@node-3'.3.1109.0>
<'rabbit@node-2'.1.253.0>
<'rabbit@node-2'.1.255.0>
<'rabbit@node-2'.1.257.0>
<'rabbit@node-3'.3.693.0>
<'rabbit@node-2'.1.259.0>
<'rabbit@node-3'.3.4128.0>
<'rabbit@node-3'.3.26698.0>
<'rabbit@node-3'.3.4192.0>
<'rabbit@node-3'.3.401.0>
<'rabbit@node-3'.3.26684.0>
<'rabbit@node-3'.3.26685.0>
<'rabbit@node-2'.1.261.0>
<'rabbit@node-2'.1.570.0>
<'rabbit@node-2'.1.3102.1>
<'rabbit@node-2'.1.31299.0>
<'rabbit@node-3'.3.2686.0>
<'rabbit@node-2'.1.940.0>
<'rabbit@node-3'.3.6045.0>
<'rabbit@node-3'.3.1581.0>
<'rabbit@node-2'.1.263.0>
<'rabbit@node-2'.1.2557.0>
<'rabbit@node-2'.1.897.0>
<'rabbit@node-2'.1.267.0>
<'rabbit@node-2'.1.3110.1>
<'rabbit@node-3'.3.1106.0>
<'rabbit@node-2'.1.269.0>
<'rabbit@node-2'.1.2211.0>
<'rabbit@node-2'.1.271.0>
<'rabbit@node-2'.1.273.0>
<'rabbit@node-2'.1.773.0>
<'rabbit@node-3'.3.4194.0>
<'rabbit@node-2'.1.359.0>
<'rabbit@node-2'.1.275.0>
<'rabbit@node-2'.1.3105.1>
<'rabbit@node-2'.1.3278.1>
<'rabbit@node-2'.1.900.0>
<'rabbit@node-2'.1.277.0>
<'rabbit@node-3'.3.4139.0>
<'rabbit@node-3'.3.716.0>
<'rabbit@node-2'.1.279.0>
<'rabbit@node-2'.1.3303.1>
<'rabbit@node-2'.1.3115.1>
<'rabbit@node-2'.1.283.0>
<'rabbit@node-2'.1.3173.1>
<'rabbit@node-2'.1.285.0>
<'rabbit@node-2'.1.289.0>
<'rabbit@node-2'.1.291.0>
<'rabbit@node-2'.1.293.0>
<'rabbit@node-2'.1.295.0>
<'rabbit@node-3'.3.943.0>
<'rabbit@node-3'.3.26686.0>
<'rabbit@node-3'.3.1100.0>
<'rabbit@node-3'.3.789.0>
<'rabbit@node-2'.1.297.0>
<'rabbit@node-2'.1.299.0>
<'rabbit@node-2'.1.301.0>
<'rabbit@node-2'.1.31320.0>
<'rabbit@node-2'.1.31301.0>
<'rabbit@node-3'.3.1103.0>
<'rabbit@node-3'.3.1143.0>
<'rabbit@node-2'.1.3253.1>
<'rabbit@node-2'.1.3178.1>
<'rabbit@node-2'.1.303.0>
<'rabbit@node-2'.1.906.0>
<'rabbit@node-2'.1.903.0>
<'rabbit@node-2'.1.1744.0>
<'rabbit@node-2'.1.305.0>
<'rabbit@node-2'.1.309.0>
<'rabbit@node-3'.3.916.0>
<'rabbit@node-2'.1.374.0>
<'rabbit@node-2'.1.311.0>
<'rabbit@node-3'.3.4125.0>
<'rabbit@node-2'.1.3312.1>
<'rabbit@node-2'.1.1325.0>
<'rabbit@node-2'.1.749.0>
<'rabbit@node-3'.3.4210.0>
<'rabbit@node-2'.1.31304.0>
<'rabbit@node-2'.1.313.0>
<'rabbit@node-3'.3.4196.0>
<'rabbit@node-2'.1.591.0>
<'rabbit@node-3'.3.4130.0>
<'rabbit@node-2'.1.315.0>
<'rabbit@node-3'.3.26687.0>
<'rabbit@node-3'.3.4137.0>
<'rabbit@node-1'.3.453.0>
<'rabbit@node-3'.3.386.0>
<'rabbit@node-3'.3.5130.0>
<'rabbit@no...

I reproduced with bug on ISO #235
"build_id": "2014-03-05_07-31-01", 
"mirantis": "yes", 
"build_number": "235", 
"nailgun_sha": "f58aad317829112913f364347b14f1f0518ad371", 
"ostf_sha": "dc54d99ddff2f497b131ad1a42362515f2a61afa", 
"fuelmain_sha": "16637e2ea0ae6fe9a773aceb9d76c6e3a75f6c3b", 
"astute_sha": "f15f5615249c59c826ea05d26707f062c88db32a", 
"release": "4.1", 
"fuellib_sha": "73313007c0914e602246ea41fa5e8ca2dfead9f8"

1. Create new environment (Ubuntu, HA mode)
2. Choose GRE segmentation
3. Choose installing both Ceph
4. Add 3 controllers+ceph, 1 compute
5. Start deployment. It was successful
6. Power off primary controller.
7. Try to create cinder volume. It is stuck in "creating" state
8. Power on primary controller

root@node-3:~# rabbitmqctl  list_queues pid
Listing queues ...
<'rabbit@node-2'.1.251.0>
<'rabbit@node-2'.1.647.0>
<'rabbit@node-3'.3.1109.0>
<'rabbit@node-2'.1.253.0>
<'rabbit@node-2'.1.255.0>
<'rabbit@node-2'.1.257.0>
<'rabbit@node-3'.3.693.0>
<'rabbit@node-2'.1.259.0>
<'rabbit@node-3'.3.4128.0>
<'rabbit@node-3'.3.26698.0>
<'rabbit@node-3'.3.4192.0>
<'rabbit@node-3'.3.401.0>
<'rabbit@node-3'.3.26684.0>
<'rabbit@node-3'.3.26685.0>
<'rabbit@node-2'.1.261.0>
<'rabbit@node-2'.1.570.0>
<'rabbit@node-2'.1.3102.1>
<'rabbit@node-2'.1.31299.0>
<'rabbit@node-3'.3.2686.0>
<'rabbit@node-2'.1.940.0>
<'rabbit@node-3'.3.6045.0>
<'rabbit@node-3'.3.1581.0>
<'rabbit@node-2'.1.263.0>
<'rabbit@node-2'.1.2557.0>
<'rabbit@node-2'.1.897.0>
<'rabbit@node-2'.1.267.0>
<'rabbit@node-2'.1.3110.1>
<'rabbit@node-3'.3.1106.0>
<'rabbit@node-2'.1.269.0>
<'rabbit@node-2'.1.2211.0>
<'rabbit@node-2'.1.271.0>
<'rabbit@node-2'.1.273.0>
<'rabbit@node-2'.1.773.0>
<'rabbit@node-3'.3.4194.0>
<'rabbit@node-2'.1.359.0>
<'rabbit@node-2'.1.275.0>
<'rabbit@node-2'.1.3105.1>
<'rabbit@node-2'.1.3278.1>
<'rabbit@node-2'.1.900.0>
<'rabbit@node-2'.1.277.0>
<'rabbit@node-3'.3.4139.0>
<'rabbit@node-3'.3.716.0>
<'rabbit@node-2'.1.279.0>
<'rabbit@node-2'.1.3303.1>
<'rabbit@node-2'.1.3115.1>
<'rabbit@node-2'.1.283.0>
<'rabbit@node-2'.1.3173.1>
<'rabbit@node-2'.1.285.0>
<'rabbit@node-2'.1.289.0>
<'rabbit@node-2'.1.291.0>
<'rabbit@node-2'.1.293.0>
<'rabbit@node-2'.1.295.0>
<'rabbit@node-3'.3.943.0>
<'rabbit@node-3'.3.26686.0>
<'rabbit@node-3'.3.1100.0>
<'rabbit@node-3'.3.789.0>
<'rabbit@node-2'.1.297.0>
<'rabbit@node-2'.1.299.0>
<'rabbit@node-2'.1.301.0>
<'rabbit@node-2'.1.31320.0>
<'rabbit@node-2'.1.31301.0>
<'rabbit@node-3'.3.1103.0>
<'rabbit@node-3'.3.1143.0>
<'rabbit@node-2'.1.3253.1>
<'rabbit@node-2'.1.3178.1>
<'rabbit@node-2'.1.303.0>
<'rabbit@node-2'.1.906.0>
<'rabbit@node-2'.1.903.0>
<'rabbit@node-2'.1.1744.0>
<'rabbit@node-2'.1.305.0>
<'rabbit@node-2'.1.309.0>
<'rabbit@node-3'.3.916.0>
<'rabbit@node-2'.1.374.0>
<'rabbit@node-2'.1.311.0>
<'rabbit@node-3'.3.4125.0>
<'rabbit@node-2'.1.3312.1>
<'rabbit@node-2'.1.1325.0>
<'rabbit@node-2'.1.749.0>
<'rabbit@node-3'.3.4210.0>
<'rabbit@node-2'.1.31304.0>
<'rabbit@node-2'.1.313.0>
<'rabbit@node-3'.3.4196.0>
<'rabbit@node-2'.1.591.0>
<'rabbit@node-3'.3.4130.0>
<'rabbit@node-2'.1.315.0>
<'rabbit@node-3'.3.26687.0>
<'rabbit@node-3'.3.4137.0>
<'rabbit@node-1'.3.453.0>
<'rabbit@node-3'.3.386.0>
<'rabbit@node-3'.3.5130.0>
<'rabbit@node-2'.1.319.0>
<'rabbit@node-2'.1.321.0>
<'rabbit@node-2'.1.31306.0>
<'rabbit@node-3'.3.1582.0>
<'rabbit@node-2'.1.323.0>
...done.

9. Power off second controller.
10. After 15 minutes:

root@node-3:~# rabbitmqctl  list_queues pid
Listing queues ...
<'rabbit@node-3'.3.1109.0>
<'rabbit@node-3'.3.693.0>
<'rabbit@node-3'.3.4128.0>
<'rabbit@node-3'.3.4192.0>
<'rabbit@node-3'.3.401.0>
<'rabbit@node-3'.3.2686.0>
<'rabbit@node-3'.3.6045.0>
<'rabbit@node-3'.3.1581.0>
<'rabbit@node-3'.3.1106.0>
<'rabbit@node-3'.3.4194.0>
<'rabbit@node-3'.3.4139.0>
<'rabbit@node-3'.3.716.0>
<'rabbit@node-3'.3.943.0>
<'rabbit@node-3'.3.1100.0>
<'rabbit@node-3'.3.789.0>
<'rabbit@node-3'.3.1103.0>
<'rabbit@node-3'.3.1143.0>
<'rabbit@node-3'.3.916.0>
<'rabbit@node-3'.3.4125.0>
<'rabbit@node-3'.3.4210.0>
<'rabbit@node-3'.3.4196.0>
<'rabbit@node-3'.3.4130.0>
<'rabbit@node-3'.3.4137.0>
<'rabbit@node-3'.3.386.0>
<'rabbit@node-3'.3.5130.0>
<'rabbit@node-3'.3.1582.0>
...done.

Only third controller.

Changed in fuel:
status:	Fix Committed → Triaged

Revision history for this message

Anastasia Palkina (apalkina) wrote on 2014-03-17:

#41

fuel-snapshot-2014-03-17_13-02-53.tgz Edit (11.5 MiB, application/x-tar)

Revision history for this message

Dmitry Borodaenko (angdraug) wrote on 2014-03-17:

#42

Anastasia, please confirm that the above was reproduced with RabbitMQ 3.2 (see https://bugs.launchpad.net/fuel/+bug/1288831).

Vladimir Kuklin (vkuklin) on 2014-03-24

Changed in fuel:
milestone:	4.1 → 5.0
tags:	added: backports-4.1.1

Andrew Woodward (xarses) on 2014-03-24

tags:

removed: ceph

Revision history for this message

Dmitry Borodaenko (angdraug) wrote on 2014-03-24:

#43

All queues on all controllers are synchronized:

root@node-20:~# rabbitmqctl list_queues name pid slave_pids synchronised_slave_pids
Listing queues ...
conductor_fanout_22e7d39b35114076866f08c7ce98279b <'rabbit@node-20'.1.1856.0> [<'rabbit@node-21'.2.253.0>, <'rabbit@node-22'.3.237.0>] [<'rabbit@node-21'.2.253.0>, <'rabbit@node-22'.3.237.0>]
cinder-volume <'rabbit@node-20'.1.335.0> [<'rabbit@node-21'.2.251.0>, <'rabbit@node-22'.3.239.0>] [<'rabbit@node-21'.2.251.0>, <'rabbit@node-22'.3.239.0>]
dhcp_agent <'rabbit@node-20'.1.748.0> [<'rabbit@node-21'.2.257.0>, <'rabbit@node-22'.3.241.0>] [<'rabbit@node-21'.2.257.0>, <'rabbit@node-22'.3.241.0>]
conductor.node-21 <'rabbit@node-21'.2.589.0> [<'rabbit@node-20'.1.4061.0>, <'rabbit@node-22'.3.245.0>] [<'rabbit@node-20'.1.4061.0>, <'rabbit@node-22'.3.245.0>]
q-agent-notifier-tunnel-update_fanout_d4dea51fcd3841428f9a58fcabf62caf <'rabbit@node-20'.1.537.0> [<'rabbit@node-21'.2.259.0>, <'rabbit@node-22'.3.243.0>] [<'rabbit@node-21'.2.259.0>, <'rabbit@node-22'.3.243.0>]
consoleauth.node-20 <'rabbit@node-20'.1.1733.0> [<'rabbit@node-21'.2.261.0>, <'rabbit@node-22'.3.247.0>] [<'rabbit@node-21'.2.261.0>, <'rabbit@node-22'.3.247.0>]
compute.node-23 <'rabbit@node-20'.1.9454.0> [<'rabbit@node-21'.2.7902.0>, <'rabbit@node-22'.3.3554.0>] [<'rabbit@node-21'.2.7902.0>, <'rabbit@node-22'.3.3554.0>]
consoleauth.node-22 <'rabbit@node-22'.3.679.0> [<'rabbit@node-20'.1.7693.0>, <'rabbit@node-21'.2.4708.0>] [<'rabbit@node-20'.1.7693.0>, <'rabbit@node-21'.2.4708.0>]
...

Revision history for this message

Dmitry Borodaenko (angdraug) wrote on 2014-03-24:

#44

After shutting down RabbitMQ on all controllers, upgrading to 3.2.4, and starting it back on one controller (last one to have been stopped, node-22), queues from all nodes are mastered on node-22:

cert <'rabbit@node-22'.1.763.0>
cert.node-20 <'rabbit@node-22'.1.764.0>
cert.node-21 <'rabbit@node-22'.1.915.0>
cert.node-22 <'rabbit@node-22'.1.943.0>
cert_fanout_41cd75baf6de4142ae27ac43d3fe1dd3 <'rabbit@node-22'.1.918.0>
cert_fanout_6e611b8e54ff40e6893f24f0f948da90 <'rabbit@node-22'.1.765.0>
cert_fanout_cad666b18f2d4283b0c9c059dd07f441 <'rabbit@node-22'.1.957.0>
cinder-scheduler <'rabbit@node-22'.1.1018.0>
cinder-scheduler:node-20 <'rabbit@node-22'.1.1024.0>
cinder-scheduler_fanout_e71b120de586417ab457d25e77414fbb <'rabbit@node-22'.1.1026.0>
cinder-volume <'rabbit@node-22'.1.1023.0>
cinder-volume:node-20 <'rabbit@node-22'.1.1025.0>
cinder-volume_fanout_60c40388ef08422f8ceea258e98f5134 <'rabbit@node-22'.1.1027.0>
compute <'rabbit@node-22'.1.1474.0>
compute.node-23 <'rabbit@node-22'.1.1476.0>
compute_fanout_bd7d38999ec6477cb5f0a411684c3bd9 <'rabbit@node-22'.1.1478.0>
...

Revision history for this message

Dmitry Borodaenko (angdraug) wrote on 2014-03-24:

#45

After upgrading and restarting RabbitMQ on the remaining controllers, all queues are still mastered on node-22 and are not synchronized anywhere.

Revision history for this message

Dmitry Borodaenko (angdraug) wrote on 2014-03-24:

#46

This is a different problem from the original root cause of this bug, I've created a separate bug for it:
https://bugs.launchpad.net/fuel/+bug/1296922

Changed in fuel:
status:	Triaged → Fix Committed
tags:	removed: backports-4.1.1

Revision history for this message

Dmitry Borodaenko (angdraug) wrote on 2014-03-24:

#47

The problem with not synchronizing the queues is specific to RabbitMQ 3.x (x-ha-mode no longer has any effect, ha policy should be defined instead).

Revision history for this message

Roman Alekseenkov (ralekseenkov) wrote on 2014-03-25:

#48

Guys, this bug was in "Fix Committed" state for 4.1. We released the 4.1 and now this bug's milestone got changed to 5.0. This is not the way to go, as we want to have a reliable way to query "the list of bugs fixed in 4.1".

As Dmitry said, let's track the new problem under a new bug. This one should be changed back to 4.1, as the original root cause has been fixed in 4.1.

Vladimir Kuklin (vkuklin) on 2014-03-26

tags:

added: backports-4.1.1

Revision history for this message

Dmitry Borodaenko (angdraug) wrote on 2014-03-26:

#49

Fix for the original issue was released in 4.1. Additional outstanding problems related to controller failover were all raised as separate bugs. Since 4.1 milestone has been closed, we can't move this bug back under 4.1, but I don't think we should have backports tag on this, either. There's nothing to backport.

Changed in fuel:
status:	Fix Committed → Fix Released

Andrew Woodward (xarses) on 2014-05-07

summary:

- Moving management vip breaks rabbitmq sessions
+ Pacemaker migration of management vip causes RabbitMQ, MySQL lockups

Fuel for OpenStack

Pacemaker migration of management vip causes RabbitMQ, MySQL lockups

Bug Description

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches