Pacemaker migration of management vip causes RabbitMQ, MySQL lockups

Bug #1285449 reported by Dmitry Borodaenko
26
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
Critical
Dmitry Borodaenko

Bug Description

  mirantis: "yes"
  release: "4.1"
  build_number: "208"
  build_id: "2014-02-26_00-30-27"
  fuellib_sha: "0a2e5bdc01c1e3bb285acb7b39125101e950ac72"
  nailgun_sha: "ea08cef3e06a72f47cfaa8cd8fe6d034e2cf722e"
  astute_sha: "10cccc87f2ee35510e43c8fa19d2bf916ca1fced"
  ostf_sha: "8e6681b6d06c7cb20a84c1cc740d5f2492fb9d85"
  fuelmain_sha: "7939e28a5b3ab65361991e2bc22a792c7561cf87"

1. Create new environment (Ubuntu, HA, Neutron/GRE, Ceph for everything)
2. Add 3 controller + ceph-osd nodes, 1 compute node
3. Deployment is successful, able to create cinder volumes in Ceph.
4. Force shutdown primary controller.
5. Try to create another cinder volume.

Result: new volume is stuck in "creating" state and never becomes "available".

Mike Scherbakov (mihgen)
Changed in fuel:
milestone: 5.0 → 4.1
importance: High → Critical
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Ryan Moe (rmoe)
status: New → Confirmed
Revision history for this message
Dmitry Borodaenko (angdraug) wrote : Re: Moving management vip breaks rabbitmq sessions

You don't even need to shut down a node to reproduce this problem, all you have to do is move the management vip to a different node with the following command (replace node-1 with hostname of the controller node that doesn't currently have the vip):

crm_resource -r vip__management_old --move --node node-1

After that, most OpenStack services become unable to either put messages on RabbitMQ queues, take messages off the queues, or acknowledge the messages.

summary: - Unable to create cinder volume after controller failover
+ Moving management vip breaks rabbitmq sessions
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

One of the typical scenarios is nova-compute keeping its RabbitMQ connection and sending service_update messages to nova-conductor, but nova-conductor not receiving them and marking the nova-compute services as down. When that happens, restarting nova-conductor fixes it. Similar case with cinder-volume and cinder-scheduler, which is likely what causes the symptom in the original bug description.

Reconfiguring services to bypass HAProxy and talk directly to the management IP of one of the controller nodes makes that service unaffected by this problem (except that other services that still are tied to HAProxy e.g. keystone or neutron remain affected and something doesn't work).

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

One of the way to see what's going on with RabbitMQ sessions is to use "ls /proc/<pid>/fd" and "lsof|grep <socket id>" to identify the file descriptors of the RabbitMQ connections, and then "strace -s 2048 -p <pid> |egrep '\((<fd1>|<fd2>)' " to see the send and recv calls to the socket.

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

The theories that we have eliminated so far:

1) It's not MySQL connections management: you can see affected processes successfully running MySQL queries over HAProxy managed connections after moving the VIP.

2) It's not RabbitMQ itself: RabbitMQ connections that bypass HAProxy and go directly to port 5673 on a controller node are not affected by VIP move.

Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

In order to speed up diagnosis, could you get a snapshot posted with an env in debug mode attached to the bug?

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

You don't really need a diagnostic bundle, this is 100% reproducible: deploy any HA cluster and move the management VIP. That's it.

Still, I'm attaching a snapshot from my environment that has nodes 1-4 in the configuration described in the bug, and nodes 5-8 with CentOS and nova-network with the same problem. Neither had debug enabled during deploy, but the Ubuntu environment is the one where we've done most of our investigations so it will have debug logs for most services.

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

> Reconfiguring services to bypass HAProxy and talk directly to the management IP of one of the controller nodes makes that
> service unaffected by this problem

From fuel-dev list https://lists.launchpad.net/fuel-dev/msg00558.html
It will not help if you shut down the controller. The problem is that you
have hanged AMQP sessions which kombu driver does not look to handle
correctly.

So, does it mean we should 1) Submit an OS bug about Kombu sessions expiration, 2) Restart all OpenStack services (resided at the controllers only) in case of VIP moving?
Would that be enough to w/a RabbitMQ sessions issue unless it will be fixed in upstream (I mean Kombu part)?

Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :
Revision history for this message
Sergey Vasilenko (xenolog) wrote :

I think we can do following:
* apply dedicated VIP for AMQP
* make master/slave Pacemaker ocf script for AMQP service
* create co-location for AMQP-VIP and AMQP-master resource

remove AMQP handling from Ha-proxy, or redirect AMQP requests from ha-proxy to the AMQP-VIP

Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

Output of rabbitmqctl report from the new master after switching mgmt vip: http://paste.openstack.org/show/LAjPWCzlHKbG4XYFo6Ii/

Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

I believe un-haproxying RabbitMQ will simplify things. If RabbitMQ already does this task, we can avoid dead end scenarios.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/77265

Changed in fuel:
assignee: Ryan Moe (rmoe) → Sergey Vasilenko (xenolog)
status: Confirmed → In Progress
Revision history for this message
Dmitry Borodaenko (angdraug) wrote : Re: Moving management vip breaks rabbitmq sessions

Configuring all OpenStack services to connect directly to rabbitmq on all controller hosts as recommended here:

http://docs.openstack.org/high-availability-guide/content/_configure_openstack_services_to_use_rabbitmq.html

does solve the broken RabbitMQ connections problem, but still leaves the services on the node that has lost the management VIP with hung MySQL connections. Enabling flush_routes options for management and public VIPs as described here:

http://doc.opensuse.org/products/draft/SLE-HA/SLE-ha-guide_sd_draft/ref.ocfagent.IPaddr2.html

makes the hung MySQL connection less likely to occur, but does not eliminate it altogether.

Changed in fuel:
assignee: Sergey Vasilenko (xenolog) → Dmitry Borodaenko (dborodaenko)
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

This patch should help reduce the 10-minute kernel default TCP connection read timeout to something meaningful:
http://sourceforge.net/p/mysql-python/feature-requests/19/

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

Recently released python-mysqldb 1.2.5 has finally introduced tcp read timeout support.

Changed in fuel:
assignee: Dmitry Borodaenko (dborodaenko) → Sergey Vasilenko (xenolog)
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

My current plan of the solution outlined here:
https://lists.launchpad.net/fuel-dev/msg00566.html

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

I'm attaching a high-level overview of the puppet classes that are involved in configuring the rabbit_hosts configuration parameter for OpenStack components in fuel-library, as a helper for the shotgun surgery I have to perform to implement the RabbitMQ configuration change.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/77409

Changed in fuel:
assignee: Sergey Vasilenko (xenolog) → Dmitry Borodaenko (dborodaenko)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/77439

Changed in fuel:
assignee: Dmitry Borodaenko (dborodaenko) → Sergey Vasilenko (xenolog)
Changed in fuel:
assignee: Sergey Vasilenko (xenolog) → Dmitry Borodaenko (dborodaenko)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/77521

Changed in fuel:
assignee: Dmitry Borodaenko (dborodaenko) → Sergey Vasilenko (xenolog)
Changed in fuel:
assignee: Sergey Vasilenko (xenolog) → Matthew Mosesohn (raytrac3r)
Revision history for this message
Matthew Mosesohn (raytrac3r) wrote : Re: Moving management vip breaks rabbitmq sessions

I've confirmed the patch successfully works around a failure of any rabbitmq server or changing virtual IP for both controllers and compute nodes. Waiting on OSCI for the python-mysql fixes. Glance and keystone refuse to reconnect even after 15 minutes, so this patch will help on that end.

Changed in fuel:
assignee: Matthew Mosesohn (raytrac3r) → Sergey Vasilenko (xenolog)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-main (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/77633

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.openstack.org/77635

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-main (stable/4.1)

Related fix proposed to branch: stable/4.1
Review: https://review.openstack.org/77636

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-main (stable/4.1)

Reviewed: https://review.openstack.org/77582
Committed: https://git.openstack.org/cgit/stackforge/fuel-main/commit/?id=46df79e683ec695dce89e4b833b31443a5817f05
Submitter: Jenkins
Branch: stable/4.1

commit 46df79e683ec695dce89e4b833b31443a5817f05
Author: Dmitry Burmistrov <email address hidden>
Date: Mon Mar 3 17:05:00 2014 +0400

    Remove version of MySQL-python from requirements-rpm.txt

    Related-Bug: #1285449

    Change-Id: I26d2a2089229240fed93cbdadc7023ac980bd958

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-main (master)

Reviewed: https://review.openstack.org/77635
Committed: https://git.openstack.org/cgit/stackforge/fuel-main/commit/?id=d4b3f9d7b4608a58651a1f5744643f21d868346c
Submitter: Jenkins
Branch: master

commit d4b3f9d7b4608a58651a1f5744643f21d868346c
Author: Roman Vyalov <email address hidden>
Date: Mon Mar 3 20:27:43 2014 +0400

    Remove version of python-sqlalchemy from requirements-rpm.txt

    Related-Bug: #1285449

    Change-Id: Ic6145e3d47326ba5feab3717cc5ca6becbc62a17

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-main (stable/4.1)

Reviewed: https://review.openstack.org/77636
Committed: https://git.openstack.org/cgit/stackforge/fuel-main/commit/?id=1385b7e212d30f60d97db84ae3cc3f3c587fc811
Submitter: Jenkins
Branch: stable/4.1

commit 1385b7e212d30f60d97db84ae3cc3f3c587fc811
Author: Roman Vyalov <email address hidden>
Date: Mon Mar 3 20:27:43 2014 +0400

    Remove version of python-sqlalchemy from requirements-rpm.txt

    Related-Bug: #1285449

    Change-Id: Ic6145e3d47326ba5feab3717cc5ca6becbc62a17

Changed in fuel:
assignee: Sergey Vasilenko (xenolog) → Matthew Mosesohn (raytrac3r)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/77643

Changed in fuel:
assignee: Matthew Mosesohn (raytrac3r) → Sergey Vasilenko (xenolog)
Changed in fuel:
assignee: Sergey Vasilenko (xenolog) → Matthew Mosesohn (raytrac3r)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/4.1)

Fix proposed to branch: stable/4.1
Review: https://review.openstack.org/77657

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: stable/4.1
Review: https://review.openstack.org/77658

Changed in fuel:
assignee: Matthew Mosesohn (raytrac3r) → Dmitry Borodaenko (dborodaenko)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-main (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/77825

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-main (master)

Reviewed: https://review.openstack.org/77825
Committed: https://git.openstack.org/cgit/stackforge/fuel-main/commit/?id=f55f7edd468378578b597992546113b58748065d
Submitter: Jenkins
Branch: master

commit f55f7edd468378578b597992546113b58748065d
Author: Dmitry Burmistrov <email address hidden>
Date: Mon Mar 3 17:05:00 2014 +0400

    Remove version of MySQL-python from requirements-rpm.txt

    Related-Bug: #1285449

    Change-Id: I26d2a2089229240fed93cbdadc7023ac980bd958

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/4.1)

Reviewed: https://review.openstack.org/77658
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=136d342a9d8df888ae44e981e9dff22827206cd5
Submitter: Jenkins
Branch: stable/4.1

commit 136d342a9d8df888ae44e981e9dff22827206cd5
Author: Dmitry Borodaenko <email address hidden>
Date: Sun Mar 2 00:00:49 2014 -0800

    Set rabbit_hosts to a list of controllers

    Connecting to RabbitMQ via HAProxy results in hung AMQP sessions after
    controller failover or VIP move. To work around this problem, OpenStack
    services have to be configured to connect directly to RabbitMQ services
    on controllers nodes via rabbit_hosts variable, which makes impl_kombu
    cycle through the listed hosts when attempting to reconnect. On
    controller nodes, preference is given to the local RabbitMQ service on
    the same controller node.

    AMQP client configuration for all OpenStack components is centralized in
    osnailyfacter and propagated through Puppet classes in a consistent
    manner. The exception is Neutron that uses its own
    sanitize_neutron_config() function to parse configuration from
    osnailyfacter. That function was extended to generate AMQP hosts list
    consistently with the above, and fixed to correctly sanitize instances
    of Array subclasses.

    No changes were made to Murano manifests since it has its own RPC
    implementation that is inconsistent with the rest of the OpenStack.

    In addition to AMQP client configuration, crm configuration is modified
    to set flush_routes flag and resource stickiness for all virtual IP
    addresses. The flush_routes flag reduces the probability of connections
    becoming hung on nodes that a VIP moves away from, and resource
    stickiness prevent unnecessary movement of VIP resources between
    controllers.

    Change-Id: Ib839032b2f1aa820b4afc64b0a9badf13414d488
    Partial-bug: #1285449

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/77657
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=b8f8476345c74798515d72cd3e05bb6d1df5d8d5
Submitter: Jenkins
Branch: stable/4.1

commit b8f8476345c74798515d72cd3e05bb6d1df5d8d5
Author: Matthew Mosesohn <email address hidden>
Date: Mon Mar 3 20:45:55 2014 +0400

    Add read_timeout and infinite retries to MySQL conns

    read_timeout=60 is an explicit parameter added
    for mysqldb to bail connections if no data is
    received for 60s. Depends on MySQLdb 1.2.5
    max_retries=-1 for all connections so that
    APIs don't give up and die

    Change-Id: Ib4a2cdcc287cbc53c18f7500d96f82d8099e0f35
    Partial-Bug: #1285449

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/77409
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=2301fc0e637c8739d12dbb582db306bfa44de817
Submitter: Jenkins
Branch: master

commit 2301fc0e637c8739d12dbb582db306bfa44de817
Author: Dmitry Borodaenko <email address hidden>
Date: Sun Mar 2 00:00:49 2014 -0800

    Set rabbit_hosts to a list of controllers

    Connecting to RabbitMQ via HAProxy results in hung AMQP sessions after
    controller failover or VIP move. To work around this problem, OpenStack
    services have to be configured to connect directly to RabbitMQ services
    on controllers nodes via rabbit_hosts variable, which makes impl_kombu
    cycle through the listed hosts when attempting to reconnect. On
    controller nodes, preference is given to the local RabbitMQ service on
    the same controller node.

    AMQP client configuration for all OpenStack components is centralized in
    osnailyfacter and propagated through Puppet classes in a consistent
    manner. The exception is Neutron that uses its own
    sanitize_neutron_config() function to parse configuration from
    osnailyfacter. That function was extended to generate AMQP hosts list
    consistently with the above, and fixed to correctly sanitize instances
    of Array subclasses.

    No changes were made to Murano manifests since it has its own RPC
    implementation that is inconsistent with the rest of the OpenStack.

    In addition to AMQP client configuration, crm configuration is modified
    to set flush_routes flag and resource stickiness for all virtual IP
    addresses. The flush_routes flag reduces the probability of connections
    becoming hung on nodes that a VIP moves away from, and resource
    stickiness prevent unnecessary movement of VIP resources between
    controllers.

    Change-Id: Ib839032b2f1aa820b4afc64b0a9badf13414d488
    Partial-bug: #1285449

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/77643
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=0a24882dbdc32c8605c7f712d1b6f2780615a15f
Submitter: Jenkins
Branch: master

commit 0a24882dbdc32c8605c7f712d1b6f2780615a15f
Author: Matthew Mosesohn <email address hidden>
Date: Mon Mar 3 20:45:55 2014 +0400

    Add read_timeout and infinite retries to MySQL conns

    read_timeout=60 is an explicit parameter added
    for mysqldb to bail connections if no data is
    received for 60s. Depends on MySQLdb 1.2.5
    max_retries=-1 for all connections so that
    APIs don't give up and die

    Change-Id: Ib4a2cdcc287cbc53c18f7500d96f82d8099e0f35
    Partial-Bug: #1285449

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
Bogdan Dobrelya (bogdando) wrote : Re: Moving management vip breaks rabbitmq sessions

note: the patch makes RabbitMQ listen on the public network connected interface as well. And in the case of nova-network, we have a security issue (Neutron name-spaces provide isolation, thus the patch should be OK for such case?)

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

TODO make sure unused /etc/haproxy/conf.d/100-rabbitmq.cfg will be removed later.

tags: added: in progress
tags: removed: in progress
tags: added: in progress
tags: removed: in progress
Revision history for this message
Anastasia Palkina (apalkina) wrote :
Download full text (4.1 KiB)

I reproduced with bug on ISO #235
"build_id": "2014-03-05_07-31-01",
"mirantis": "yes",
"build_number": "235",
"nailgun_sha": "f58aad317829112913f364347b14f1f0518ad371",
"ostf_sha": "dc54d99ddff2f497b131ad1a42362515f2a61afa",
"fuelmain_sha": "16637e2ea0ae6fe9a773aceb9d76c6e3a75f6c3b",
"astute_sha": "f15f5615249c59c826ea05d26707f062c88db32a",
"release": "4.1",
"fuellib_sha": "73313007c0914e602246ea41fa5e8ca2dfead9f8"

1. Create new environment (Ubuntu, HA mode)
2. Choose GRE segmentation
3. Choose installing both Ceph
4. Add 3 controllers+ceph, 1 compute
5. Start deployment. It was successful
6. Power off primary controller.
7. Try to create cinder volume. It is stuck in "creating" state
8. Power on primary controller

root@node-3:~# rabbitmqctl list_queues pid
Listing queues ...
<'rabbit@node-2'.1.251.0>
<'rabbit@node-2'.1.647.0>
<'rabbit@node-3'.3.1109.0>
<'rabbit@node-2'.1.253.0>
<'rabbit@node-2'.1.255.0>
<'rabbit@node-2'.1.257.0>
<'rabbit@node-3'.3.693.0>
<'rabbit@node-2'.1.259.0>
<'rabbit@node-3'.3.4128.0>
<'rabbit@node-3'.3.26698.0>
<'rabbit@node-3'.3.4192.0>
<'rabbit@node-3'.3.401.0>
<'rabbit@node-3'.3.26684.0>
<'rabbit@node-3'.3.26685.0>
<'rabbit@node-2'.1.261.0>
<'rabbit@node-2'.1.570.0>
<'rabbit@node-2'.1.3102.1>
<'rabbit@node-2'.1.31299.0>
<'rabbit@node-3'.3.2686.0>
<'rabbit@node-2'.1.940.0>
<'rabbit@node-3'.3.6045.0>
<'rabbit@node-3'.3.1581.0>
<'rabbit@node-2'.1.263.0>
<'rabbit@node-2'.1.2557.0>
<'rabbit@node-2'.1.897.0>
<'rabbit@node-2'.1.267.0>
<'rabbit@node-2'.1.3110.1>
<'rabbit@node-3'.3.1106.0>
<'rabbit@node-2'.1.269.0>
<'rabbit@node-2'.1.2211.0>
<'rabbit@node-2'.1.271.0>
<'rabbit@node-2'.1.273.0>
<'rabbit@node-2'.1.773.0>
<'rabbit@node-3'.3.4194.0>
<'rabbit@node-2'.1.359.0>
<'rabbit@node-2'.1.275.0>
<'rabbit@node-2'.1.3105.1>
<'rabbit@node-2'.1.3278.1>
<'rabbit@node-2'.1.900.0>
<'rabbit@node-2'.1.277.0>
<'rabbit@node-3'.3.4139.0>
<'rabbit@node-3'.3.716.0>
<'rabbit@node-2'.1.279.0>
<'rabbit@node-2'.1.3303.1>
<'rabbit@node-2'.1.3115.1>
<'rabbit@node-2'.1.283.0>
<'rabbit@node-2'.1.3173.1>
<'rabbit@node-2'.1.285.0>
<'rabbit@node-2'.1.289.0>
<'rabbit@node-2'.1.291.0>
<'rabbit@node-2'.1.293.0>
<'rabbit@node-2'.1.295.0>
<'rabbit@node-3'.3.943.0>
<'rabbit@node-3'.3.26686.0>
<'rabbit@node-3'.3.1100.0>
<'rabbit@node-3'.3.789.0>
<'rabbit@node-2'.1.297.0>
<'rabbit@node-2'.1.299.0>
<'rabbit@node-2'.1.301.0>
<'rabbit@node-2'.1.31320.0>
<'rabbit@node-2'.1.31301.0>
<'rabbit@node-3'.3.1103.0>
<'rabbit@node-3'.3.1143.0>
<'rabbit@node-2'.1.3253.1>
<'rabbit@node-2'.1.3178.1>
<'rabbit@node-2'.1.303.0>
<'rabbit@node-2'.1.906.0>
<'rabbit@node-2'.1.903.0>
<'rabbit@node-2'.1.1744.0>
<'rabbit@node-2'.1.305.0>
<'rabbit@node-2'.1.309.0>
<'rabbit@node-3'.3.916.0>
<'rabbit@node-2'.1.374.0>
<'rabbit@node-2'.1.311.0>
<'rabbit@node-3'.3.4125.0>
<'rabbit@node-2'.1.3312.1>
<'rabbit@node-2'.1.1325.0>
<'rabbit@node-2'.1.749.0>
<'rabbit@node-3'.3.4210.0>
<'rabbit@node-2'.1.31304.0>
<'rabbit@node-2'.1.313.0>
<'rabbit@node-3'.3.4196.0>
<'rabbit@node-2'.1.591.0>
<'rabbit@node-3'.3.4130.0>
<'rabbit@node-2'.1.315.0>
<'rabbit@node-3'.3.26687.0>
<'rabbit@node-3'.3.4137.0>
<'rabbit@node-1'.3.453.0>
<'rabbit@node-3'.3.386.0>
<'rabbit@node-3'.3.5130.0>
<'rabbit@no...

Read more...

Changed in fuel:
status: Fix Committed → Triaged
Revision history for this message
Anastasia Palkina (apalkina) wrote :
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

Anastasia, please confirm that the above was reproduced with RabbitMQ 3.2 (see https://bugs.launchpad.net/fuel/+bug/1288831).

Changed in fuel:
milestone: 4.1 → 5.0
tags: added: backports-4.1.1
Andrew Woodward (xarses)
tags: removed: ceph
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

All queues on all controllers are synchronized:

root@node-20:~# rabbitmqctl list_queues name pid slave_pids synchronised_slave_pids
Listing queues ...
conductor_fanout_22e7d39b35114076866f08c7ce98279b <'rabbit@node-20'.1.1856.0> [<'rabbit@node-21'.2.253.0>, <'rabbit@node-22'.3.237.0>] [<'rabbit@node-21'.2.253.0>, <'rabbit@node-22'.3.237.0>]
cinder-volume <'rabbit@node-20'.1.335.0> [<'rabbit@node-21'.2.251.0>, <'rabbit@node-22'.3.239.0>] [<'rabbit@node-21'.2.251.0>, <'rabbit@node-22'.3.239.0>]
dhcp_agent <'rabbit@node-20'.1.748.0> [<'rabbit@node-21'.2.257.0>, <'rabbit@node-22'.3.241.0>] [<'rabbit@node-21'.2.257.0>, <'rabbit@node-22'.3.241.0>]
conductor.node-21 <'rabbit@node-21'.2.589.0> [<'rabbit@node-20'.1.4061.0>, <'rabbit@node-22'.3.245.0>] [<'rabbit@node-20'.1.4061.0>, <'rabbit@node-22'.3.245.0>]
q-agent-notifier-tunnel-update_fanout_d4dea51fcd3841428f9a58fcabf62caf <'rabbit@node-20'.1.537.0> [<'rabbit@node-21'.2.259.0>, <'rabbit@node-22'.3.243.0>] [<'rabbit@node-21'.2.259.0>, <'rabbit@node-22'.3.243.0>]
consoleauth.node-20 <'rabbit@node-20'.1.1733.0> [<'rabbit@node-21'.2.261.0>, <'rabbit@node-22'.3.247.0>] [<'rabbit@node-21'.2.261.0>, <'rabbit@node-22'.3.247.0>]
compute.node-23 <'rabbit@node-20'.1.9454.0> [<'rabbit@node-21'.2.7902.0>, <'rabbit@node-22'.3.3554.0>] [<'rabbit@node-21'.2.7902.0>, <'rabbit@node-22'.3.3554.0>]
consoleauth.node-22 <'rabbit@node-22'.3.679.0> [<'rabbit@node-20'.1.7693.0>, <'rabbit@node-21'.2.4708.0>] [<'rabbit@node-20'.1.7693.0>, <'rabbit@node-21'.2.4708.0>]
...

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

After shutting down RabbitMQ on all controllers, upgrading to 3.2.4, and starting it back on one controller (last one to have been stopped, node-22), queues from all nodes are mastered on node-22:

cert <'rabbit@node-22'.1.763.0>
cert.node-20 <'rabbit@node-22'.1.764.0>
cert.node-21 <'rabbit@node-22'.1.915.0>
cert.node-22 <'rabbit@node-22'.1.943.0>
cert_fanout_41cd75baf6de4142ae27ac43d3fe1dd3 <'rabbit@node-22'.1.918.0>
cert_fanout_6e611b8e54ff40e6893f24f0f948da90 <'rabbit@node-22'.1.765.0>
cert_fanout_cad666b18f2d4283b0c9c059dd07f441 <'rabbit@node-22'.1.957.0>
cinder-scheduler <'rabbit@node-22'.1.1018.0>
cinder-scheduler:node-20 <'rabbit@node-22'.1.1024.0>
cinder-scheduler_fanout_e71b120de586417ab457d25e77414fbb <'rabbit@node-22'.1.1026.0>
cinder-volume <'rabbit@node-22'.1.1023.0>
cinder-volume:node-20 <'rabbit@node-22'.1.1025.0>
cinder-volume_fanout_60c40388ef08422f8ceea258e98f5134 <'rabbit@node-22'.1.1027.0>
compute <'rabbit@node-22'.1.1474.0>
compute.node-23 <'rabbit@node-22'.1.1476.0>
compute_fanout_bd7d38999ec6477cb5f0a411684c3bd9 <'rabbit@node-22'.1.1478.0>
...

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

After upgrading and restarting RabbitMQ on the remaining controllers, all queues are still mastered on node-22 and are not synchronized anywhere.

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

This is a different problem from the original root cause of this bug, I've created a separate bug for it:
https://bugs.launchpad.net/fuel/+bug/1296922

Changed in fuel:
status: Triaged → Fix Committed
tags: removed: backports-4.1.1
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

The problem with not synchronizing the queues is specific to RabbitMQ 3.x (x-ha-mode no longer has any effect, ha policy should be defined instead).

Revision history for this message
Roman Alekseenkov (ralekseenkov) wrote :

Guys, this bug was in "Fix Committed" state for 4.1. We released the 4.1 and now this bug's milestone got changed to 5.0. This is not the way to go, as we want to have a reliable way to query "the list of bugs fixed in 4.1".

As Dmitry said, let's track the new problem under a new bug. This one should be changed back to 4.1, as the original root cause has been fixed in 4.1.

tags: added: backports-4.1.1
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

Fix for the original issue was released in 4.1. Additional outstanding problems related to controller failover were all raised as separate bugs. Since 4.1 milestone has been closed, we can't move this bug back under 4.1, but I don't think we should have backports tag on this, either. There's nothing to backport.

Changed in fuel:
status: Fix Committed → Fix Released
Andrew Woodward (xarses)
summary: - Moving management vip breaks rabbitmq sessions
+ Pacemaker migration of management vip causes RabbitMQ, MySQL lockups
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.