Fuel for OpenStack

MQ downtime after killing master RabbitMQ

Bug #1559136 reported by Ilya Shakhat on 2016-03-18

This bug affects 1 person

	Status	Importance	Assigned to	Milestone
Fuel for OpenStack	Invalid	High	Dmitry Mescheryakov	Fuel for OpenStack 10.0
Mitaka	Fix Released	High	Dmitry Mescheryakov	Fuel for OpenStack 9.1
Newton	Invalid	High	Dmitry Mescheryakov	Fuel for OpenStack 10.0

Bug Description

Killing master RabbitMQ server results in ~30 seconds downtime of the whole cluster.

Steps to reproduce:
0) Install the latest oslo.messaging simulator on one of nodes (preferably to compute). Get connection parameters from service config files (username and password).
1) Find which node runs master by doing `pcs resource`
2) Start simulator server, e.g. "python simulator.py --url rabbit://nova:tGAPNtjHh8yvvkR69MooN1eD@node-58:5673,nova:tGAPNtjHh8yvvkR69MooN1eD@node-111:5673,nova:tGAPNtjHh8yvvkR69MooN1eD@node-123:5673/ --debug true rpc-server"
3) Start simulator client, e.g. "python simulator.py --url rabbit://nova:tGAPNtjHh8yvvkR69MooN1eD@node-58:5673,nova:tGAPNtjHh8yvvkR69MooN1eD@node-111:5673,nova:tGAPNtjHh8yvvkR69MooN1eD@node-123:5673/ --debug true -l 200 rpc-client -p 1 --is-cast true"
4) After several second kill RabbitMQ process (beam) on master node

It's observed that process killing doesn't affect throughput much, but the "after-shock" actions do. In the following experiment the downtime was about 30 seconds.

The experiment
~~~~~~~~~~~~~
Controllers: node-123 (master), node-111 and node-58.
Simulator Server is connected to node-123, the client to node-111.

The moment of killing master RabbitMQ process:

client
2016-03-18 13:20:15,488 INFO root client-0 : seq: 13 count: 588 bytes: 1885040
2016-03-18 13:20:16,488 INFO root client-0 : seq: 14 count: 170 bytes: 530547
2016-03-18 13:20:17,488 INFO root client-0 : seq: 15 count: 471 bytes: 1483060

server
2016-03-18 13:20:15,399 INFO root server : seq: 25 count: 586 bytes: 1888334 latency: 0.002 min: 0.002 max: 0.012
2016-03-18 13:20:15,784 ERROR oslo.messaging._drivers.impl_rabbit AMQP server node-123:5673 closed the connection. Check login credentials: Socket closed
2016-03-18 13:20:16,399 INFO root server : seq: 26 count: 223 bytes: 702102 latency: 0.002 min: 0.002 max: 0.003
2016-03-18 13:20:16,789 ERROR oslo.messaging._drivers.impl_rabbit AMQP server on node-123:5673 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.
2016-03-18 13:20:17,399 INFO root server : seq: 27 count: 0 bytes: 0
2016-03-18 13:20:17,802 INFO oslo.messaging._drivers.impl_rabbit Reconnected to AMQP server on node-58:5673 via [amqp] client
2016-03-18 13:20:18,402 INFO root server : seq: 28 count: 840 bytes: 2702260 latency: 0.648 min: 0.064 max: 2.061

So the overall downtime from the server p-o-v is 2 seconds.

The issues start 1 minute after:

server
2016-03-18 13:21:19,422 INFO root server : seq: 89 count: 0 bytes: 0
2016-03-18 13:21:20,410 ERROR oslo.messaging._drivers.impl_rabbit Failed to process message ... skipping it.

2016-03-18 13:21:26,381 ERROR oslo.messaging._drivers.impl_rabbit AMQP server on node-58:5673 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.
2016-03-18 13:21:20,422 INFO root server : seq: 90 count: 10 bytes: 34170 latency: 0.002 min: 0.002 max: 0.002
2016-03-18 13:21:21,425 INFO root server : seq: 91 count: 895 bytes: 2881936 latency: 0.002 min: 0.002 max: 0.010
2016-03-18 13:21:22,426 INFO root server : seq: 92 count: 1015 bytes: 3241000 latency: 0.002 min: 0.001 max: 0.010
2016-03-18 13:21:23,426 INFO root server : seq: 93 count: 1022 bytes: 3269220 latency: 0.002 min: 0.001 max: 0.006
2016-03-18 13:21:24,427 INFO root server : seq: 94 count: 949 bytes: 3050223 latency: 0.002 min: 0.002 max: 0.011
ConnectionForced: (0, 0): (320) CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'
2016-03-18 13:21:25,427 INFO root server : seq: 95 count: 899 bytes: 2865040 latency: 0.002 min: 0.001 max: 0.009
2016-03-18 13:21:26,426 INFO root server : seq: 96 count: 0 bytes: 0
------ // -------
2016-03-18 13:21:54,448 INFO root server : seq: 124 count: 0 bytes: 0
2016-03-18 13:21:54,460 INFO oslo.messaging._drivers.impl_rabbit Reconnected to AMQP server on node-58:5673 via [amqp] client
2016-03-18 13:21:55,449 INFO root server : seq: 125 count: 979 bytes: 3117822 latency: 0.002 min: 0.001 max: 0.011

The overall downtime is 30 seconds with attempts to reconnect to different endpoints (see logs attached)

Tags:

Revision history for this message

Ilya Shakhat (shakhat) wrote on 2016-03-18:

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "570"
  build_id: "570"
  fuel-nailgun_sha: "558ca91a854cf29e395940c232911ffb851899c1"
  python-fuelclient_sha: "4f234669cfe88a9406f4e438b1e1f74f1ef484a5"
  fuel-agent_sha: "658be72c4b42d3e1436b86ac4567ab914bfb451b"
  fuel-nailgun-agent_sha: "b2bb466fd5bd92da614cdbd819d6999c510ebfb1"
  astute_sha: "b81577a5b7857c4be8748492bae1dec2fa89b446"
  fuel-library_sha: "c2a335b5b725f1b994f78d4c78723d29fa44685a"
  fuel-ostf_sha: "3bc76a63a9e7d195ff34eadc29552f4235fa6c52"
  fuel-mirror_sha: "fb45b80d7bee5899d931f926e5c9512e2b442749"
  fuelmenu_sha: "78ffc73065a9674b707c081d128cb7eea611474f"
  shotgun_sha: "63645dea384a37dde5c01d4f8905566978e5d906"
  network-checker_sha: "a43cf96cd9532f10794dce736350bf5bed350e9d"
  fuel-upgrade_sha: "616a7490ec7199f69759e97e42f9b97dfc87e85b"
  fuelmain_sha: "d605bcbabf315382d56d0ce8143458be67c53434"

Changed in fuel:
assignee:	nobody → MOS Oslo (mos-oslo)

Revision history for this message

Ilya Shakhat (shakhat) wrote on 2016-03-18:

Logs Edit (5.8 MiB, application/x-tar)

Revision history for this message

Ilya Shakhat (shakhat) wrote on 2016-03-18:

Contents of /var/log/rabbitmq of controllers Edit (1.9 MiB, application/x-tar)

Revision history for this message

Alexey Lebedeff (alebedev-a) wrote on 2016-03-18:

startup_log of rabbit on node-123 shows something unusual - looks like there were an attempt to start 2 rabbits simultaneously.

Revision history for this message

Ilya Shakhat (shakhat) wrote on 2016-03-18:

DEPLOYMENT:
HW (2 x Intel(R) Xeon(R) CPU E5-2680, RAM 256G, HDD 3T), 3 controllers, 7 computes

Roman Podoliaka (rpodolyaka) on 2016-03-21

Changed in fuel:
status:	New → Confirmed
importance:	Undecided → Medium
milestone:	none → 9.0

Revision history for this message

Ilya Shakhat (shakhat) wrote on 2016-03-21:

@Roman, why the importance is medium? it states that the whole cluster downtime is 30 second, meaning no service occur at all

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2016-03-21:

@Ilya, that is because so far our expectation was that 5 minutes recovery is ok in case of failover. That is why we consider requirement to have completely seamless recovery to be 'nice to have'.

Revision history for this message

Dmitry Belyaninov (dbelyaninov) wrote on 2016-03-24:

Seems like this issue was reproduced on 9.0 iso #101

Test case:
https://product-ci.infra.mirantis.net/job/9.0.system_test.ubuntu.plugins.thread_2_separate_services/56/console

Revision history for this message

Bug Checker Bot (bug-checker) wrote on 2016-03-28: Autochecker

(This check performed automatically)
Please, make sure that bug description contains the following sections filled in with the appropriate data related to the bug you are describing:

actual result

version

expected result

For more detailed information on the contents of each of the listed sections see https://wiki.openstack.org/wiki/Fuel/How_to_contribute#Here_is_how_you_file_a_bug

tags:

added: need-info

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2016-04-11:

#10

Raising importance to High - complete downtime means that no RabbitMQ node is up meaning HA queues do not survive even a single node death. And they are so precious to us.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-04-11: Fix proposed to fuel-library (master)

#11

Fix proposed to branch: master
Review: https://review.openstack.org/304070

Changed in fuel:
assignee:	MOS Oslo (mos-oslo) → Dmitry Mescheryakov (dmitrymex)
status:	Confirmed → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-04-15:

#12

Fix proposed to branch: master
Review: https://review.openstack.org/306431

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-05-24: Change abandoned on fuel-library (master)

#13

Change abandoned by Fuel DevOps Robot (<email address hidden>) on branch: master
Review: https://review.openstack.org/306431
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-05-24:

#14

Change abandoned by Fuel DevOps Robot (<email address hidden>) on branch: master
Review: https://review.openstack.org/304070
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-06-02: Fix proposed to fuel-library (stable/mitaka)

#15

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/324646

Revision history for this message

Dina Belova (dbelova) wrote on 2016-06-08:

#16

Moving to 9.0-updates after conversation with Dmitry. Changes are ready, but they require extended reliability testing to ensure fix actually works ok, This work is still in progress.

tags:

added: move-to-mu

Dmitry Mescheryakov (dmitrymex) on 2016-07-11

tags:

added: 10.0-reviewed

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-07-28:

#17

Lowering to medium because the 30 sec downtime fits the 2-3 min MTTR, which is by the OCF RA desgin.

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2016-07-28:

#18

Bogdan, please see comment #10 from me - I have raised the importance to high because complete downtime means that no HA queue will survive and all messages will be lost. Keeping in mind that it is caused just by one node death, it is not that good, we can do much better.

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2016-08-19:

#19

Marking as invalid for 10.0 since RabbitMQ's OCF script was moved to rabbitmq-server package

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-08-23: Fix merged to fuel-library (stable/mitaka)

#20

Reviewed: https://review.openstack.org/324646
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=fee92985a15fbad8dedaa3dc29b66ddbb6b89182
Submitter: Jenkins
Branch: stable/mitaka

commit fee92985a15fbad8dedaa3dc29b66ddbb6b89182
Author: Dmitry Mescheryakov <email address hidden>
Date: Thu Apr 14 20:12:49 2016 +0300

Rank master score based on start time

    Right now we assign 1000 to the oldest nodes and 1 to others. That
    creates a problem when Master restarts and no node is promoted until
    that node starts back. In that case the returned node will have score
    of 1, like all other slaves and Pacemaker will select to promote it
    again. The node is clean empty and afterwards other slaves join to
    it, wiping their data as well. As a result, we loose all the messages.

    The new algorithm actually ranks nodes, not just selects the oldest
    one. It also maintains the invariant that if node A started later
    than node B, then node A score must be smaller than that of
    node B. As a result, freshly started node has no chance of being
    selected in preference to older node. If several nodes start
    simultaneously, among them an older node might temporarily receive
    lower score than a younger one, but that is neglectable.

    Also remove any action on demote or demote notification - all of
    these duplicate actions done in stop or stop notification. With these
    removed, changing master on a running cluster does not affect RabbitMQ
    cluster in any way - we just declare another node master and that is
    it. It is important for the current change because master score might
    change after initial cluster start up causing master migration from
    one node to another.

    Partial-Bug: #1561894
    Partial-Bug: #1559136
    Change-Id: I2ee629eb07a6145647211da4dbe186aa9c45f61d

tags:

added: in-stable-mitaka

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-08-23:

#21

Reviewed: https://review.openstack.org/324647
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=67e9b3d74f22a433da8def35a7c8bfb40f78ae89
Submitter: Jenkins
Branch: stable/mitaka

commit 67e9b3d74f22a433da8def35a7c8bfb40f78ae89
Author: Dmitry Mescheryakov <email address hidden>
Date: Wed May 25 10:48:50 2016 +0300

Enhance split-brain detection logic

    Previous split brain logic worked as follows: each slave checked
    that it is connected to master. If check fails, slave restarts. The
    ultimate flaw in that logic is that there is little guarantee that
    master is alive at the moment. Moreover, if master dies, it is very
    probable that during the next monitor check slaves will detect its
    death and restart, causing complete RabbitMQ cluster downtime.

    With the new approach master node checks that slaves are connected to
    it and orders them to restart if they are not. The check is performed
    after master node health check, meaning that at least that node
    survives. Also, orders expire in one minute and freshly started node
    ignores orders to restart for three minutes to give cluster time to
    stabilize.

    Also corrected the problem, when node starts and is already clustered.
    In that case OCF script forgot to start the RabbitMQ app, causing
    subsequent restart. Now we ensure that RabbitMQ app is running.

    The two introduced attributes rabbit-start-phase-1-time and
    rabbit-ordered-to-restart are made private. In order to allow master
    to set node's order to restart, both ocf_update_private_attr and
    ocf_get_private_attr signatures are expanded to allow passing
    node name.

    Finally, a bug is fixed in ocf_get_private_attr. Unlike crm_attribute,
    attrd_updater returns empty string instead of "(null)", when an
    attribute is not defined on needed node, but is defined on some other
    node. Correspondingly changed code to expect empty string, not a
    "(null)".

Closes-Bug: #1561894
Closes-Bug: #1559136

Change-Id: Ib72794361dac54817975163593ea7e07f7e8b4e1

Reviewed:  https://review.openstack.org/324647
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=67e9b3d74f22a433da8def35a7c8bfb40f78ae89
Submitter: Jenkins
Branch:    stable/mitaka

commit 67e9b3d74f22a433da8def35a7c8bfb40f78ae89
Author: Dmitry Mescheryakov <dmescheryakov@mirantis.com>
Date:   Wed May 25 10:48:50 2016 +0300

Enhance split-brain detection logic
    
    Previous split brain logic worked as follows: each slave checked
    that it is connected to master. If check fails, slave restarts. The
    ultimate flaw in that logic is that there is little guarantee that
    master is alive at the moment. Moreover, if master dies, it is very
    probable that during the next monitor check slaves will detect its
    death and restart, causing complete RabbitMQ cluster downtime.
    
    With the new approach master node checks that slaves are connected to
    it and orders them to restart if they are not. The check is performed
    after master node health check, meaning that at least that node
    survives. Also, orders expire in one minute and freshly started node
    ignores orders to restart for three minutes to give cluster time to
    stabilize.
    
    Also corrected the problem, when node starts and is already clustered.
    In that case OCF script forgot to start the RabbitMQ app, causing
    subsequent restart. Now we ensure that RabbitMQ app is running.
    
    The two introduced attributes rabbit-start-phase-1-time and
    rabbit-ordered-to-restart are made private. In order to allow master
    to set node's order to restart, both ocf_update_private_attr and
    ocf_get_private_attr signatures are expanded to allow passing
    node name.
    
    Finally, a bug is fixed in ocf_get_private_attr. Unlike crm_attribute,
    attrd_updater returns empty string instead of "(null)", when an
    attribute is not defined on needed node, but is defined on some other
    node. Correspondingly changed code to expect empty string, not a
    "(null)".
    
    Closes-Bug: #1561894
    Closes-Bug: #1559136
    
    Change-Id: Ib72794361dac54817975163593ea7e07f7e8b4e1

Timur Nurlygayanov (tnurlygayanov) on 2016-09-05

tags:

added: rabbitmq

Revision history for this message

Alexey Galkin (agalkin) wrote on 2016-09-09:

#22

Vebrificated on 9.1 snapshot #233 by autotest: https://mirantis.testrail.com/index.php?/tests/view/15311668&group_by=cases:section_id&group_order=asc&group_id=55073