Fuel for OpenStack

RabbitMQ crashed after power off of primary controller

Bug #1513511 reported by Anastasia Palkina on 2015-11-05

This bug affects 5 people

	Status	Importance	Assigned to	Milestone
Fuel for OpenStack	Fix Released	Critical	Alexey Lebedeff	Fuel for OpenStack 9.0
7.0.x	Fix Released	Critical	Denis Puchkin	Fuel for OpenStack 7.0-mu-2
8.0.x	Fix Released	Critical	Alexey Lebedeff	Fuel for OpenStack 8.0
Mitaka	Fix Released	Critical	Alexey Lebedeff	Fuel for OpenStack 9.0

Bug Description

1. I have successful deployment (Ubuntu) with 3 controllers (node-7,6,2), 1 compute and 1 cinder
2. Power off primary controller (node-7)
3. Wait near 20 minutes
4. Start OSTF tests. Tests for RabbitMQ has failed.

PCS status: http://paste.openstack.org/show/478108/

Also there are many crash reports in /<email address hidden>:

=CRASH REPORT==== 5-Nov-2015::15:37:13 ===
  crasher:
    initial call: application_master:init/4
    pid: <0.4541.0>
    registered_name: []
    exception exit: {{shutdown,
                      {failed_to_start_child,rabbit_mgmt_sup,
                       {'EXIT',
                        {{shutdown,
                          [{{already_started,<5030.920.0>},
                            {child,undefined,rabbit_mgmt_db,
                             {rabbit_mgmt_db,start_link,[]},
                             permanent,4294967295,worker,
                             [rabbit_mgmt_db]}}]},
                         {gen_server2,call,
                          [<0.4563.0>,{init,<0.4561.0>},infinity]}}}}},
                     {rabbit_mgmt_app,start,[normal,[]]}}
      in function application_master:init/4 (application_master.erl, line 133)
    ancestors: [<0.4540.0>]
    messages: [{'EXIT',<0.4542.0>,normal}]
    links: [<0.4540.0>,<0.7.0>]
    dictionary: []
    trap_exit: true
    status: running
    heap_size: 610
    stack_size: 27
    reductions: 135
  neighbours:

Release 7.0 ISO #301 + fuel-provisioning-scripts-7.0.0-7681.1.gite013fd0.noarch

Logs are here: https://drive.google.com/a/mirantis.com/file/d/0B6SjzarTGFxab3RJN0tjWmFYZTg/view?usp=sharing

NOTE:
After almost 1 hour RabbitMQ starts to work. This situation and crash reports are not normal. Also crash reports continue appear in logs.

See original description

Tags:

Anastasia Palkina (apalkina) on 2015-11-05

summary:

- RAbbitMQ crashed after power off of primary controller
+ RabbitMQ crashed after power off of primary controller

Revision history for this message

Andrey Sledzinskiy (asledzinskiy) wrote on 2015-11-05:

https://bugs.launchpad.net/fuel/+bug/1513512 may be related

Anastasia Palkina (apalkina) on 2015-11-05

description:	updated
description:	updated
description:	updated

Dmitry Klenov (dklenov) on 2015-11-06

tags:	added: area-library
Changed in fuel:
status:	New → Confirmed

Revision history for this message

Alexey Lebedeff (alebedev-a) wrote on 2015-11-06:

So, current findings:

Stacktrace about "failed_to_start_child,rabbit_mgmt_sup": it's acctualy the expected behaviour after a network split, this is documented at https://github.com/rabbitmq/rabbitmq-management/blob/master/src/rabbit_mgmt_sup_sup.erl#L19
This should not be the source of startup failure.

Records in /var/log/syslog about "node name already occupied epmd-starter-443584618" are also unrelated. They are about incorrent PRNG initialization (ala https://xkcd.com/221/), but also should not have any impact on startup process.

Still searching for real cause of startup failure.

Revision history for this message

Alexey Lebedeff (alebedev-a) wrote on 2015-11-06:

Actually, it IS about "failed_to_start_child,rabbit_mgmt_sup". Let's wait if upstream confirms my suspicions at https://github.com/rabbitmq/rabbitmq-management/issues/81

Dmitry Mescheryakov (dmitrymex) on 2015-11-06

Changed in fuel:
assignee:	Fuel Library Team (fuel-library) → Alexey Lebedeff (alebedev-a)

Revision history for this message

Alexey Lebedeff (alebedev-a) wrote on 2015-11-09:

It's not a regression, rabbit has this code since 3.3.0 (somewhere after 2014-02-11).

OCF script temporary starts rabbitmq as a sort of healthcheck - and problem manifests there.
While there is no alternative, the following workaround will help to avoid the bug:
- First we need to detect if the node was already joined to the cluster (start temporary beam process, get list of nodes know to mnesia)
- If the node is not joined yet, we should block all erlang distibution traffic to/from outside world (ports from inet_dist_listen_min to inet_dist_listen_max), and it should be done before starting any beam processes - or their can become tainted.
- The block should be removed in the "join_to_cluster", just before call to "rabbitmqctl join_cluster"
- If the node is joined to cluster, there is no need in blocking

Fix should go to our local copy of OCF script, because in upstream it's the original bug that will be fixed.

Revision history for this message

Alexey Lebedeff (alebedev-a) wrote on 2015-11-11:

So, the final writeup on this issue.

There is 2 bugs in rabbitmq that cause this situation.

The first one is https://github.com/rabbitmq/rabbitmq-server/issues/224 , which was fixed in 3.5.5. A node in this state is:
a) not fully functional, while monitoring could say otherwise
b) continues communication attempts to some down nodes, while it sholudn't be doing it
After restarting that node everything normalizes. But we have no way to reliably detect that a node has such a condition, so the only viable way is upgrading to 3.5.5

The second one is https://github.com/rabbitmq/rabbitmq-management/issues/81 , which is triggered by unnecessary connection attempts from a node with the first bug.
This issue has the proposed fix, but it is not merged into rabbitmq yet.
Also there is possible workarounds:
- Fix the first bug, and this one will not be triggered at all (unless you put some effort into it =)
- Disable management plugin
- Prevent communication with node that is starting but is not a part of rabbitmq cluster yet (as described in previous comment).

Revision history for this message

Ksenia Svechnikova (kdemina) wrote on 2015-11-11:

This issue also affect swarm 7.0.system_test.ubuntu.cic_maintenance_mode auto_cic_maintenance_mode as reboot --force is used for each controller

        Scenario:
            1. Revert snapshot with 3x ['controller', 'mongo'], 2x['compute', 'cinder']
            2. Unexpected reboot (reboot --force)
            3. Wait until controller is switching in maintenance mode
            4. Exit maintenance mode
            5. Check the controller become available

Crash: https://paste.mirantis.net/show/1403/

https://product-ci.infra.mirantis.net/view/7.0_swarm/job/7.0.system_test.ubuntu.cic_maintenance_mode/103/console

Dmitry Pyzhov (dpyzhov) on 2015-11-17

no longer affects:

fuel/8.0.x

Revision history for this message

Alexey Lebedeff (alebedev-a) wrote on 2015-11-20:

Stacktrace provided by Ksenia Demina is related to another bug, which will be fixed in 3.5.7 - https://github.com/rabbitmq/rabbitmq-server/pull/431

Matthew Mosesohn (raytrac3r) on 2015-11-23

tags:

added: area-mos
removed: area-library

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2015-12-10: Fix proposed to packages/trusty/rabbitmq-server (master)

Fix proposed to branch: master
Change author: Alexey Lebedeff <email address hidden>
Review: https://review.fuel-infra.org/14592

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2015-12-10: Fix proposed to packages/trusty/rabbitmq-server (8.0)

#10

Fix proposed to branch: 8.0
Change author: Alexey Lebedeff <email address hidden>
Review: https://review.fuel-infra.org/14594

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2015-12-11: Fix merged to packages/trusty/rabbitmq-server (master)

#11

Reviewed: https://review.fuel-infra.org/14592
Submitter: Pkgs Jenkins <email address hidden>
Branch: master

Commit: ac4f97955df87131e0047f80161271b0af889bf4
Author: Alexey Lebedeff <email address hidden>
Date: Thu Dec 10 16:44:05 2015

Backport upstream fix for management plugin

https://github.com/rabbitmq/rabbitmq-management/pull/84

Without this fix node may fail to start after network split.

Change-Id: I901055ea89b88edbbfa5350186ce3ad2d4bc71fb
Closes-Bug: #1513511

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2015-12-13: Fix merged to packages/trusty/rabbitmq-server (8.0)

#12

Reviewed: https://review.fuel-infra.org/14594
Submitter: Pkgs Jenkins <email address hidden>
Branch: 8.0

Commit: c497fc96aecf165f0891984b3af3ff695002b6f5
Author: Alexey Lebedeff <email address hidden>
Date: Fri Dec 11 11:32:52 2015

Backport upstream fix for management plugin

https://github.com/rabbitmq/rabbitmq-management/pull/84

Without this fix node may fail to start after network split.

Change-Id: I901055ea89b88edbbfa5350186ce3ad2d4bc71fb
Closes-Bug: #1513511

Alexey Lebedeff (alebedev-a) on 2015-12-14

Changed in fuel:
status:	Confirmed → Fix Committed

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2015-12-22: Fix proposed to packages/trusty/rabbitmq-server (7.0)

#13

Fix proposed to branch: 7.0
Change author: Alexey Lebedeff <email address hidden>
Review: https://review.fuel-infra.org/15436

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2015-12-24: Fix merged to packages/trusty/rabbitmq-server (7.0)

#14

Reviewed: https://review.fuel-infra.org/15436
Submitter: Denis V. Meltsaykin <email address hidden>
Branch: 7.0

Commit: fb4cdaba3a3b713683a4c16de8a330f6e5eed40e
Author: Alexey Lebedeff <email address hidden>
Date: Tue Dec 22 16:03:54 2015

Backport upstream fix for management plugin

https://github.com/rabbitmq/rabbitmq-management/pull/84

Without this fix node may fail to start after network split.

Change-Id: I901055ea89b88edbbfa5350186ce3ad2d4bc71fb
Closes-Bug: #1513511

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-12-31: Fix proposed to fuel-library (master)

#15

Fix proposed to branch: master
Review: https://review.openstack.org/262754

Sergii Rizvan (srizvan) on 2016-01-11

tags:

added: on-verification

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2016-01-11:

#16

fix for master branch is on review, marked as "In Progress" for MOS 9.0

Revision history for this message

Sergii Rizvan (srizvan) wrote on 2016-01-14:

#17

Verified on MOS 7.0

Packages:
rabbitmq-server
Version:
3.5.4-1~u14.04+mos3

tags:

removed: on-verification

Fuel Devops McRobotson (fuel-devops-robot) on 2016-01-21

no longer affects:

fuel/future

Alexander Zatserklyany (zatserklyany) on 2016-01-22

tags:

added: on-verification

Revision history for this message

Alexander Zatserklyany (zatserklyany) wrote on 2016-01-22:

#18

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "464"

Approach 1
-----------------
./utils/jenkins/system_tests.sh -t test -w $(pwd) -j fuelweb_test -i $ISO_PATH -o --group=auto_cic_maintenance_mode -V ${VENV_PATH} -K
...
----------------------------------------------------------------------
Ran 5 tests in 19547.357s

OK
========

Approach 2
-----------------
1. I have successful deployment (Ubuntu) with 3 controllers (node-1,4,5), 1 compute and 1 cinder
2. Power off primary controller (node-4)
3. Wait near 20 minutes
4. Start OSTF tests.

Tests for RabbitMQ didn't failed.
pcs status didn't show any failed actions

Ksenia Svechnikova (kdemina) on 2016-01-25

tags:

removed: on-verification

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-01-26: Fix merged to fuel-library (master)

#19

Reviewed: https://review.openstack.org/262754
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=c882b7f9cf74dea07479665d53fe3275e4831d24
Submitter: Jenkins
Branch: master

commit c882b7f9cf74dea07479665d53fe3275e4831d24
Author: Alexey Lebedeff <email address hidden>
Date: Thu Jan 21 15:20:48 2016 +0300

Improve OCF script diagnostics for timed-out 'list_channels'

Upstream PR: https://github.com/rabbitmq/rabbitmq-server/pull/563

    Currently time-out when running 'rabbitmqctl list_channels' is treated
    as a sign that current node is unhealthy. But it could not be the
    case, as the hanging channel could be actually on some other
    node. Given that currently we have seen more than one bug related to
    'list_channels', it makes sense to improve diagnostics here.

    This patch doesn't change any behaviour, only improves logging after
    time-out happens. If time-outs continue to occur (even with latest
    rabbitmq versions or with backported fixes), we could switch to this
    improved list_channels and kill rabbitmq only if stuck channels are
    located on current node. But I hope that all related rabbitmq bugs
    were already closed.

    Change-Id: I4746d3a4e85dc2a51af581034ae09a1cf0eefce2
    Partial-Bug: #1515223
    Partial-Bug: #1513511

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-01-26: Fix proposed to fuel-library (stable/8.0)

#20

Fix proposed to branch: stable/8.0
Review: https://review.openstack.org/272608

Olena Logvinova (ologvinova) on 2016-02-01

tags:

added: 7.0-mu-2

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-02-01: Fix merged to fuel-library (stable/8.0)

#21

Reviewed: https://review.openstack.org/272608
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=98a0698b7e177dee08f972a48fedc817cc9167a7
Submitter: Jenkins
Branch: stable/8.0