RabbitMQ crashed after power off of primary controller

Bug #1513511 reported by Anastasia Palkina
38
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
Critical
Alexey Lebedeff
7.0.x
Fix Released
Critical
Denis Puchkin
8.0.x
Fix Released
Critical
Alexey Lebedeff
Mitaka
Fix Released
Critical
Alexey Lebedeff

Bug Description

1. I have successful deployment (Ubuntu) with 3 controllers (node-7,6,2), 1 compute and 1 cinder
2. Power off primary controller (node-7)
3. Wait near 20 minutes
4. Start OSTF tests. Tests for RabbitMQ has failed.

PCS status: http://paste.openstack.org/show/478108/

Also there are many crash reports in /<email address hidden>:

=CRASH REPORT==== 5-Nov-2015::15:37:13 ===
  crasher:
    initial call: application_master:init/4
    pid: <0.4541.0>
    registered_name: []
    exception exit: {{shutdown,
                      {failed_to_start_child,rabbit_mgmt_sup,
                       {'EXIT',
                        {{shutdown,
                          [{{already_started,<5030.920.0>},
                            {child,undefined,rabbit_mgmt_db,
                             {rabbit_mgmt_db,start_link,[]},
                             permanent,4294967295,worker,
                             [rabbit_mgmt_db]}}]},
                         {gen_server2,call,
                          [<0.4563.0>,{init,<0.4561.0>},infinity]}}}}},
                     {rabbit_mgmt_app,start,[normal,[]]}}
      in function application_master:init/4 (application_master.erl, line 133)
    ancestors: [<0.4540.0>]
    messages: [{'EXIT',<0.4542.0>,normal}]
    links: [<0.4540.0>,<0.7.0>]
    dictionary: []
    trap_exit: true
    status: running
    heap_size: 610
    stack_size: 27
    reductions: 135
  neighbours:

Release 7.0 ISO #301 + fuel-provisioning-scripts-7.0.0-7681.1.gite013fd0.noarch

Logs are here: https://drive.google.com/a/mirantis.com/file/d/0B6SjzarTGFxab3RJN0tjWmFYZTg/view?usp=sharing

NOTE:
After almost 1 hour RabbitMQ starts to work. This situation and crash reports are not normal. Also crash reports continue appear in logs.

summary: - RAbbitMQ crashed after power off of primary controller
+ RabbitMQ crashed after power off of primary controller
Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :
description: updated
description: updated
description: updated
Dmitry Klenov (dklenov)
tags: added: area-library
Changed in fuel:
status: New → Confirmed
Revision history for this message
Alexey Lebedeff (alebedev-a) wrote :

So, current findings:

Stacktrace about "failed_to_start_child,rabbit_mgmt_sup": it's acctualy the expected behaviour after a network split, this is documented at https://github.com/rabbitmq/rabbitmq-management/blob/master/src/rabbit_mgmt_sup_sup.erl#L19
This should not be the source of startup failure.

Records in /var/log/syslog about "node name already occupied epmd-starter-443584618" are also unrelated. They are about incorrent PRNG initialization (ala https://xkcd.com/221/), but also should not have any impact on startup process.

Still searching for real cause of startup failure.

Revision history for this message
Alexey Lebedeff (alebedev-a) wrote :

Actually, it IS about "failed_to_start_child,rabbit_mgmt_sup". Let's wait if upstream confirms my suspicions at https://github.com/rabbitmq/rabbitmq-management/issues/81

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Alexey Lebedeff (alebedev-a)
Revision history for this message
Alexey Lebedeff (alebedev-a) wrote :

It's not a regression, rabbit has this code since 3.3.0 (somewhere after 2014-02-11).

OCF script temporary starts rabbitmq as a sort of healthcheck - and problem manifests there.
While there is no alternative, the following workaround will help to avoid the bug:
- First we need to detect if the node was already joined to the cluster (start temporary beam process, get list of nodes know to mnesia)
- If the node is not joined yet, we should block all erlang distibution traffic to/from outside world (ports from inet_dist_listen_min to inet_dist_listen_max), and it should be done before starting any beam processes - or their can become tainted.
- The block should be removed in the "join_to_cluster", just before call to "rabbitmqctl join_cluster"
- If the node is joined to cluster, there is no need in blocking

Fix should go to our local copy of OCF script, because in upstream it's the original bug that will be fixed.

Revision history for this message
Alexey Lebedeff (alebedev-a) wrote :

So, the final writeup on this issue.

There is 2 bugs in rabbitmq that cause this situation.

The first one is https://github.com/rabbitmq/rabbitmq-server/issues/224 , which was fixed in 3.5.5. A node in this state is:
a) not fully functional, while monitoring could say otherwise
b) continues communication attempts to some down nodes, while it sholudn't be doing it
After restarting that node everything normalizes. But we have no way to reliably detect that a node has such a condition, so the only viable way is upgrading to 3.5.5

The second one is https://github.com/rabbitmq/rabbitmq-management/issues/81 , which is triggered by unnecessary connection attempts from a node with the first bug.
This issue has the proposed fix, but it is not merged into rabbitmq yet.
Also there is possible workarounds:
- Fix the first bug, and this one will not be triggered at all (unless you put some effort into it =)
- Disable management plugin
- Prevent communication with node that is starting but is not a part of rabbitmq cluster yet (as described in previous comment).

Revision history for this message
Ksenia Svechnikova (kdemina) wrote :

This issue also affect swarm 7.0.system_test.ubuntu.cic_maintenance_mode auto_cic_maintenance_mode as reboot --force is used for each controller

        Scenario:
            1. Revert snapshot with 3x ['controller', 'mongo'], 2x['compute', 'cinder']
            2. Unexpected reboot (reboot --force)
            3. Wait until controller is switching in maintenance mode
            4. Exit maintenance mode
            5. Check the controller become available

Crash: https://paste.mirantis.net/show/1403/

https://product-ci.infra.mirantis.net/view/7.0_swarm/job/7.0.system_test.ubuntu.cic_maintenance_mode/103/console

Dmitry Pyzhov (dpyzhov)
no longer affects: fuel/8.0.x
Revision history for this message
Alexey Lebedeff (alebedev-a) wrote :

Stacktrace provided by Ksenia Demina is related to another bug, which will be fixed in 3.5.7 - https://github.com/rabbitmq/rabbitmq-server/pull/431

tags: added: area-mos
removed: area-library
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to packages/trusty/rabbitmq-server (master)

Fix proposed to branch: master
Change author: Alexey Lebedeff <email address hidden>
Review: https://review.fuel-infra.org/14592

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to packages/trusty/rabbitmq-server (8.0)

Fix proposed to branch: 8.0
Change author: Alexey Lebedeff <email address hidden>
Review: https://review.fuel-infra.org/14594

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to packages/trusty/rabbitmq-server (master)

Reviewed: https://review.fuel-infra.org/14592
Submitter: Pkgs Jenkins <email address hidden>
Branch: master

Commit: ac4f97955df87131e0047f80161271b0af889bf4
Author: Alexey Lebedeff <email address hidden>
Date: Thu Dec 10 16:44:05 2015

Backport upstream fix for management plugin

https://github.com/rabbitmq/rabbitmq-management/pull/84

Without this fix node may fail to start after network split.

Change-Id: I901055ea89b88edbbfa5350186ce3ad2d4bc71fb
Closes-Bug: #1513511

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to packages/trusty/rabbitmq-server (8.0)

Reviewed: https://review.fuel-infra.org/14594
Submitter: Pkgs Jenkins <email address hidden>
Branch: 8.0

Commit: c497fc96aecf165f0891984b3af3ff695002b6f5
Author: Alexey Lebedeff <email address hidden>
Date: Fri Dec 11 11:32:52 2015

Backport upstream fix for management plugin

https://github.com/rabbitmq/rabbitmq-management/pull/84

Without this fix node may fail to start after network split.

Change-Id: I901055ea89b88edbbfa5350186ce3ad2d4bc71fb
Closes-Bug: #1513511

Changed in fuel:
status: Confirmed → Fix Committed
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to packages/trusty/rabbitmq-server (7.0)

Fix proposed to branch: 7.0
Change author: Alexey Lebedeff <email address hidden>
Review: https://review.fuel-infra.org/15436

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to packages/trusty/rabbitmq-server (7.0)

Reviewed: https://review.fuel-infra.org/15436
Submitter: Denis V. Meltsaykin <email address hidden>
Branch: 7.0

Commit: fb4cdaba3a3b713683a4c16de8a330f6e5eed40e
Author: Alexey Lebedeff <email address hidden>
Date: Tue Dec 22 16:03:54 2015

Backport upstream fix for management plugin

https://github.com/rabbitmq/rabbitmq-management/pull/84

Without this fix node may fail to start after network split.

Change-Id: I901055ea89b88edbbfa5350186ce3ad2d4bc71fb
Closes-Bug: #1513511

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/262754

Sergii Rizvan (srizvan)
tags: added: on-verification
Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

fix for master branch is on review, marked as "In Progress" for MOS 9.0

Revision history for this message
Sergii Rizvan (srizvan) wrote :

Verified on MOS 7.0

Packages:
rabbitmq-server
Version:
3.5.4-1~u14.04+mos3

tags: removed: on-verification
no longer affects: fuel/future
tags: added: on-verification
Revision history for this message
Alexander Zatserklyany (zatserklyany) wrote :

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "464"

Approach 1
-----------------
./utils/jenkins/system_tests.sh -t test -w $(pwd) -j fuelweb_test -i $ISO_PATH -o --group=auto_cic_maintenance_mode -V ${VENV_PATH} -K
...
----------------------------------------------------------------------
Ran 5 tests in 19547.357s

OK
========

Approach 2
-----------------
1. I have successful deployment (Ubuntu) with 3 controllers (node-1,4,5), 1 compute and 1 cinder
2. Power off primary controller (node-4)
3. Wait near 20 minutes
4. Start OSTF tests.

Tests for RabbitMQ didn't failed.
pcs status didn't show any failed actions

tags: removed: on-verification
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/262754
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=c882b7f9cf74dea07479665d53fe3275e4831d24
Submitter: Jenkins
Branch: master

commit c882b7f9cf74dea07479665d53fe3275e4831d24
Author: Alexey Lebedeff <email address hidden>
Date: Thu Jan 21 15:20:48 2016 +0300

    Improve OCF script diagnostics for timed-out 'list_channels'

    Upstream PR: https://github.com/rabbitmq/rabbitmq-server/pull/563

    Currently time-out when running 'rabbitmqctl list_channels' is treated
    as a sign that current node is unhealthy. But it could not be the
    case, as the hanging channel could be actually on some other
    node. Given that currently we have seen more than one bug related to
    'list_channels', it makes sense to improve diagnostics here.

    This patch doesn't change any behaviour, only improves logging after
    time-out happens. If time-outs continue to occur (even with latest
    rabbitmq versions or with backported fixes), we could switch to this
    improved list_channels and kill rabbitmq only if stuck channels are
    located on current node. But I hope that all related rabbitmq bugs
    were already closed.

    Change-Id: I4746d3a4e85dc2a51af581034ae09a1cf0eefce2
    Partial-Bug: #1515223
    Partial-Bug: #1513511

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/8.0)

Fix proposed to branch: stable/8.0
Review: https://review.openstack.org/272608

tags: added: 7.0-mu-2
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/8.0)

Reviewed: https://review.openstack.org/272608
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=98a0698b7e177dee08f972a48fedc817cc9167a7
Submitter: Jenkins
Branch: stable/8.0

commit 98a0698b7e177dee08f972a48fedc817cc9167a7
Author: Alexey Lebedeff <email address hidden>
Date: Thu Jan 21 15:20:48 2016 +0300

    Improve OCF script diagnostics for timed-out 'list_channels'

    Cherry-pick c882b7f9cf74dea07479665d53fe3275e4831d24 from 'master'

    Upstream PR: https://github.com/rabbitmq/rabbitmq-server/pull/563

    Currently time-out when running 'rabbitmqctl list_channels' is treated
    as a sign that current node is unhealthy. But it could not be the
    case, as the hanging channel could be actually on some other
    node. Given that currently we have seen more than one bug related to
    'list_channels', it makes sense to improve diagnostics here.

    This patch doesn't change any behaviour, only improves logging after
    time-out happens. If time-outs continue to occur (even with latest
    rabbitmq versions or with backported fixes), we could switch to this
    improved list_channels and kill rabbitmq only if stuck channels are
    located on current node. But I hope that all related rabbitmq bugs
    were already closed.

    Change-Id: I4746d3a4e85dc2a51af581034ae09a1cf0eefce2
    Partial-Bug: #1515223
    Partial-Bug: #1513511

Revision history for this message
Sofiia Andriichenko (sandriichenko) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.