pacemaker restarts rabbitmq due 'rabbitmqctl list_channels' timed out.

Bug #1515223 reported by Leontii Istomin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
Alexey Lebedeff
7.0.x
Fix Released
High
Rodion Tikunov
8.0.x
Fix Released
High
Alexey Lebedeff
Future
Invalid
Undecided
Alexey Lebedeff

Bug Description

During boot_and_delete_server_with_secgroups nova rally scenario we faced with the error:
from rally.log: http://paste.openstack.org/show/478520/
from mysql by instance uuid: http://paste.openstack.org/show/478521/
from nova-compute: http://paste.openstack.org/show/478522/
from haproxy by request to neutron: http://paste.openstack.org/show/478524/
from neutron-all on node-197: http://paste.openstack.org/show/478528/
Rabbitmq was stopped on node-77 and node-198. On node-198 - firstly: http://paste.openstack.org/show/478530/
from pacemaker.log on node-198: http://paste.openstack.org/show/478531/

Cluster configuration:
Baremetal,Ubuntu,IBP,HA,Neutron-vlan,DVR,Ceph-all,Nova-debug,Nova-quotas,7.0-301-mu1
Controllers:3 Computes:178 Copmutes+Ceph:20

api: '1.0'
astute_sha: 6c5b73f93e24cc781c809db9159927655ced5012
auth_required: true
build_id: '301'
build_number: '301'
feature_groups:
- mirantis
fuel-agent_sha: 50e90af6e3d560e9085ff71d2950cfbcca91af67
fuel-library_sha: 5d50055aeca1dd0dc53b43825dc4c8f7780be9dd
fuel-nailgun-agent_sha: d7027952870a35db8dc52f185bb1158cdd3d1ebd
fuel-ostf_sha: 2cd967dccd66cfc3a0abd6af9f31e5b4d150a11c
fuelmain_sha: a65d453215edb0284a2e4761be7a156bb5627677
nailgun_sha: 4162b0c15adb425b37608c787944d1983f543aa8
openstack_version: 2015.1.0-7.0
production: docker
python-fuelclient_sha: 486bde57cda1badb68f915f66c61b544108606f3
release: '7.0'

Diagnostic Snapshot: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-11-11_12-04-03.tar.xz

Tags: area-mos scale
Artem Roma (aroma-x)
Changed in fuel:
assignee: nobody → MOS Nova (mos-nova)
status: New → Confirmed
importance: Undecided → High
milestone: none → 7.0-updates
tags: added: area-mos
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

My and Leontiy observations: once boot_and_delete_server_with_secgroups starts, RabbitMQ CPU usage raises from 300% to 1800% (visible in atop logs). Also, that is the time when 'rabbitmqctl list_channels' starts to time out.

Current plan: we don't understand what exactly causes the issue. It is either big message count or big messages passing through the RabbitMQ. We are going to implement a logging for messages sizes and reproduce the issue once more.

Changed in fuel:
assignee: MOS Nova (mos-nova) → MOS Oslo (mos-oslo)
Dmitry Pyzhov (dpyzhov)
no longer affects: fuel/8.0.x
description: updated
Changed in fuel:
assignee: MOS Oslo (mos-oslo) → Alexey Lebedeff (alebedev-a)
Revision history for this message
Alexey Lebedeff (alebedev-a) wrote :

A check using 'list_channels' with external timeout is very inaccurate - it could result in false positives due to the bad channels that are actually located on remote nodes.
I'm going to add some diagnostics output to OCF script near every 'list_channels' invocation, that will explain which node is actually responsible for 'list_channels' timeout.
When enough data is gathered and if it will be feasible, we could replace 'list_channels' with this new node-aware diagnostics script.

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to packages/trusty/rabbitmq-server (8.0)

Fix proposed to branch: 8.0
Change author: Alexey Lebedeff <email address hidden>
Review: https://review.fuel-infra.org/15512

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to packages/trusty/rabbitmq-server (master)

Fix proposed to branch: master
Change author: Alexey Lebedeff <email address hidden>
Review: https://review.fuel-infra.org/15513

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to packages/trusty/rabbitmq-server (8.0)

Reviewed: https://review.fuel-infra.org/15512
Submitter: Pkgs Jenkins <email address hidden>
Branch: 8.0

Commit: 6b45958024465ff64f5466815cf04169da0c9cf5
Author: Alexey Lebedeff <email address hidden>
Date: Thu Dec 24 11:32:21 2015

Backport infinite loop detection

Upstream fix https://github.com/rabbitmq/rabbitmq-common/pull/26 (patch
modified to reflect current state of upstream code).

Sudden death of cluster node could result in a stuck queue process -
this will result in redeclare attempts to hang. With this patch such
condition will be detected - AMQP channel will be closed and error will
be logged. And probably it could help us to discover underlying bug, by
localizing it in time.

And for referenced partial bugs it'll allow us to confirm or reject
hypothesis that it's related.

Change-Id: I09df5c5f2333cc462475798260cdfa9f4f5de654
Partial-Bug: #1515223
Partial-Bug: #1523622

Changed in fuel:
milestone: 8.0 → 9.0
status: Confirmed → New
Changed in fuel:
status: New → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/262754

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/262754
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=c882b7f9cf74dea07479665d53fe3275e4831d24
Submitter: Jenkins
Branch: master

commit c882b7f9cf74dea07479665d53fe3275e4831d24
Author: Alexey Lebedeff <email address hidden>
Date: Thu Jan 21 15:20:48 2016 +0300

    Improve OCF script diagnostics for timed-out 'list_channels'

    Upstream PR: https://github.com/rabbitmq/rabbitmq-server/pull/563

    Currently time-out when running 'rabbitmqctl list_channels' is treated
    as a sign that current node is unhealthy. But it could not be the
    case, as the hanging channel could be actually on some other
    node. Given that currently we have seen more than one bug related to
    'list_channels', it makes sense to improve diagnostics here.

    This patch doesn't change any behaviour, only improves logging after
    time-out happens. If time-outs continue to occur (even with latest
    rabbitmq versions or with backported fixes), we could switch to this
    improved list_channels and kill rabbitmq only if stuck channels are
    located on current node. But I hope that all related rabbitmq bugs
    were already closed.

    Change-Id: I4746d3a4e85dc2a51af581034ae09a1cf0eefce2
    Partial-Bug: #1515223
    Partial-Bug: #1513511

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/8.0)

Fix proposed to branch: stable/8.0
Review: https://review.openstack.org/272608

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/8.0)

Reviewed: https://review.openstack.org/272608
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=98a0698b7e177dee08f972a48fedc817cc9167a7
Submitter: Jenkins
Branch: stable/8.0

commit 98a0698b7e177dee08f972a48fedc817cc9167a7
Author: Alexey Lebedeff <email address hidden>
Date: Thu Jan 21 15:20:48 2016 +0300

    Improve OCF script diagnostics for timed-out 'list_channels'

    Cherry-pick c882b7f9cf74dea07479665d53fe3275e4831d24 from 'master'

    Upstream PR: https://github.com/rabbitmq/rabbitmq-server/pull/563

    Currently time-out when running 'rabbitmqctl list_channels' is treated
    as a sign that current node is unhealthy. But it could not be the
    case, as the hanging channel could be actually on some other
    node. Given that currently we have seen more than one bug related to
    'list_channels', it makes sense to improve diagnostics here.

    This patch doesn't change any behaviour, only improves logging after
    time-out happens. If time-outs continue to occur (even with latest
    rabbitmq versions or with backported fixes), we could switch to this
    improved list_channels and kill rabbitmq only if stuck channels are
    located on current node. But I hope that all related rabbitmq bugs
    were already closed.

    Change-Id: I4746d3a4e85dc2a51af581034ae09a1cf0eefce2
    Partial-Bug: #1515223
    Partial-Bug: #1513511

Revision history for this message
Ivan Lozgachev (ilozgachev) wrote :

Verified for Fuel 8.0 on ENV-10 Build 482 and ENV-14 Build 496

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Change abandoned on packages/trusty/rabbitmq-server (master)

Change abandoned by Alexey Lebedeff <email address hidden> on branch: master
Review: https://review.fuel-infra.org/15513

Andrew Kalach (akndex)
Changed in fuel:
status: Fix Committed → Fix Released
Revision history for this message
Rodion Tikunov (rtikunov) wrote :

Patch from comment #10 has already presented in 7.0.
So patch https://review.fuel-infra.org/#/c/27554/ is enough to fix the bug.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.