Make OCF scripts tolerate rabbitmqctl timeouts to a certain degree. UX: rabbitmq node was stopped by pacemaker

Bug #1479815 reported by Leontiy Istomin on 2015-07-30
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
High
Dmitry Mescheryakov
6.0.x
High
MOS Maintenance
6.1.x
High
Michal Rostecki
7.0.x
High
Dmitry Mescheryakov
8.0.x
High
Dmitry Mescheryakov
Future
High
Mikhail Chernik
Mitaka
High
Dmitry Mescheryakov

Bug Description

During the following rally scenarios rabbitmq was stopped on some controller nodes:
node-33:
29 08:16:45 was running boot_and_list_server_batch_x10 rally scenario
node-39:
29 10:24:13 was running boot_and_delete_server_with_secgroups rally scenario
29 22:07:23 was running boot_and_delete_server_with_neutron_secgroups rally scenario

from pacemaker.log on node-33: http://paste.openstack.org/show/406411/
from pacemaker.log on node-39: http://paste.openstack.org/show/406413/

atop on node-33 at 08:16:41: http://paste.openstack.org/show/406416/
atop on node-39 at 10:24:01: http://paste.openstack.org/show/406418/
atop on node-39 at 22:07:21: http://paste.openstack.org/show/406420/

Cluster configuration:
Baremetal,Ubuntu,IBP,HA, Neutron-vlan,Ceph-all,Nova-debug,Nova-quotas,Sahara,Murano,7.0-98
Controllers:3 Computes+Ceph:47

api: '1.0'
astute_sha: 34e0493afa22999c4a07d3198ceb945116ab7932
auth_required: true
build_id: 2015-07-27_09-24-22
build_number: '98'
feature_groups:
- mirantis
fuel-agent_sha: 2a65f11c10b0aeb5184247635a19740fc3edde21
fuel-library_sha: 39c3162ee2e2ff6e3af82f703998f95ff4cc2b7a
fuel-ostf_sha: 94a483c8aba639be3b96616c1396ef290dcc00cd
fuelmain_sha: 921918a3bd3d278431f35ad917989e46b0c24100
nailgun_sha: d5c19f6afc66b5efe3c61ecb49025c1002ccbdc6
openstack_version: 2015.1.0-7.0
production: docker
python-fuelclient_sha: 58c411d87a7eaf0fd6892eae2b5cb1eff4190c98
release: '7.0'

Diagnostic Snapshot: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-07-30_09-14-44.tar.xz

Changed in fuel:
status: New → Confirmed
importance: Undecided → High
assignee: nobody → Dmitry Mescheryakov (dmitrymex)
milestone: none → 7.0
Vladimir Kuklin (vkuklin) wrote :

Folks

There is a log of pacemaker showing 25.x LA - you cannot expect for cluster to handle such load.

Leontiy Istomin (listomin) wrote :

Have reproduced the issue with rabbitmq-3.5.4:
<30>Aug 2 16:02:56 node-1 lrmd: INFO: p_rabbitmq-server: su_rabbit_cmd(): the invoked command exited 2: /usr/sbin/rabbitmqctl -t 30 -q list_queues memory messages consumer_utilisation
<30>Aug 2 16:02:56 node-1 lrmd: INFO: p_rabbitmq-server: get_monitor(): get_monitor function ready to return 0
Aug 02 16:02:56 [8897] node-1.domain.tld pacemaker_remoted: notice: operation_finished: p_rabbitmq-server_monitor_103000:18822:stderr [ Error: operation list_queues on node 'rabbit@node-1' timed out ]
Aug 02 16:02:57 [8900] node-1.domain.tld crmd: info: throttle_handle_load: Moderate CPU load detected: 17.080000

http://paste.openstack.org/show/406748/

Cluster configuration:
Baremetal,Ubuntu,IBP,HA, Neutron-vxlan,Ceph-all,Nova-debug,Nova-quotas,Sahara,Murano,7.0-custom-764
Controllers:3 Computes+Ceph:47

api: '1.0'
astute_sha: 34e0493afa22999c4a07d3198ceb945116ab7932
auth_required: true
build_id: 2015-07-27_09-24-22
build_number: '98'
feature_groups:
- mirantis
fuel-agent_sha: 2a65f11c10b0aeb5184247635a19740fc3edde21
fuel-library_sha: 39c3162ee2e2ff6e3af82f703998f95ff4cc2b7a
fuel-ostf_sha: 94a483c8aba639be3b96616c1396ef290dcc00cd
fuelmain_sha: 921918a3bd3d278431f35ad917989e46b0c24100
nailgun_sha: d5c19f6afc66b5efe3c61ecb49025c1002ccbdc6
openstack_version: 2015.1.0-7.0
production: docker
python-fuelclient_sha: 58c411d87a7eaf0fd6892eae2b5cb1eff4190c98
release: '7.0'

Diagnostic Snapshot: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-08-03_10-03-53.tar.xz

Leontiy Istomin (listomin) wrote :

I have found that rabbitmq keeps all CPU http://paste.openstack.org/show/412864/
at the time when rabbitmq isn't reachable http://paste.openstack.org/show/412867/
I monitored used sockets at the time:
Thu Aug 13 11:34:03 UTC 2015
rabbitmq with pid=33083 opened 583 files from 102300
rabbitmq with pid=33083 used 581 sockets from 92068

socket limits it isn't cause of the issue

Leontiy Istomin (listomin) wrote :

can easily reproduce the issue with NeutronSecGroupPlugin.create_and_delete_secgroups rally scenario. Test was from 13 15:05:19 to 13 15:58:40 UTC

During the test rabbitmq was stopped on node-196: http://paste.openstack.org/show/412901/
dstat at the time on node-196: http://paste.openstack.org/show/412902/
from pacemaker on node-196: http://paste.openstack.org/show/412904/
Screenshot of management console at the time of stopping rabbitmq on node-196 (from node-172) is attached (rabbit_management.png).
node-197 was stopped by pacemaker since minute after stopping node-196.
node-196 and node-197 wasn't recovered. node-172 successfully worked all test. Test was passed successfully. Result from management console is attached (rabbit_management_after_test.png)

Diagnostic Snapshot: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-08-13_15-56-36.tar.xz

Bogdan Dobrelya (bogdando) wrote :

What is the impact, please elaborate. Was all of the rabbit nodes down or not? Was a control plane affected by complete downtime or not? Was data plane functioning or not? Setting to the medium, unless details provided

tags: added: rabbitmq
Changed in fuel:
status: Confirmed → Incomplete
importance: High → Medium
Dmitry Mescheryakov (dmitrymex) wrote :

The RabbitMQ rebooting on a single machine causes several minutes outage on the OpenStack side, because right now oslo.messaging can not seamlessly failover to live controllers. As discussed with Bogdan we will make OCF scripts tolerate rabbitmqctl timeouts to a certain degree by introducing fail count. This will help us avoid non-needed RabbitMQ reboots and as a result OpenStack outages.

Changed in fuel:
status: Incomplete → In Progress
importance: Medium → High
Bogdan Dobrelya (bogdando) wrote :
summary: + Make OCF scripts tolerate rabbitmqctl timeouts to a certain degree. UX:
rabbitmq node was stopped by pacemaker
Changed in fuel:
assignee: Dmitry Mescheryakov (dmitrymex) → Sergii Golovatiuk (sgolovatiuk)
Changed in fuel:
assignee: Sergii Golovatiuk (sgolovatiuk) → Dmitry Mescheryakov (dmitrymex)

Reviewed: https://review.openstack.org/217738
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=2707a5ebbff7012a94de77b60fd594f5bcb29e05
Submitter: Jenkins
Branch: master

commit 2707a5ebbff7012a94de77b60fd594f5bcb29e05
Author: Dmitry Mescheryakov <email address hidden>
Date: Tue Aug 25 17:38:44 2015 +0300

    Make RabbitMQ OCF script tolerate rabbitmqctl timeouts

    The change makes OCF script ignore small number of timeouts of rabbitmqctl
    for 'heavy' operations: list_channels, get_alarms and list_queues.
    Number of tolerated timeouts in a row is configured through a new variable
    'max_rabbitmqctl_timeouts'. By default it is set to 1, i.e. rabbitmqctl
    timeouts are not tolerated at all.

    Bug #1487517 is fixed by extracting declaration of local variables
    'rc_alarms' and 'rc_queues' from assignment operations.

    Text for Operations Guide:

    If on node where RabbitMQ is deployed
    other processes consume significant part of CPU, RabbitMQ starts
    responding slow to queries by 'rabbitmqctl' utility. The utility is
    used by RabbitMQ's OCF script to monitor state of the RabbitMQ.
    When utility fails to return in pre-defined timeout, OCF script
    considers RabbitMQ to be down and restarts it, which might lead to
    a limited (several minutes) OpenStack downtime. Such restarts
    are undesirable as they cause downtime without benefit. To
    mitigate the issue, the OCF script might be told to tolerate
    certain amount of rabbitmqctl timeouts in a row using the following
    command:
      crm_resource --resource p_rabbitmq-server --set-parameter \
          max_rabbitmqctl_timeouts --parameter-value N

    Here N should be replaced with the number of timeouts. For instance,
    if it is set to 3, the OCF script will tolerate two rabbitmqctl
    timeouts in a row, but fail if the third one occurs.

    By default the parameter is set to 1, i.e. rabbitmqctl timeout is not
    tolerated at all. The downside of increasing the parameter is that
    if a real issue occurs which causes rabbitmqctl timeout, OCF script
    will detect that only after N monitor runs and so the restart, which
    might fix the issue, will be delayed.

    To understand that RabbitMQ's restart was caused by rabbitmqctl timeout
    you should examine lrmd.log of the corresponding controller on Fuel
    master node in /var/log/docker-logs/remote/ directory. Here lines like
    "the invoked command exited 137: /usr/sbin/rabbitmqctl list_channels ..."

    indicate rabbitmqctl timeout. The next line will explain if it
    caused restart or not. For example:
    "rabbitmqctl timed out 2 of max. 3 time(s) in a row. Doing nothing for now."

    DocImpact: user-guide, operations-guide

    Closes-Bug: #1479815
    Closes-Bug: #1487517
    Change-Id: I9dec06fc08dbeefbc67249b9e9633c8aab5e09ca

Changed in fuel:
status: In Progress → Fix Committed
tags: added: release-notes-done

Reviewed: https://review.openstack.org/221680
Committed: https://git.openstack.org/cgit/stackforge/fuel-docs/commit/?id=ad7999d3f8288d56c2167fc99c5371acb030fdb3
Submitter: Jenkins
Branch: master

commit ad7999d3f8288d56c2167fc99c5371acb030fdb3
Author: Alexander Adamov <email address hidden>
Date: Wed Sep 9 13:01:39 2015 +0300

    [OpsGuide]Make RabbitMQ OCF script tolerate timeouts

    Adds resolved issue LP1479815 to Operations Guide.

    Change-Id: I3ebd6d510b1d300a895d5d7905aabeaf4c96c1e4
    Related-Bug: #1479815

Leontiy Istomin (listomin) wrote :

change status to confirmed for 7.0 due the patch has been reverted

Changed in fuel:
status: Fix Committed → Confirmed
Polina Petriuk (ppetriuk) wrote :

Leontiy, what was the reason for reverting the patch for MOS7.0?

tags: added: customer-found

One more repro on Scale lab, env 10. lmrd.log from node-100:

2015-09-19T10:28:31.956130+00:00 info: INFO: p_rabbitmq-server: su_rabbit_cmd(): the invoked command exited 137: /usr/sbin/rabbitmqctl list_channels 2>&1 > /dev/null
2015-09-19T10:28:31.989742+00:00 err: ERROR: p_rabbitmq-server: get_monitor(): 'rabbitmqctl list_channels' timed out 1 of max. 1 time(s) in a row and is not responding. The resource is failed.

Polina, the patch itself if not reverted. But due to the bug https://bugs.launchpad.net/fuel/+bug/1496386 enabling fix for the current issue breaks RabbitMQ cluster.

Right now we are working on fix for #1496386. If we do not make it for 7.0 release, we will revert the change for documentation suggesting to enable the fix. Here is the revert - https://review.openstack.org/#/c/224556/ . Leontiy was referring to it.

Vitaly Sedelnik (vsedelnik) wrote :

This bug will be fixed by merging https://review.openstack.org/#/c/225120 which fixes Critical 7.0 issue - so leaving this issue targeted to 7.0 milestone.

Fix for the issue, which blocks enabling of the fix for the current bug, is merged. Hence we are not going to revert our documentation and can consider current bug to be closed.

Changed in fuel:
status: Confirmed → Fix Committed
Leontiy Istomin (listomin) wrote :

I've applied https://review.openstack.org/#/c/225120 and performed
crm_resource --resource p_rabbitmq-server --set-parameter max_rabbitmqctl_timeouts --parameter-value 5
then I've restarted rabbitmq by:
pcs resource disable master_p_rabbitmq-server
wait stopping rabbitmq
pcs resource enable master_p_rabbitmq-server
During boot_and_delete_server_with_neutron_secgroups rally scenario Rabbitmq was broken even in this case due "list_queues timeout"

Changed in fuel:
status: Fix Committed → Confirmed

As Leontiy pointed out, the fix does not fully help in Scale lab's case. After discussing the issue with Leontiy, we decided to continue tracking Scale lab problem in this bug.

Reviewed: https://review.openstack.org/222614
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=a304fac9bf1ee4e98cfc355e3058b9664c2768c2
Submitter: Jenkins
Branch: stable/6.1

commit a304fac9bf1ee4e98cfc355e3058b9664c2768c2
Author: Dmitry Mescheryakov <email address hidden>
Date: Tue Aug 25 17:38:44 2015 +0300

    Make RabbitMQ OCF script tolerate rabbitmqctl timeouts

    The change makes OCF script ignore small number of timeouts of rabbitmqctl
    for 'heavy' operations: list_channels, get_alarms and list_queues.
    Number of tolerated timeouts in a row is configured through a new variable
    'max_rabbitmqctl_timeouts'. By default it is set to 1, i.e. rabbitmqctl
    timeouts are not tolerated at all.

    Bug #1487517 is fixed by extracting declaration of local variables
    'rc_alarms' and 'rc_queues' from assignment operations.

    Text for Operations Guide:

    If on node where RabbitMQ is deployed
    other processes consume significant part of CPU, RabbitMQ starts
    responding slow to queries by 'rabbitmqctl' utility. The utility is
    used by RabbitMQ's OCF script to monitor state of the RabbitMQ.
    When utility fails to return in pre-defined timeout, OCF script
    considers RabbitMQ to be down and restarts it, which might lead to
    a limited (several minutes) OpenStack downtime. Such restarts
    are undesirable as they cause downtime without benefit. To
    mitigate the issue, the OCF script might be told to tolerate
    certain amount of rabbitmqctl timeouts in a row using the following
    command:
      crm_resource --resource p_rabbitmq-server --set-parameter \
          max_rabbitmqctl_timeouts --parameter-value N

    Here N should be replaced with the number of timeouts. For instance,
    if it is set to 3, the OCF script will tolerate two rabbitmqctl
    timeouts in a row, but fail if the third one occurs.

    By default the parameter is set to 1, i.e. rabbitmqctl timeout is not
    tolerated at all. The downside of increasing the parameter is that
    if a real issue occurs which causes rabbitmqctl timeout, OCF script
    will detect that only after N monitor runs and so the restart, which
    might fix the issue, will be delayed.

    To understand that RabbitMQ's restart was caused by rabbitmqctl timeout
    you should examine lrmd.log of the corresponding controller on Fuel
    master node in /var/log/docker-logs/remote/ directory. Here lines like
    "the invoked command exited 137: /usr/sbin/rabbitmqctl list_channels ..."

    indicate rabbitmqctl timeout. The next line will explain if it
    caused restart or not. For example:
    "rabbitmqctl timed out 2 of max. 3 time(s) in a row. Doing nothing for now."

    DocImpact: user-guide, operations-guide

    Closes-Bug: #1479815
    Closes-Bug: #1487517
    Change-Id: I9dec06fc08dbeefbc67249b9e9633c8aab5e09ca
    (cherry picked from commit 2707a5ebbff7012a94de77b60fd594f5bcb29e05)

Dmitry Pyzhov (dpyzhov) on 2015-10-12
no longer affects: fuel/8.0.x
Vitaly Sedelnik (vsedelnik) wrote :

Moved from 7,0-mu-1 to 7.0-updates as root cause is still unknown and we don't expect the fix to be available soon

tags: added: support
Polina Petriuk (ppetriuk) wrote :

This OCF script will have an issue described in https://bugs.launchpad.net/mos/+bug/1503331.

Dmitry Pyzhov (dpyzhov) on 2015-10-22
tags: added: area-mos

We agreed with Mikhail Chernik that he will investigate if 200 nodes env is still affected by that issue

Changed in fuel:
assignee: Dmitry Mescheryakov (dmitrymex) → Mikhail Chernik (mchernik)
Changed in fuel:
milestone: 8.0 → 9.0
status: Confirmed → New
Artem Roma (aroma-x) on 2016-01-04
Changed in fuel:
status: New → Confirmed
Roman Podoliaka (rpodolyaka) wrote :

Moving to Incomplete until we hear from Mikhail

Changed in fuel:
status: Confirmed → Incomplete
Mikhail Chernik (mchernik) wrote :

Run NeutronSecGroupPlugin.create_and_delete_secgroups Rally scenario on MOS 8.0, ISO 482

Cluster configuration: Neutron+VLAN with DVR. 200 computes, 3 controllers, Cinder.

No failed connections to RabbitMQ was detected during test run.

Mikhail Chernik (mchernik) wrote :

Update: scenario mentioned in previous comment was executed in 5 thread. When running the same scenario in 20 threads, RabbitMQ begins to restart, however test completed without errors and at least one RMQ instance was online.

Roman Podoliaka (rpodolyaka) wrote :

Per discussion with Dmitry: he's going to continue the investigation of this problem on the Mikhail's environment. Still, this must not be a blocker for us, thus, tentatively moving this to 8.0-updates.

Changed in fuel:
assignee: Mikhail Chernik (mchernik) → Dmitry Mescheryakov (dmitrymex)
status: Incomplete → Confirmed
tags: added: move-to-mu

(This check performed automatically)
Please, make sure that bug description contains the following sections filled in with the appropriate data related to the bug you are describing:

actual result

version

expected result

steps to reproduce

For more detailed information on the contents of each of the listed sections see https://wiki.openstack.org/wiki/Fuel/How_to_contribute#Here_is_how_you_file_a_bug

tags: added: need-info
Changed in fuel:
milestone: 9.0 → 10.0
Roman Rufanov (rrufanov) wrote :

Please back-port to 6.0 and it will be delivered to customers as patch or instruction with steps.

Roman, backport to 6.x could be found in that CR - https://review.openstack.org/#/c/316053 (it is a cumulative set of changes between 6.x and 9.0)

Guys, after further thinking I am closing this bug. The scope of this issue was to cover intermittent 'rabbitmqctl list_channels' timeouts due to high load. We did it with the change https://review.openstack.org/217738

What people observe after that are RabbitMQ bugs, which cause list_channels to timeout until RabbitMQ cluster is restarted. That new issue will be tracked in bug https://bugs.launchpad.net/mos/+bug/1566816

Changed in fuel:
status: Confirmed → Invalid
Alexey Stupnikov (astupnikov) wrote :

We no longer support MOS5.1, MOS6.0, MOS6.1
We deliver only Critical/Security fixes to MOS7.0, MOS8.0.
We deliver only High/Critical/Security fixes to MOS9.2.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers