Pacemaker tries to start rabbit eternally

Bug #1496386 reported by Ilya Shakhat on 2015-09-16
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
High
Dmitry Mescheryakov
6.0.x
High
MOS Maintenance
6.1.x
Critical
Sergii Rizvan
7.0.x
Critical
Dmitry Mescheryakov
8.0.x
High
Dmitry Mescheryakov

Bug Description

When user tries to enable fix https://review.openstack.org/#/c/217738/ for bug https://bugs.launchpad.net/fuel/+bug/1479815 , OCF script restarts slave RabbitMQ nodes. At least one of them goes down and OCF script fails to bring it up. The issue could be resolved only by manually fixing the failed nodes.

Steps to reproduce:
1. Set value for parameter "max_rabbitmqctl_timeouts" by executing from controller (I did it from master):
    crm_resource --resource p_rabbitmq-server --set-parameter max_rabbitmqctl_timeouts --parameter-value 5
Pacemaker tries to restart rabbit, however on other node (non-master) rabbit remains in failed state.

$ pcs resource
Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
     p_rabbitmq-server (ocf::fuel:rabbitmq-server): FAILED
     Masters: [ node-2.domain.tld ]
     Slaves: [ node-4.domain.tld ]

According to lrmd.log pacemaker attempts to start rabbit but thinks it's already up and is member of the cluster:
2015-09-16T12:56:18.201683+00:00 info: INFO: p_rabbitmq-server: notify: post-start begin.
2015-09-16T12:56:18.206365+00:00 info: INFO: p_rabbitmq-server: my_host(): hostlist is: node-5.domain.tld
2015-09-16T12:56:18.945587+00:00 warning: WARNING: p_rabbitmq-server: notify: We are already clustered with node node-2.domain.tld
2015-09-16T12:56:18.958814+00:00 info: INFO: p_rabbitmq-server: notify: post-start end.

Ilya Shakhat (shakhat) wrote :

Code that gets nodes always returns OCF_SUCCESS:
    local c_status=$(${OCF_RESKEY_ctl} eval "mnesia:system_info(${infotype})." 2>/dev/null)
    rc=$?

(https://github.com/stackforge/fuel-library/blob/master/files/fuel-ha-utils/ocf/rabbitmq#L458-L459)
$? is the result of local variable assignment operation

Changed in fuel:
assignee: nobody → MOS Oslo (mos-oslo)
importance: Undecided → Critical
milestone: none → 7.0
Ilya Shakhat (shakhat) wrote :
Vladimir Kuklin (vkuklin) wrote :

This is as designed. We have rabbitmq running and it tells us that we are part of the same cluster as the node that is marked as rabbitmq "master" node.

Eternal restart of rabbitmq with fails may happen because of lack of memory or high cpu load that may lead to timeout. I would suggest setting this variable before you impose load on the cluster

Changed in fuel:
status: New → Confirmed
status: Confirmed → New
Ilya Shakhat (shakhat) wrote :

The issue is reproducible on cluster in "calm" state, no Rally, no load.

Dmitry Mescheryakov (dmitrymex) wrote :

I was able to reproduce the issue on my environment

Changed in fuel:
status: New → Confirmed
Mike Scherbakov (mihgen) wrote :

Folks,
can you please clarify steps to reproduce and user impact? Are you guys sure it's a blocker for 7.0?

Dmitry Mescheryakov (dmitrymex) wrote :

I'd like to address Vova's question if it is actually attribute change which triggers restart of RabbitMQ server. It can actually be clearly seen in logs. Below is an example from my environment:

lrmd.log - http://paste.openstack.org/show/464980/
pacemaker.log - http://paste.openstack.org/show/464989/

In lrmd.log it could be seen that 'stop' operation started out of a sudden at 17:12:06.260777. Previous 'monitor' operation completed 10 seconds before that with success. At the same time, in pacemaker.log it could be seen that it detected of attribute at 17:12:05:
Sep 16 17:12:05 [4552] node-3.test.domain.local cib: info: cib_perform_op: + /cib/configuration/resources/master[@id='master_p_rabbitmq
-server']/primitive[@id='p_rabbitmq-server']/instance_attributes[@id='p_rabbitmq-server-instance_attributes']/nvpair[@id='p_rabbitmq-server-instance_
attributes-max_rabbitmqctl_timeouts']: @value=12

and closer to the end of the snippet (within the same second) it decided to restart RabbitMQ on node-3 and node-5 (slaves)

Dmitry Mescheryakov (dmitrymex) wrote :

Mike,

after some thinking I consider bug to be not that critical: it is triggered by changing attribute, which is not required for normal operation. The attribute change enables fix https://review.openstack.org/#/c/217738/ , which is disabled by default. But while the fix is nice to have, it is not that critical. So IMO we can safely address the issue in 8.0-updates.

The only thing is that we will have to revert Doc change https://review.openstack.org/#/c/221680/ where suggest to use that fix.

Changed in fuel:
importance: Critical → High
Dmitry Mescheryakov (dmitrymex) wrote :

Lowering importance to high, lets discuss that if somebody disagrees.

This issue appeared in scale lab during testing. This situation happens when pacemaker kills RabbitMQ due to high load and then due to this particular issue RabbitMQ can't be healed and the whole cloud becomes unusable.

Georgy,

Actually that exactly problem appears when user executes command
crm_resource --resource p_rabbitmq-server --set-parameter max_rabbitmqctl_timeouts --parameter-value 5

in that case some RabbitMQ nodes become broken and OCF scripts can not fix them. The workaround is not to use the command above. It is not nice, as it blocks using of fix https://review.openstack.org/#/c/217738/ , but we will have to wait until I fix that particular issue.

description: updated
Changed in fuel:
assignee: MOS Oslo (mos-oslo) → Dmitry Mescheryakov (dmitrymex)
status: Confirmed → In Progress

This issue affects the normal usage of the OpenStack cloud at scale of 200 nodes. This issue blocks the fix for the bug (https://bugs.launchpad.net/fuel/+bug/1479815) which is reproducible under normal load on 200 nodes.

This issue should be Critical as it directly affects usability of the cloud at 200 nodes. Rabbit MQ without pacemaker management can tolerate high load while enabled pacemaker kills rabbitMQ due to timeout.
The fix #1479815 is intended to increase timeout values so that OpenStack cloud still has HA for Rabbit and still tolerates high load spikes. During high load period Rabbit can respond slowly but it is functioning.

The fix for this issue should allow end user to change timeout values for the RabbitMQ management scripts. Right now it is not possible doe to this issue.

Moving issue back to 7.0, as it became critical

tags: added: rabbitmq

Fix proposed to branch: stable/7.0
Review: https://review.openstack.org/226250

Changed in fuel:
assignee: MOS Oslo (mos-oslo) → Dmitry Mescheryakov (dmitrymex)

Reviewed: https://review.openstack.org/225120
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=c1900b49e6ddcfb84bf5c501c75f3fee80903eca
Submitter: Jenkins
Branch: master

commit c1900b49e6ddcfb84bf5c501c75f3fee80903eca
Author: Dmitry Mescheryakov <email address hidden>
Date: Fri Sep 18 15:05:03 2015 +0300

    Start RabbitMQ app on notify

    On notify, if we detect that we are a part of a cluster we still
    need to start the RabbitMQ application, because it is always
    down after action_start finishes.

    Closes-Bug: #1496386
    Change-Id: I307452b687a6100cc4489c8decebbc3dccdbc432

Changed in fuel:
status: In Progress → Fix Committed

Reviewed: https://review.openstack.org/226250
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=5d50055aeca1dd0dc53b43825dc4c8f7780be9dd
Submitter: Jenkins
Branch: stable/7.0

commit 5d50055aeca1dd0dc53b43825dc4c8f7780be9dd
Author: Dmitry Mescheryakov <email address hidden>
Date: Fri Sep 18 15:05:03 2015 +0300

    Start RabbitMQ app on notify

    On notify, if we detect that we are a part of a cluster we still
    need to start the RabbitMQ application, because it is always
    down after action_start finishes.

    Closes-Bug: #1496386
    Change-Id: I307452b687a6100cc4489c8decebbc3dccdbc432

tags: added: on-verification
Dmitriy Kruglov (dkruglov) wrote :

Verified on MOS 7.0 RC4 iso. The issue is not reproduced.

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "7.0"
  openstack_version: "2015.1.0-7.0"
  api: "1.0"
  build_number: "301"
  build_id: "301"
  nailgun_sha: "4162b0c15adb425b37608c787944d1983f543aa8"
  python-fuelclient_sha: "486bde57cda1badb68f915f66c61b544108606f3"
  fuel-agent_sha: "50e90af6e3d560e9085ff71d2950cfbcca91af67"
  fuel-nailgun-agent_sha: "d7027952870a35db8dc52f185bb1158cdd3d1ebd"
  astute_sha: "6c5b73f93e24cc781c809db9159927655ced5012"
  fuel-library_sha: "5d50055aeca1dd0dc53b43825dc4c8f7780be9dd"
  fuel-ostf_sha: "2cd967dccd66cfc3a0abd6af9f31e5b4d150a11c"
  fuelmain_sha: "a65d453215edb0284a2e4761be7a156bb5627677"

tags: removed: on-verification
tags: added: on-verification
Dmitriy Kruglov (dkruglov) wrote :

Verified on MOS 8.0. The issue is not reproduced.

ISO info:
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  openstack_version: "2015.1.0-8.0"
  api: "1.0"
  build_number: "138"
  build_id: "138"
  fuel-nailgun_sha: "3a745ee87e659b3ba239bbede21e491292646acb"
  python-fuelclient_sha: "769df968e19d95a4ab4f12b1d2c76d385cf3168c"
  fuel-agent_sha: "84335446172cc6a699252c184076a519ac791ca1"
  fuel-nailgun-agent_sha: "d66f188a1832a9c23b04884a14ef00fc5605ec6d"
  astute_sha: "e99368bd77496870592781f4ba4fb0caacb9f3a7"
  fuel-library_sha: "80c2dcf3e298e576dd50111825041466b0e38d3f"
  fuel-ostf_sha: "983d0e6fe64397d6ff3bd72311c26c44b02de3e8"
  fuel-createmirror_sha: "df6a93f7e2819d3dfa600052b0f901d9594eb0db"
  fuelmain_sha: "4c58b6503fc780be117777182165fd7b037b1a96"

tags: removed: on-verification
Dmitry Pyzhov (dpyzhov) on 2015-10-22
tags: added: area-mos
tags: added: rca-done

Reviewed: https://review.openstack.org/270314
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=bb5340e01fcb8e85a397f1db87b810696b2fb11e
Submitter: Jenkins
Branch: stable/6.1

commit bb5340e01fcb8e85a397f1db87b810696b2fb11e
Author: Dmitry Mescheryakov <email address hidden>
Date: Fri Sep 18 15:05:03 2015 +0300

    Start RabbitMQ app on notify

    On notify, if we detect that we are a part of a cluster we still
    need to start the RabbitMQ application, because it is always
    down after action_start finishes.

    Closes-Bug: #1496386
    Change-Id: I307452b687a6100cc4489c8decebbc3dccdbc432
    (cherry picked from commit c1900b49e6ddcfb84bf5c501c75f3fee80903eca)

Dmitry (dtsapikov) on 2016-02-25
tags: added: on-verification
Dmitry (dtsapikov) wrote :

Bug was not reproduced.
Verified on 6.1+mu5

tags: removed: on-verification
Roman Rufanov (rrufanov) wrote :

Reopening - Please back-port to 6.0 and it will be delivered as patch or instruction with steps.

tags: added: customer-found
Alexey Stupnikov (astupnikov) wrote :

We no longer support MOS5.1, MOS6.0, MOS6.1
We deliver only Critical/Security fixes to MOS7.0, MOS8.0.
We deliver only High/Critical/Security fixes to MOS9.2.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers