Pacemaker tries to start rabbit eternally

Bug #1496386 reported by Ilya Shakhat
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
Dmitry Mescheryakov
6.0.x
Won't Fix
High
MOS Maintenance
6.1.x
Fix Released
Critical
Sergii Rizvan
7.0.x
Fix Released
Critical
Dmitry Mescheryakov
8.0.x
Fix Released
High
Dmitry Mescheryakov

Bug Description

When user tries to enable fix https://review.openstack.org/#/c/217738/ for bug https://bugs.launchpad.net/fuel/+bug/1479815 , OCF script restarts slave RabbitMQ nodes. At least one of them goes down and OCF script fails to bring it up. The issue could be resolved only by manually fixing the failed nodes.

Steps to reproduce:
1. Set value for parameter "max_rabbitmqctl_timeouts" by executing from controller (I did it from master):
    crm_resource --resource p_rabbitmq-server --set-parameter max_rabbitmqctl_timeouts --parameter-value 5
Pacemaker tries to restart rabbit, however on other node (non-master) rabbit remains in failed state.

$ pcs resource
Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
     p_rabbitmq-server (ocf::fuel:rabbitmq-server): FAILED
     Masters: [ node-2.domain.tld ]
     Slaves: [ node-4.domain.tld ]

According to lrmd.log pacemaker attempts to start rabbit but thinks it's already up and is member of the cluster:
2015-09-16T12:56:18.201683+00:00 info: INFO: p_rabbitmq-server: notify: post-start begin.
2015-09-16T12:56:18.206365+00:00 info: INFO: p_rabbitmq-server: my_host(): hostlist is: node-5.domain.tld
2015-09-16T12:56:18.945587+00:00 warning: WARNING: p_rabbitmq-server: notify: We are already clustered with node node-2.domain.tld
2015-09-16T12:56:18.958814+00:00 info: INFO: p_rabbitmq-server: notify: post-start end.

Revision history for this message
Ilya Shakhat (shakhat) wrote :

Code that gets nodes always returns OCF_SUCCESS:
    local c_status=$(${OCF_RESKEY_ctl} eval "mnesia:system_info(${infotype})." 2>/dev/null)
    rc=$?

(https://github.com/stackforge/fuel-library/blob/master/files/fuel-ha-utils/ocf/rabbitmq#L458-L459)
$? is the result of local variable assignment operation

Changed in fuel:
assignee: nobody → MOS Oslo (mos-oslo)
importance: Undecided → Critical
milestone: none → 7.0
Revision history for this message
Ilya Shakhat (shakhat) wrote :
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

This is as designed. We have rabbitmq running and it tells us that we are part of the same cluster as the node that is marked as rabbitmq "master" node.

Eternal restart of rabbitmq with fails may happen because of lack of memory or high cpu load that may lead to timeout. I would suggest setting this variable before you impose load on the cluster

Changed in fuel:
status: New → Confirmed
status: Confirmed → New
Revision history for this message
Ilya Shakhat (shakhat) wrote :

The issue is reproducible on cluster in "calm" state, no Rally, no load.

Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

I was able to reproduce the issue on my environment

Changed in fuel:
status: New → Confirmed
Revision history for this message
Mike Scherbakov (mihgen) wrote :

Folks,
can you please clarify steps to reproduce and user impact? Are you guys sure it's a blocker for 7.0?

Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

I'd like to address Vova's question if it is actually attribute change which triggers restart of RabbitMQ server. It can actually be clearly seen in logs. Below is an example from my environment:

lrmd.log - http://paste.openstack.org/show/464980/
pacemaker.log - http://paste.openstack.org/show/464989/

In lrmd.log it could be seen that 'stop' operation started out of a sudden at 17:12:06.260777. Previous 'monitor' operation completed 10 seconds before that with success. At the same time, in pacemaker.log it could be seen that it detected of attribute at 17:12:05:
Sep 16 17:12:05 [4552] node-3.test.domain.local cib: info: cib_perform_op: + /cib/configuration/resources/master[@id='master_p_rabbitmq
-server']/primitive[@id='p_rabbitmq-server']/instance_attributes[@id='p_rabbitmq-server-instance_attributes']/nvpair[@id='p_rabbitmq-server-instance_
attributes-max_rabbitmqctl_timeouts']: @value=12

and closer to the end of the snippet (within the same second) it decided to restart RabbitMQ on node-3 and node-5 (slaves)

Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

Mike,

after some thinking I consider bug to be not that critical: it is triggered by changing attribute, which is not required for normal operation. The attribute change enables fix https://review.openstack.org/#/c/217738/ , which is disabled by default. But while the fix is nice to have, it is not that critical. So IMO we can safely address the issue in 8.0-updates.

The only thing is that we will have to revert Doc change https://review.openstack.org/#/c/221680/ where suggest to use that fix.

Changed in fuel:
importance: Critical → High
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

Lowering importance to high, lets discuss that if somebody disagrees.

Revision history for this message
Georgy Okrokvertskhov (gokrokvertskhov) wrote :

This issue appeared in scale lab during testing. This situation happens when pacemaker kills RabbitMQ due to high load and then due to this particular issue RabbitMQ can't be healed and the whole cloud becomes unusable.

Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

Georgy,

Actually that exactly problem appears when user executes command
crm_resource --resource p_rabbitmq-server --set-parameter max_rabbitmqctl_timeouts --parameter-value 5

in that case some RabbitMQ nodes become broken and OCF scripts can not fix them. The workaround is not to use the command above. It is not nice, as it blocks using of fix https://review.openstack.org/#/c/217738/ , but we will have to wait until I fix that particular issue.

Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :
description: updated
Changed in fuel:
assignee: MOS Oslo (mos-oslo) → Dmitry Mescheryakov (dmitrymex)
status: Confirmed → In Progress
Revision history for this message
Georgy Okrokvertskhov (gokrokvertskhov) wrote :

This issue affects the normal usage of the OpenStack cloud at scale of 200 nodes. This issue blocks the fix for the bug (https://bugs.launchpad.net/fuel/+bug/1479815) which is reproducible under normal load on 200 nodes.

This issue should be Critical as it directly affects usability of the cloud at 200 nodes. Rabbit MQ without pacemaker management can tolerate high load while enabled pacemaker kills rabbitMQ due to timeout.
The fix #1479815 is intended to increase timeout values so that OpenStack cloud still has HA for Rabbit and still tolerates high load spikes. During high load period Rabbit can respond slowly but it is functioning.

The fix for this issue should allow end user to change timeout values for the RabbitMQ management scripts. Right now it is not possible doe to this issue.

Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

Moving issue back to 7.0, as it became critical

Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :
tags: added: rabbitmq
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/7.0)

Fix proposed to branch: stable/7.0
Review: https://review.openstack.org/226250

Changed in fuel:
assignee: MOS Oslo (mos-oslo) → Dmitry Mescheryakov (dmitrymex)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/225120
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=c1900b49e6ddcfb84bf5c501c75f3fee80903eca
Submitter: Jenkins
Branch: master

commit c1900b49e6ddcfb84bf5c501c75f3fee80903eca
Author: Dmitry Mescheryakov <email address hidden>
Date: Fri Sep 18 15:05:03 2015 +0300

    Start RabbitMQ app on notify

    On notify, if we detect that we are a part of a cluster we still
    need to start the RabbitMQ application, because it is always
    down after action_start finishes.

    Closes-Bug: #1496386
    Change-Id: I307452b687a6100cc4489c8decebbc3dccdbc432

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/7.0)

Reviewed: https://review.openstack.org/226250
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=5d50055aeca1dd0dc53b43825dc4c8f7780be9dd
Submitter: Jenkins
Branch: stable/7.0

commit 5d50055aeca1dd0dc53b43825dc4c8f7780be9dd
Author: Dmitry Mescheryakov <email address hidden>
Date: Fri Sep 18 15:05:03 2015 +0300

    Start RabbitMQ app on notify

    On notify, if we detect that we are a part of a cluster we still
    need to start the RabbitMQ application, because it is always
    down after action_start finishes.

    Closes-Bug: #1496386
    Change-Id: I307452b687a6100cc4489c8decebbc3dccdbc432

tags: added: on-verification
Revision history for this message
Dmitriy Kruglov (dkruglov) wrote :

Verified on MOS 7.0 RC4 iso. The issue is not reproduced.

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "7.0"
  openstack_version: "2015.1.0-7.0"
  api: "1.0"
  build_number: "301"
  build_id: "301"
  nailgun_sha: "4162b0c15adb425b37608c787944d1983f543aa8"
  python-fuelclient_sha: "486bde57cda1badb68f915f66c61b544108606f3"
  fuel-agent_sha: "50e90af6e3d560e9085ff71d2950cfbcca91af67"
  fuel-nailgun-agent_sha: "d7027952870a35db8dc52f185bb1158cdd3d1ebd"
  astute_sha: "6c5b73f93e24cc781c809db9159927655ced5012"
  fuel-library_sha: "5d50055aeca1dd0dc53b43825dc4c8f7780be9dd"
  fuel-ostf_sha: "2cd967dccd66cfc3a0abd6af9f31e5b4d150a11c"
  fuelmain_sha: "a65d453215edb0284a2e4761be7a156bb5627677"

tags: removed: on-verification
tags: added: on-verification
Revision history for this message
Dmitriy Kruglov (dkruglov) wrote :

Verified on MOS 8.0. The issue is not reproduced.

ISO info:
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  openstack_version: "2015.1.0-8.0"
  api: "1.0"
  build_number: "138"
  build_id: "138"
  fuel-nailgun_sha: "3a745ee87e659b3ba239bbede21e491292646acb"
  python-fuelclient_sha: "769df968e19d95a4ab4f12b1d2c76d385cf3168c"
  fuel-agent_sha: "84335446172cc6a699252c184076a519ac791ca1"
  fuel-nailgun-agent_sha: "d66f188a1832a9c23b04884a14ef00fc5605ec6d"
  astute_sha: "e99368bd77496870592781f4ba4fb0caacb9f3a7"
  fuel-library_sha: "80c2dcf3e298e576dd50111825041466b0e38d3f"
  fuel-ostf_sha: "983d0e6fe64397d6ff3bd72311c26c44b02de3e8"
  fuel-createmirror_sha: "df6a93f7e2819d3dfa600052b0f901d9594eb0db"
  fuelmain_sha: "4c58b6503fc780be117777182165fd7b037b1a96"

tags: removed: on-verification
Dmitry Pyzhov (dpyzhov)
tags: added: area-mos
tags: added: rca-done
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/6.1)

Fix proposed to branch: stable/6.1
Review: https://review.openstack.org/270314

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/6.1)

Reviewed: https://review.openstack.org/270314
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=bb5340e01fcb8e85a397f1db87b810696b2fb11e
Submitter: Jenkins
Branch: stable/6.1

commit bb5340e01fcb8e85a397f1db87b810696b2fb11e
Author: Dmitry Mescheryakov <email address hidden>
Date: Fri Sep 18 15:05:03 2015 +0300

    Start RabbitMQ app on notify

    On notify, if we detect that we are a part of a cluster we still
    need to start the RabbitMQ application, because it is always
    down after action_start finishes.

    Closes-Bug: #1496386
    Change-Id: I307452b687a6100cc4489c8decebbc3dccdbc432
    (cherry picked from commit c1900b49e6ddcfb84bf5c501c75f3fee80903eca)

Dmitry (dtsapikov)
tags: added: on-verification
Revision history for this message
Dmitry (dtsapikov) wrote :

Bug was not reproduced.
Verified on 6.1+mu5

tags: removed: on-verification
Revision history for this message
Roman Rufanov (rrufanov) wrote :

Reopening - Please back-port to 6.0 and it will be delivered as patch or instruction with steps.

tags: added: customer-found
Revision history for this message
Alexey Stupnikov (astupnikov) wrote :

We no longer support MOS5.1, MOS6.0, MOS6.1
We deliver only Critical/Security fixes to MOS7.0, MOS8.0.
We deliver only High/Critical/Security fixes to MOS9.2.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.