Fuel for OpenStack

Pacemaker tries to start rabbit eternally

Bug #1496386 reported by Ilya Shakhat on 2015-09-16

This bug affects 2 people

	Status	Importance	Assigned to	Milestone
Fuel for OpenStack	Fix Released	High	Dmitry Mescheryakov	Fuel for OpenStack 8.0
6.0.x	Won't Fix	High	MOS Maintenance	Fuel for OpenStack 6.0-updates
6.1.x	Fix Released	Critical	Sergii Rizvan	Fuel for OpenStack 6.1-mu-5
7.0.x	Fix Released	Critical	Dmitry Mescheryakov	Fuel for OpenStack 7.0
8.0.x	Fix Released	High	Dmitry Mescheryakov	Fuel for OpenStack 8.0

Bug Description

When user tries to enable fix https://review.openstack.org/#/c/217738/ for bug https://bugs.launchpad.net/fuel/+bug/1479815 , OCF script restarts slave RabbitMQ nodes. At least one of them goes down and OCF script fails to bring it up. The issue could be resolved only by manually fixing the failed nodes.

Steps to reproduce:
1. Set value for parameter "max_rabbitmqctl_timeouts" by executing from controller (I did it from master):
crm_resource --resource p_rabbitmq-server --set-parameter max_rabbitmqctl_timeouts --parameter-value 5
Pacemaker tries to restart rabbit, however on other node (non-master) rabbit remains in failed state.

$ pcs resource
Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
     p_rabbitmq-server (ocf::fuel:rabbitmq-server): FAILED
     Masters: [ node-2.domain.tld ]
     Slaves: [ node-4.domain.tld ]

According to lrmd.log pacemaker attempts to start rabbit but thinks it's already up and is member of the cluster:
2015-09-16T12:56:18.201683+00:00 info: INFO: p_rabbitmq-server: notify: post-start begin.
2015-09-16T12:56:18.206365+00:00 info: INFO: p_rabbitmq-server: my_host(): hostlist is: node-5.domain.tld
2015-09-16T12:56:18.945587+00:00 warning: WARNING: p_rabbitmq-server: notify: We are already clustered with node node-2.domain.tld
2015-09-16T12:56:18.958814+00:00 info: INFO: p_rabbitmq-server: notify: post-start end.

See original description

Tags:

Revision history for this message

Ilya Shakhat (shakhat) wrote on 2015-09-16:

Code that gets nodes always returns OCF_SUCCESS:
local c_status=$(${OCF_RESKEY_ctl} eval "mnesia:system_info(${infotype})." 2>/dev/null)
rc=$?

(https://github.com/stackforge/fuel-library/blob/master/files/fuel-ha-utils/ocf/rabbitmq#L458-L459)
$? is the result of local variable assignment operation

Changed in fuel:
assignee:	nobody → MOS Oslo (mos-oslo)
importance:	Undecided → Critical
milestone:	none → 7.0

Revision history for this message

Ilya Shakhat (shakhat) wrote on 2015-09-16:

lrmd.log from affected node Edit (633.2 KiB, application/octet-stream)

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2015-09-16:

This is as designed. We have rabbitmq running and it tells us that we are part of the same cluster as the node that is marked as rabbitmq "master" node.

Eternal restart of rabbitmq with fails may happen because of lack of memory or high cpu load that may lead to timeout. I would suggest setting this variable before you impose load on the cluster

Eugene Bogdanov (ebogdanov) on 2015-09-16

Changed in fuel:
status:	New → Confirmed
status:	Confirmed → New

Revision history for this message

Ilya Shakhat (shakhat) wrote on 2015-09-16:

The issue is reproducible on cluster in "calm" state, no Rally, no load.

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2015-09-16:

I was able to reproduce the issue on my environment

Changed in fuel:
status:	New → Confirmed

Revision history for this message

Mike Scherbakov (mihgen) wrote on 2015-09-16:

Folks,
can you please clarify steps to reproduce and user impact? Are you guys sure it's a blocker for 7.0?

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2015-09-16:

I'd like to address Vova's question if it is actually attribute change which triggers restart of RabbitMQ server. It can actually be clearly seen in logs. Below is an example from my environment:

lrmd.log - http://paste.openstack.org/show/464980/
pacemaker.log - http://paste.openstack.org/show/464989/

In lrmd.log it could be seen that 'stop' operation started out of a sudden at 17:12:06.260777. Previous 'monitor' operation completed 10 seconds before that with success. At the same time, in pacemaker.log it could be seen that it detected of attribute at 17:12:05:
Sep 16 17:12:05 [4552] node-3.test.domain.local cib: info: cib_perform_op: + /cib/configuration/resources/master[@id='master_p_rabbitmq
-server']/primitive[@id='p_rabbitmq-server']/instance_attributes[@id='p_rabbitmq-server-instance_attributes']/nvpair[@id='p_rabbitmq-server-instance_
attributes-max_rabbitmqctl_timeouts']: @value=12

and closer to the end of the snippet (within the same second) it decided to restart RabbitMQ on node-3 and node-5 (slaves)

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2015-09-16:

Mike,

after some thinking I consider bug to be not that critical: it is triggered by changing attribute, which is not required for normal operation. The attribute change enables fix https://review.openstack.org/#/c/217738/ , which is disabled by default. But while the fix is nice to have, it is not that critical. So IMO we can safely address the issue in 8.0-updates.

The only thing is that we will have to revert Doc change https://review.openstack.org/#/c/221680/ where suggest to use that fix.

Changed in fuel:
importance:	Critical → High

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2015-09-16:

Lowering importance to high, lets discuss that if somebody disagrees.

Revision history for this message

Georgy Okrokvertskhov (gokrokvertskhov) wrote on 2015-09-16:

#10

This issue appeared in scale lab during testing. This situation happens when pacemaker kills RabbitMQ due to high load and then due to this particular issue RabbitMQ can't be healed and the whole cloud becomes unusable.

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2015-09-17:

#11

Georgy,

Actually that exactly problem appears when user executes command
crm_resource --resource p_rabbitmq-server --set-parameter max_rabbitmqctl_timeouts --parameter-value 5

in that case some RabbitMQ nodes become broken and OCF scripts can not fix them. The workaround is not to use the command above. It is not nice, as it blocks using of fix https://review.openstack.org/#/c/217738/ , but we will have to wait until I fix that particular issue.

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2015-09-17:

#12

Ops guide revert - https://review.openstack.org/#/c/224556/

Dmitry Mescheryakov (dmitrymex) on 2015-09-18

description:

updated

OpenStack Infra (hudson-openstack) on 2015-09-18

Changed in fuel:
assignee:	MOS Oslo (mos-oslo) → Dmitry Mescheryakov (dmitrymex)
status:	Confirmed → In Progress

Revision history for this message

Georgy Okrokvertskhov (gokrokvertskhov) wrote on 2015-09-18:

#13

This issue affects the normal usage of the OpenStack cloud at scale of 200 nodes. This issue blocks the fix for the bug (https://bugs.launchpad.net/fuel/+bug/1479815) which is reproducible under normal load on 200 nodes.

This issue should be Critical as it directly affects usability of the cloud at 200 nodes. Rabbit MQ without pacemaker management can tolerate high load while enabled pacemaker kills rabbitMQ due to timeout.
The fix #1479815 is intended to increase timeout values so that OpenStack cloud still has HA for Rabbit and still tolerates high load spikes. During high load period Rabbit can respond slowly but it is functioning.

The fix for this issue should allow end user to change timeout values for the RabbitMQ management scripts. Right now it is not possible doe to this issue.

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2015-09-18:

#14

Moving issue back to 7.0, as it became critical

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2015-09-18:

#15

Link to the fix - https://review.openstack.org/#/c/225120

Bogdan Dobrelya (bogdando) on 2015-09-22

tags:

added: rabbitmq

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-09-22: Fix proposed to fuel-library (stable/7.0)

#16

Fix proposed to branch: stable/7.0
Review: https://review.openstack.org/226250

Changed in fuel:
assignee:	MOS Oslo (mos-oslo) → Dmitry Mescheryakov (dmitrymex)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-09-22: Fix merged to fuel-library (master)

#17

Reviewed: https://review.openstack.org/225120
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=c1900b49e6ddcfb84bf5c501c75f3fee80903eca
Submitter: Jenkins
Branch: master

commit c1900b49e6ddcfb84bf5c501c75f3fee80903eca
Author: Dmitry Mescheryakov <email address hidden>
Date: Fri Sep 18 15:05:03 2015 +0300

Start RabbitMQ app on notify

    On notify, if we detect that we are a part of a cluster we still
    need to start the RabbitMQ application, because it is always
    down after action_start finishes.

Closes-Bug: #1496386
Change-Id: I307452b687a6100cc4489c8decebbc3dccdbc432

Changed in fuel:
status:	In Progress → Fix Committed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-09-22: Fix merged to fuel-library (stable/7.0)

#18

Reviewed: https://review.openstack.org/226250
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=5d50055aeca1dd0dc53b43825dc4c8f7780be9dd
Submitter: Jenkins
Branch: stable/7.0

commit 5d50055aeca1dd0dc53b43825dc4c8f7780be9dd
Author: Dmitry Mescheryakov <email address hidden>
Date: Fri Sep 18 15:05:03 2015 +0300

Start RabbitMQ app on notify

    On notify, if we detect that we are a part of a cluster we still
    need to start the RabbitMQ application, because it is always
    down after action_start finishes.

Closes-Bug: #1496386
Change-Id: I307452b687a6100cc4489c8decebbc3dccdbc432

Dmitriy Kruglov (dkruglov) on 2015-09-24

tags:

added: on-verification

Revision history for this message

Dmitriy Kruglov (dkruglov) wrote on 2015-09-24:

#19

Verified on MOS 7.0 RC4 iso. The issue is not reproduced.

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "7.0"
  openstack_version: "2015.1.0-7.0"
  api: "1.0"
  build_number: "301"
  build_id: "301"
  nailgun_sha: "4162b0c15adb425b37608c787944d1983f543aa8"
  python-fuelclient_sha: "486bde57cda1badb68f915f66c61b544108606f3"
  fuel-agent_sha: "50e90af6e3d560e9085ff71d2950cfbcca91af67"
  fuel-nailgun-agent_sha: "d7027952870a35db8dc52f185bb1158cdd3d1ebd"
  astute_sha: "6c5b73f93e24cc781c809db9159927655ced5012"
  fuel-library_sha: "5d50055aeca1dd0dc53b43825dc4c8f7780be9dd"
  fuel-ostf_sha: "2cd967dccd66cfc3a0abd6af9f31e5b4d150a11c"
  fuelmain_sha: "a65d453215edb0284a2e4761be7a156bb5627677"

Dmitriy Kruglov (dkruglov) on 2015-09-24

tags:

removed: on-verification

Dmitriy Kruglov (dkruglov) on 2015-10-12

tags:

added: on-verification

Revision history for this message

Dmitriy Kruglov (dkruglov) wrote on 2015-10-12:

#20

Verified on MOS 8.0. The issue is not reproduced.

ISO info:
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  openstack_version: "2015.1.0-8.0"
  api: "1.0"
  build_number: "138"
  build_id: "138"
  fuel-nailgun_sha: "3a745ee87e659b3ba239bbede21e491292646acb"
  python-fuelclient_sha: "769df968e19d95a4ab4f12b1d2c76d385cf3168c"
  fuel-agent_sha: "84335446172cc6a699252c184076a519ac791ca1"
  fuel-nailgun-agent_sha: "d66f188a1832a9c23b04884a14ef00fc5605ec6d"
  astute_sha: "e99368bd77496870592781f4ba4fb0caacb9f3a7"
  fuel-library_sha: "80c2dcf3e298e576dd50111825041466b0e38d3f"
  fuel-ostf_sha: "983d0e6fe64397d6ff3bd72311c26c44b02de3e8"
  fuel-createmirror_sha: "df6a93f7e2819d3dfa600052b0f901d9594eb0db"
  fuelmain_sha: "4c58b6503fc780be117777182165fd7b037b1a96"

tags:

removed: on-verification

Dmitry Pyzhov (dpyzhov) on 2015-10-22

tags:

added: area-mos

Sergey Shevorakov (sshevorakov) on 2015-11-05

tags:

added: rca-done

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-01-20: Fix proposed to fuel-library (stable/6.1)

#21

Fix proposed to branch: stable/6.1
Review: https://review.openstack.org/270314

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-02-18: Fix merged to fuel-library (stable/6.1)

#22

Reviewed: https://review.openstack.org/270314
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=bb5340e01fcb8e85a397f1db87b810696b2fb11e
Submitter: Jenkins
Branch: stable/6.1

commit bb5340e01fcb8e85a397f1db87b810696b2fb11e
Author: Dmitry Mescheryakov <email address hidden>
Date: Fri Sep 18 15:05:03 2015 +0300

Start RabbitMQ app on notify

    On notify, if we detect that we are a part of a cluster we still
    need to start the RabbitMQ application, because it is always
    down after action_start finishes.

    Closes-Bug: #1496386
    Change-Id: I307452b687a6100cc4489c8decebbc3dccdbc432
    (cherry picked from commit c1900b49e6ddcfb84bf5c501c75f3fee80903eca)

Dmitry (dtsapikov) on 2016-02-25

tags:

added: on-verification

Revision history for this message

Dmitry (dtsapikov) wrote on 2016-03-11:

#23

Bug was not reproduced.
Verified on 6.1+mu5

tags:

removed: on-verification

Revision history for this message

Roman Rufanov (rrufanov) wrote on 2016-05-19:

#24

Reopening - Please back-port to 6.0 and it will be delivered as patch or instruction with steps.

Fabrizio Soppelsa (fsoppelsa) on 2016-08-24

tags:

added: customer-found

Revision history for this message

Alexey Stupnikov (astupnikov) wrote on 2017-11-16:

#25

We no longer support MOS5.1, MOS6.0, MOS6.1
We deliver only Critical/Security fixes to MOS7.0, MOS8.0.
We deliver only High/Critical/Security fixes to MOS9.2.