Bug #1490941 “Rabbit OCF script doesn't reelect master in case o...” : Series 6.1.x : Bugs : Fuel for OpenStack

Revision history for this message

Eugene Nikanorov (enikanorov) wrote on 2015-09-01:

#1

To create aforementioned problems with connection to localhost one need to reduce the size of arp cache:
sysctl - w net.ipv4.neigh.default.gc_thresh1 = 4
sysctl - w net.ipv4.neigh.default.gc_thresh2 = 8
sysctl - w net.ipv4.neigh.default.gc_thresh3 = 16

description:

updated

Maciej Kwiek (maciej-iai) on 2015-09-02

Changed in fuel:
assignee:	nobody → Fuel Library Team (fuel-library)

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-09-02:

#2

I put this bug to incomplete because we need to reproduce it.

Changed in fuel:
status:	New → Incomplete
importance:	Undecided → High
tags:	added: ha rabbitmq scale
Changed in fuel:
milestone:	none → 8.0

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-09-02:

#3

Note: AFAIK from the original customer ticket, the failed node was remaining in kernel panic with disabled automatic rebooting.
So we should reproduce an exact case: then a node in kernel panic and ping to localhost is unstable due to arp cache issues described.

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-09-02:

#4

@Evgeniy, please confirm did I understood this correct? one node must be in kernel panic and the second one must be affected with unstable localhost ping issue?

Revision history for this message

Eugene Nikanorov (enikanorov) wrote on 2015-09-02:

#5

No, the node doesn't have to be in kernel panic state.
The 'unstable localhost ping ' is critical for the repro, however.

Bogdan Dobrelya (bogdando) on 2015-09-03

Changed in fuel:
assignee:	Fuel Library Team (fuel-library) → Bogdan Dobrelya (bogdando)
status:	Incomplete → Confirmed

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-09-03:

#6

AFAICT w/o deep investigation made yet, the beam.smp and epmd maintain multiple connections to localhost (write/read fds). Therefore, the rabbit OCF logic relies heavily on these. And it seems the whole logic became broken when localhost connections behave unstable. I'm not sure if we could fix this control plane issue, but I will look what could be done.

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-09-03:

#7

Reproduced with a sporadic localhost blocking script http://pastebin.com/WprLQDJy. Note, save with "iptables-save > /etc/iptables.rules" before to start this script!

Results are:
1) many of the beam.smp connections to the epmd may turn half-opened (are in SYN_SENT)
2) when there are half-opened connections, rabbitmqctl may hang (or may not) on each command. When hangs, it spawns multiple beam.smp processes like:
28399 /usr/lib/erlang/erts-5.10.4/bin/beam.smp -- -root /usr/lib/erlang -progname erl -- -home /var/lib/rabbitmq -- -pa /usr/lib/rabbitmq/lib/rabbitmq_server-3.5.4/sbin/../ebin -noshell -noinput -hidden -boot start_clean -sasl errlog_type error -mnesia dir "/var/lib/rabbitmq/mnesia/rabbit@node-1" -s rabbit_control_main -nodename rabbit@node-1 -extra eval rabbit_misc:which_applications().
30362 /usr/lib/erlang/erts-5.10.4/bin/beam.smp -- -root /usr/lib/erlang -progname erl -- -home /var/lib/rabbitmq -- -pa /usr/lib/rabbitmq/lib/rabbitmq_server-3.5.4/sbin/../ebin -noshell -noinput -hidden -boot start_clean -sasl errlog_type error -mnesia dir "/var/lib/rabbitmq/mnesia/rabbit@node-1" -s rabbit_control_main -nodename rabbit@node-1 -extra eval rabbit_misc:which_applications().
32415 /usr/lib/erlang/erts-5.10.4/bin/beam.smp -- -root /usr/lib/erlang -progname erl -- -home /var/lib/rabbitmq -- -pa /usr/lib/rabbitmq/lib/rabbitmq_server-3.5.4/sbin/../ebin -noshell -noinput -hidden -boot start_clean -sasl errlog_type error -mnesia dir "/var/lib/rabbitmq/mnesia/rabbit@node-1" -s rabbit_control_main -nodename rabbit@node-1 -extra status
and multiple epmd-starter halt processes like:
2923 /usr/lib/erlang/erts-5.10.4/bin/beam.smp -- -root /usr/lib/erlang -progname erl -- -home /var/lib/rabbitmq -- -sname epmd-starter-443584618 -noshell -eval halt()
3) Pacemaker behaves as the following:
- detects resource failure and kills rabbit's beam process
- not always being able to recover the resource (because rabbitmqctl may hang)
- sometimes rabbitmqctl commands succeed and allow the resource to recover (rejoin cluster, report running healthy)
- but rabbitmqctl still may hang, so the rabbit app retains in a broken state
- the first timed out stop action will make the resource unmanaged. This is expected behavior and normally would STONITH the node resolving the issue.
- some time later, may return the resource to the managed state and report as running healthy, while rabbitmqctl command still hang and there is updates in lrmd log. This is strange and unexpected, although is beyond
the node's awaited "death" brought by STONITH expected.
4) Even killing all beam.smp & epmd doesn't help. After reboot everything restores.

I believe we cannot fix this behaviour w/o pacemaker fencing enabled.

Reproduced with a sporadic localhost blocking script http://pastebin.com/WprLQDJy. Note, save with "iptables-save > /etc/iptables.rules" before to start this script!

Results are:
1) many of the beam.smp connections to the epmd may turn half-opened (are in SYN_SENT)
2) when there are half-opened connections, rabbitmqctl may hang (or may not) on each command. When hangs, it spawns multiple beam.smp processes like:
	28399 /usr/lib/erlang/erts-5.10.4/bin/beam.smp -- -root /usr/lib/erlang -progname erl -- -home /var/lib/rabbitmq -- -pa /usr/lib/rabbitmq/lib/rabbitmq_server-3.5.4/sbin/../ebin -noshell -noinput -hidden -boot start_clean -sasl errlog_type error -mnesia dir "/var/lib/rabbitmq/mnesia/rabbit@node-1" -s rabbit_control_main -nodename rabbit@node-1 -extra eval rabbit_misc:which_applications().
	30362 /usr/lib/erlang/erts-5.10.4/bin/beam.smp -- -root /usr/lib/erlang -progname erl -- -home /var/lib/rabbitmq -- -pa /usr/lib/rabbitmq/lib/rabbitmq_server-3.5.4/sbin/../ebin -noshell -noinput -hidden -boot start_clean -sasl errlog_type error -mnesia dir "/var/lib/rabbitmq/mnesia/rabbit@node-1" -s rabbit_control_main -nodename rabbit@node-1 -extra eval rabbit_misc:which_applications().
	32415 /usr/lib/erlang/erts-5.10.4/bin/beam.smp -- -root /usr/lib/erlang -progname erl -- -home /var/lib/rabbitmq -- -pa /usr/lib/rabbitmq/lib/rabbitmq_server-3.5.4/sbin/../ebin -noshell -noinput -hidden -boot start_clean -sasl errlog_type error -mnesia dir "/var/lib/rabbitmq/mnesia/rabbit@node-1" -s rabbit_control_main -nodename rabbit@node-1 -extra status
and multiple epmd-starter halt processes like:
	2923 /usr/lib/erlang/erts-5.10.4/bin/beam.smp -- -root /usr/lib/erlang -progname erl -- -home /var/lib/rabbitmq -- -sname epmd-starter-443584618 -noshell -eval halt()
3) Pacemaker behaves as the following:
- detects resource failure and kills rabbit's beam process
- not always being able to recover the resource (because rabbitmqctl may hang)
- sometimes rabbitmqctl commands succeed and allow the resource to recover (rejoin cluster, report running healthy)
- but rabbitmqctl still may hang, so the rabbit app retains in a broken state
- the first timed out stop action will make the resource unmanaged. This is expected behavior and normally would STONITH the node resolving the issue.
- some time later, may return the resource to the managed state and report as running healthy, while rabbitmqctl command still hang and there is updates in lrmd log. This is strange and unexpected, although is beyond
the node's awaited "death" brought by STONITH expected.
4) Even killing all beam.smp & epmd doesn't help. After reboot everything restores.

I believe we cannot fix this behaviour w/o pacemaker fencing enabled.

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-09-03:

#8

Set to wishlist, because the STONITH is the only solution and is a feature

Nastya Urlapova (aurlapova) on 2015-09-03

Changed in fuel:
status:	Confirmed → Won't Fix
importance:	High → Wishlist

Roman Rufanov (rrufanov) on 2015-09-22

tags:

added: support

Revision history for this message

Andrew Woodward (xarses) wrote on 2015-09-23:

#9

This makes no sense, why is a localhost ping ever unstable? why does it require STONITH to resolve? If it causes cluster failure it must be addressed

Revision history for this message

Andrew Woodward (xarses) wrote on 2015-09-23:

#10

This is customer found in 6.0 re-opening

Revision history for this message

Eugene Nikanorov (enikanorov) wrote on 2015-09-23:

#11

"why is a localhost ping ever unstable"
It's actually not a localhost ping, but a ping to one of the local ips (management one) where endpoints such as rabbit are located.

Matthew Mosesohn (raytrac3r) on 2015-09-25

tags:

added: tricky

Revision history for this message

Matthew Mosesohn (raytrac3r) wrote on 2015-10-01:

#12

We are never failing rabbitmq over to another controller because we have migration-threshold set to INFINITY. Additionally the failure-timeout is 60. I also think we should set resource-stickiness to 100 for the rabbitmq service. This will encourage rabbitmq-master to bounce more freely to the most least failing host.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-10-01: Fix proposed to fuel-library (master)

#13

Fix proposed to branch: master
Review: https://review.openstack.org/229912

Changed in fuel:
status:	Won't Fix → In Progress

Revision history for this message

Matthew Mosesohn (raytrac3r) wrote on 2015-10-01:

#14

With the patch I have on review using Bogdan's script, it now correctly reelects master. Master moves to failed state in about 1 minute. Promotion takes another 2 minutes. I'm open to more ideas if you can get RabbitMQ to re-elect a new master quicker. At least with this solution RabbitMQ doesn't stay down indefinitely.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-10-02: Fix merged to fuel-library (master)

#15

Reviewed: https://review.openstack.org/229912
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=48e2c672077a438cf138b45830d6ecaa49e9444b
Submitter: Jenkins
Branch: master

commit 48e2c672077a438cf138b45830d6ecaa49e9444b
Author: Matthew Mosesohn <email address hidden>
Date: Thu Oct 1 17:24:17 2015 +0300

Enable migration of rabbitmq resource on failure

    RabbitMQ resource in pacemaker was never migrating
    due to numerous failures. Now it will fail under
    the following circumstances:
    * Over 10 failures (was INFINITY)
    * Fails for over 30s (was 360s)

    Also enabled resource stickiness (was 0) to reduce
    likelihood of moving RabbitMQ master back to the
    failed host.

Change-Id: I016b7b2ca01ede7ebdd07c06f685299b2654ac8a
Closes-Bug: #1490941

Changed in fuel:
status:	In Progress → Fix Committed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-10-02: Fix proposed to fuel-library (stable/7.0)

#16

Fix proposed to branch: stable/7.0
Review: https://review.openstack.org/230453

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-10-07: Fix merged to fuel-library (stable/7.0)

#17

Reviewed: https://review.openstack.org/230453
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=b03aa3414035303bb07a011070cd1e554053539c
Submitter: Jenkins
Branch: stable/7.0

commit b03aa3414035303bb07a011070cd1e554053539c
Author: Matthew Mosesohn <email address hidden>
Date: Thu Oct 1 17:24:17 2015 +0300

Enable migration of rabbitmq resource on failure

    RabbitMQ resource in pacemaker was never migrating
    due to numerous failures. Now it will fail under
    the following circumstances:
    * Over 10 failures (was INFINITY)
    * Fails for over 30s (was 360s)

    Also enabled resource stickiness (was 0) to reduce
    likelihood of moving RabbitMQ master back to the
    failed host.

    Change-Id: I016b7b2ca01ede7ebdd07c06f685299b2654ac8a
    Closes-Bug: #1490941
    (cherry picked from commit 48e2c672077a438cf138b45830d6ecaa49e9444b)

Dmitry Pyzhov (dpyzhov) on 2015-10-08

no longer affects:	fuel/8.0.x
Changed in fuel:
milestone:	7.0 → 8.0

Dmitriy Kruglov (dkruglov) on 2015-10-15

tags:

added: on-verification

Revision history for this message

Dmitriy Kruglov (dkruglov) wrote on 2015-10-15:

#18

Verified on MOS 8.0. The issue is not reproduced.

ISO data:
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  openstack_version: "2015.1.0-8.0"
  api: "1.0"
  build_number: "141"
  build_id: "141"
  fuel-nailgun_sha: "1479c0b03ad928f2ea2a819fbf8218cff32e51b9"
  python-fuelclient_sha: "769df968e19d95a4ab4f12b1d2c76d385cf3168c"
  fuel-agent_sha: "cf699820fb0a4d20bef001861e006dc9797b5733"
  fuel-nailgun-agent_sha: "08e0a11cf1f29b705e4b910d9b9db5e9b708b6e3"
  astute_sha: "a090546d43c770ac27ca81c6f8c78ff0ba4a93e0"
  fuel-library_sha: "cd1b4b67d2b00fb10264d6626327688b170f0bf8"
  fuel-ostf_sha: "983d0e6fe64397d6ff3bd72311c26c44b02de3e8"
  fuel-createmirror_sha: "df6a93f7e2819d3dfa600052b0f901d9594eb0db"
  fuelmain_sha: "3303f41f99cf9167da01d503dd5d2c8dab141447"

Revision history for this message

Dmitriy Kruglov (dkruglov) wrote on 2015-10-15:

#19

Verified on MOS 7.0, custom ISO. The issue is not reproduced.

ISO info:
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "7.0"
  openstack_version: "2015.1.0-7.0"
  api: "1.0"
  build_number: "1260"
  build_id: "2015-10-09_12-02-12"
  nailgun_sha: "edbae54d510edbaa1d379e9523febe5a0e5acd41"
  python-fuelclient_sha: "486bde57cda1badb68f915f66c61b544108606f3"
  fuel-agent_sha: "50e90af6e3d560e9085ff71d2950cfbcca91af67"
  fuel-nailgun-agent_sha: "d7027952870a35db8dc52f185bb1158cdd3d1ebd"
  astute_sha: "6c5b73f93e24cc781c809db9159927655ced5012"
  fuel-library_sha: "713698e88c6e1e4ed9ebad759a21266890898d57"
  fuel-ostf_sha: "2cd967dccd66cfc3a0abd6af9f31e5b4d150a11c"
  fuelmain_sha: "a65d453215edb0284a2e4761be7a156bb5627677"

Changed in fuel:
status:	Fix Committed → Fix Released
tags:	removed: on-verification

Dmitry Pyzhov (dpyzhov) on 2015-10-22

tags:

added: area-library

Revision history for this message

Dmitriy Kruglov (dkruglov) wrote on 2015-11-04:

#20

Verified on MOS 7.0 MU1. The issue is not reproduced.

ISO info:
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "7.0"
  openstack_version: "2015.1.0-7.0"
  api: "1.0"
  build_number: "301"
  build_id: "301"
  nailgun_sha: "4162b0c15adb425b37608c787944d1983f543aa8"
  python-fuelclient_sha: "486bde57cda1badb68f915f66c61b544108606f3"
  fuel-agent_sha: "50e90af6e3d560e9085ff71d2950cfbcca91af67"
  fuel-nailgun-agent_sha: "d7027952870a35db8dc52f185bb1158cdd3d1ebd"
  astute_sha: "6c5b73f93e24cc781c809db9159927655ced5012"
  fuel-library_sha: "5d50055aeca1dd0dc53b43825dc4c8f7780be9dd"
  fuel-ostf_sha: "2cd967dccd66cfc3a0abd6af9f31e5b4d150a11c"
  fuelmain_sha: "a65d453215edb0284a2e4761be7a156bb5627677"

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-12-03: Fix proposed to fuel-library (stable/6.1)

#21

Fix proposed to branch: stable/6.1
Review: https://review.openstack.org/252965

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-01-25: Fix merged to fuel-library (stable/6.1)

#22

Reviewed: https://review.openstack.org/252965
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=b92876515f67660cc0ada1300f7a4a46204c0376
Submitter: Jenkins
Branch: stable/6.1

commit b92876515f67660cc0ada1300f7a4a46204c0376
Author: Matthew Mosesohn <email address hidden>
Date: Thu Oct 1 17:24:17 2015 +0300

Enable migration of rabbitmq resource on failure

    RabbitMQ resource in pacemaker was never migrating
    due to numerous failures. Now it will fail under
    the following circumstances:
    * Over 10 failures (was INFINITY)
    * Fails for over 30s (was 360s)

    Also enabled resource stickiness (was 0) to reduce
    likelihood of moving RabbitMQ master back to the
    failed host.

    Change-Id: I016b7b2ca01ede7ebdd07c06f685299b2654ac8a
    Closes-Bug: #1490941
    (cherry picked from commit 48e2c672077a438cf138b45830d6ecaa49e9444b)

Revision history for this message

Alexey Stupnikov (astupnikov) wrote on 2017-05-29:

#23

MOS6.0 is no longer supported, moving to Won't Fix.

	Status	Importance	Assigned to	Milestone
Fuel for OpenStack	Fix Released	High	Matthew Mosesohn	Fuel for OpenStack 8.0
6.0.x	Won't Fix	High	MOS Maintenance	Fuel for OpenStack 6.0-updates
6.1.x	Fix Committed	High	Denis Puchkin	Fuel for OpenStack 6.1-updates
7.0.x	Fix Released	High	Matthew Mosesohn	Fuel for OpenStack 7.0-mu-1

Fuel for OpenStack

Rabbit OCF script doesn't reelect master in case of master node failure

Bug Description

Other bug subscribers

Remote bug watches