Rabbit OCF script doesn't reelect master in case of master node failure

Bug #1490941 reported by Eugene Nikanorov
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
Matthew Mosesohn
6.0.x
Won't Fix
High
MOS Maintenance
6.1.x
Fix Committed
High
Denis Puchkin
7.0.x
Fix Released
High
Matthew Mosesohn

Bug Description

Rabbit OCF script version 6.0, 6.1 both show the same behavior.
When master node experience issue with arp cache and rabbitmq is having problems to connect to localhost, rabbitmq ocf script keeps trying to bring rabbitmq up for minutes, cycling over start/stop/cleanup.

It's expected that it would elect new master node for rabbitmq and give up on bringing up rabbitmq on failing node.

This results in cloud outage.

Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

To create aforementioned problems with connection to localhost one need to reduce the size of arp cache:
sysctl - w net.ipv4.neigh.default.gc_thresh1 = 4
sysctl - w net.ipv4.neigh.default.gc_thresh2 = 8
sysctl - w net.ipv4.neigh.default.gc_thresh3 = 16

description: updated
Changed in fuel:
assignee: nobody → Fuel Library Team (fuel-library)
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

I put this bug to incomplete because we need to reproduce it.

Changed in fuel:
status: New → Incomplete
importance: Undecided → High
tags: added: ha rabbitmq scale
Changed in fuel:
milestone: none → 8.0
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Note: AFAIK from the original customer ticket, the failed node was remaining in kernel panic with disabled automatic rebooting.
So we should reproduce an exact case: then a node in kernel panic and ping to localhost is unstable due to arp cache issues described.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

@Evgeniy, please confirm did I understood this correct? one node must be in kernel panic and the second one must be affected with unstable localhost ping issue?

Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

No, the node doesn't have to be in kernel panic state.
The 'unstable localhost ping ' is critical for the repro, however.

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Bogdan Dobrelya (bogdando)
status: Incomplete → Confirmed
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

AFAICT w/o deep investigation made yet, the beam.smp and epmd maintain multiple connections to localhost (write/read fds). Therefore, the rabbit OCF logic relies heavily on these. And it seems the whole logic became broken when localhost connections behave unstable. I'm not sure if we could fix this control plane issue, but I will look what could be done.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Reproduced with a sporadic localhost blocking script http://pastebin.com/WprLQDJy. Note, save with "iptables-save > /etc/iptables.rules" before to start this script!

Results are:
1) many of the beam.smp connections to the epmd may turn half-opened (are in SYN_SENT)
2) when there are half-opened connections, rabbitmqctl may hang (or may not) on each command. When hangs, it spawns multiple beam.smp processes like:
 28399 /usr/lib/erlang/erts-5.10.4/bin/beam.smp -- -root /usr/lib/erlang -progname erl -- -home /var/lib/rabbitmq -- -pa /usr/lib/rabbitmq/lib/rabbitmq_server-3.5.4/sbin/../ebin -noshell -noinput -hidden -boot start_clean -sasl errlog_type error -mnesia dir "/var/lib/rabbitmq/mnesia/rabbit@node-1" -s rabbit_control_main -nodename rabbit@node-1 -extra eval rabbit_misc:which_applications().
 30362 /usr/lib/erlang/erts-5.10.4/bin/beam.smp -- -root /usr/lib/erlang -progname erl -- -home /var/lib/rabbitmq -- -pa /usr/lib/rabbitmq/lib/rabbitmq_server-3.5.4/sbin/../ebin -noshell -noinput -hidden -boot start_clean -sasl errlog_type error -mnesia dir "/var/lib/rabbitmq/mnesia/rabbit@node-1" -s rabbit_control_main -nodename rabbit@node-1 -extra eval rabbit_misc:which_applications().
 32415 /usr/lib/erlang/erts-5.10.4/bin/beam.smp -- -root /usr/lib/erlang -progname erl -- -home /var/lib/rabbitmq -- -pa /usr/lib/rabbitmq/lib/rabbitmq_server-3.5.4/sbin/../ebin -noshell -noinput -hidden -boot start_clean -sasl errlog_type error -mnesia dir "/var/lib/rabbitmq/mnesia/rabbit@node-1" -s rabbit_control_main -nodename rabbit@node-1 -extra status
and multiple epmd-starter halt processes like:
 2923 /usr/lib/erlang/erts-5.10.4/bin/beam.smp -- -root /usr/lib/erlang -progname erl -- -home /var/lib/rabbitmq -- -sname epmd-starter-443584618 -noshell -eval halt()
3) Pacemaker behaves as the following:
- detects resource failure and kills rabbit's beam process
- not always being able to recover the resource (because rabbitmqctl may hang)
- sometimes rabbitmqctl commands succeed and allow the resource to recover (rejoin cluster, report running healthy)
- but rabbitmqctl still may hang, so the rabbit app retains in a broken state
- the first timed out stop action will make the resource unmanaged. This is expected behavior and normally would STONITH the node resolving the issue.
- some time later, may return the resource to the managed state and report as running healthy, while rabbitmqctl command still hang and there is updates in lrmd log. This is strange and unexpected, although is beyond
the node's awaited "death" brought by STONITH expected.
4) Even killing all beam.smp & epmd doesn't help. After reboot everything restores.

I believe we cannot fix this behaviour w/o pacemaker fencing enabled.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Set to wishlist, because the STONITH is the only solution and is a feature

Changed in fuel:
status: Confirmed → Won't Fix
importance: High → Wishlist
Roman Rufanov (rrufanov)
tags: added: support
Revision history for this message
Andrew Woodward (xarses) wrote :

This makes no sense, why is a localhost ping ever unstable? why does it require STONITH to resolve? If it causes cluster failure it must be addressed

Revision history for this message
Andrew Woodward (xarses) wrote :

This is customer found in 6.0 re-opening

Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

"why is a localhost ping ever unstable"
It's actually not a localhost ping, but a ping to one of the local ips (management one) where endpoints such as rabbit are located.

tags: added: tricky
Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

We are never failing rabbitmq over to another controller because we have migration-threshold set to INFINITY. Additionally the failure-timeout is 60. I also think we should set resource-stickiness to 100 for the rabbitmq service. This will encourage rabbitmq-master to bounce more freely to the most least failing host.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/229912

Changed in fuel:
status: Won't Fix → In Progress
Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

With the patch I have on review using Bogdan's script, it now correctly reelects master. Master moves to failed state in about 1 minute. Promotion takes another 2 minutes. I'm open to more ideas if you can get RabbitMQ to re-elect a new master quicker. At least with this solution RabbitMQ doesn't stay down indefinitely.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/229912
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=48e2c672077a438cf138b45830d6ecaa49e9444b
Submitter: Jenkins
Branch: master

commit 48e2c672077a438cf138b45830d6ecaa49e9444b
Author: Matthew Mosesohn <email address hidden>
Date: Thu Oct 1 17:24:17 2015 +0300

    Enable migration of rabbitmq resource on failure

    RabbitMQ resource in pacemaker was never migrating
    due to numerous failures. Now it will fail under
    the following circumstances:
    * Over 10 failures (was INFINITY)
    * Fails for over 30s (was 360s)

    Also enabled resource stickiness (was 0) to reduce
    likelihood of moving RabbitMQ master back to the
    failed host.

    Change-Id: I016b7b2ca01ede7ebdd07c06f685299b2654ac8a
    Closes-Bug: #1490941

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/7.0)

Fix proposed to branch: stable/7.0
Review: https://review.openstack.org/230453

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/7.0)

Reviewed: https://review.openstack.org/230453
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=b03aa3414035303bb07a011070cd1e554053539c
Submitter: Jenkins
Branch: stable/7.0

commit b03aa3414035303bb07a011070cd1e554053539c
Author: Matthew Mosesohn <email address hidden>
Date: Thu Oct 1 17:24:17 2015 +0300

    Enable migration of rabbitmq resource on failure

    RabbitMQ resource in pacemaker was never migrating
    due to numerous failures. Now it will fail under
    the following circumstances:
    * Over 10 failures (was INFINITY)
    * Fails for over 30s (was 360s)

    Also enabled resource stickiness (was 0) to reduce
    likelihood of moving RabbitMQ master back to the
    failed host.

    Change-Id: I016b7b2ca01ede7ebdd07c06f685299b2654ac8a
    Closes-Bug: #1490941
    (cherry picked from commit 48e2c672077a438cf138b45830d6ecaa49e9444b)

Dmitry Pyzhov (dpyzhov)
no longer affects: fuel/8.0.x
Changed in fuel:
milestone: 7.0 → 8.0
tags: added: on-verification
Revision history for this message
Dmitriy Kruglov (dkruglov) wrote :

Verified on MOS 8.0. The issue is not reproduced.

ISO data:
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  openstack_version: "2015.1.0-8.0"
  api: "1.0"
  build_number: "141"
  build_id: "141"
  fuel-nailgun_sha: "1479c0b03ad928f2ea2a819fbf8218cff32e51b9"
  python-fuelclient_sha: "769df968e19d95a4ab4f12b1d2c76d385cf3168c"
  fuel-agent_sha: "cf699820fb0a4d20bef001861e006dc9797b5733"
  fuel-nailgun-agent_sha: "08e0a11cf1f29b705e4b910d9b9db5e9b708b6e3"
  astute_sha: "a090546d43c770ac27ca81c6f8c78ff0ba4a93e0"
  fuel-library_sha: "cd1b4b67d2b00fb10264d6626327688b170f0bf8"
  fuel-ostf_sha: "983d0e6fe64397d6ff3bd72311c26c44b02de3e8"
  fuel-createmirror_sha: "df6a93f7e2819d3dfa600052b0f901d9594eb0db"
  fuelmain_sha: "3303f41f99cf9167da01d503dd5d2c8dab141447"

Revision history for this message
Dmitriy Kruglov (dkruglov) wrote :

Verified on MOS 7.0, custom ISO. The issue is not reproduced.

ISO info:
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "7.0"
  openstack_version: "2015.1.0-7.0"
  api: "1.0"
  build_number: "1260"
  build_id: "2015-10-09_12-02-12"
  nailgun_sha: "edbae54d510edbaa1d379e9523febe5a0e5acd41"
  python-fuelclient_sha: "486bde57cda1badb68f915f66c61b544108606f3"
  fuel-agent_sha: "50e90af6e3d560e9085ff71d2950cfbcca91af67"
  fuel-nailgun-agent_sha: "d7027952870a35db8dc52f185bb1158cdd3d1ebd"
  astute_sha: "6c5b73f93e24cc781c809db9159927655ced5012"
  fuel-library_sha: "713698e88c6e1e4ed9ebad759a21266890898d57"
  fuel-ostf_sha: "2cd967dccd66cfc3a0abd6af9f31e5b4d150a11c"
  fuelmain_sha: "a65d453215edb0284a2e4761be7a156bb5627677"

Changed in fuel:
status: Fix Committed → Fix Released
tags: removed: on-verification
Dmitry Pyzhov (dpyzhov)
tags: added: area-library
Revision history for this message
Dmitriy Kruglov (dkruglov) wrote :

Verified on MOS 7.0 MU1. The issue is not reproduced.

ISO info:
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "7.0"
  openstack_version: "2015.1.0-7.0"
  api: "1.0"
  build_number: "301"
  build_id: "301"
  nailgun_sha: "4162b0c15adb425b37608c787944d1983f543aa8"
  python-fuelclient_sha: "486bde57cda1badb68f915f66c61b544108606f3"
  fuel-agent_sha: "50e90af6e3d560e9085ff71d2950cfbcca91af67"
  fuel-nailgun-agent_sha: "d7027952870a35db8dc52f185bb1158cdd3d1ebd"
  astute_sha: "6c5b73f93e24cc781c809db9159927655ced5012"
  fuel-library_sha: "5d50055aeca1dd0dc53b43825dc4c8f7780be9dd"
  fuel-ostf_sha: "2cd967dccd66cfc3a0abd6af9f31e5b4d150a11c"
  fuelmain_sha: "a65d453215edb0284a2e4761be7a156bb5627677"

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/6.1)

Fix proposed to branch: stable/6.1
Review: https://review.openstack.org/252965

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/6.1)

Reviewed: https://review.openstack.org/252965
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=b92876515f67660cc0ada1300f7a4a46204c0376
Submitter: Jenkins
Branch: stable/6.1

commit b92876515f67660cc0ada1300f7a4a46204c0376
Author: Matthew Mosesohn <email address hidden>
Date: Thu Oct 1 17:24:17 2015 +0300

    Enable migration of rabbitmq resource on failure

    RabbitMQ resource in pacemaker was never migrating
    due to numerous failures. Now it will fail under
    the following circumstances:
    * Over 10 failures (was INFINITY)
    * Fails for over 30s (was 360s)

    Also enabled resource stickiness (was 0) to reduce
    likelihood of moving RabbitMQ master back to the
    failed host.

    Change-Id: I016b7b2ca01ede7ebdd07c06f685299b2654ac8a
    Closes-Bug: #1490941
    (cherry picked from commit 48e2c672077a438cf138b45830d6ecaa49e9444b)

Revision history for this message
Alexey Stupnikov (astupnikov) wrote :

MOS6.0 is no longer supported, moving to Won't Fix.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.