OSTF 'RabbitMQ availability' test fails after network outage and recovery (cluster_status hangs on controllers)

Bug #1585128 reported by ElenaRossokhina
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
Alexey Lebedeff
Mitaka
Fix Released
High
Dina Belova

Bug Description

Detailed bug description:
fuel-9.0-mos-376-2016-05-19_18-18-59.iso

Steps to reproduce:
1.Create and deploy next cluster - Neutron Vlan, cinder/swift, 3 controller, 2 compute, 1 cinder nodes
2.Run OSTF
3.Verify network
4.Simulate network outage:
For all networks except "admin":
- Locate bridge associated with needed network: "virsh net-dumpxml <network_name>"
- Find all interfaces attached to the bridge using "brctl show <bridge_name>", remember it
- Destroy network using virsh net-destroy
5. Fix network connection after 5minute pause:
For all networks except "admin":
-Restore network using "virsh net-start <network_name>"
Attach all interfaces to bridges according to data step 6 using "brctl addif <bridge> <iface>"
6.Wait until OSTF 'HA' suite passes (FAIL)
Expected results:
All steps OK

Actual result:
Step #6 fails:
Time limit exceeded

root@node-1:~# haproxy-status.sh | grep DOWN

'crm status' hangs with the following output: http://paste.openstack.org/show/505909/

Step #5 was executed at 2016-05-24T08:34:33. The 'rabbitmqctl cluster_status' hangs since then.

See full lrmd and RabbitMQ logs attached in comment #8

Revision history for this message
ElenaRossokhina (esolomina) wrote :

Dump is timed out, but I saved full env

tags: added: area-library
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Please provide logs, if dump is failing, a tarball would work as well. Also please provide pcs status outputs from all controllers.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Note, the affected node-1 has rabbitmqctl status working http://pastebin.com/d70xJ6sp but cluster_status hangs, indeed

tags: added: rabbitmq
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Also, the action monitor cannot detect that type of a "failure":
http://pastebin.com/4Sr04E1u

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

The same situation with all of the controllers

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

this seems some erlang / mnesia broken state, I can do nothing here, passing to MOS Oslo as they have erlang devs

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
summary: OSTF 'RabbitMQ availability' test fails after network outage and
- recovery
+ recovery (cluster_status hangs on controllers)
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :
description: updated
Revision history for this message
Alexey Lebedeff (alebedev-a) wrote :

There is a bug in rabbitmq autoheal logic. I'm not sure whether it was triggered by network partitions or by interaction with OCF script (they both do start/stop of rabbit). But I think it doesn't matter - we just should disable autoheal because it does exactly the same thing as OCF script does, and they can interfere with each other.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/322269

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/322272

Revision history for this message
Alexey Lebedeff (alebedev-a) wrote :

I've manually tested OCF script behaviour with disabled autoheal, OCF script is doing its job perfectly.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

"But I think it doesn't matter - we just should disable autoheal because it does exactly the same thing as OCF script does, and they can interfere with each other."

Sorry, no. OCF has nothing to the partitions recovery

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

I believe such a chagne must be first verified. For example, I have already used a custom Jepsen test to test different rabbit modes in the face of net partitions. We could use as well to compare ## of duplicated/lost/unexpected messages after network partitions a)with built-in autoheal against b)Pacemaker quorum mode for the OCF resource w/o autoheal

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Besides that, a scale lab test should be done to verify how the change would impact real workloads

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

So, here is a draft of the test plan I hope to cover this topic by https://docs.google.com/document/d/1f2CqhwXfH_2dWEQ6nfHjNH8HUfBEAXa1HUlhet0E2vs/edit#

Revision history for this message
Alexey Lebedeff (alebedev-a) wrote :

The end-result of OCF script behavior after partition is EXACTLY the same as that of autoheal - mnesia reset of some node in cluster. And it's a very bad idea to have two different entities doing restarts/resets without any coordination. Especially given that they use completely different rules for choosing winners/losers.

So let's look what this things are doing and what problems they lead to.

OCF script:
- stops rabbit when it detects network partition (i.e. node doesn't see what pacemaker currently considers master)
- after connectivity restores, it tries to start rabbit
- attempt to start fails because of inconsistent mnesia, forat that point mnesia reset happens

autoheal:
- it's expected that rabbits continue to run even after the network split
- when network split heals, rabbit arbitrarily decides on winning partition
- all loosing rabbits are stopped and their mnesia is being reset

And this gives us the following problems:
- autoheal is even not always used (as it happens after the healing of partition, and OCF can stop rabbit after partition itself)
- autoheal and pacemaker have different notions about who is the winner, so they can start to restart/reset different nodes at the same time - and both succeed at their job. And this can lead to a significant data loss
- autoheal is expecting that it's the only entity responsible for starting/stopping rabbits. So when OCF script does this at a wrong time, autoheal becomes stuck forever.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/326563

no longer affects: fuel/newton
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to packages/trusty/rabbitmq-server (master)

Reviewed: https://review.fuel-infra.org/20993
Submitter: Pkgs Jenkins <email address hidden>
Branch: master

Commit: 0c3f206ee0e7c850c426c8e4c110c26ec8b95135
Author: Alexey Lebedeff <email address hidden>
Date: Tue Jun 7 14:55:43 2016

Merge current state of the HA OCF script

Change-Id: I657719886e9e8fccd6b9d238fe0f93f843da4171
Closes-Bug: 1585128

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/mitaka)

Reviewed: https://review.openstack.org/326563
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=3d8fae9943294ca717e66dfe49ff1efc75061c00
Submitter: Jenkins
Branch: stable/mitaka

commit 3d8fae9943294ca717e66dfe49ff1efc75061c00
Author: Alexey Lebedeff <email address hidden>
Date: Tue Jun 7 18:02:14 2016 +0300

    Check cluster_status liveness during OCF checks

    Upstream PR - https://github.com/rabbitmq/rabbitmq-server/pull/819
    `master`-first policy doesn't apply - OCF script is removed there.

    We've observed some autoheal bug that made cluster_status became stuck
    forever. This will help aleviate problem before proper fix for autoheal
    is developed.

    Change-Id: I15c9c5f2257ba7eb6414bf5d1372f5bf2b216e44
    Closes-Bug: 1585128

Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

Fix to master merged in https://review.fuel-infra.org/20993

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
Alexey Lebedeff (alebedev-a) wrote :

I've done some tests and now I'm 100% sure that OCF script fully replaces `autoheal` mode of rabbitmq. `running_nodes` reported by `cluster_status` contains only nodes with which the given node is consistent with. So even without additional checking of `partitions` field OCF does just the right thing when it sees that node that it considers the master is missing in `running_nodes` list.

tags: added: on-verification
Revision history for this message
Ekaterina Shutova (eshutova) wrote :

Used scenario from description.
Result: wait ~10min after networks recovering, all OSTF tests are passed incl. RabbitMQ availability and RabbitMQ replication:
2016-06-17 16:23:34 DEBUG (ha_base) Result string is resource master_p_rabbitmq-server is running on: node-6.test.domain.local ^M
resource master_p_rabbitmq-server is running on: node-8.test.domain.local ^M
resource master_p_rabbitmq-server is running on: node-7.test.domain.local
2016-06-17 16:23:34 DEBUG (test_rabbit) Current res is resource master_p_rabbitmq-server is running on: node-6.test.domain.local ^M
.....
2016-06-17 16:23:37 DEBUG (ha_base) Result of executing command rabbitmqctl list_channels is Listing channels ...^M
<email address hidden> nova 0 0^M
<email address hidden> nova 0 0^M
<email address hidden> nova 0 0^M
....
Verified on:
cat /etc/fuel_build_id:
 497
cat /etc/fuel_build_number:
 497
cat /etc/fuel_release:
 9.0
cat /etc/fuel_openstack_version:
 mitaka-9.0

tags: removed: on-verification
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/322269
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=9e8834489e6caec1b7e640ae56d2fbf4bf2e3775
Submitter: Jenkins
Branch: master

commit 9e8834489e6caec1b7e640ae56d2fbf4bf2e3775
Author: Alexey Lebedeff <email address hidden>
Date: Fri May 27 19:47:12 2016 +0300

    Disable rabbitmq cluster partition handling

    Currently autoheal functionality is a subset of OCF script
    functionality (autoheal is concerned only with network partitions, OCF
    also handles all sort of cases where rabbitmq becomes stuck). But as
    both of this functions cause stopping and starting of rabbits, it
    results in unwanted interference - as we've observed in 1585128, where
    it triggered bug in autoheal code.

    Even if rabbitmq bug will be fixed, it makes no sense to use both
    autoheal and OCF script (actually, when clustered rabbit will stop
    getting stuck during network split, autoheal will be as good as OCF
    script).

    Change-Id: I38fbf6d50e9aed35f3c5e3bc1e17de7001304706
    Closes-Bug: 1585128

Revision history for this message
Dmitry Kalashnik (dkalashnik) wrote :

Also required https://review.openstack.org/#/c/322272/2 for stable/mitaka

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/mitaka)

Reviewed: https://review.openstack.org/322272
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=e489354cf8a595233ae6fb2f8c5c4e6ff007814f
Submitter: Jenkins
Branch: stable/mitaka

commit e489354cf8a595233ae6fb2f8c5c4e6ff007814f
Author: Alexey Lebedeff <email address hidden>
Date: Fri May 27 19:47:12 2016 +0300

    Disable rabbitmq cluster partition handling

    Cherry-pick 9e8834489e6caec1b7e640ae56d2fbf4bf2e3775 from master

    Currently autoheal functionality is a subset of OCF script
    functionality (autoheal is concerned only with network partitions, OCF
    also handles all sort of cases where rabbitmq becomes stuck). But as
    both of this functions cause stopping and starting of rabbits, it
    results in unwanted interference - as we've observed in 1585128, where
    it triggered bug in autoheal code.

    Even if rabbitmq bug will be fixed, it makes no sense to use both
    autoheal and OCF script (actually, when clustered rabbit will stop
    getting stuck during network split, autoheal will be as good as OCF
    script).

    Change-Id: I38fbf6d50e9aed35f3c5e3bc1e17de7001304706
    Closes-Bug: 1585128

tags: added: in-stable-mitaka
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.