Fuel for OpenStack

OSTF 'RabbitMQ availability' test fails after network outage and recovery (cluster_status hangs on controllers)

Bug #1585128 reported by ElenaRossokhina on 2016-05-24

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Fix Committed	High	Alexey Lebedeff	Fuel for OpenStack 10.0
	Mitaka	Fix Released	High	Dina Belova	Fuel for OpenStack 9.0

Bug Description

Detailed bug description:
fuel-9.0-mos-376-2016-05-19_18-18-59.iso

Steps to reproduce:
1.Create and deploy next cluster - Neutron Vlan, cinder/swift, 3 controller, 2 compute, 1 cinder nodes
2.Run OSTF
3.Verify network
4.Simulate network outage:
For all networks except "admin":
- Locate bridge associated with needed network: "virsh net-dumpxml <network_name>"
- Find all interfaces attached to the bridge using "brctl show <bridge_name>", remember it
- Destroy network using virsh net-destroy
5. Fix network connection after 5minute pause:
For all networks except "admin":
-Restore network using "virsh net-start <network_name>"
Attach all interfaces to bridges according to data step 6 using "brctl addif <bridge> <iface>"
6.Wait until OSTF 'HA' suite passes (FAIL)
Expected results:
All steps OK

Actual result:
Step #6 fails:
Time limit exceeded

root@node-1:~# haproxy-status.sh | grep DOWN

'crm status' hangs with the following output: http://paste.openstack.org/show/505909/

Step #5 was executed at 2016-05-24T08:34:33. The 'rabbitmqctl cluster_status' hangs since then.

See full lrmd and RabbitMQ logs attached in comment #8

See original description

Tags:

Revision history for this message

ElenaRossokhina (esolomina) wrote on 2016-05-24:

Dump is timed out, but I saved full env

Oleksiy Molchanov (omolchanov) on 2016-05-26

tags:

added: area-library

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-05-27:

Please provide logs, if dump is failing, a tarball would work as well. Also please provide pcs status outputs from all controllers.

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-05-27:

Note, the affected node-1 has rabbitmqctl status working http://pastebin.com/d70xJ6sp but cluster_status hangs, indeed

tags:

added: rabbitmq

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-05-27:

Also, the action monitor cannot detect that type of a "failure":
http://pastebin.com/4Sr04E1u

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-05-27:

The same situation with all of the controllers

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-05-27:

this seems some erlang / mnesia broken state, I can do nothing here, passing to MOS Oslo as they have erlang devs

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-05-27:

logs.tgz Edit (1.7 MiB, application/x-tar)

summary:

OSTF 'RabbitMQ availability' test fails after network outage and
- recovery
+ recovery (cluster_status hangs on controllers)

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2016-05-27:

More full lrmd and rabbitmq logs from env Edit (10.8 MiB, application/x-tar)

Dmitry Mescheryakov (dmitrymex) on 2016-05-27

description:

updated

Revision history for this message

Alexey Lebedeff (alebedev-a) wrote on 2016-05-27:

There is a bug in rabbitmq autoheal logic. I'm not sure whether it was triggered by network partitions or by interaction with OCF script (they both do start/stop of rabbit). But I think it doesn't matter - we just should disable autoheal because it does exactly the same thing as OCF script does, and they can interfere with each other.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-05-27: Fix proposed to fuel-library (master)

#10

Fix proposed to branch: master
Review: https://review.openstack.org/322269

Changed in fuel:
status:	Confirmed → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-05-27: Fix proposed to fuel-library (stable/mitaka)

#11

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/322272

Revision history for this message

Alexey Lebedeff (alebedev-a) wrote on 2016-05-27:

#12

I've manually tested OCF script behaviour with disabled autoheal, OCF script is doing its job perfectly.

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-05-30:

#13

"But I think it doesn't matter - we just should disable autoheal because it does exactly the same thing as OCF script does, and they can interfere with each other."

Sorry, no. OCF has nothing to the partitions recovery

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-05-30:

#14

I believe such a chagne must be first verified. For example, I have already used a custom Jepsen test to test different rabbit modes in the face of net partitions. We could use as well to compare ## of duplicated/lost/unexpected messages after network partitions a)with built-in autoheal against b)Pacemaker quorum mode for the OCF resource w/o autoheal

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-05-30:

#15

Besides that, a scale lab test should be done to verify how the change would impact real workloads

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-05-30:

#16

So, here is a draft of the test plan I hope to cover this topic by https://docs.google.com/document/d/1f2CqhwXfH_2dWEQ6nfHjNH8HUfBEAXa1HUlhet0E2vs/edit#

Revision history for this message

Alexey Lebedeff (alebedev-a) wrote on 2016-05-30:

#17

The end-result of OCF script behavior after partition is EXACTLY the same as that of autoheal - mnesia reset of some node in cluster. And it's a very bad idea to have two different entities doing restarts/resets without any coordination. Especially given that they use completely different rules for choosing winners/losers.

So let's look what this things are doing and what problems they lead to.

OCF script:
- stops rabbit when it detects network partition (i.e. node doesn't see what pacemaker currently considers master)
- after connectivity restores, it tries to start rabbit
- attempt to start fails because of inconsistent mnesia, forat that point mnesia reset happens

autoheal:
- it's expected that rabbits continue to run even after the network split
- when network split heals, rabbit arbitrarily decides on winning partition
- all loosing rabbits are stopped and their mnesia is being reset

And this gives us the following problems:
- autoheal is even not always used (as it happens after the healing of partition, and OCF can stop rabbit after partition itself)
- autoheal and pacemaker have different notions about who is the winner, so they can start to restart/reset different nodes at the same time - and both succeed at their job. And this can lead to a significant data loss
- autoheal is expecting that it's the only entity responsible for starting/stopping rabbits. So when OCF script does this at a wrong time, autoheal becomes stuck forever.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-06-07:

#18

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/326563

Maksim Malchuk (mmalchuk) on 2016-06-07

no longer affects:

fuel/newton

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2016-06-07: Fix merged to packages/trusty/rabbitmq-server (master)

#19

Reviewed: https://review.fuel-infra.org/20993
Submitter: Pkgs Jenkins <email address hidden>
Branch: master

Commit: 0c3f206ee0e7c850c426c8e4c110c26ec8b95135
Author: Alexey Lebedeff <email address hidden>
Date: Tue Jun 7 14:55:43 2016

Merge current state of the HA OCF script

Change-Id: I657719886e9e8fccd6b9d238fe0f93f843da4171
Closes-Bug: 1585128

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-06-08: Fix merged to fuel-library (stable/mitaka)

#20

Reviewed: https://review.openstack.org/326563
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=3d8fae9943294ca717e66dfe49ff1efc75061c00
Submitter: Jenkins
Branch: stable/mitaka

commit 3d8fae9943294ca717e66dfe49ff1efc75061c00
Author: Alexey Lebedeff <email address hidden>
Date: Tue Jun 7 18:02:14 2016 +0300

Check cluster_status liveness during OCF checks

Upstream PR - https://github.com/rabbitmq/rabbitmq-server/pull/819
`master`-first policy doesn't apply - OCF script is removed there.

    We've observed some autoheal bug that made cluster_status became stuck
    forever. This will help aleviate problem before proper fix for autoheal
    is developed.

Change-Id: I15c9c5f2257ba7eb6414bf5d1372f5bf2b216e44
Closes-Bug: 1585128

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2016-06-08:

#21

Fix to master merged in https://review.fuel-infra.org/20993

Changed in fuel:
status:	In Progress → Fix Committed

Revision history for this message

Alexey Lebedeff (alebedev-a) wrote on 2016-06-14:

#22

I've done some tests and now I'm 100% sure that OCF script fully replaces `autoheal` mode of rabbitmq. `running_nodes` reported by `cluster_status` contains only nodes with which the given node is consistent with. So even without additional checking of `partitions` field OCF does just the right thing when it sees that node that it considers the master is missing in `running_nodes` list.

Ekaterina Shutova (eshutova) on 2016-06-17

tags:

added: on-verification

Revision history for this message

Ekaterina Shutova (eshutova) wrote on 2016-06-17:

#23

Used scenario from description.
Result: wait ~10min after networks recovering, all OSTF tests are passed incl. RabbitMQ availability and RabbitMQ replication:
2016-06-17 16:23:34 DEBUG (ha_base) Result string is resource master_p_rabbitmq-server is running on: node-6.test.domain.local ^M
resource master_p_rabbitmq-server is running on: node-8.test.domain.local ^M
resource master_p_rabbitmq-server is running on: node-7.test.domain.local
2016-06-17 16:23:34 DEBUG (test_rabbit) Current res is resource master_p_rabbitmq-server is running on: node-6.test.domain.local ^M
.....
2016-06-17 16:23:37 DEBUG (ha_base) Result of executing command rabbitmqctl list_channels is Listing channels ...^M
<email address hidden> nova 0 0^M
<email address hidden> nova 0 0^M
<email address hidden> nova 0 0^M
....
Verified on:
cat /etc/fuel_build_id:
497
cat /etc/fuel_build_number:
497
cat /etc/fuel_release:
9.0
cat /etc/fuel_openstack_version:
mitaka-9.0

tags:

removed: on-verification

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-06-22: Fix merged to fuel-library (master)

#24

Reviewed: https://review.openstack.org/322269
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=9e8834489e6caec1b7e640ae56d2fbf4bf2e3775
Submitter: Jenkins
Branch: master

commit 9e8834489e6caec1b7e640ae56d2fbf4bf2e3775
Author: Alexey Lebedeff <email address hidden>
Date: Fri May 27 19:47:12 2016 +0300

Disable rabbitmq cluster partition handling

    Currently autoheal functionality is a subset of OCF script
    functionality (autoheal is concerned only with network partitions, OCF
    also handles all sort of cases where rabbitmq becomes stuck). But as
    both of this functions cause stopping and starting of rabbits, it
    results in unwanted interference - as we've observed in 1585128, where
    it triggered bug in autoheal code.

    Even if rabbitmq bug will be fixed, it makes no sense to use both
    autoheal and OCF script (actually, when clustered rabbit will stop
    getting stuck during network split, autoheal will be as good as OCF
    script).

Change-Id: I38fbf6d50e9aed35f3c5e3bc1e17de7001304706
Closes-Bug: 1585128

Revision history for this message

Dmitry Kalashnik (dkalashnik) wrote on 2016-06-23:

#25

Also required https://review.openstack.org/#/c/322272/2 for stable/mitaka