Fuel for OpenStack

RabbitMQ cluster is broken after controller destroy: 'On the controller node-3.test.domain.local, resource master_p_rabbitmq-server is active but failed to start (managed)'

Bug #1524024 reported by Artem Panchenko on 2015-12-08

This bug report is a duplicate of: Bug #1472230: Pacemaker shows healthy status for rabbitmq node meanwhile the node is actually down/split brain. Edit Remove

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	In Progress	High	Valeriy Saharov	Fuel for OpenStack 9.0
	8.0.x	Confirmed	High	Fuel Library (Deprecated)	Fuel for OpenStack 8.0

Bug Description

Fuel version info (8.0 build #264): http://paste.openstack.org/show/481207/

System tests 'ha_neutron_destroy_controllers' fails on step '7. Check pacemaker status' after controller node is destroyed, because RabbitMQ server is unable to start on one of 2 alive controllers:

Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
p_rabbitmq-server (ocf::fuel:rabbitmq-server): FAILED node-3.test.domain.local
Masters: [ node-1.test.domain.local ]

root@node-3:~# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-3' ...
[{nodes,[{disc,['rabbit@node-3']}]}]

Here is a part of pacemaker logs:

Dec 08 04:28:58 [14416] node-3.test.domain.local pacemaker_remoted: notice: operation_finished: p_rabbitmq-server_stop_0:8443:stderr [ Error: unable to connect to node 'rabbit@node-3': nodedown ]
Dec 08 04:28:58 [14416] node-3.test.domain.local pacemaker_remoted: notice: operation_finished: p_rabbitmq-server_stop_0:8443:stderr [ ]
Dec 08 04:28:58 [14416] node-3.test.domain.local pacemaker_remoted: notice: operation_finished: p_rabbitmq-server_stop_0:8443:stderr [ DIAGNOSTICS ]
Dec 08 04:28:58 [14416] node-3.test.domain.local pacemaker_remoted: notice: operation_finished: p_rabbitmq-server_stop_0:8443:stderr [ =========== ]
Dec 08 04:28:58 [14416] node-3.test.domain.local pacemaker_remoted: notice: operation_finished: p_rabbitmq-server_stop_0:8443:stderr [ ]
Dec 08 04:28:58 [14416] node-3.test.domain.local pacemaker_remoted: notice: operation_finished: p_rabbitmq-server_stop_0:8443:stderr [ attempted to contact: ['rabbit@node-3'] ]
Dec 08 04:28:58 [14416] node-3.test.domain.local pacemaker_remoted: notice: operation_finished: p_rabbitmq-server_stop_0:8443:stderr [ ]
Dec 08 04:28:58 [14416] node-3.test.domain.local pacemaker_remoted: notice: operation_finished: p_rabbitmq-server_stop_0:8443:stderr [ rabbit@node-3: ]
Dec 08 04:28:58 [14416] node-3.test.domain.local pacemaker_remoted: notice: operation_finished: p_rabbitmq-server_stop_0:8443:stderr [ * connected to epmd (port 4369) on node-3 ]
Dec 08 04:28:58 [14416] node-3.test.domain.local pacemaker_remoted: notice: operation_finished: p_rabbitmq-server_stop_0:8443:stderr [ * epmd reports: node 'rabbit' not running at all ]
Dec 08 04:28:58 [14416] node-3.test.domain.local pacemaker_remoted: notice: operation_finished: p_rabbitmq-server_stop_0:8443:stderr [ no other nodes on node-3 ]
Dec 08 04:28:58 [14416] node-3.test.domain.local pacemaker_remoted: notice: operation_finished: p_rabbitmq-server_stop_0:8443:stderr [ * suggestion: start the node ]
Dec 08 04:28:58 [14416] node-3.test.domain.local pacemaker_remoted: notice: operation_finished: p_rabbitmq-server_stop_0:8443:stderr [ ]
Dec 08 04:28:58 [14416] node-3.test.domain.local pacemaker_remoted: notice: operation_finished: p_rabbitmq-server_stop_0:8443:stderr [ current node details: ]
Dec 08 04:28:58 [14416] node-3.test.domain.local pacemaker_remoted: notice: operation_finished: p_rabbitmq-server_stop_0:8443:stderr [ - node name: 'rabbitmq-cli-8527@node-3' ]
Dec 08 04:28:58 [14416] node-3.test.domain.local pacemaker_remoted: notice: operation_finished: p_rabbitmq-server_stop_0:8443:stderr [ - home dir: /var/lib/rabbitmq ]
Dec 08 04:28:58 [14416] node-3.test.domain.local pacemaker_remoted: notice: operation_finished: p_rabbitmq-server_stop_0:8443:stderr [ - cookie hash: soeIWU2jk2YNseTyDSlsEA== ]
Dec 08 04:28:58 [14416] node-3.test.domain.local pacemaker_remoted: notice: operation_finished: p_rabbitmq-server_stop_0:8443:stderr [ ]
Dec 08 04:28:58 [14416] node-3.test.domain.local pacemaker_remoted: notice: operation_finished: p_rabbitmq-server_stop_0:8443:stderr [ cat: /var/run/rabbitmq/pid: No such file or directory ]

Please note that test was waiting for worikng RabbitMQ cluster (passed OSTF tests) more than 20 minutes, but it hasn't been recovered (even after 1 hour when I reverted environment manually).

Steps to reproduce:

1. Deploy environment with 3 controllers
2. Destroy first controller
3. Wait 20 minutes
4. Check pacemaker status on alive controllers

Expected result: all resources are running

Actual: 'p_rabbitmq-server' resource is stopped/failed on one controllers

Tags:

Revision history for this message

Artem Panchenko (apanchenko-8) wrote on 2015-12-08:

fuel-snapshot-2015-12-08_04-11-46.tar.xz Edit (104.6 MiB, application/octet-stream)

Dmitry Pyzhov (dpyzhov) on 2015-12-09

tags:

added: area-mos

Roman Podoliaka (rpodolyaka) on 2015-12-10

Changed in fuel:
status:	New → Confirmed

Nastya Urlapova (aurlapova) on 2015-12-15

tags:

added: swarm-fail-driver

Dmitry Mescheryakov (dmitrymex) on 2015-12-22

Changed in fuel:
assignee:	MOS Oslo (mos-oslo) → Dmitriy Ukhlov (dukhlov)

Dmitry Pyzhov (dpyzhov) on 2015-12-24

tags:

removed: swarm-fail-driver

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2015-12-29:

The attached snapshot is missing lrmd.log files for controller nodes and it is critical for analyzing such issues. Please reproduce the issue and attach a fresh snapshot. Also please make sure that lrmd logs are included and if they are missing, attach them separately.

Changed in fuel:
status:	Confirmed → Incomplete
assignee:	Dmitriy Ukhlov (dukhlov) → Fuel QA Team (fuel-qa)

Revision history for this message

Tatyanka (tatyana-leontovich) wrote on 2016-01-11:

Move to invalid for now according it is in incomplete status quite long, also it is not reproduces during 417 iso swarm run

Changed in fuel:
status:	Incomplete → Invalid

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2016-01-18:

env_logs.tar.gz Edit (17.7 MiB, application/x-tar)

Logs attached.

Issue reproduced on MOS #417.

Scenario for reproduce:
https://mirantis.testrail.com/index.php?/cases/view/542817

Bug reproduced not in 100% of cases.

Changed in fuel:
status:	Invalid → Confirmed
assignee:	Fuel QA Team (fuel-qa) → Fuel Library Team (fuel-library)
milestone:	8.0 → 9.0

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2016-01-18:

Reproduced in automated tests with the following scenario:

1. Login to the first Openstack controller node and restart RabbitMQ service:
service rabbitmq-server restart
2. Wait while RabbitMQ will successfully started on the controller node
3. Repeat steps -1-2 for all controller nodes in your cluster
4. Execute 'nova list', 'keystone user-list', 'glance image-list', 'nova-manage service list', 'keystone service-list' several times and verify that all works fine.
5. Execute 'rabbitmqctl cluster_status' and verify that all RabbitMQ nodes are in the same cluster.

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-01-18:

Please fix either the test case description or the step 1. One shall not use 'service foo restart' when the foo is under pacemaker control. Unless the test case is to check that "destructive" action made by intention

Valeriy Saharov (vsakharov) on 2016-01-18