RabbitMQ cluster is broken after controller destroy: 'On the controller node-3.test.domain.local, resource master_p_rabbitmq-server is active but failed to start (managed)'

Bug #1524024 reported by Artem Panchenko
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
In Progress
High
Valeriy Saharov
8.0.x
Confirmed
High
Fuel Library (Deprecated)

Bug Description

Fuel version info (8.0 build #264): http://paste.openstack.org/show/481207/

System tests 'ha_neutron_destroy_controllers' fails on step '7. Check pacemaker status' after controller node is destroyed, because RabbitMQ server is unable to start on one of 2 alive controllers:

Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
     p_rabbitmq-server (ocf::fuel:rabbitmq-server): FAILED node-3.test.domain.local
     Masters: [ node-1.test.domain.local ]

root@node-3:~# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-3' ...
[{nodes,[{disc,['rabbit@node-3']}]}]

Here is a part of pacemaker logs:

Dec 08 04:28:58 [14416] node-3.test.domain.local pacemaker_remoted: notice: operation_finished: p_rabbitmq-server_stop_0:8443:stderr [ Error: unable to connect to node 'rabbit@node-3': nodedown ]
Dec 08 04:28:58 [14416] node-3.test.domain.local pacemaker_remoted: notice: operation_finished: p_rabbitmq-server_stop_0:8443:stderr [ ]
Dec 08 04:28:58 [14416] node-3.test.domain.local pacemaker_remoted: notice: operation_finished: p_rabbitmq-server_stop_0:8443:stderr [ DIAGNOSTICS ]
Dec 08 04:28:58 [14416] node-3.test.domain.local pacemaker_remoted: notice: operation_finished: p_rabbitmq-server_stop_0:8443:stderr [ =========== ]
Dec 08 04:28:58 [14416] node-3.test.domain.local pacemaker_remoted: notice: operation_finished: p_rabbitmq-server_stop_0:8443:stderr [ ]
Dec 08 04:28:58 [14416] node-3.test.domain.local pacemaker_remoted: notice: operation_finished: p_rabbitmq-server_stop_0:8443:stderr [ attempted to contact: ['rabbit@node-3'] ]
Dec 08 04:28:58 [14416] node-3.test.domain.local pacemaker_remoted: notice: operation_finished: p_rabbitmq-server_stop_0:8443:stderr [ ]
Dec 08 04:28:58 [14416] node-3.test.domain.local pacemaker_remoted: notice: operation_finished: p_rabbitmq-server_stop_0:8443:stderr [ rabbit@node-3: ]
Dec 08 04:28:58 [14416] node-3.test.domain.local pacemaker_remoted: notice: operation_finished: p_rabbitmq-server_stop_0:8443:stderr [ * connected to epmd (port 4369) on node-3 ]
Dec 08 04:28:58 [14416] node-3.test.domain.local pacemaker_remoted: notice: operation_finished: p_rabbitmq-server_stop_0:8443:stderr [ * epmd reports: node 'rabbit' not running at all ]
Dec 08 04:28:58 [14416] node-3.test.domain.local pacemaker_remoted: notice: operation_finished: p_rabbitmq-server_stop_0:8443:stderr [ no other nodes on node-3 ]
Dec 08 04:28:58 [14416] node-3.test.domain.local pacemaker_remoted: notice: operation_finished: p_rabbitmq-server_stop_0:8443:stderr [ * suggestion: start the node ]
Dec 08 04:28:58 [14416] node-3.test.domain.local pacemaker_remoted: notice: operation_finished: p_rabbitmq-server_stop_0:8443:stderr [ ]
Dec 08 04:28:58 [14416] node-3.test.domain.local pacemaker_remoted: notice: operation_finished: p_rabbitmq-server_stop_0:8443:stderr [ current node details: ]
Dec 08 04:28:58 [14416] node-3.test.domain.local pacemaker_remoted: notice: operation_finished: p_rabbitmq-server_stop_0:8443:stderr [ - node name: 'rabbitmq-cli-8527@node-3' ]
Dec 08 04:28:58 [14416] node-3.test.domain.local pacemaker_remoted: notice: operation_finished: p_rabbitmq-server_stop_0:8443:stderr [ - home dir: /var/lib/rabbitmq ]
Dec 08 04:28:58 [14416] node-3.test.domain.local pacemaker_remoted: notice: operation_finished: p_rabbitmq-server_stop_0:8443:stderr [ - cookie hash: soeIWU2jk2YNseTyDSlsEA== ]
Dec 08 04:28:58 [14416] node-3.test.domain.local pacemaker_remoted: notice: operation_finished: p_rabbitmq-server_stop_0:8443:stderr [ ]
Dec 08 04:28:58 [14416] node-3.test.domain.local pacemaker_remoted: notice: operation_finished: p_rabbitmq-server_stop_0:8443:stderr [ cat: /var/run/rabbitmq/pid: No such file or directory ]

Please note that test was waiting for worikng RabbitMQ cluster (passed OSTF tests) more than 20 minutes, but it hasn't been recovered (even after 1 hour when I reverted environment manually).

Steps to reproduce:

1. Deploy environment with 3 controllers
2. Destroy first controller
3. Wait 20 minutes
4. Check pacemaker status on alive controllers

Expected result: all resources are running

Actual: 'p_rabbitmq-server' resource is stopped/failed on one controllers

Revision history for this message
Artem Panchenko (apanchenko-8) wrote :
Dmitry Pyzhov (dpyzhov)
tags: added: area-mos
Changed in fuel:
status: New → Confirmed
tags: added: swarm-fail-driver
Changed in fuel:
assignee: MOS Oslo (mos-oslo) → Dmitriy Ukhlov (dukhlov)
Dmitry Pyzhov (dpyzhov)
tags: removed: swarm-fail-driver
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

The attached snapshot is missing lrmd.log files for controller nodes and it is critical for analyzing such issues. Please reproduce the issue and attach a fresh snapshot. Also please make sure that lrmd logs are included and if they are missing, attach them separately.

Changed in fuel:
status: Confirmed → Incomplete
assignee: Dmitriy Ukhlov (dukhlov) → Fuel QA Team (fuel-qa)
Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

Move to invalid for now according it is in incomplete status quite long, also it is not reproduces during 417 iso swarm run

Changed in fuel:
status: Incomplete → Invalid
Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

Logs attached.

Issue reproduced on MOS #417.

Scenario for reproduce:
https://mirantis.testrail.com/index.php?/cases/view/542817

Bug reproduced not in 100% of cases.

Changed in fuel:
status: Invalid → Confirmed
assignee: Fuel QA Team (fuel-qa) → Fuel Library Team (fuel-library)
milestone: 8.0 → 9.0
Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

Reproduced in automated tests with the following scenario:

1. Login to the first Openstack controller node and restart RabbitMQ service:
service rabbitmq-server restart
2. Wait while RabbitMQ will successfully started on the controller node
3. Repeat steps -1-2 for all controller nodes in your cluster
4. Execute 'nova list', 'keystone user-list', 'glance image-list', 'nova-manage service list', 'keystone service-list' several times and verify that all works fine.
5. Execute 'rabbitmqctl cluster_status' and verify that all RabbitMQ nodes are in the same cluster.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Please fix either the test case description or the step 1. One shall not use 'service foo restart' when the foo is under pacemaker control. Unless the test case is to check that "destructive" action made by intention

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Valeriy Saharov (vsakharov)
status: Confirmed → In Progress
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.