Rabbit node failed to join cluster after shutdown 2 controllers with rabbit master and then turn on first of them

Bug #1565868 reported by Andrey Sledzinskiy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
Alexey Lebedeff
8.0.x
In Progress
High
Alexey Lebedeff
Mitaka
Fix Released
High
Alexey Lebedeff
Newton
Fix Committed
High
Alexey Lebedeff

Bug Description

iso - 9.0-152

Steps:
1. Create and deploy next cluster - Neutron Vlan, ceph for volumes and images, 2 controller, 1 controller+ceph, 1 compute, 1 compute+ceph
2. Open Health Check tab and run sanity, smoke, ha tests
3. Shutdown controller with rabbit master (node-1 in case of this test)
4. Wait HA OSTF tests pass
5. Shutdown next controller with rabbit master (node-3 in case of this test)
6. Start controller from step 3
7. Wait HA OSTF tests pass

Expected result - all tests pass
Actual result - rabbit tests fail, rabbit isn't running on node-2

root@node-1:~# rabbitmqctl cluster_status
Cluster status of node 'rabbit@messaging-node-1' ...
[{nodes,[{disc,['rabbit@messaging-node-1']}]},
 {running_nodes,['rabbit@messaging-node-1']},
 {cluster_name,<<"<email address hidden>">>},
 {partitions,[]},
 {alarms,[{'rabbit@messaging-node-1',[]}]}]

root@node-2:~# rabbitmqctl cluster_status
Cluster status of node 'rabbit@messaging-node-2' ...
[{nodes,[{disc,['rabbit@messaging-node-2','rabbit@messaging-node-3']}]},
 {alarms,[]}]

fuel version - http://paste.openstack.org/show/492893/

Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :
Changed in fuel:
assignee: MOS Oslo (mos-oslo) → Alexey Lebedeff (alebedev-a)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/301232

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/8.0)

Fix proposed to branch: stable/8.0
Review: https://review.openstack.org/301234

Revision history for this message
Alexey Lebedeff (alebedev-a) wrote :

The problem was due to OCF script not cleaning up mnesia directory at all. pacemaker log contained a lot of entries about resetting mnesia, but looking into /var/lib/rabbitmq/rabbit@<nodename> directory revealed a lot of files with modification time predating every log record about reset - which shouldn't happen if reset happened correctly.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/301232
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=6e1d58557d54b0140f763733953f69a2c23396ac
Submitter: Jenkins
Branch: master

commit 6e1d58557d54b0140f763733953f69a2c23396ac
Author: Alexey Lebedeff <email address hidden>
Date: Mon Apr 4 19:44:09 2016 +0300

    Fix half-hearted attempt to erase mnesia in OCF RA

    ocf_run does $("$@"), so "${MNESIA_FILES}/*" wasn't expanded and mnesia
    directory wasn't actually cleaned up

    It's safe to remove that directory completely - it will be re-created
    automatically by mnesia.

    Upstream https://github.com/rabbitmq/rabbitmq-server/pull/724

    Change-Id: I0aa47f61e03c99ee6ebb56b833463cdf4ccd243e
    Closes-Bug: 1565868

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/304711

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/mitaka)

Reviewed: https://review.openstack.org/304711
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=a2d9314432c5d7c31c18ffb981e9cf2c20898607
Submitter: Jenkins
Branch: stable/mitaka

commit a2d9314432c5d7c31c18ffb981e9cf2c20898607
Author: Alexey Lebedeff <email address hidden>
Date: Mon Apr 4 19:44:09 2016 +0300

    Fix half-hearted attempt to erase mnesia in OCF RA

    Cherry-picked from master (6e1d58557d54b0140f763733953f69a2c23396ac)

    ocf_run does $("$@"), so "${MNESIA_FILES}/*" wasn't expanded and mnesia
    directory wasn't actually cleaned up

    It's safe to remove that directory completely - it will be re-created
    automatically by mnesia.

    Upstream https://github.com/rabbitmq/rabbitmq-server/pull/724

    Change-Id: I0aa47f61e03c99ee6ebb56b833463cdf4ccd243e
    Closes-Bug: 1565868

Revision history for this message
Mikhail Samoylov (msamoylov) wrote :

Re-opened for fuel 9.0 iso 432:
https://product-ci.infra.mirantis.net/job/9.0.system_test.ubuntu.ha_neutron_destructive/130/consoleFull
Failed test:
Destroy two controllers and check pacemaker status is correct

    Scenario:
        1. Revert environment
        2. Destroy first controller
        3. Check pacemaker status
        4. Run OSTF
        5. Revert environment
        6. Destroy second controller
        7. Check pacemaker status
        8. Run OSTF

http://paste.openstack.org/show/507642/

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

The OSTF output shows that this issue is different - we cannot get not only RabbitMQ state.

  - Check state of haproxy backends on controllers (failure) Can not set proxy for Health Check.Make sure that network configuration for controllers is correct
  - Check data replication over mysql (failure) Can not set proxy for Health Check.Make sure that network configuration for controllers is correct
  - Check if amount of tables in databases is the same on each node (failure) Can not set proxy for Health Check.Make sure that network configuration for controllers is correct
  - Check galera environment state (failure) Can not set proxy for Health Check.Make sure that network configuration for controllers is correct
  - Check pacemaker status (failure) Can not set proxy for Health Check.Make sure that network configuration for controllers is correct
  - RabbitMQ availability (failure) Can not set proxy for Health Check.Make sure that network configuration for controllers is correct
  - RabbitMQ replication (failure) Can not set proxy for Health Check.Make sure that network configuration for controllers is correct

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

Please, create new bug for this

Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :

Today I've created bug on keystone authorization failure - https://bugs.launchpad.net/mos/+bug/1588767
I think it's the same problem as mentioned by Mikhail

tags: added: on-verification
Revision history for this message
Sofiia Andriichenko (sandriichenko) wrote :

ISO: mos 481
Steps:
1. Create and deploy next cluster - Neutron Vlan, ceph for volumes and images, 2-controller, 1-controller+ceph, 1-compute, 1-compute+ceph
2. Open Health Check tab and run sanity, smoke, ha tests

Expected result:
 all tests pass

Actual result:
OSTF tests fail:
Check data replication over mysql - on step 1. Check that mysql is running on all controller or database nodes.
Check if amount of tables in databases is the same on each node - on step 2. Request list of tables for os databases on each node.
Check galera environment state - on step 2. Ssh on each node containing database and request state of galera
node
RabbitMQ availability - on step 1. Retrieve cluster status for each controller.

with error - Time limit exceeded while waiting for get status from galera node to finish. Please refer to OpenStack logs for more details.

snapshot: https://drive.google.com/a/mirantis.com/file/d/0BxPLDs6wcpbDVDBIcEkzeUpkVms/view?usp=sharing

tags: removed: on-verification
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

I have investigated Sofia's environment. The tests listed by Sofia repeatedly failed. After some digging I have found out that the root cause is that node-1, ..., node-4 were not resolving to IP addresses on the master node. After I added them to /etc/hosts, the tests started to pass.

Sofia, please create a new bug for the issue you have found. I am returning current one to fix committed.

tags: added: on-verificatione
Revision history for this message
Sofiia Andriichenko (sandriichenko) wrote :
Revision history for this message
Oleksiy Molchanov (omolchanov) wrote :
tags: added: on-verification
removed: on-verificatione
tags: removed: on-verification
tags: added: on-verification
tags: removed: on-verification
Revision history for this message
Nastya Urlapova (aurlapova) wrote :
Revision history for this message
Alexey Galkin (agalkin) wrote :

Verified on 9.0.mos-all #495 RC2.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-library (stable/8.0)

Change abandoned by Andreas Jaeger (<email address hidden>) on branch: stable/8.0
Review: https://review.opendev.org/301234
Reason: This repo is retired now, no further work will get merged.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.