Fuel for OpenStack

Rabbit node failed to join cluster after shutdown 2 controllers with rabbit master and then turn on first of them

Bug #1565868 reported by Andrey Sledzinskiy on 2016-04-04

This bug affects 1 person

	Status	Importance	Assigned to	Milestone
Fuel for OpenStack	Fix Committed	High	Alexey Lebedeff	Fuel for OpenStack 10.0
8.0.x	In Progress	High	Alexey Lebedeff	Fuel for OpenStack 8.0-updates
Mitaka	Fix Released	High	Alexey Lebedeff	Fuel for OpenStack 9.0
Newton	Fix Committed	High	Alexey Lebedeff	Fuel for OpenStack 10.0

Bug Description

iso - 9.0-152

Steps:
1. Create and deploy next cluster - Neutron Vlan, ceph for volumes and images, 2 controller, 1 controller+ceph, 1 compute, 1 compute+ceph
2. Open Health Check tab and run sanity, smoke, ha tests
3. Shutdown controller with rabbit master (node-1 in case of this test)
4. Wait HA OSTF tests pass
5. Shutdown next controller with rabbit master (node-3 in case of this test)
6. Start controller from step 3
7. Wait HA OSTF tests pass

Expected result - all tests pass
Actual result - rabbit tests fail, rabbit isn't running on node-2

root@node-1:~# rabbitmqctl cluster_status
Cluster status of node 'rabbit@messaging-node-1' ...
[{nodes,[{disc,['rabbit@messaging-node-1']}]},
{running_nodes,['rabbit@messaging-node-1']},
{cluster_name,<<"<email address hidden>">>},
{partitions,[]},
{alarms,[{'rabbit@messaging-node-1',[]}]}]

root@node-2:~# rabbitmqctl cluster_status
Cluster status of node 'rabbit@messaging-node-2' ...
[{nodes,[{disc,['rabbit@messaging-node-2','rabbit@messaging-node-3']}]},
{alarms,[]}]

fuel version - http://paste.openstack.org/show/492893/

Revision history for this message

Andrey Sledzinskiy (asledzinskiy) wrote on 2016-04-04:

fail_error_ha_ceph_neutron_rabbit_master_destroy-fuel-snapshot-2016-04-04_02-23-20.tar.xz Edit (61.7 MiB, application/octet-stream)

Alexey Lebedeff (alebedev-a) on 2016-04-04

Changed in fuel:
assignee:	MOS Oslo (mos-oslo) → Alexey Lebedeff (alebedev-a)
status:	New → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-04-04: Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/301232

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-04-04: Fix proposed to fuel-library (stable/8.0)

Fix proposed to branch: stable/8.0
Review: https://review.openstack.org/301234

Revision history for this message

Alexey Lebedeff (alebedev-a) wrote on 2016-04-05:

The problem was due to OCF script not cleaning up mnesia directory at all. pacemaker log contained a lot of entries about resetting mnesia, but looking into /var/lib/rabbitmq/rabbit@<nodename> directory revealed a lot of files with modification time predating every log record about reset - which shouldn't happen if reset happened correctly.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-04-08: Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/301232
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=6e1d58557d54b0140f763733953f69a2c23396ac
Submitter: Jenkins
Branch: master

commit 6e1d58557d54b0140f763733953f69a2c23396ac
Author: Alexey Lebedeff <email address hidden>
Date: Mon Apr 4 19:44:09 2016 +0300

Fix half-hearted attempt to erase mnesia in OCF RA

ocf_run does $("$@"), so "${MNESIA_FILES}/*" wasn't expanded and mnesia
directory wasn't actually cleaned up

It's safe to remove that directory completely - it will be re-created
automatically by mnesia.

Upstream https://github.com/rabbitmq/rabbitmq-server/pull/724

Change-Id: I0aa47f61e03c99ee6ebb56b833463cdf4ccd243e
Closes-Bug: 1565868

Changed in fuel:
status:	In Progress → Fix Committed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-04-12: Fix proposed to fuel-library (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/304711

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-05-05: Fix merged to fuel-library (stable/mitaka)

Reviewed: https://review.openstack.org/304711
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=a2d9314432c5d7c31c18ffb981e9cf2c20898607
Submitter: Jenkins
Branch: stable/mitaka

commit a2d9314432c5d7c31c18ffb981e9cf2c20898607
Author: Alexey Lebedeff <email address hidden>
Date: Mon Apr 4 19:44:09 2016 +0300

Fix half-hearted attempt to erase mnesia in OCF RA

Cherry-picked from master (6e1d58557d54b0140f763733953f69a2c23396ac)

ocf_run does $("$@"), so "${MNESIA_FILES}/*" wasn't expanded and mnesia
directory wasn't actually cleaned up

It's safe to remove that directory completely - it will be re-created
automatically by mnesia.

Upstream https://github.com/rabbitmq/rabbitmq-server/pull/724

Change-Id: I0aa47f61e03c99ee6ebb56b833463cdf4ccd243e
Closes-Bug: 1565868

Revision history for this message

Mikhail Samoylov (msamoylov) wrote on 2016-06-03:

Re-opened for fuel 9.0 iso 432:
https://product-ci.infra.mirantis.net/job/9.0.system_test.ubuntu.ha_neutron_destructive/130/consoleFull
Failed test:
Destroy two controllers and check pacemaker status is correct

    Scenario:
        1. Revert environment
        2. Destroy first controller
        3. Check pacemaker status
        4. Run OSTF
        5. Revert environment
        6. Destroy second controller
        7. Check pacemaker status
        8. Run OSTF

http://paste.openstack.org/show/507642/

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2016-06-03:

The OSTF output shows that this issue is different - we cannot get not only RabbitMQ state.

  - Check state of haproxy backends on controllers (failure) Can not set proxy for Health Check.Make sure that network configuration for controllers is correct
  - Check data replication over mysql (failure) Can not set proxy for Health Check.Make sure that network configuration for controllers is correct
  - Check if amount of tables in databases is the same on each node (failure) Can not set proxy for Health Check.Make sure that network configuration for controllers is correct
  - Check galera environment state (failure) Can not set proxy for Health Check.Make sure that network configuration for controllers is correct
  - Check pacemaker status (failure) Can not set proxy for Health Check.Make sure that network configuration for controllers is correct
  - RabbitMQ availability (failure) Can not set proxy for Health Check.Make sure that network configuration for controllers is correct
  - RabbitMQ replication (failure) Can not set proxy for Health Check.Make sure that network configuration for controllers is correct

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2016-06-03:

#10

Please, create new bug for this

Revision history for this message

Andrey Sledzinskiy (asledzinskiy) wrote on 2016-06-03:

#11

Today I've created bug on keystone authorization failure - https://bugs.launchpad.net/mos/+bug/1588767
I think it's the same problem as mentioned by Mikhail

Sofiia Andriichenko (sandriichenko) on 2016-06-14

tags:

added: on-verification

Revision history for this message

Sofiia Andriichenko (sandriichenko) wrote on 2016-06-15:

#12

ISO: mos 481
Steps:
1. Create and deploy next cluster - Neutron Vlan, ceph for volumes and images, 2-controller, 1-controller+ceph, 1-compute, 1-compute+ceph
2. Open Health Check tab and run sanity, smoke, ha tests

Expected result:
all tests pass

Actual result:
OSTF tests fail:
Check data replication over mysql - on step 1. Check that mysql is running on all controller or database nodes.
Check if amount of tables in databases is the same on each node - on step 2. Request list of tables for os databases on each node.
Check galera environment state - on step 2. Ssh on each node containing database and request state of galera
node
RabbitMQ availability - on step 1. Retrieve cluster status for each controller.

with error - Time limit exceeded while waiting for get status from galera node to finish. Please refer to OpenStack logs for more details.

snapshot: https://drive.google.com/a/mirantis.com/file/d/0BxPLDs6wcpbDVDBIcEkzeUpkVms/view?usp=sharing

tags:

removed: on-verification

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2016-06-15:

#13

I have investigated Sofia's environment. The tests listed by Sofia repeatedly failed. After some digging I have found out that the root cause is that node-1, ..., node-4 were not resolving to IP addresses on the master node. After I added them to /etc/hosts, the tests started to pass.

Sofia, please create a new bug for the issue you have found. I am returning current one to fix committed.

Oleksiy Molchanov (omolchanov) on 2016-06-15

tags:

added: on-verificatione

Revision history for this message

Sofiia Andriichenko (sandriichenko) wrote on 2016-06-15:

#14

The verification blocked by
https://bugs.launchpad.net/fuel/+bug/1592876

Revision history for this message

Oleksiy Molchanov (omolchanov) wrote on 2016-06-16:

#15

I was testing on #465.

http://paste.openstack.org/show/516506/

tags:

added: on-verification
removed: on-verificatione

Oleksiy Molchanov (omolchanov) on 2016-06-16

tags:

removed: on-verification

Ekaterina Shutova (eshutova) on 2016-06-16

tags:

added: on-verification

Ekaterina Shutova (eshutova) on 2016-06-16

tags:

removed: on-verification

Revision history for this message

Nastya Urlapova (aurlapova) wrote on 2016-06-27:

#16

Test are green: https://product-ci.infra.mirantis.net/job/9.0.system_test.ubuntu.ha_destructive_ceph_neutron/153/testReport/(root)/ha_ceph_neutron_rabbit_master_destroy/ + https://product-ci.infra.mirantis.net/job/9.0.system_test.ubuntu.ha_neutron_destructive/154/testReport/(root)/ha_neutron_destroy_controllers/ha_neutron_destroy_controllers/

ISO #535

Revision history for this message

Alexey Galkin (agalkin) wrote on 2016-06-27:

#17

Verified on 9.0.mos-all #495 RC2.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-12-18: Change abandoned on fuel-library (stable/8.0)

#18

Change abandoned by Andreas Jaeger (<email address hidden>) on branch: stable/8.0
Review: https://review.opendev.org/301234
Reason: This repo is retired now, no further work will get merged.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

fail_error_ha_ceph_neutron_rabbit_master_destroy-fuel-snapshot-2016-04-04_02-23-20.tar.xz Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.