Fuel for OpenStack

RabbitMQ has failed after resetting of primary controller. Channel error on connection <0.2742.0>

Bug #1529875 reported by Anastasia Palkina on 2015-12-29

This bug report is a duplicate of: Bug #1472230: Pacemaker shows healthy status for rabbitmq node meanwhile the node is actually down/split brain. Edit Remove

10

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Confirmed	High	Kyrylo Galanov	Fuel for OpenStack 8.0

Bug Description

1. Create new environment
2. Choose Neutron, VLAN
3. Choose Ceph for images
4. Add 3 controller, 1 compute, 1 cinder, 3 ceph
5. Move Storage network to eth1
6. Move Management network to eth2 and untag it
7. Deploy the environment. It was successful
8. Run OSTF tests. It was successful
9. Reset primary controller (node-3)
10. Wait about 25 minutes and starting OSTF tests.
Tests "RabbitMQ availability"and "RabbitMQ replication" failed.

There is error in /<email address hidden>

=ERROR REPORT==== 29-Dec-2015::13:41:06 ===
Channel error on connection <0.2742.0> (192.168.0.9:49558 -> 192.168.0.10:5673, vhost: '/', user: 'nova'), channel 1:
{amqp_error,not_found,
"no exchange 'reply_2040bcdfd9894ee2b9fab1887ec469b3' in vhost '/'",
'exchange.declare'}

PCS Status: http://paste.openstack.org/show/482814/

Logs are here: https://drive.google.com/a/mirantis.com/file/d/0B6SjzarTGFxaUEx1WVI3SlZvSzg/view?usp=sharing

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "362"
  build_id: "362"
  fuel-nailgun_sha: "53c72a9600158bea873eec2af1322a716e079ea0"
  python-fuelclient_sha: "4f234669cfe88a9406f4e438b1e1f74f1ef484a5"
  fuel-agent_sha: "7463551bc74841d1049869aaee777634fb0e5149"
  fuel-nailgun-agent_sha: "92ebd5ade6fab60897761bfa084aefc320bff246"
  astute_sha: "c7ca63a49216744e0bfdfff5cb527556aad2e2a5"
  fuel-library_sha: "ba8063d34ff6419bddf2a82b1de1f37108d96082"
  fuel-ostf_sha: "889ddb0f1a4fa5f839fd4ea0c0017a3c181aa0c1"
  fuel-mirror_sha: "8adb10618bb72bb36bb018386d329b494b036573"
  fuelmenu_sha: "824f6d3ebdc10daf2f7195c82a8ca66da5abee99"
  shotgun_sha: "63645dea384a37dde5c01d4f8905566978e5d906"
  network-checker_sha: "9f0ba4577915ce1e77f5dc9c639a5ef66ca45896"
  fuel-upgrade_sha: "616a7490ec7199f69759e97e42f9b97dfc87e85b"
  fuelmain_sha: "07d5f1c3e1b352cb713852a3a96022ddb8fe2676"

Tags:

Kyrylo Galanov (kgalanov) on 2015-12-29

Changed in fuel:
assignee:	Fuel Library Team (fuel-library) → Kyrylo Galanov (kgalanov)

Revision history for this message

Kyrylo Galanov (kgalanov) wrote on 2015-12-29:

#1

Deleting mnesia database on a slave node fixes the issue:
$ rm -fr /var/lib/rabbitmq/mnesia/*

Need to investigate why it was broken.

Revision history for this message

Matthew Mosesohn (raytrac3r) wrote on 2015-12-29:

#2

Similar behavior as bug 1529861

deploy succeeded, then one of the nodes says it's up in corosync, but is really down and there's no rabbitmq-server process running.

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-12-29:

#3

Not a dup. There was no network partitions and unmanaged resources. The issue is that only the running node-3 was actually running while the reset was down and shown wrong in pcs status as Slaves

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-12-29:

#4

What I found is that the uptime value was not reset for the node-3 after it was rebooted, which is wrong:

node-4 lrmd:
before node-3 reset
2015-12-29T13:34:44.242420+00:00 info: INFO: p_rabbitmq-server: get_monitor(): comparing our uptime (4490) with node-9.domain.tld (4488)
2015-12-29T13:34:44.317439+00:00 info: INFO: p_rabbitmq-server: get_monitor(): comparing our uptime (4490) with node-3.domain.tld (7142)
after node-3 reset
2015-12-29T13:38:23.253014+00:00 info: INFO: p_rabbitmq-server: get_monitor(): comparing our uptime (4709) with node-9.domain.tld (4707)
2015-12-29T13:38:23.359804+00:00 info: INFO: p_rabbitmq-server: get_monitor(): comparing our uptime (4709) with node-3.domain.tld (7361)

Changed in fuel:
status:	New → Confirmed
tags:	added: area-library ha rabbitmq tricky

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-12-31:

#5

May be duplicated by https://bugs.launchpad.net/fuel/+bug/1530228

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-12-31:

#6

Okay, so here we had had the following:
node-3: behaved OK, recovered at: 2015-12-29T13:41:22.853829

node-9: seg-faulted all the time, failing to join the node-3 in endless loop. Even though its mnesia got reset every time after the join attempt failed. Status in pacemaker was correct all the time according to the pengine logs. Note, that join attempts was also reported exit code 2, so there were not only segfaults. But it is not clear why it was not able to join the node-3 and form a cluster.

node-4: the most strange behavior. The last join attempt was failed and followed by the stop with code 139 (a seg fault) at 2015-12-29T13:41:05.873048
After that it was kept down but pacemaker reported it as a running Slave (and this is the subject of the main bug 1472230 as well)
No details why, crmd log only contains this:
2015-12-29T13:38:13.842800+00:00 notice: notice: crm_update_peer_state: pcmk_quorum_notification: Node node-3.domain.tld[3] - state is now member (was lost)
2015-12-29T13:38:23.379517+00:00 notice: notice: process_lrm_event: Operation p_rabbitmq-server_monitor_30000: ok (node=node-4.domain.tld, call=160, rc=0, cib-update=285, confirmed=false)
2015-12-29T13:38:37.348010+00:00 notice: notice: process_lrm_event: Operation p_rabbitmq-server_notify_0: ok (node=node-4.domain.tld, call=289, rc=0, cib-update=0, confirmed=true)
2015-12-29T13:38:54.857193+00:00 notice: notice: process_lrm_event: Operation p_rabbitmq-server_monitor_30000: not running (node=node-4.domain.tld, call=160, rc=7, cib-update=287, confirmed=false)
2015-12-29T13:39:15.833534+00:00 notice: notice: process_lrm_event: Operation p_rabbitmq-server_monitor_103000: not running (node=node-4.domain.tld, call=159, rc=7, cib-update=288, confirmed=false)
2015-12-29T13:39:23.846079+00:00 notice: notice: process_lrm_event: Operation p_rabbitmq-server_notify_0: ok (node=node-4.domain.tld, call=293, rc=0, cib-update=0, confirmed=true)
2015-12-29T13:39:47.070403+00:00 notice: notice: process_lrm_event: Operation p_rabbitmq-server_notify_0: ok (node=node-4.domain.tld, call=294, rc=0, cib-update=0, confirmed=true)
(repeats)

Okay, so here we had had the following:
node-3: behaved OK, recovered at: 2015-12-29T13:41:22.853829

node-9: seg-faulted all the time, failing to join the node-3 in endless loop. Even though its mnesia got reset every time after the join attempt failed. Status in pacemaker was correct all the time according to the pengine logs. Note, that join attempts was also reported exit code 2, so there were not only segfaults. But it is not clear why it was not able to join the node-3 and form a cluster.

node-4: the most strange behavior. The last join attempt was failed and followed by the stop with code 139 (a seg fault) at 2015-12-29T13:41:05.873048
After that it was kept down but pacemaker reported it as a running Slave (and this is the subject of the main bug 1472230 as well)
No details why, crmd log only contains this:
2015-12-29T13:38:13.842800+00:00 notice:    notice: crm_update_peer_state: pcmk_quorum_notification: Node node-3.domain.tld[3] - state is now member (was lost)
2015-12-29T13:38:23.379517+00:00 notice:    notice: process_lrm_event: Operation p_rabbitmq-server_monitor_30000: ok (node=node-4.domain.tld, call=160, rc=0, cib-update=285, confirmed=false)
2015-12-29T13:38:37.348010+00:00 notice:    notice: process_lrm_event: Operation p_rabbitmq-server_notify_0: ok (node=node-4.domain.tld, call=289, rc=0, cib-update=0, confirmed=true)
2015-12-29T13:38:54.857193+00:00 notice:    notice: process_lrm_event: Operation p_rabbitmq-server_monitor_30000: not running (node=node-4.domain.tld, call=160, rc=7, cib-update=287, confirmed=false)
2015-12-29T13:39:15.833534+00:00 notice:    notice: process_lrm_event: Operation p_rabbitmq-server_monitor_103000: not running (node=node-4.domain.tld, call=159, rc=7, cib-update=288, confirmed=false)
2015-12-29T13:39:23.846079+00:00 notice:    notice: process_lrm_event: Operation p_rabbitmq-server_notify_0: ok (node=node-4.domain.tld, call=293, rc=0, cib-update=0, confirmed=true)
2015-12-29T13:39:47.070403+00:00 notice:    notice: process_lrm_event: Operation p_rabbitmq-server_notify_0: ok (node=node-4.domain.tld, call=294, rc=0, cib-update=0, confirmed=true)
(repeats)

Report a bug

This report contains Public information

Everyone can see this information.

Duplicate of bug #1472230 Remove

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.