RabbitMQ has failed after resetting of primary controller. Channel error on connection <0.2742.0>

Bug #1529875 reported by Anastasia Palkina
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Confirmed
High
Kyrylo Galanov

Bug Description

1. Create new environment
2. Choose Neutron, VLAN
3. Choose Ceph for images
4. Add 3 controller, 1 compute, 1 cinder, 3 ceph
5. Move Storage network to eth1
6. Move Management network to eth2 and untag it
7. Deploy the environment. It was successful
8. Run OSTF tests. It was successful
9. Reset primary controller (node-3)
10. Wait about 25 minutes and starting OSTF tests.
Tests "RabbitMQ availability"and "RabbitMQ replication" failed.

There is error in /<email address hidden>

=ERROR REPORT==== 29-Dec-2015::13:41:06 ===
Channel error on connection <0.2742.0> (192.168.0.9:49558 -> 192.168.0.10:5673, vhost: '/', user: 'nova'), channel 1:
{amqp_error,not_found,
            "no exchange 'reply_2040bcdfd9894ee2b9fab1887ec469b3' in vhost '/'",
            'exchange.declare'}

PCS Status: http://paste.openstack.org/show/482814/

Logs are here: https://drive.google.com/a/mirantis.com/file/d/0B6SjzarTGFxaUEx1WVI3SlZvSzg/view?usp=sharing

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "362"
  build_id: "362"
  fuel-nailgun_sha: "53c72a9600158bea873eec2af1322a716e079ea0"
  python-fuelclient_sha: "4f234669cfe88a9406f4e438b1e1f74f1ef484a5"
  fuel-agent_sha: "7463551bc74841d1049869aaee777634fb0e5149"
  fuel-nailgun-agent_sha: "92ebd5ade6fab60897761bfa084aefc320bff246"
  astute_sha: "c7ca63a49216744e0bfdfff5cb527556aad2e2a5"
  fuel-library_sha: "ba8063d34ff6419bddf2a82b1de1f37108d96082"
  fuel-ostf_sha: "889ddb0f1a4fa5f839fd4ea0c0017a3c181aa0c1"
  fuel-mirror_sha: "8adb10618bb72bb36bb018386d329b494b036573"
  fuelmenu_sha: "824f6d3ebdc10daf2f7195c82a8ca66da5abee99"
  shotgun_sha: "63645dea384a37dde5c01d4f8905566978e5d906"
  network-checker_sha: "9f0ba4577915ce1e77f5dc9c639a5ef66ca45896"
  fuel-upgrade_sha: "616a7490ec7199f69759e97e42f9b97dfc87e85b"
  fuelmain_sha: "07d5f1c3e1b352cb713852a3a96022ddb8fe2676"

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Kyrylo Galanov (kgalanov)
Revision history for this message
Kyrylo Galanov (kgalanov) wrote :

Deleting mnesia database on a slave node fixes the issue:
$ rm -fr /var/lib/rabbitmq/mnesia/*

Need to investigate why it was broken.

Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

Similar behavior as bug 1529861

deploy succeeded, then one of the nodes says it's up in corosync, but is really down and there's no rabbitmq-server process running.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Not a dup. There was no network partitions and unmanaged resources. The issue is that only the running node-3 was actually running while the reset was down and shown wrong in pcs status as Slaves

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

What I found is that the uptime value was not reset for the node-3 after it was rebooted, which is wrong:

node-4 lrmd:
before node-3 reset
2015-12-29T13:34:44.242420+00:00 info: INFO: p_rabbitmq-server: get_monitor(): comparing our uptime (4490) with node-9.domain.tld (4488)
2015-12-29T13:34:44.317439+00:00 info: INFO: p_rabbitmq-server: get_monitor(): comparing our uptime (4490) with node-3.domain.tld (7142)
after node-3 reset
2015-12-29T13:38:23.253014+00:00 info: INFO: p_rabbitmq-server: get_monitor(): comparing our uptime (4709) with node-9.domain.tld (4707)
2015-12-29T13:38:23.359804+00:00 info: INFO: p_rabbitmq-server: get_monitor(): comparing our uptime (4709) with node-3.domain.tld (7361)

Changed in fuel:
status: New → Confirmed
tags: added: area-library ha rabbitmq tricky
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Okay, so here we had had the following:
node-3: behaved OK, recovered at: 2015-12-29T13:41:22.853829

node-9: seg-faulted all the time, failing to join the node-3 in endless loop. Even though its mnesia got reset every time after the join attempt failed. Status in pacemaker was correct all the time according to the pengine logs. Note, that join attempts was also reported exit code 2, so there were not only segfaults. But it is not clear why it was not able to join the node-3 and form a cluster.

node-4: the most strange behavior. The last join attempt was failed and followed by the stop with code 139 (a seg fault) at 2015-12-29T13:41:05.873048
After that it was kept down but pacemaker reported it as a running Slave (and this is the subject of the main bug 1472230 as well)
No details why, crmd log only contains this:
2015-12-29T13:38:13.842800+00:00 notice: notice: crm_update_peer_state: pcmk_quorum_notification: Node node-3.domain.tld[3] - state is now member (was lost)
2015-12-29T13:38:23.379517+00:00 notice: notice: process_lrm_event: Operation p_rabbitmq-server_monitor_30000: ok (node=node-4.domain.tld, call=160, rc=0, cib-update=285, confirmed=false)
2015-12-29T13:38:37.348010+00:00 notice: notice: process_lrm_event: Operation p_rabbitmq-server_notify_0: ok (node=node-4.domain.tld, call=289, rc=0, cib-update=0, confirmed=true)
2015-12-29T13:38:54.857193+00:00 notice: notice: process_lrm_event: Operation p_rabbitmq-server_monitor_30000: not running (node=node-4.domain.tld, call=160, rc=7, cib-update=287, confirmed=false)
2015-12-29T13:39:15.833534+00:00 notice: notice: process_lrm_event: Operation p_rabbitmq-server_monitor_103000: not running (node=node-4.domain.tld, call=159, rc=7, cib-update=288, confirmed=false)
2015-12-29T13:39:23.846079+00:00 notice: notice: process_lrm_event: Operation p_rabbitmq-server_notify_0: ok (node=node-4.domain.tld, call=293, rc=0, cib-update=0, confirmed=true)
2015-12-29T13:39:47.070403+00:00 notice: notice: process_lrm_event: Operation p_rabbitmq-server_notify_0: ok (node=node-4.domain.tld, call=294, rc=0, cib-update=0, confirmed=true)
(repeats)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.