Fuel for OpenStack

RabbitMQ OCF script requires manual intervention in rare cases

Bug #1394635 reported by Egor Kotko on 2014-11-20

This bug report is a duplicate of: Bug #1396946: Rabbitmq OCF script requires additional criteria to be met for Master/Slave statuses. Edit Remove

This bug affects 2 people

	Status	Importance	Assigned to	Milestone
Fuel for OpenStack	Confirmed	Medium	Vladimir Kuklin	Fuel for OpenStack 6.1
5.1.x	Confirmed	Medium	Vladimir Kuklin	Fuel for OpenStack 5.1.1-updates
6.0.x	Confirmed	Medium	Vladimir Kuklin	Fuel for OpenStack 6.0-updates
6.1.x	Confirmed	Medium	Fuel Library (Deprecated)	Fuel for OpenStack 6.1

Bug Description

{"build_id": "2014-11-19_21-56-43", "ostf_sha": "64cb59c681658a7a55cc2c09d079072a41beb346", "build_number": "25", "auth_required": true, "api": "1.0", "nailgun_sha": "7580f6341a726c2019f880ae23ff3f1c581fd850", "production": "docker", "fuelmain_sha": "eac9e2704424d1cb3f183c9f74567fd42a1fa6f3", "astute_sha": "fce051a6d013b1c30aa07320d225f9af734545de", "feature_groups": ["mirantis"], "release": "5.1.1", "release_versions": {"2014.1.3-5.1.1": {"VERSION": {"build_id": "2014-11-19_21-56-43", "ostf_sha": "64cb59c681658a7a55cc2c09d079072a41beb346", "build_number": "25", "api": "1.0", "nailgun_sha": "7580f6341a726c2019f880ae23ff3f1c581fd850", "production": "docker", "fuelmain_sha": "eac9e2704424d1cb3f183c9f74567fd42a1fa6f3", "astute_sha": "fce051a6d013b1c30aa07320d225f9af734545de", "feature_groups": ["mirantis"], "release": "5.1.1", "fuellib_sha": "5611c516362bea0fd47fcb5376a9f22dcfbb8307"}}}, "fuellib_sha": "5611c516362bea0fd47fcb5376a9f22dcfbb8307"}

Steps to reproduce:
1. Deploy cluster with configuration (on Hardware lab):
Centos HA, Neutron VLAN, 5 Controllers, 7 Computes
2. Execute on Primary controller "shutdown -h now"
3. Wait ~20 min

Expected result:
Cluster will be in correct state

Actual result:
Cluster is in incorrect state.
See the attached log (rabbit@node-3).

After shutting down node-2

[root@node-3 ~]# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-3' ...
[{nodes,[{disc,['rabbit@node-2','rabbit@node-3','rabbit@node-4',
'rabbit@node-5','rabbit@node-7']}]}]
...done.

Tags:

Revision history for this message

Egor Kotko (ykotko) wrote on 2014-11-20:

rabbit@node-3.log Edit (146.4 KiB, text/plain)

Revision history for this message

Egor Kotko (ykotko) wrote on 2014-11-20:

fuel-snapshot-2014-11-20_15-54-59.tgz Edit (40.5 MiB, application/x-tar)

Revision history for this message

Dmitry Borodaenko (angdraug) wrote on 2014-11-20:

Please explain impact of this bug in more details and change the summary to something more specific than just "incorrect work".

Changed in fuel:
status:	New → Incomplete

Revision history for this message

Dmitry Borodaenko (angdraug) wrote on 2014-11-20:

Please also explain why this is only targeted for 5.1.1 and not also for 6.0.

Revision history for this message

Egor Kotko (ykotko) wrote on 2014-11-21: Re: Handshake_timeout of rabbit after shutdown primary controller

I have targeted it only on 5.1.1 because I have got it only on 5.1.1 iso - will check this case on 6.0 too.
Rabbit log contains several types of issues like:

=ERROR REPORT==== 20-Nov-2014::13:46:24 ===
closing AMQP connection <0.6966.0> (192.168.0.12:38912 -> 192.168.0.5:5673):
{handshake_timeout,handshake}

=ERROR REPORT==== 20-Nov-2014::14:22:27 ===
AMQP connection <0.1919.0> (running), channel 0 - error:
{amqp_error,connection_forced,
"broker forced connection closure with reason 'shutdown'",none}

Also I have alive environment with this issue:
http://172.16.39.130:8000/#cluster/1/nodes

summary:

- Incorrect work of rabbit after shutdown primary controller
+ Handshake_timeout of rabbit after shutdown primary controller

Vladimir Kuklin (vkuklin) on 2014-11-21

Changed in fuel:
status:	Incomplete → Confirmed
milestone:	5.1.1 → 6.0

Vladimir Kuklin (vkuklin) on 2014-11-21

Changed in fuel:
assignee:	Fuel Library Team (fuel-library) → Vladimir Kuklin (vkuklin)

Revision history for this message

Dmitry Borodaenko (angdraug) wrote on 2014-11-22:

Thanks for figuring out the targeted release and updating the summary! I still don't see explicit confirmation of impact of this bug: which OpenStack control plane operations are impacted in what way by this RabbitMQ error? is it reliably reproducible or highly intermittent? is there a workaround?

Revision history for this message

Egor Kotko (ykotko) wrote on 2014-11-24:

I have get it again on virtual lab. Alive environment accessible here: http://172.18.164.133:8000/
After failover possible to get problems with nova. Sometimes it possible to get huge tomeout on cteartion instance/security group, or instance can be in build state infinite time.

{"build_id": "2014-11-20_21-01-00", "ostf_sha": "64cb59c681658a7a55cc2c09d079072a41beb346", "build_number": "28", "auth_required": true, "api": "1.0", "nailgun_sha": "7580f6341a726c2019f880ae23ff3f1c581fd850", "production": "docker", "fuelmain_sha": "eac9e2704424d1cb3f183c9f74567fd42a1fa6f3", "astute_sha": "51087c92a50be982071a074ff2bea01f1a5ddb76", "feature_groups": ["mirantis"], "release": "5.1.1", "release_versions": {"2014.1.3-5.1.1": {"VERSION": {"build_id": "2014-11-20_21-01-00", "ostf_sha": "64cb59c681658a7a55cc2c09d079072a41beb346", "build_number": "28", "api": "1.0", "nailgun_sha": "7580f6341a726c2019f880ae23ff3f1c581fd850", "production": "docker", "fuelmain_sha": "eac9e2704424d1cb3f183c9f74567fd42a1fa6f3", "astute_sha": "51087c92a50be982071a074ff2bea01f1a5ddb76", "feature_groups": ["mirantis"], "release": "5.1.1", "fuellib_sha": "b3d9f0e203f2f0faf3763e871a8dc31570777fed"}}}, "fuellib_sha": "b3d9f0e203f2f0faf3763e871a8dc31570777fed"}

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2014-11-24:

Checked with Yegor - did not find any issues with AMQP for the "http://172.18.164.133:8000/" environment - there are some performance issues, but none of them are related to AMQP failover. We are going to try to reproduce this bug one more time using real hardware.

Vladimir Kuklin (vkuklin) on 2014-11-25

Changed in fuel:
status:	Confirmed → Incomplete

Revision history for this message

Kirill Omelchenko (kirill-omelchenko) wrote on 2014-12-01:

I had a kind of this issue on a virtual env (5.1.1 - #45).

3x Controllers, 2x Computes, 2x CEPH-storage

- after successfull setup, shutdown the primary controller.

as a result we have errors output by crm status:
[root@node-2 ~]# crm status
Last updated: Mon Dec 1 10:14:55 2014
Last change: Mon Dec 1 10:14:18 2014 via crm_attribute on node-3.test.domain.local
Stack: classic openais (with plugin)
Current DC: node-3.test.domain.local - partition with quorum
Version: 1.1.10-14.el6_5.3-368c726
3 Nodes configured, 3 expected votes
17 Resources configured

Online: [ node-2.test.domain.local node-3.test.domain.local ]
OFFLINE: [ node-1.test.domain.local ]

vip__management_old (ocf::mirantis:ns_IPaddr2): Started node-2.test.domain.local
vip__public_old (ocf::mirantis:ns_IPaddr2): Started node-2.test.domain.local
Clone Set: clone_ping_vip__public_old [ping_vip__public_old]
     Started: [ node-2.test.domain.local node-3.test.domain.local ]
Clone Set: clone_p_mysql [p_mysql]
     Started: [ node-2.test.domain.local node-3.test.domain.local ]
Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
     Masters: [ node-3.test.domain.local ]
     Slaves: [ node-2.test.domain.local ]
Clone Set: clone_p_haproxy [p_haproxy]
     Started: [ node-2.test.domain.local node-3.test.domain.local ]
Clone Set: clone_p_openstack-heat-engine [p_openstack-heat-engine]
     Started: [ node-2.test.domain.local node-3.test.domain.local ]

Failed actions:
    ping_vip__public_old_monitor_20000 on node-2.test.domain.local 'unknown error' (1): call=64, status=Timed Out, last-rc-change='Fri Nov 28 16:45:15 2014', queued=0ms, exec=0ms
    p_mysql_monitor_120000 on node-2.test.domain.local 'unknown error' (1): call=90, status=complete, last-rc-change='Fri Nov 28 14:54:30 2014', queued=0ms, exec=0ms
    p_mysql_monitor_120000 on node-3.test.domain.local 'unknown error' (1): call=105, status=complete, last-rc-change='Fri Nov 28 14:53:22 2014', queued=0ms, exec=0ms

Impacts instance creation and all related tests/actions both via OSTF and manualy.
Diagnostic snapshot: https://copy.com/4KpLdOkhteZMHiOm

I had a kind of this issue on a virtual env (5.1.1 - #45).

3x Controllers, 2x Computes, 2x CEPH-storage

- after successfull setup, shutdown the primary controller.

as a result we have errors output by crm status:
[root@node-2 ~]# crm status
Last updated: Mon Dec  1 10:14:55 2014
Last change: Mon Dec  1 10:14:18 2014 via crm_attribute on node-3.test.domain.local
Stack: classic openais (with plugin)
Current DC: node-3.test.domain.local - partition with quorum
Version: 1.1.10-14.el6_5.3-368c726
3 Nodes configured, 3 expected votes
17 Resources configured

Online: [ node-2.test.domain.local node-3.test.domain.local ]
OFFLINE: [ node-1.test.domain.local ]

vip__management_old	(ocf::mirantis:ns_IPaddr2):	Started node-2.test.domain.local 
 vip__public_old	(ocf::mirantis:ns_IPaddr2):	Started node-2.test.domain.local 
 Clone Set: clone_ping_vip__public_old [ping_vip__public_old]
     Started: [ node-2.test.domain.local node-3.test.domain.local ]
 Clone Set: clone_p_mysql [p_mysql]
     Started: [ node-2.test.domain.local node-3.test.domain.local ]
 Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
     Masters: [ node-3.test.domain.local ]
     Slaves: [ node-2.test.domain.local ]
 Clone Set: clone_p_haproxy [p_haproxy]
     Started: [ node-2.test.domain.local node-3.test.domain.local ]
 Clone Set: clone_p_openstack-heat-engine [p_openstack-heat-engine]
     Started: [ node-2.test.domain.local node-3.test.domain.local ]

Impacts instance creation and all related tests/actions both via OSTF and manualy.
Diagnostic snapshot: https://copy.com/4KpLdOkhteZMHiOm

Changed in fuel:
status:	Incomplete → Confirmed

Revision history for this message

Egor Kotko (ykotko) wrote on 2014-12-02:

#10

fuel-snapshot-2014-12-02_09-23-12.tgz Edit (77.4 MiB, application/x-tar)

Revision history for this message

Kirill Omelchenko (komelchenko) wrote on 2014-12-02:

#11

Seems to be a floating bug. Will try to reproduce it again.

Changed in fuel:
importance:	High → Medium

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2014-12-03:

#12

This issue is really hard to reproduce. The problem is that one of the nodes goes into constant loop trying to join the cluster while other nodes are trying to forget it. There is no known solution for this, but workaround is to do following:

stop rabbitmq on all controller nodes:

crm resource stop master_p_rabbitmq-server

then on each controller node remove mnesia database for rabbitmq:

rm -rf /var/lib/rabbitmq/mnesia

start rabbitmq again:

crm resource start master_p_rabbitmq-server

Changed in fuel:
status:	Confirmed → Won't Fix
milestone:	6.0 → 6.1

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2014-12-03:

#13

BTW, it could be rabbitmq bug itself - we need to work on feature investigation more thoroughly for 5.1.2 and 6.0.1 and 6.1 releases.

Changed in fuel:
status:	Won't Fix → Confirmed
no longer affects:	fuel/6.1.x
tags:	added: release-notes
summary:	- Handshake_timeout of rabbit after shutdown primary controller + Handshake_timeout of rabbit after destructive actions

Vladimir Kuklin (vkuklin) on 2014-12-03

summary:

- Handshake_timeout of rabbit after destructive actions
+ Handshake_timeout of rabbit after connectivity issues

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-12-12: Re: Handshake_timeout of rabbit after connectivity issues

#14

Please note, that this issue could be as well fixed from Oslo.messaging side, see x-cancel-on-ha-failover https://bugs.launchpad.net/nova/+bug/856764/comments/70