RabbitMQ OCF script requires manual intervention in rare cases

Bug #1394635 reported by Egor Kotko
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Confirmed
Medium
Vladimir Kuklin
5.1.x
Confirmed
Medium
Vladimir Kuklin
6.0.x
Confirmed
Medium
Vladimir Kuklin
6.1.x
Confirmed
Medium
Fuel Library (Deprecated)

Bug Description

{"build_id": "2014-11-19_21-56-43", "ostf_sha": "64cb59c681658a7a55cc2c09d079072a41beb346", "build_number": "25", "auth_required": true, "api": "1.0", "nailgun_sha": "7580f6341a726c2019f880ae23ff3f1c581fd850", "production": "docker", "fuelmain_sha": "eac9e2704424d1cb3f183c9f74567fd42a1fa6f3", "astute_sha": "fce051a6d013b1c30aa07320d225f9af734545de", "feature_groups": ["mirantis"], "release": "5.1.1", "release_versions": {"2014.1.3-5.1.1": {"VERSION": {"build_id": "2014-11-19_21-56-43", "ostf_sha": "64cb59c681658a7a55cc2c09d079072a41beb346", "build_number": "25", "api": "1.0", "nailgun_sha": "7580f6341a726c2019f880ae23ff3f1c581fd850", "production": "docker", "fuelmain_sha": "eac9e2704424d1cb3f183c9f74567fd42a1fa6f3", "astute_sha": "fce051a6d013b1c30aa07320d225f9af734545de", "feature_groups": ["mirantis"], "release": "5.1.1", "fuellib_sha": "5611c516362bea0fd47fcb5376a9f22dcfbb8307"}}}, "fuellib_sha": "5611c516362bea0fd47fcb5376a9f22dcfbb8307"}

Steps to reproduce:
1. Deploy cluster with configuration (on Hardware lab):
Centos HA, Neutron VLAN, 5 Controllers, 7 Computes
2. Execute on Primary controller "shutdown -h now"
3. Wait ~20 min

Expected result:
Cluster will be in correct state

Actual result:
Cluster is in incorrect state.
See the attached log (rabbit@node-3).

After shutting down node-2

[root@node-3 ~]# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-3' ...
[{nodes,[{disc,['rabbit@node-2','rabbit@node-3','rabbit@node-4',
                'rabbit@node-5','rabbit@node-7']}]}]
...done.

Revision history for this message
Egor Kotko (ykotko) wrote :
Revision history for this message
Egor Kotko (ykotko) wrote :
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

Please explain impact of this bug in more details and change the summary to something more specific than just "incorrect work".

Changed in fuel:
status: New → Incomplete
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

Please also explain why this is only targeted for 5.1.1 and not also for 6.0.

Revision history for this message
Egor Kotko (ykotko) wrote : Re: Handshake_timeout of rabbit after shutdown primary controller

I have targeted it only on 5.1.1 because I have got it only on 5.1.1 iso - will check this case on 6.0 too.
Rabbit log contains several types of issues like:

=ERROR REPORT==== 20-Nov-2014::13:46:24 ===
closing AMQP connection <0.6966.0> (192.168.0.12:38912 -> 192.168.0.5:5673):
{handshake_timeout,handshake}

=ERROR REPORT==== 20-Nov-2014::14:22:27 ===
AMQP connection <0.1919.0> (running), channel 0 - error:
{amqp_error,connection_forced,
            "broker forced connection closure with reason 'shutdown'",none}

Also I have alive environment with this issue:
http://172.16.39.130:8000/#cluster/1/nodes

summary: - Incorrect work of rabbit after shutdown primary controller
+ Handshake_timeout of rabbit after shutdown primary controller
Changed in fuel:
status: Incomplete → Confirmed
milestone: 5.1.1 → 6.0
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Vladimir Kuklin (vkuklin)
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

Thanks for figuring out the targeted release and updating the summary! I still don't see explicit confirmation of impact of this bug: which OpenStack control plane operations are impacted in what way by this RabbitMQ error? is it reliably reproducible or highly intermittent? is there a workaround?

Revision history for this message
Egor Kotko (ykotko) wrote :

I have get it again on virtual lab. Alive environment accessible here: http://172.18.164.133:8000/
After failover possible to get problems with nova. Sometimes it possible to get huge tomeout on cteartion instance/security group, or instance can be in build state infinite time.

{"build_id": "2014-11-20_21-01-00", "ostf_sha": "64cb59c681658a7a55cc2c09d079072a41beb346", "build_number": "28", "auth_required": true, "api": "1.0", "nailgun_sha": "7580f6341a726c2019f880ae23ff3f1c581fd850", "production": "docker", "fuelmain_sha": "eac9e2704424d1cb3f183c9f74567fd42a1fa6f3", "astute_sha": "51087c92a50be982071a074ff2bea01f1a5ddb76", "feature_groups": ["mirantis"], "release": "5.1.1", "release_versions": {"2014.1.3-5.1.1": {"VERSION": {"build_id": "2014-11-20_21-01-00", "ostf_sha": "64cb59c681658a7a55cc2c09d079072a41beb346", "build_number": "28", "api": "1.0", "nailgun_sha": "7580f6341a726c2019f880ae23ff3f1c581fd850", "production": "docker", "fuelmain_sha": "eac9e2704424d1cb3f183c9f74567fd42a1fa6f3", "astute_sha": "51087c92a50be982071a074ff2bea01f1a5ddb76", "feature_groups": ["mirantis"], "release": "5.1.1", "fuellib_sha": "b3d9f0e203f2f0faf3763e871a8dc31570777fed"}}}, "fuellib_sha": "b3d9f0e203f2f0faf3763e871a8dc31570777fed"}

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

Checked with Yegor - did not find any issues with AMQP for the "http://172.18.164.133:8000/" environment - there are some performance issues, but none of them are related to AMQP failover. We are going to try to reproduce this bug one more time using real hardware.

Changed in fuel:
status: Confirmed → Incomplete
Revision history for this message
Kirill Omelchenko (kirill-omelchenko) wrote :

I had a kind of this issue on a virtual env (5.1.1 - #45).

3x Controllers, 2x Computes, 2x CEPH-storage

- after successfull setup, shutdown the primary controller.

as a result we have errors output by crm status:
[root@node-2 ~]# crm status
Last updated: Mon Dec 1 10:14:55 2014
Last change: Mon Dec 1 10:14:18 2014 via crm_attribute on node-3.test.domain.local
Stack: classic openais (with plugin)
Current DC: node-3.test.domain.local - partition with quorum
Version: 1.1.10-14.el6_5.3-368c726
3 Nodes configured, 3 expected votes
17 Resources configured

Online: [ node-2.test.domain.local node-3.test.domain.local ]
OFFLINE: [ node-1.test.domain.local ]

 vip__management_old (ocf::mirantis:ns_IPaddr2): Started node-2.test.domain.local
 vip__public_old (ocf::mirantis:ns_IPaddr2): Started node-2.test.domain.local
 Clone Set: clone_ping_vip__public_old [ping_vip__public_old]
     Started: [ node-2.test.domain.local node-3.test.domain.local ]
 Clone Set: clone_p_mysql [p_mysql]
     Started: [ node-2.test.domain.local node-3.test.domain.local ]
 Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
     Masters: [ node-3.test.domain.local ]
     Slaves: [ node-2.test.domain.local ]
 Clone Set: clone_p_haproxy [p_haproxy]
     Started: [ node-2.test.domain.local node-3.test.domain.local ]
 Clone Set: clone_p_openstack-heat-engine [p_openstack-heat-engine]
     Started: [ node-2.test.domain.local node-3.test.domain.local ]

Failed actions:
    ping_vip__public_old_monitor_20000 on node-2.test.domain.local 'unknown error' (1): call=64, status=Timed Out, last-rc-change='Fri Nov 28 16:45:15 2014', queued=0ms, exec=0ms
    p_mysql_monitor_120000 on node-2.test.domain.local 'unknown error' (1): call=90, status=complete, last-rc-change='Fri Nov 28 14:54:30 2014', queued=0ms, exec=0ms
    p_mysql_monitor_120000 on node-3.test.domain.local 'unknown error' (1): call=105, status=complete, last-rc-change='Fri Nov 28 14:53:22 2014', queued=0ms, exec=0ms

Impacts instance creation and all related tests/actions both via OSTF and manualy.
Diagnostic snapshot: https://copy.com/4KpLdOkhteZMHiOm

Changed in fuel:
status: Incomplete → Confirmed
Revision history for this message
Egor Kotko (ykotko) wrote :
Revision history for this message
Kirill Omelchenko (komelchenko) wrote :

Seems to be a floating bug. Will try to reproduce it again.

Changed in fuel:
importance: High → Medium
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

This issue is really hard to reproduce. The problem is that one of the nodes goes into constant loop trying to join the cluster while other nodes are trying to forget it. There is no known solution for this, but workaround is to do following:

stop rabbitmq on all controller nodes:

crm resource stop master_p_rabbitmq-server

then on each controller node remove mnesia database for rabbitmq:

rm -rf /var/lib/rabbitmq/mnesia

start rabbitmq again:

crm resource start master_p_rabbitmq-server

Changed in fuel:
status: Confirmed → Won't Fix
milestone: 6.0 → 6.1
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

BTW, it could be rabbitmq bug itself - we need to work on feature investigation more thoroughly for 5.1.2 and 6.0.1 and 6.1 releases.

Changed in fuel:
status: Won't Fix → Confirmed
no longer affects: fuel/6.1.x
tags: added: release-notes
summary: - Handshake_timeout of rabbit after shutdown primary controller
+ Handshake_timeout of rabbit after destructive actions
summary: - Handshake_timeout of rabbit after destructive actions
+ Handshake_timeout of rabbit after connectivity issues
Revision history for this message
Bogdan Dobrelya (bogdando) wrote : Re: Handshake_timeout of rabbit after connectivity issues

Please note, that this issue could be as well fixed from Oslo.messaging side, see x-cancel-on-ha-failover https://bugs.launchpad.net/nova/+bug/856764/comments/70

summary: - Handshake_timeout of rabbit after connectivity issues
+ RabbitMQ OCF script requires manual intervention in rare cases
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.