Fuel for OpenStack

Deployment fails during controllers removal: execution of '/usr/sbin/rabbitmq-plugins list -E -m' command expired

Bug #1529952 reported by Artem Panchenko on 2015-12-29

This bug report is a duplicate of: Bug #1472230: Pacemaker shows healthy status for rabbitmq node meanwhile the node is actually down/split brain. Edit Remove

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	In Progress	High	Kyrylo Galanov	Fuel for OpenStack 9.0
	8.0.x	Confirmed	High	Fuel Library (Deprecated)	Fuel for OpenStack 8.0

Bug Description

Deployment fails during controllers removal, because puppet task 'rabbitmq.pp' returns error on primary controller after 2 other controllers are removed:

2015-12-29 02:51:56 +0000 /Stage[main]/Rabbitmq/Rabbitmq_plugin[rabbitmq_management] (info): Starting to evaluate the resource
2015-12-29 02:51:56 +0000 Puppet (debug): Executing '/usr/sbin/rabbitmq-plugins list -E -m'
2015-12-29 02:52:06 +0000 /Stage[main]/Rabbitmq/Rabbitmq_plugin[rabbitmq_management] (err): Could not evaluate: execution expired

All commands which try to connect to RabbitMQ hang on primary controller, for example:

root@node-1:~# time /usr/sbin/rabbitmq-plugins list -E -m^C
BREAK: (a)bort (c)ontinue (p)roc info (i)nfo (l)oaded
(v)ersion (k)ill (D)b-tables (d)istribution
real 12m52.051s
user 0m0.653s
sys 0m0.180s

Also RabbitMQ daemon is dead on all controllers:

Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
Stopped: [ node-2.test.domain.local node-4.test.domain.local ]

Steps to reproduce:

            1. Create cluster
            2. Add 1 controller node
            3. Deploy the cluster
            4. Check swift, and invoke swift-rings-rebalance.sh
               on primary controller if check failed
            5. Add 2 controller nodes
            6. Deploy changes
            7. Check swift, and invoke swift-rings-rebalance.sh
               on primary controller if check failed
            8. Run OSTF
            9. Add 2 controller 1 compute nodes
            10. Deploy changes
            11. Check swift, and invoke swift-rings-rebalance.sh
                on all the controllers
            12. Run OSTF
            13. Delete 2 controllers.
            14. Deploy changes

Expected result: nodes are successfully removed, cluster is operational
Actual result: node are removed, but re-deployment of alive controller failed, cluster has 'error' status

Diagnostic snapshot: https://drive.google.com/file/d/0BzaZINLQ8-xkSzhMeW9EcGkxZ2c/view?usp=sharing

Tags:

Artem Panchenko (apanchenko-8) on 2015-12-29

Changed in fuel:
milestone:	none → 9.0

Ivan Ponomarev (ivanzipfer) on 2015-12-30

tags:	added: area-library
Changed in fuel:
status:	New → Confirmed

Kyrylo Galanov (kgalanov) on 2016-01-05

tags:	added: team-bugfix
Changed in fuel:
assignee:	Fuel Library Team (fuel-library) → Kyrylo Galanov (kgalanov)

Kyrylo Galanov (kgalanov) on 2016-01-05

Changed in fuel:
status:	Confirmed → In Progress

Revision history for this message

Kyrylo Galanov (kgalanov) wrote on 2016-01-06:

In step 13 primary controller is deleted. That may cause rabbitmq failure.

Bogdan Dobrelya (bogdando) on 2016-01-08

tags:

added: life-cycle-management

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-01-08:

There are the same fork bomb patterns in the logs as in the bug #1472230. Also, there are multiple corosync issues "warning: qb_ipcs_event_sendv: new_event_notification (9966-9288-14): Broken pipe (32)"

Revision history for this message

Tatyanka (tatyana-leontovich) wrote on 2016-01-21:

reproduced on 429 iso
scenario:
https://mirantis.testrail.com/index.php?/tests/view/2465653&group_by=tests:status_id&group_order=asc&group_id=8