Deployment fails during controllers removal: execution of '/usr/sbin/rabbitmq-plugins list -E -m' command expired

Bug #1529952 reported by Artem Panchenko
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
In Progress
High
Kyrylo Galanov
8.0.x
Confirmed
High
Fuel Library (Deprecated)

Bug Description

Deployment fails during controllers removal, because puppet task 'rabbitmq.pp' returns error on primary controller after 2 other controllers are removed:

2015-12-29 02:51:56 +0000 /Stage[main]/Rabbitmq/Rabbitmq_plugin[rabbitmq_management] (info): Starting to evaluate the resource
2015-12-29 02:51:56 +0000 Puppet (debug): Executing '/usr/sbin/rabbitmq-plugins list -E -m'
2015-12-29 02:52:06 +0000 /Stage[main]/Rabbitmq/Rabbitmq_plugin[rabbitmq_management] (err): Could not evaluate: execution expired

All commands which try to connect to RabbitMQ hang on primary controller, for example:

root@node-1:~# time /usr/sbin/rabbitmq-plugins list -E -m^C
BREAK: (a)bort (c)ontinue (p)roc info (i)nfo (l)oaded
       (v)ersion (k)ill (D)b-tables (d)istribution
real 12m52.051s
user 0m0.653s
sys 0m0.180s

Also RabbitMQ daemon is dead on all controllers:

 Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
     Stopped: [ node-2.test.domain.local node-4.test.domain.local ]

Steps to reproduce:

            1. Create cluster
            2. Add 1 controller node
            3. Deploy the cluster
            4. Check swift, and invoke swift-rings-rebalance.sh
               on primary controller if check failed
            5. Add 2 controller nodes
            6. Deploy changes
            7. Check swift, and invoke swift-rings-rebalance.sh
               on primary controller if check failed
            8. Run OSTF
            9. Add 2 controller 1 compute nodes
            10. Deploy changes
            11. Check swift, and invoke swift-rings-rebalance.sh
                on all the controllers
            12. Run OSTF
            13. Delete 2 controllers.
            14. Deploy changes

Expected result: nodes are successfully removed, cluster is operational
Actual result: node are removed, but re-deployment of alive controller failed, cluster has 'error' status

Diagnostic snapshot: https://drive.google.com/file/d/0BzaZINLQ8-xkSzhMeW9EcGkxZ2c/view?usp=sharing

Changed in fuel:
milestone: none → 9.0
tags: added: area-library
Changed in fuel:
status: New → Confirmed
tags: added: team-bugfix
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Kyrylo Galanov (kgalanov)
Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
Kyrylo Galanov (kgalanov) wrote :

In step 13 primary controller is deleted. That may cause rabbitmq failure.

tags: added: life-cycle-management
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

There are the same fork bomb patterns in the logs as in the bug #1472230. Also, there are multiple corosync issues "warning: qb_ipcs_event_sendv: new_event_notification (9966-9288-14): Broken pipe (32)"

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.