Redeployment failed with error "Rabbitmq/Rabbitmq_user[nova]) Could not evaluate: Command is still failing after 180 seconds expired!"

Bug #1472135 reported by Egor Kotko
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Confirmed
High
Bogdan Dobrelya

Bug Description

{"build_id": "2015-07-02_07-14-17", "build_number": "22", "release_versions": {"2014.2.2-7.0": {"VERSION": {"build_id": "2015-07-02_07-14-17", "build_number": "22", "api": "1.0", "fuel-library_sha": "f25d5da0a95b000fe654e03e08eb09dbfdd2caaa", "nailgun_sha": "f7a87bf0039727678bffe160991d643433d2031c", "feature_groups": ["mirantis"], "openstack_version": "2014.2.2-7.0", "production": "docker", "python-fuelclient_sha": "315d8bf991fbe7e2ab91abfc1f59b2f24fd92f45", "astute_sha": "4669156830daa3bb39573ad4a821bb1a0e2702eb", "fuel-ostf_sha": "a752c857deafd2629baf646b1b3188f02ff38084", "release": "7.0", "fuelmain_sha": "4f2dff3bdc327858fa45bcc2853cfbceae68a40c"}}}, "auth_required": true, "api": "1.0", "fuel-library_sha": "f25d5da0a95b000fe654e03e08eb09dbfdd2caaa", "nailgun_sha": "f7a87bf0039727678bffe160991d643433d2031c", "feature_groups": ["mirantis"], "openstack_version": "2014.2.2-7.0", "production": "docker", "python-fuelclient_sha": "315d8bf991fbe7e2ab91abfc1f59b2f24fd92f45", "astute_sha": "4669156830daa3bb39573ad4a821bb1a0e2702eb", "fuel-ostf_sha": "a752c857deafd2629baf646b1b3188f02ff38084", "release": "7.0", "fuelmain_sha": "4f2dff3bdc327858fa45bcc2853cfbceae68a40c"}

Steps to reproduce:
            1. Create cluster
            2. Add 1 controller node
            3. Deploy the cluster
            4. Add 2 controller nodes
            5. Deploy changes
            6. Run network verification
            7. Add 2 controller nodes, 1 compute node
            8. Deploy changes
            9. Run network verification
            10. Run OSTF ha, sanity, smoke
            11. Delete the primary and the last added controllers.
            12. Deploy changes
            13. Run OSTF ha, sanity, smoke

Actual result:
Deployment failed on step "12" with errors in puppet log:
http://paste.openstack.org/show/351118/

Revision history for this message
Egor Kotko (ykotko) wrote :
Changed in fuel:
status: New → Confirmed
assignee: Fuel Library Team (fuel-library) → Oleksiy Molchanov (omolchanov)
Revision history for this message
Oleksiy Molchanov (omolchanov) wrote :

@Bogdan, can you check? I can see on node-5 rabbitmq was not able to operate, but I cannot determine the root cause.

Changed in fuel:
assignee: Oleksiy Molchanov (omolchanov) → Bogdan Dobrelya (bogdando)
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

This looks like a flaw in the promote logic of the rabbitmq pacemaker RA.
Logs show that the node-2 was selected for promotion but "something went wrong" ending up in post-promote exited in unexpected state: http://paste.openstack.org/show/VK4oWR2SroyJK5Pvh5WQ/, see lines 18-22.

A corresponding OCF script code is https://github.com/stackforge/fuel-library/blob/master/files/fuel-ha-utils/ocf/rabbitmq#L1415-L1419 and it exited after the line 1455 below. As a result, the post promote exited with running but, it seems, completely broken rabbit app (you can in logs see the list_channels reported error 2).
Other rabbit nodes failed to join this master and failed to operate normally, making the node removal operation to fail as well.

So this situation is definitely a buggy and should be fixed. The post promote should exit with generic error when rabbit app was running but list_channels reported errors.

Revision history for this message
Oleksiy Molchanov (omolchanov) wrote :

@Bogdan, thanks for an update

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.