Fuel for OpenStack

Redeployment failed with error "Rabbitmq/Rabbitmq_user[nova]) Could not evaluate: Command is still failing after 180 seconds expired!"

Bug #1472135 reported by Egor Kotko on 2015-07-07

This bug report is a duplicate of: Bug #1472230: Pacemaker shows healthy status for rabbitmq node meanwhile the node is actually down/split brain. Edit Remove

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Confirmed	High	Bogdan Dobrelya	Fuel for OpenStack 7.0

Bug Description

{"build_id": "2015-07-02_07-14-17", "build_number": "22", "release_versions": {"2014.2.2-7.0": {"VERSION": {"build_id": "2015-07-02_07-14-17", "build_number": "22", "api": "1.0", "fuel-library_sha": "f25d5da0a95b000fe654e03e08eb09dbfdd2caaa", "nailgun_sha": "f7a87bf0039727678bffe160991d643433d2031c", "feature_groups": ["mirantis"], "openstack_version": "2014.2.2-7.0", "production": "docker", "python-fuelclient_sha": "315d8bf991fbe7e2ab91abfc1f59b2f24fd92f45", "astute_sha": "4669156830daa3bb39573ad4a821bb1a0e2702eb", "fuel-ostf_sha": "a752c857deafd2629baf646b1b3188f02ff38084", "release": "7.0", "fuelmain_sha": "4f2dff3bdc327858fa45bcc2853cfbceae68a40c"}}}, "auth_required": true, "api": "1.0", "fuel-library_sha": "f25d5da0a95b000fe654e03e08eb09dbfdd2caaa", "nailgun_sha": "f7a87bf0039727678bffe160991d643433d2031c", "feature_groups": ["mirantis"], "openstack_version": "2014.2.2-7.0", "production": "docker", "python-fuelclient_sha": "315d8bf991fbe7e2ab91abfc1f59b2f24fd92f45", "astute_sha": "4669156830daa3bb39573ad4a821bb1a0e2702eb", "fuel-ostf_sha": "a752c857deafd2629baf646b1b3188f02ff38084", "release": "7.0", "fuelmain_sha": "4f2dff3bdc327858fa45bcc2853cfbceae68a40c"}

Steps to reproduce:
            1. Create cluster
            2. Add 1 controller node
            3. Deploy the cluster
            4. Add 2 controller nodes
            5. Deploy changes
            6. Run network verification
            7. Add 2 controller nodes, 1 compute node
            8. Deploy changes
            9. Run network verification
            10. Run OSTF ha, sanity, smoke
            11. Delete the primary and the last added controllers.
            12. Deploy changes
            13. Run OSTF ha, sanity, smoke

Actual result:
Deployment failed on step "12" with errors in puppet log:
http://paste.openstack.org/show/351118/

Revision history for this message

Egor Kotko (ykotko) wrote on 2015-07-07:

snapshot_link Edit (93 bytes, text/plain)

Oleksiy Molchanov (omolchanov) on 2015-07-07

Changed in fuel:
status:	New → Confirmed
assignee:	Fuel Library Team (fuel-library) → Oleksiy Molchanov (omolchanov)

Revision history for this message

Oleksiy Molchanov (omolchanov) wrote on 2015-07-07:

@Bogdan, can you check? I can see on node-5 rabbitmq was not able to operate, but I cannot determine the root cause.

Changed in fuel:
assignee:	Oleksiy Molchanov (omolchanov) → Bogdan Dobrelya (bogdando)

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-07-07:

This looks like a flaw in the promote logic of the rabbitmq pacemaker RA.
Logs show that the node-2 was selected for promotion but "something went wrong" ending up in post-promote exited in unexpected state: http://paste.openstack.org/show/VK4oWR2SroyJK5Pvh5WQ/, see lines 18-22.

A corresponding OCF script code is https://github.com/stackforge/fuel-library/blob/master/files/fuel-ha-utils/ocf/rabbitmq#L1415-L1419 and it exited after the line 1455 below. As a result, the post promote exited with running but, it seems, completely broken rabbit app (you can in logs see the list_channels reported error 2).
Other rabbit nodes failed to join this master and failed to operate normally, making the node removal operation to fail as well.

So this situation is definitely a buggy and should be fixed. The post promote should exit with generic error when rabbit app was running but list_channels reported errors.

Revision history for this message

Oleksiy Molchanov (omolchanov) wrote on 2015-07-08:

@Bogdan, thanks for an update