[nailgun] During an HA deployment, a controller node can have a failure which may lead to failures

Bug #1456318 reported by Alex Schultz
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
Alex Schultz

Bug Description

In researching bug 1455570, I found that even though the rabbitmq task failed on one of the controller nodes, the deployment was allowed to continue until it ultimately tried to run a post_deployment task on the controller node with the rabbitmq failure.

I found that fail_if_error is not set to true for controllers[0] as part of the DeploymentHASerializer[1]. This is a problem for deployments with Ceph roles as the restart_rados tasks may fail if it was not properly deployed on the controllers, bug 1455570 is a perfect example of that. Either controllers must not be allowed to fail, or actions in the post_deployment phase should not include failed nodes.

[0] http://paste.openstack.org/show/227384/
[1] https://github.com/stackforge/fuel-web/blob/master/nailgun/nailgun/orchestrator/deployment_serializers.py#L1624

summary: - [astute] During an HA deployment, a controller node can have a failure
+ [nailgun] During an HA deployment, a controller node can have a failure
which may lead to failures
tags: added: module-nailgun
removed: module-astute
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-web (master)

Fix proposed to branch: master
Review: https://review.openstack.org/184118

Changed in fuel:
assignee: nobody → Alex Schultz (alex-schultz)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-web (master)

Reviewed: https://review.openstack.org/184118
Committed: https://git.openstack.org/cgit/stackforge/fuel-web/commit/?id=630cf5361d218954c9c8373b6ada9823962b926f
Submitter: Jenkins
Branch: master

commit 630cf5361d218954c9c8373b6ada9823962b926f
Author: Alex Schultz <email address hidden>
Date: Mon May 18 15:09:20 2015 -0500

    Add controller to deployment critical roles list

    Currently if a failure happens on a controller as part of an HA
    deployment, the failure is ignored and the deployment continues on the
    rest of the nodes. This can lead to errors in the post deployment
    phase. One particular instance is when Ceph is being utilized and
    we attempt to restart the rados processes on all of the controllers.

    In order to prevent errors that may occur in the post deployment
    phase, we need to ensure that all actions were successful on all of
    the regular controller nodes as well as the primary-controller role.

    Change-Id: I466c9f13f373d9a7a5e05e365c173895c87a56a2
    Closes-Bug: 1456318
    Related-Bug: 1455570

Changed in fuel:
status: In Progress → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.