Fuel for OpenStack

[nailgun] During an HA deployment, a controller node can have a failure which may lead to failures

Bug #1456318 reported by Alex Schultz on 2015-05-18

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Fix Committed	High	Alex Schultz	Fuel for OpenStack 6.1

Bug Description

In researching bug 1455570, I found that even though the rabbitmq task failed on one of the controller nodes, the deployment was allowed to continue until it ultimately tried to run a post_deployment task on the controller node with the rabbitmq failure.

I found that fail_if_error is not set to true for controllers[0] as part of the DeploymentHASerializer[1]. This is a problem for deployments with Ceph roles as the restart_rados tasks may fail if it was not properly deployed on the controllers, bug 1455570 is a perfect example of that. Either controllers must not be allowed to fail, or actions in the post_deployment phase should not include failed nodes.

[0] http://paste.openstack.org/show/227384/
[1] https://github.com/stackforge/fuel-web/blob/master/nailgun/nailgun/orchestrator/deployment_serializers.py#L1624

Tags:

Alex Schultz (alex-schultz) on 2015-05-18

summary:	- [astute] During an HA deployment, a controller node can have a failure + [nailgun] During an HA deployment, a controller node can have a failure which may lead to failures
tags:	added: module-nailgun removed: module-astute

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-05-18: Fix proposed to fuel-web (master)

Fix proposed to branch: master
Review: https://review.openstack.org/184118

Changed in fuel:
assignee:	nobody → Alex Schultz (alex-schultz)
status:	New → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-05-19: Fix merged to fuel-web (master)

Reviewed: https://review.openstack.org/184118
Committed: https://git.openstack.org/cgit/stackforge/fuel-web/commit/?id=630cf5361d218954c9c8373b6ada9823962b926f
Submitter: Jenkins
Branch: master

commit 630cf5361d218954c9c8373b6ada9823962b926f
Author: Alex Schultz <email address hidden>
Date: Mon May 18 15:09:20 2015 -0500

Add controller to deployment critical roles list

    Currently if a failure happens on a controller as part of an HA
    deployment, the failure is ignored and the deployment continues on the
    rest of the nodes. This can lead to errors in the post deployment
    phase. One particular instance is when Ceph is being utilized and
    we attempt to restart the rados processes on all of the controllers.

    In order to prevent errors that may occur in the post deployment
    phase, we need to ensure that all actions were successful on all of
    the regular controller nodes as well as the primary-controller role.

    Change-Id: I466c9f13f373d9a7a5e05e365c173895c87a56a2
    Closes-Bug: 1456318
    Related-Bug: 1455570

Changed in fuel:
status:	In Progress → Fix Committed

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.