Fuel for OpenStack

MOS9, sometimes, when run "deploy-change", fuel specifies random nodes for provision

Bug #1657927 reported by ZHI BING WANG on 2017-01-20

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Invalid	High	Fuel Sustaining	Fuel for OpenStack 9.2

Bug Description

I have seen this at least twice on a customer site in different DCs.

For example, in a deployed environment, they want to run deploy-change to fix one compute host, which is in error status. However, fuel marks 4 other nodes, which were in ready status to be provisioned. Those 4 nodes have Openstack installed.
Fuel does not allow to run "deployment" only, we have to use "provision + deployment". Then, we have to PXE boot and re-provision those 4 nodes, in order to fix one node in error status.

This is a very serious problem.

Steps I was able to reproduce it with:
1. In a ready environment,
2. make one node offline,
3. do a deploy-changes to add a new node or delete a node, whatever
4. the deployment will fail with error , because of the offline node.
5. Run deploy-changes again, Fuel will re-provision that node.

See original description

Tags:

ZHI BING WANG (zwang) on 2017-01-20

description:

updated

ZHI BING WANG (zwang) on 2017-01-20

description:

updated

Julia Aranovich (jkirnosova) on 2017-01-20

Changed in fuel:
assignee:	nobody → Fuel Sustaining (fuel-sustaining-team)
milestone:	none → 9.2

Revision history for this message

ZHI BING WANG (zwang) wrote on 2017-01-23:

I have seen it again. It should be easy to reproduce.
1. In a ready environment,
2. make one node offline,
3. do a deploy-changes to add a new node or delete a node, whatever
4. the deployment will fail with error , because of the offline node.
5. Run deploy-changes again, Fuel will re-provision that node.

Revision history for this message

Evgeniy L (rustyrobot) wrote on 2017-01-23:

The problem with older versions of Fuel, seems to be fixed now, but has to be doublechecked.

https://github.com/openstack/fuel-astute/blob/9.0.1/lib/astute/task_deployment.rb#L205-L211

Revision history for this message

Sam Stoelinga (sammiestoel) wrote on 2017-01-23:

I had asked Zhibing to reproduce this as this may result in data loss. He had seen this at AT&T before. I've set priority to High because of possible data loss impact.

tags:	added: customer-found
Changed in fuel:
importance:	Undecided → High
description:	updated

Revision history for this message

Evgeniy L (rustyrobot) wrote on 2017-01-23:

Sam, I was able to debug the issue, it happens if node is unaccesible via MCollective during deployment. There is hardcoded value to move node to error_type=provision, which makes Nailgun to re-provision the node, during second deployment run.

The issue is fixed in 9.1 and newer.

https://github.com/openstack/fuel-astute/blob/9.0.1/lib/astute/task_deployment.rb#L205-L211
https://github.com/openstack/fuel-astute/blob/9.1/lib/astute/task_deployment.rb#L315-L318