MOS9, sometimes, when run "deploy-change", fuel specifies random nodes for provision

Bug #1657927 reported by ZHI BING WANG
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
High
Fuel Sustaining

Bug Description

I have seen this at least twice on a customer site in different DCs.

For example, in a deployed environment, they want to run deploy-change to fix one compute host, which is in error status. However, fuel marks 4 other nodes, which were in ready status to be provisioned. Those 4 nodes have Openstack installed.
Fuel does not allow to run "deployment" only, we have to use "provision + deployment". Then, we have to PXE boot and re-provision those 4 nodes, in order to fix one node in error status.

This is a very serious problem.

Steps I was able to reproduce it with:
1. In a ready environment,
2. make one node offline,
3. do a deploy-changes to add a new node or delete a node, whatever
4. the deployment will fail with error , because of the offline node.
5. Run deploy-changes again, Fuel will re-provision that node.

ZHI BING WANG (zwang)
description: updated
ZHI BING WANG (zwang)
description: updated
Changed in fuel:
assignee: nobody → Fuel Sustaining (fuel-sustaining-team)
milestone: none → 9.2
Revision history for this message
ZHI BING WANG (zwang) wrote :

I have seen it again. It should be easy to reproduce.
1. In a ready environment,
2. make one node offline,
3. do a deploy-changes to add a new node or delete a node, whatever
4. the deployment will fail with error , because of the offline node.
5. Run deploy-changes again, Fuel will re-provision that node.

Revision history for this message
Evgeniy L (rustyrobot) wrote :

The problem with older versions of Fuel, seems to be fixed now, but has to be doublechecked.

https://github.com/openstack/fuel-astute/blob/9.0.1/lib/astute/task_deployment.rb#L205-L211

Revision history for this message
Sam Stoelinga (sammiestoel) wrote :

I had asked Zhibing to reproduce this as this may result in data loss. He had seen this at AT&T before. I've set priority to High because of possible data loss impact.

tags: added: customer-found
Changed in fuel:
importance: Undecided → High
description: updated
Revision history for this message
Evgeniy L (rustyrobot) wrote :

Sam, I was able to debug the issue, it happens if node is unaccesible via MCollective during deployment. There is hardcoded value to move node to error_type=provision, which makes Nailgun to re-provision the node, during second deployment run.

The issue is fixed in 9.1 and newer.

https://github.com/openstack/fuel-astute/blob/9.0.1/lib/astute/task_deployment.rb#L205-L211
https://github.com/openstack/fuel-astute/blob/9.1/lib/astute/task_deployment.rb#L315-L318

http://paste.openstack.org/show/596157/
http://paste.openstack.org/show/596157/

Revision history for this message
Oleksiy Molchanov (omolchanov) wrote :

Marking as Invalid, as it is already fixed in 9.1+

Changed in fuel:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.