[RFE] Overcloud deploy resiliency

Bug #1633299 reported by Joe Talerico
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ironic
Won't Fix
Undecided
Unassigned

Bug Description

When the user kicks off a deployment the nodes begin to install the overcloud image. Once the image is written to disk the node should restart and boot the overcloud image. However, there is no mechanism to make sure the nodes have rebooted into the overcloud image.

Since there isn't anything to check that the nodes have booted the overcloud image, deployments are left to fail if they get stuck.

What I have seen that cause deployments to get stuck:

1) The node just doesn't reboot (seen on HP, Dell and Supermicro)
2) Node reboots, but dreaded RAID battery died or something similar (node should be rescheduled)

For #1, I simply issue a ironic node-set-power-state <uuid> off, wait 30 seconds, and issue a on. This typically gets the deployment moving again.

For #2, I am pretty much dead in the water, unless I can get the raid battery, or raid message cleared in time for the deployment. In this case, I think we should reschedule, and put the node in maintenance mode.

This RFE started from : http://lists.openstack.org/pipermail/openstack-dev/2016-October/105388.html

Tags: rfe
tags: added: rfe
Joe Talerico (jtaleric)
no longer affects: tripleo
Revision history for this message
Joe Talerico (jtaleric) wrote :

Tracking issues we discussed on IRC...

multi-tenancy - provisioning network is removed from the instance once the reboot has occurred (at the switch level). This is due to security.

nova console-log - I suggested we could insert a hash into the metadata output, which we search for in the nova console-log to determine if the node is in the desired state. However nova console-log might not be support across hypervisors.

So, with multi-tenancy could we keep the provisioning network until after the reboot to ensure the desired state, once we know the host is in the desired state, remove the port and access to the provisioning network. This does leave a potential security risk.

Revision history for this message
Jim Rollenhagen (jim-rollenhagen) wrote :

This doesn't seem possible to do in a generic way, because checking if the node is "up" is completely image/deploy/node independent. So, we must reject this RFE.

Orchestration tools above nova can handle this case instead.

Changed in ironic:
status: New → Won't Fix
Revision history for this message
Ruby Loo (rloo) wrote :

Thanks for submitting this. We discussed this in today's ironic meeting [1], so please take a look if you want to know what folks thought.

wrt #1, folks think that is a bug.

wrt #2. Outside scope of Ironic.

wrt comment #1, keeping provisioning network until after the reboot -- NO, big security issue.

[1] starting at 17:19:32, http://eavesdrop.openstack.org/meetings/ironic/2016/ironic.2016-11-28-17.00.log.html

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.