Ironic

[RFE] Overcloud deploy resiliency

Bug #1633299 reported by Joe Talerico on 2016-10-14

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Ironic	Won't Fix	Undecided	Unassigned

Bug Description

When the user kicks off a deployment the nodes begin to install the overcloud image. Once the image is written to disk the node should restart and boot the overcloud image. However, there is no mechanism to make sure the nodes have rebooted into the overcloud image.

Since there isn't anything to check that the nodes have booted the overcloud image, deployments are left to fail if they get stuck.

What I have seen that cause deployments to get stuck:

1) The node just doesn't reboot (seen on HP, Dell and Supermicro)
2) Node reboots, but dreaded RAID battery died or something similar (node should be rescheduled)

For #1, I simply issue a ironic node-set-power-state <uuid> off, wait 30 seconds, and issue a on. This typically gets the deployment moving again.

For #2, I am pretty much dead in the water, unless I can get the raid battery, or raid message cleared in time for the deployment. In this case, I think we should reschedule, and put the node in maintenance mode.

This RFE started from : http://lists.openstack.org/pipermail/openstack-dev/2016-October/105388.html

Tags:

Lucas Alvares Gomes (lucasagomes) on 2016-10-14

tags:

added: rfe

Joe Talerico (jtaleric) on 2016-10-18

no longer affects:

tripleo

Revision history for this message

Joe Talerico (jtaleric) wrote on 2016-10-18:

Tracking issues we discussed on IRC...

multi-tenancy - provisioning network is removed from the instance once the reboot has occurred (at the switch level). This is due to security.

nova console-log - I suggested we could insert a hash into the metadata output, which we search for in the nova console-log to determine if the node is in the desired state. However nova console-log might not be support across hypervisors.

So, with multi-tenancy could we keep the provisioning network until after the reboot to ensure the desired state, once we know the host is in the desired state, remove the port and access to the provisioning network. This does leave a potential security risk.

Revision history for this message

Jim Rollenhagen (jim-rollenhagen) wrote on 2016-11-28:

This doesn't seem possible to do in a generic way, because checking if the node is "up" is completely image/deploy/node independent. So, we must reject this RFE.

Orchestration tools above nova can handle this case instead.

Changed in ironic:
status:	New → Won't Fix

Revision history for this message

Ruby Loo (rloo) wrote on 2016-11-28:

Thanks for submitting this. We discussed this in today's ironic meeting [1], so please take a look if you want to know what folks thought.

wrt #1, folks think that is a bug.

wrt #2. Outside scope of Ironic.

wrt comment #1, keeping provisioning network until after the reboot -- NO, big security issue.

[1] starting at 17:19:32, http://eavesdrop.openstack.org/meetings/ironic/2016/ironic.2016-11-28-17.00.log.html

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.