Node has to be powered off before putting in maintenace when clean fails

Bug #1672877 reported by Sai Sindhur Malleni
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ironic
Triaged
Medium
Unassigned

Bug Description

When using ironic automated cleaning, if the node clean fails it is put into clean failed state and in maintenance. Ironic hence loses track of the power state, and reports power off even if the node is actually turned on. This causes very difficult to debug problems from an os-collect-config/os-net-config perspective, because the node that failed cleaning might still have an IP that the undercloud will use to provision new machines. So the "clean failed" will keep trying to get metadata from the undercloud and the new node that gets the same IP address doesn't get metadata/ gets it intermittently. This leads to failed deployments when machines are recycled from one workload to another.

Revision history for this message
Dmitry Tantsur (divius) wrote :

Hi! This bug cannot be fixed for two reasons:

1. If we power off on failed cleaning, we're losing the ability to debug it AND we can potentially brick the node, if it was doing something dangerous.

2. Not syncing power state during maintenance is one of the main features of maintenance.

Now, the whole situation may improve with the specific faults work https://review.openstack.org/#/c/334113/, but as it is we cannot really fix it.

Revision history for this message
Jim Rollenhagen (jim-rollenhagen) wrote :

What if we checked power state immediately when cleaning fails, so the database is up to date?

Or is this bug about getting out of sync if someone turns on the node while it's in maintenance?

Revision history for this message
Sai Sindhur Malleni (smalleni) wrote :

It would be helpful to alteast have the clast known power state show up before putting it in maintenance, instead of showing power off.

Revision history for this message
Dmitry Tantsur (divius) wrote :

I'd rather set power state to None on entering maintenance, and keep it this way, until the node gets out of it. Then we won't lie to users if any power actions are done to the node.

Revision history for this message
Sai Sindhur Malleni (smalleni) wrote :

dtantsur++ Can we triage this and set priority at your convenience please.

Michael Turek (mjturek)
Changed in ironic:
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
Sam Betts (sambetts) wrote :

This bug seems less about the power state, and more about the node holding onto an IP address in the flat network that it shouldn't have because the neutron port get's deleted so from neutron's perspective the IP is "free". Perhaps the solution should be that in the case of cleaning failed, ironic doesn't delete the neutron port for that node, keeping the IP assignment and preventing it from being assigned to a different server and then conflicting.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.