OpenStack Compute (nova)

unavailable ironic nodes being scheduled to

Bug #1503453 reported by Mark Silence on 2015-10-06

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Fix Released	Medium	Jesse J. Cook
	Mitaka	Fix Released	Medium	Jay Faulkner

Bug Description

When the compute resource tracker checks nodes, the ironic driver checks the node against a list of states that it should return resources for. This is to prevent nodes in various ironic states, like our cleaning process, that are not available from being scheduled to by nova.

The logic around this check ( https://github.com/openstack/nova/blob/master/nova/virt/ironic/driver.py#L334-L351 ) looks for existing instances on the node, and if they aren't found it then looks at the conditions for returning the node as unavailable.

The problem is when you have an orphaned instance on your node, one which ironic sees as present but nova does not (usually nova lists it as having been deleted).

The instance detection will return true, causing the memory_mb_used and memory_mb values to be set to the retrieved value from instance_info['memory_mb'].

The check for _node_resources_unavailable will not run as it is an elif. This means that even if this node is in maintenance state, we won't notice and return all zeros for resources as we normally would.

Once the resource tracker calls _update_usage_from_instance, it will not find an instance associated with the node from nova's point of view and will return all of the memory as available instead, causing builds to be scheduled to this node.

Ironic will then fail the build attempt due to it showing an instance already associated with the node.

Tags:

Mark Silence (madasi) on 2015-10-06

Changed in nova:
assignee:	nobody → Mark Silence (madasi)

Markus Zoeller (markus_z) (mzoeller) on 2015-10-07

tags:

added: ironic

Michael Still (mikal) on 2015-10-08

Changed in nova:
status:	New → Triaged
importance:	Undecided → Medium

Revision history for this message

Mark Silence (madasi) wrote on 2015-10-08:

My initial thought was to swap the logical order of the _node_resources_used and the _node_resources_unavailable checks so that the check for instances only happens after we check for unavailable conditions, however I think that would cause the same situation as bug #1502177 where if you have a maintenance node with an active instance that nova does know about, it would set the usage itself from the instance record, subtract it from the 0 total resources we sent due to maintenance state, and would report negative free space.

It looks like if the ironic driver implements the get_per_instance_usage() call, then the compute's resource tracker would properly account for orphaned instances and stop reporting them as available capacity. However, I think we would need to pass an ironic node identifier since this probably addresses a compute under nova's one compute == one host assumption. This would mean changing the function signature and thus the driver API, which is not a trivial change.

Trying to see what the best way to do this is.

Mark Silence (madasi) on 2015-10-08

Changed in nova:
assignee:	Mark Silence (madasi) → nobody

Revision history for this message

Michael Davies (mrda) wrote on 2015-10-13:

Discussed in IRC with jroll and madasi. Solution still up for discussion.

One comment though "should we add nodename to get_per_instance_usage() is super contentious".

If anyone has the bandwidth to pick this up, feel free.

Revision history for this message

Jim Rollenhagen (jim-rollenhagen) wrote on 2016-02-18:

I've been told that adding nodename as an argument to get_per_instance_usage() won't be happening, as folks are trying to get away from instance.node. Anyone else have ideas on fixing this?

Jesse J. Cook (jesse-j-cook) on 2016-04-15

Changed in nova:
assignee:	nobody → Jesse J. Cook (jesse-j-cook)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-04-16: Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/306670

Changed in nova:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-05-09: Fix merged to nova (master)

Reviewed: https://review.openstack.org/306670
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=016b810f675b20e8ce78f4c82dc9c679c0162b7a
Submitter: Jenkins
Branch: master

commit 016b810f675b20e8ce78f4c82dc9c679c0162b7a
Author: Jesse J. Cook <email address hidden>
Date: Sat Apr 16 00:35:34 2016 +0000

Unavailable hosts have no resources for use

If a host's:

* resources are unavailable
* in a unusable state

the system should:

      * report 0 resources
      * show 0 resources
      * not be scheduled to

Change-Id: Ia1c2f6f161dde1e23acce85a54566d07805d13df
Closes-Bug: 1503453

Changed in nova:
status:	In Progress → Fix Released

Revision history for this message

Jay Faulkner (jason-oldos) wrote on 2016-05-27:

There's still another case of this bug not handled:

If the Ironic node is in a state that'd typically be considered usable (such as AVAILABLE), but has an instance uuid (an invalid state, but possible nonetheless), nova will still schedule to it. I will push a patch that should resolve this case as well.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-05-27: Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/321907

Michael Still (mikal) on 2016-05-29

tags:

added: liberty-backport-potential mitaka-backport-potential

Revision history for this message

Tony Breeds (o-tony) wrote on 2016-05-29:

Certainly appropriate for Mitaka bit it isn't a critical/security bug so dropping liberty

tags:

removed: ironic liberty-backport-potential

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-05-31: Fix merged to nova (master)

Reviewed: https://review.openstack.org/321907
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=ffe3093786b137f46646c4cf5fb3ba6b597418a9
Submitter: Jenkins
Branch: master

commit ffe3093786b137f46646c4cf5fb3ba6b597418a9
Author: Jay Faulkner <email address hidden>
Date: Thu May 26 17:31:24 2016 -0700

Ironic nodes with instance_uuid are not available

    Currently, if a node is in AVAILABLE or NOSTATE (legacy), regardless of
    if it has an instance_uuid it is considered able to be scheduled to.
    However, it's impossible for a deployment to succeed to an ironic node
    with instance_uuid populated. We should not schedule to nodes in this
    state.

Change-Id: I41e0c8f1f8a91e11180a6edd72907cf76fe4b235
Closes-bug: 1503453

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-05-31: Fix proposed to nova (stable/mitaka)

#10

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/323196

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-05-31:

#11

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/323477

Revision history for this message

Davanum Srinivas (DIMS) (dims-v) wrote on 2016-06-02: Fix included in openstack/nova 14.0.0.0b1

#12

This issue was fixed in the openstack/nova 14.0.0.0b1 development milestone.

Matt Riedemann (mriedem) on 2016-06-07

tags:

added: ironic
removed: mitaka-backport-potential

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-06-09: Fix merged to nova (stable/mitaka)

#14

Reviewed: https://review.openstack.org/323477
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=154aaace9f4cf26b24ccf2edbfbbf1d9dc404f13
Submitter: Jenkins
Branch: stable/mitaka

commit 154aaace9f4cf26b24ccf2edbfbbf1d9dc404f13
Author: Jesse J. Cook <email address hidden>
Date: Sat Apr 16 00:35:34 2016 +0000

Unavailable hosts have no resources for use

If a host's:

* resources are unavailable
* in a unusable state

the system should:

      * report 0 resources
      * show 0 resources
      * not be scheduled to

    Change-Id: Ia1c2f6f161dde1e23acce85a54566d07805d13df
    Closes-Bug: 1503453
    (cherry picked from commit 016b810f675b20e8ce78f4c82dc9c679c0162b7a)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-06-09:

#15

Reviewed: https://review.openstack.org/323196
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=8ed1c2da1d5aa3e645afac6fdbb64c11cb151664
Submitter: Jenkins
Branch: stable/mitaka

commit 8ed1c2da1d5aa3e645afac6fdbb64c11cb151664
Author: Jay Faulkner <email address hidden>
Date: Thu May 26 17:31:24 2016 -0700

Ironic nodes with instance_uuid are not available

    Change-Id: I41e0c8f1f8a91e11180a6edd72907cf76fe4b235
    Closes-bug: 1503453
    (cherry picked from commit ffe3093786b137f46646c4cf5fb3ba6b597418a9)

Revision history for this message

Doug Hellmann (doug-hellmann) wrote on 2016-06-14: Fix included in openstack/nova 13.1.0

#16

This issue was fixed in the openstack/nova 13.1.0 release.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.