unavailable ironic nodes being scheduled to

Bug #1503453 reported by Mark Silence
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Jesse J. Cook
Mitaka
Fix Released
Medium
Jay Faulkner

Bug Description

When the compute resource tracker checks nodes, the ironic driver checks the node against a list of states that it should return resources for. This is to prevent nodes in various ironic states, like our cleaning process, that are not available from being scheduled to by nova.

The logic around this check ( https://github.com/openstack/nova/blob/master/nova/virt/ironic/driver.py#L334-L351 ) looks for existing instances on the node, and if they aren't found it then looks at the conditions for returning the node as unavailable.

The problem is when you have an orphaned instance on your node, one which ironic sees as present but nova does not (usually nova lists it as having been deleted).

The instance detection will return true, causing the memory_mb_used and memory_mb values to be set to the retrieved value from instance_info['memory_mb'].

The check for _node_resources_unavailable will not run as it is an elif. This means that even if this node is in maintenance state, we won't notice and return all zeros for resources as we normally would.

Once the resource tracker calls _update_usage_from_instance, it will not find an instance associated with the node from nova's point of view and will return all of the memory as available instead, causing builds to be scheduled to this node.

Ironic will then fail the build attempt due to it showing an instance already associated with the node.

Tags: ironic
Mark Silence (madasi)
Changed in nova:
assignee: nobody → Mark Silence (madasi)
tags: added: ironic
Michael Still (mikal)
Changed in nova:
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
Mark Silence (madasi) wrote :

My initial thought was to swap the logical order of the _node_resources_used and the _node_resources_unavailable checks so that the check for instances only happens after we check for unavailable conditions, however I think that would cause the same situation as bug #1502177 where if you have a maintenance node with an active instance that nova does know about, it would set the usage itself from the instance record, subtract it from the 0 total resources we sent due to maintenance state, and would report negative free space.

It looks like if the ironic driver implements the get_per_instance_usage() call, then the compute's resource tracker would properly account for orphaned instances and stop reporting them as available capacity. However, I think we would need to pass an ironic node identifier since this probably addresses a compute under nova's one compute == one host assumption. This would mean changing the function signature and thus the driver API, which is not a trivial change.

Trying to see what the best way to do this is.

Mark Silence (madasi)
Changed in nova:
assignee: Mark Silence (madasi) → nobody
Revision history for this message
Michael Davies (mrda) wrote :

Discussed in IRC with jroll and madasi. Solution still up for discussion.

One comment though "should we add nodename to get_per_instance_usage() is super contentious".

If anyone has the bandwidth to pick this up, feel free.

Revision history for this message
Jim Rollenhagen (jim-rollenhagen) wrote :

I've been told that adding nodename as an argument to get_per_instance_usage() won't be happening, as folks are trying to get away from instance.node. Anyone else have ideas on fixing this?

Changed in nova:
assignee: nobody → Jesse J. Cook (jesse-j-cook)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/306670

Changed in nova:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/306670
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=016b810f675b20e8ce78f4c82dc9c679c0162b7a
Submitter: Jenkins
Branch: master

commit 016b810f675b20e8ce78f4c82dc9c679c0162b7a
Author: Jesse J. Cook <email address hidden>
Date: Sat Apr 16 00:35:34 2016 +0000

    Unavailable hosts have no resources for use

    If a host's:

      * resources are unavailable
      * in a unusable state

    the system should:

      * report 0 resources
      * show 0 resources
      * not be scheduled to

    Change-Id: Ia1c2f6f161dde1e23acce85a54566d07805d13df
    Closes-Bug: 1503453

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
Jay Faulkner (jason-oldos) wrote :

There's still another case of this bug not handled:

If the Ironic node is in a state that'd typically be considered usable (such as AVAILABLE), but has an instance uuid (an invalid state, but possible nonetheless), nova will still schedule to it. I will push a patch that should resolve this case as well.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/321907

Michael Still (mikal)
tags: added: liberty-backport-potential mitaka-backport-potential
Revision history for this message
Tony Breeds (o-tony) wrote :

Certainly appropriate for Mitaka bit it isn't a critical/security bug so dropping liberty

tags: removed: ironic liberty-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/321907
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=ffe3093786b137f46646c4cf5fb3ba6b597418a9
Submitter: Jenkins
Branch: master

commit ffe3093786b137f46646c4cf5fb3ba6b597418a9
Author: Jay Faulkner <email address hidden>
Date: Thu May 26 17:31:24 2016 -0700

    Ironic nodes with instance_uuid are not available

    Currently, if a node is in AVAILABLE or NOSTATE (legacy), regardless of
    if it has an instance_uuid it is considered able to be scheduled to.
    However, it's impossible for a deployment to succeed to an ironic node
    with instance_uuid populated. We should not schedule to nodes in this
    state.

    Change-Id: I41e0c8f1f8a91e11180a6edd72907cf76fe4b235
    Closes-bug: 1503453

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/323196

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/323477

Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote : Fix included in openstack/nova 14.0.0.0b1

This issue was fixed in the openstack/nova 14.0.0.0b1 development milestone.

Matt Riedemann (mriedem)
tags: added: ironic
removed: mitaka-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/mitaka)

Reviewed: https://review.openstack.org/323477
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=154aaace9f4cf26b24ccf2edbfbbf1d9dc404f13
Submitter: Jenkins
Branch: stable/mitaka

commit 154aaace9f4cf26b24ccf2edbfbbf1d9dc404f13
Author: Jesse J. Cook <email address hidden>
Date: Sat Apr 16 00:35:34 2016 +0000

    Unavailable hosts have no resources for use

    If a host's:

      * resources are unavailable
      * in a unusable state

    the system should:

      * report 0 resources
      * show 0 resources
      * not be scheduled to

    Change-Id: Ia1c2f6f161dde1e23acce85a54566d07805d13df
    Closes-Bug: 1503453
    (cherry picked from commit 016b810f675b20e8ce78f4c82dc9c679c0162b7a)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/323196
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=8ed1c2da1d5aa3e645afac6fdbb64c11cb151664
Submitter: Jenkins
Branch: stable/mitaka

commit 8ed1c2da1d5aa3e645afac6fdbb64c11cb151664
Author: Jay Faulkner <email address hidden>
Date: Thu May 26 17:31:24 2016 -0700

    Ironic nodes with instance_uuid are not available

    Currently, if a node is in AVAILABLE or NOSTATE (legacy), regardless of
    if it has an instance_uuid it is considered able to be scheduled to.
    However, it's impossible for a deployment to succeed to an ironic node
    with instance_uuid populated. We should not schedule to nodes in this
    state.

    Change-Id: I41e0c8f1f8a91e11180a6edd72907cf76fe4b235
    Closes-bug: 1503453
    (cherry picked from commit ffe3093786b137f46646c4cf5fb3ba6b597418a9)

Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/nova 13.1.0

This issue was fixed in the openstack/nova 13.1.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers