ironic baremetal node ownership not checked with early vif plugging

Bug #1766301 reported by Julia Kreger
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Jim Rollenhagen
Queens
Fix Released
Undecided
Unassigned

Bug Description

It is possible for scheduling to tell nova that a baremetal node can support multiple instances when in reality that is not the case. The issue is that the virt driver for ironic ironic does not check nor assert that the node is in use. This is not an issue, except when before the virt driver has claimed the node. Due to needing the vif plugging information completed for block device mapping, https://github.com/openstack/nova/blob/stable/queens/nova/virt/ironic/driver.py#L1809 can cause resource exhaustion without checking to see if a node is locked.

Depending on scheduling, we can end up having multiple vif plugging requests for the same node. All actions beyond the first vif plugging will fail if only one port is assigned to the node.

This demonstrates itself as:

Message: Build of instance c7c5191b-59ed-44a0-8b2a-0f68e48e9a52 aborted: Failure prepping block device., Code: 500

With logging from nova-compute:
2018-04-19 19:49:06.832 18246 ERROR nova.virt.ironic.driver [req-90f1e5e7-1ee0-4f1d-af88-a42f74b0a8e0 e9b4e6ab60ae40cc84ee5689c38608ef f3dccd2210514e3695c4d087d81a65a7 - default default] Can
40-50b3-489b-ae1e-0840e0608253 to the node 64eea9c8-b69f-4a30-a05f-2c7b09f5b1d8 due to error: Unable to attach VIF 7d0a6b40-50b3-489b-ae1e-0840e0608253, not enough free physical ports. (HTTP
able to attach VIF 7d0a6b40-50b3-489b-ae1e-0840e0608253, not enough free physical ports. (HTTP 400)
2018-04-19 19:49:06.833 18246 ERROR nova.virt.ironic.driver [req-90f1e5e7-1ee0-4f1d-af88-a42f74b0a8e0 e9b4e6ab60ae40cc84ee5689c38608ef f3dccd2210514e3695c4d087d81a65a7 - default default] [in
-4067-af8e-cd6e25fb6b59] Error preparing deploy for instance 964f0d93-e9c4-4067-af8e-cd6e25fb6b59 on baremetal node 64eea9c8-b69f-4a30-a05f-2c7b09f5b1d8.: VirtualInterfacePlugException: Cann
0-50b3-489b-ae1e-0840e0608253 to the node 64eea9c8-b69f-4a30-a05f-2c7b09f5b1d8 due to error: Unable to attach VIF 7d0a6b40-50b3-489b-ae1e-0840e0608253, not enough free physical ports. (HTTP
2018-04-19 19:49:06.833 18246 ERROR nova.compute.manager [req-90f1e5e7-1ee0-4f1d-af88-a42f74b0a8e0 e9b4e6ab60ae40cc84ee5689c38608ef f3dccd2210514e3695c4d087d81a65a7 - default default] [insta
67-af8e-cd6e25fb6b59] Failure prepping block device: VirtualInterfacePlugException: Cannot attach VIF 7d0a6b40-50b3-489b-ae1e-0840e0608253 to the node 64eea9c8-b69f-4a30-a05f-2c7b09f5b1d8 du
 attach VIF 7d0a6b40-50b3-489b-ae1e-0840e0608253, not enough free physical ports. (HTTP 400)

HTTP logging:

192.168.24.1 - - [19/Apr/2018:19:49:00 -0400] "POST /v1/nodes/64eea9c8-b69f-4a30-a05f-2c7b09f5b1d8/vifs HTTP/1.1" 204 - "-" "python-ironicclient"
192.168.24.1 - - [19/Apr/2018:19:49:04 -0400] "POST /v1/nodes/64eea9c8-b69f-4a30-a05f-2c7b09f5b1d8/vifs HTTP/1.1" 400 177 "-" "python-ironicclient"
192.168.24.1 - - [19/Apr/2018:19:49:05 -0400] "PATCH /v1/nodes/64eea9c8-b69f-4a30-a05f-2c7b09f5b1d8 HTTP/1.1" 200 1239 "-" "python-ironicclient"
192.168.24.1 - - [19/Apr/2018:19:49:06 -0400] "DELETE /v1/nodes/64eea9c8-b69f-4a30-a05f-2c7b09f5b1d8/vifs/7d0a6b40-50b3-489b-ae1e-0840e0608253 HTTP/1.1" 400 225 "-" "python-ironicclient"
192.168.24.1 - - [19/Apr/2018:19:49:20 -0400] "POST /v1/nodes/67eeb698-cfd7-4f30-83c7-c6748f78da60/vifs HTTP/1.1" 204 - "-" "python-ironicclient"

RedHat Bugzilla:

https://bugzilla.redhat.com/show_bug.cgi?id=1560690

How to reproduce:

Bulk schedule nodes, ideally in most cases with resource class scheduling by flavor disabled which will result in a high liklihood that the same physical baremetal node will be selected. This can be preproduced fairly easily with TripleO and a lack of a resource class defined on the flavor.

Tags: ironic
Changed in nova:
assignee: nobody → Julia Kreger (juliaashleykreger)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/563714

Changed in nova:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/563722

Changed in nova:
assignee: Julia Kreger (juliaashleykreger) → Jim Rollenhagen (jim-rollenhagen)
Changed in nova:
assignee: Jim Rollenhagen (jim-rollenhagen) → Julia Kreger (juliaashleykreger)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Julia Kreger (<email address hidden>) on branch: master
Review: https://review.openstack.org/563714
Reason: Superceeded by suggested patch.

melanie witt (melwitt)
Changed in nova:
importance: Undecided → Medium
Changed in nova:
assignee: Julia Kreger (juliaashleykreger) → Jim Rollenhagen (jim-rollenhagen)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/563722
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=e45c5ec819cfba3a45367bfe7dd853769c80d816
Submitter: Zuul
Branch: master

commit e45c5ec819cfba3a45367bfe7dd853769c80d816
Author: Jim Rollenhagen <email address hidden>
Date: Mon Apr 23 13:53:00 2018 -0400

    ironic: add instance_uuid before any other spawn activity

    In the baremetal use case, it is necessary to allow a remote
    source of truth assert information about a physical node,
    as opposed to a client asserting desired configuration.

    As such, the prepare_networks_before_block_device_mapping virt
    driver method was added, however in doing so network VIF attaching
    was moved before actual hardware reservation. As ironic only
    supports users to have a number of VIFs limited by the number
    of network interfaces in the physical node, VIF attach actions
    cannot be performed without first asserting control over the node.

    Adding an "instance_uuid" upfront allows other ironic API consumers
    to be aware that a node is now in use. Alternatively it also allows
    nova to become aware prior to adding network information on the node
    that the node may already be in use by an external user.

    Co-Authored-By: Julia Kreger <email address hidden>

    Change-Id: I87f085589bb663c519650f307f25d087c88bbdb1
    Closes-Bug: #1766301

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 18.0.0.0b3

This issue was fixed in the openstack/nova 18.0.0.0b3 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/743493

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/queens)

Reviewed: https://review.opendev.org/743493
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=6be5a27b361ebaa36975459a2a32bf4330ce2660
Submitter: Zuul
Branch: stable/queens

commit 6be5a27b361ebaa36975459a2a32bf4330ce2660
Author: Jim Rollenhagen <email address hidden>
Date: Mon Apr 23 13:53:00 2018 -0400

    ironic: add instance_uuid before any other spawn activity

    In the baremetal use case, it is necessary to allow a remote
    source of truth assert information about a physical node,
    as opposed to a client asserting desired configuration.

    As such, the prepare_networks_before_block_device_mapping virt
    driver method was added, however in doing so network VIF attaching
    was moved before actual hardware reservation. As ironic only
    supports users to have a number of VIFs limited by the number
    of network interfaces in the physical node, VIF attach actions
    cannot be performed without first asserting control over the node.

    Adding an "instance_uuid" upfront allows other ironic API consumers
    to be aware that a node is now in use. Alternatively it also allows
    nova to become aware prior to adding network information on the node
    that the node may already be in use by an external user.

    Co-Authored-By: Julia Kreger <email address hidden>

    Conflicts:
        nova/tests/unit/compute/test_compute_mgr.py

    NOTE(melwitt): The conflict is because change
    I755b6fdddc9d754326cd9c81b6880581641f73e8 is not in Queens.

    Closes-Bug: #1766301
    Change-Id: I87f085589bb663c519650f307f25d087c88bbdb1
    (cherry picked from commit e45c5ec819cfba3a45367bfe7dd853769c80d816)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova queens-eol

This issue was fixed in the openstack/nova queens-eol release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.