PCI passthrough reschedule race condition

Bug #1860555 reported by Mark Goddard
44
This bug affects 8 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Mark Goddard

Bug Description

Steps to reproduce
------------------

Create multiple instances concurrently using a flavor with a PCI passthrough request (--property "pci_passthrough:alias"="<alias>:<count>"), and a scheduler hint with some anti-affinity constraint.

Expected result
---------------

The instances are created successfully, and each have the expected number of PCI devices attached.

Actual result
-------------

Sometimes, instances may fail during creation, or may be created with more PCI devices than requested.

Environment
-----------

Nova 18.2.2 (rocky), CentOS 7, libvirt, deployed by kolla-ansible.

Analysis
--------

If an instance with PCI passthrough devices is rescheduled (e.g. due to
affinity violation), the instance can end up with extra PCI devices attached.
If the devices selected on the original and subsequent compute nodes have the
same address, the instance will fail to create, with the following error:

libvirtError: internal error: Device 0000:89:00.0 is already in use

However, if the devices are different, and all available on the first and
second compute nodes, the VM may end up with additional hostdevs.

On investigation, when the node is rescheduled, the instance object passed to
the conductor RPC API contains the PCI devices that should have been freed.
This is because the claim object holds a clone of the instance that is used to
perform the abort on failure [1][2], and the PCI devices removed from its list are not
reflected in the original object. There is a secondary issue that the PCI
manager was not passing through the instance to the PCI object's free() method
in all cases [3], resulting in the PCI device not being removed from the
instance.pci_devices list.

I have two alternative fixes for this issue, but they will need a little time to work their way out of an organisation. Essentially:

1. pass the original instance (not the clone) to the abort function in the Claim.
2. refresh the instance from DB when rescheduling

The former is a more general solution, but I don't know the reasons for using a clone in the first place. The second works for reschedules, but may leave a hole for resize or migration. I haven't reproduced the issue in those cases but it seems possible that it would be present.

[1] https://opendev.org/openstack/nova/src/branch/master/nova/compute/claims.py#L64
[2] https://opendev.org/openstack/nova/src/branch/master/nova/compute/claims.py#L83
[3] https://opendev.org/openstack/nova/src/branch/master/nova/pci/manager.py#L309

Changed in nova:
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/710847

Changed in nova:
assignee: nobody → Mark Goddard (mgoddard)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/710848

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/760354

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by "Mark Goddard <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/nova/+/710847
Reason: Let's go with https://review.opendev.org/c/openstack/nova/+/710848 instead

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/760354
Committed: https://opendev.org/openstack/nova/commit/a2d77845ab247f1b09e2ae4f32f9421c3f50b98d
Submitter: "Zuul (22348)"
Branch: master

commit a2d77845ab247f1b09e2ae4f32f9421c3f50b98d
Author: sdmitriev1 <email address hidden>
Date: Sun Dec 12 16:45:59 2021 +0000

    Functional test test_boot_reschedule_with_proper_pci_device_count

    Lets first ensure we have a test that proves we have bad behaviour,
    then follow up with the fix and the test tweak to prove it.

    On the first compute node it fails due to group policy error.
    On the second compute node instance should have exactly one PCI device.

    Related-Bug: #1860555
    Change-Id: Ia122fff268c8f45ad3e5a3071d2cb7c990cb2c1d

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/926407
Committed: https://opendev.org/openstack/nova/commit/f8b98390dc99f6cb0101c88223eb840e0d1c7124
Submitter: "Zuul (22348)"
Branch: master

commit f8b98390dc99f6cb0101c88223eb840e0d1c7124
Author: Balazs Gibizer <email address hidden>
Date: Thu Aug 15 13:06:39 2024 +0200

    Fix PCI passthrough cleanup on reschedule

    The resource tracker Claim object works on a copy of the instance object
    got from the compute manager. But the PCI claim logic does not use the
    copy but use the original instance object. However the abort claim logic
    including the abort PCI claim logic worked on the copy only. Therefore the
    claimed PCI devices are visible to the compute manager in the
    instance.pci_decives list even after the claim is aborted.

    There was another bug in the PCIDevice object where the instance object
    wasn't passed to the free() function and therefore the
    instance.pci_devices list wasn't updated when the device was freed.

    Closes-Bug: #1860555
    Change-Id: Iff343d4d78996cd17a6a584fefa7071c81311673

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/2023.2)

Related fix proposed to branch: stable/2023.2
Review: https://review.opendev.org/c/openstack/nova/+/926554

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/2023.2)

Fix proposed to branch: stable/2023.2
Review: https://review.opendev.org/c/openstack/nova/+/926555

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/2024.1)

Related fix proposed to branch: stable/2024.1
Review: https://review.opendev.org/c/openstack/nova/+/926556

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/2024.1)

Fix proposed to branch: stable/2024.1
Review: https://review.opendev.org/c/openstack/nova/+/926557

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/2023.1)

Related fix proposed to branch: stable/2023.1
Review: https://review.opendev.org/c/openstack/nova/+/926558

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/2023.1)

Fix proposed to branch: stable/2023.1
Review: https://review.opendev.org/c/openstack/nova/+/926559

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by "Balazs Gibizer <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/nova/+/710848
Reason: fix merged in https://review.opendev.org/c/openstack/nova/+/926407

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/2024.1)

Reviewed: https://review.opendev.org/c/openstack/nova/+/926556
Committed: https://opendev.org/openstack/nova/commit/7b0c0cf591a6baf20cc7abd09e7fecc558cf6116
Submitter: "Zuul (22348)"
Branch: stable/2024.1

commit 7b0c0cf591a6baf20cc7abd09e7fecc558cf6116
Author: sdmitriev1 <email address hidden>
Date: Sun Dec 12 16:45:59 2021 +0000

    Functional test test_boot_reschedule_with_proper_pci_device_count

    Lets first ensure we have a test that proves we have bad behaviour,
    then follow up with the fix and the test tweak to prove it.

    On the first compute node it fails due to group policy error.
    On the second compute node instance should have exactly one PCI device.

    Related-Bug: #1860555
    Change-Id: Ia122fff268c8f45ad3e5a3071d2cb7c990cb2c1d
    (cherry picked from commit a2d77845ab247f1b09e2ae4f32f9421c3f50b98d)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/2024.1)

Reviewed: https://review.opendev.org/c/openstack/nova/+/926557
Committed: https://opendev.org/openstack/nova/commit/a0be2c97a642b48d4512bd40e136d0eb7c0696f6
Submitter: "Zuul (22348)"
Branch: stable/2024.1

commit a0be2c97a642b48d4512bd40e136d0eb7c0696f6
Author: Balazs Gibizer <email address hidden>
Date: Thu Aug 15 13:06:39 2024 +0200

    Fix PCI passthrough cleanup on reschedule

    The resource tracker Claim object works on a copy of the instance object
    got from the compute manager. But the PCI claim logic does not use the
    copy but use the original instance object. However the abort claim logic
    including the abort PCI claim logic worked on the copy only. Therefore the
    claimed PCI devices are visible to the compute manager in the
    instance.pci_decives list even after the claim is aborted.

    There was another bug in the PCIDevice object where the instance object
    wasn't passed to the free() function and therefore the
    instance.pci_devices list wasn't updated when the device was freed.

    Closes-Bug: #1860555
    Change-Id: Iff343d4d78996cd17a6a584fefa7071c81311673
    (cherry picked from commit f8b98390dc99f6cb0101c88223eb840e0d1c7124)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/2023.2)

Reviewed: https://review.opendev.org/c/openstack/nova/+/926554
Committed: https://opendev.org/openstack/nova/commit/4dadb087b65f3650fa75f21e6f3505d1e98b1f89
Submitter: "Zuul (22348)"
Branch: stable/2023.2

commit 4dadb087b65f3650fa75f21e6f3505d1e98b1f89
Author: sdmitriev1 <email address hidden>
Date: Sun Dec 12 16:45:59 2021 +0000

    Functional test test_boot_reschedule_with_proper_pci_device_count

    Lets first ensure we have a test that proves we have bad behaviour,
    then follow up with the fix and the test tweak to prove it.

    On the first compute node it fails due to group policy error.
    On the second compute node instance should have exactly one PCI device.

    Related-Bug: #1860555
    Change-Id: Ia122fff268c8f45ad3e5a3071d2cb7c990cb2c1d
    (cherry picked from commit a2d77845ab247f1b09e2ae4f32f9421c3f50b98d)
    (cherry picked from commit 7b0c0cf591a6baf20cc7abd09e7fecc558cf6116)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/2023.2)

Reviewed: https://review.opendev.org/c/openstack/nova/+/926555
Committed: https://opendev.org/openstack/nova/commit/f9b2fdd8692a5f973f560867cebb6db0ee998ba9
Submitter: "Zuul (22348)"
Branch: stable/2023.2

commit f9b2fdd8692a5f973f560867cebb6db0ee998ba9
Author: Balazs Gibizer <email address hidden>
Date: Thu Aug 15 13:06:39 2024 +0200

    Fix PCI passthrough cleanup on reschedule

    The resource tracker Claim object works on a copy of the instance object
    got from the compute manager. But the PCI claim logic does not use the
    copy but use the original instance object. However the abort claim logic
    including the abort PCI claim logic worked on the copy only. Therefore the
    claimed PCI devices are visible to the compute manager in the
    instance.pci_decives list even after the claim is aborted.

    There was another bug in the PCIDevice object where the instance object
    wasn't passed to the free() function and therefore the
    instance.pci_devices list wasn't updated when the device was freed.

    Closes-Bug: #1860555
    Change-Id: Iff343d4d78996cd17a6a584fefa7071c81311673
    (cherry picked from commit f8b98390dc99f6cb0101c88223eb840e0d1c7124)
    (cherry picked from commit a0be2c97a642b48d4512bd40e136d0eb7c0696f6)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/2023.1)

Reviewed: https://review.opendev.org/c/openstack/nova/+/926558
Committed: https://opendev.org/openstack/nova/commit/4ae8aa82d606b8d5cf134b0eaae0d81ccac6f13a
Submitter: "Zuul (22348)"
Branch: stable/2023.1

commit 4ae8aa82d606b8d5cf134b0eaae0d81ccac6f13a
Author: sdmitriev1 <email address hidden>
Date: Sun Dec 12 16:45:59 2021 +0000

    Functional test test_boot_reschedule_with_proper_pci_device_count

    Lets first ensure we have a test that proves we have bad behaviour,
    then follow up with the fix and the test tweak to prove it.

    On the first compute node it fails due to group policy error.
    On the second compute node instance should have exactly one PCI device.

    Related-Bug: #1860555
    Change-Id: Ia122fff268c8f45ad3e5a3071d2cb7c990cb2c1d
    (cherry picked from commit a2d77845ab247f1b09e2ae4f32f9421c3f50b98d)
    (cherry picked from commit 7b0c0cf591a6baf20cc7abd09e7fecc558cf6116)
    (cherry picked from commit 4dadb087b65f3650fa75f21e6f3505d1e98b1f89)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/2023.1)

Reviewed: https://review.opendev.org/c/openstack/nova/+/926559
Committed: https://opendev.org/openstack/nova/commit/6c3d72f4776eb8f1e1b8857d1b7120c12abffdfe
Submitter: "Zuul (22348)"
Branch: stable/2023.1

commit 6c3d72f4776eb8f1e1b8857d1b7120c12abffdfe
Author: Balazs Gibizer <email address hidden>
Date: Thu Aug 15 13:06:39 2024 +0200

    Fix PCI passthrough cleanup on reschedule

    The resource tracker Claim object works on a copy of the instance object
    got from the compute manager. But the PCI claim logic does not use the
    copy but use the original instance object. However the abort claim logic
    including the abort PCI claim logic worked on the copy only. Therefore the
    claimed PCI devices are visible to the compute manager in the
    instance.pci_decives list even after the claim is aborted.

    There was another bug in the PCIDevice object where the instance object
    wasn't passed to the free() function and therefore the
    instance.pci_devices list wasn't updated when the device was freed.

    Closes-Bug: #1860555
    Change-Id: Iff343d4d78996cd17a6a584fefa7071c81311673
    (cherry picked from commit f8b98390dc99f6cb0101c88223eb840e0d1c7124)
    (cherry picked from commit a0be2c97a642b48d4512bd40e136d0eb7c0696f6)
    (cherry picked from commit f9b2fdd8692a5f973f560867cebb6db0ee998ba9)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 30.0.0.0rc1

This issue was fixed in the openstack/nova 30.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.