Bug #1860555 “PCI passthrough reschedule race condition” : Bugs : OpenStack Compute (nova)

Stephen Finucane (stephenfinucane) on 2020-01-28

Changed in nova:
importance:	Undecided → High

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-03-02: Fix proposed to nova (master)

#1

Fix proposed to branch: master
Review: https://review.opendev.org/710847

Changed in nova:
assignee:	nobody → Mark Goddard (mgoddard)
status:	New → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-03-02:

#2

Fix proposed to branch: master
Review: https://review.opendev.org/710848

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-10-29: Related fix proposed to nova (master)

#3

Related fix proposed to branch: master
Review: https://review.opendev.org/760354

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-05-18: Change abandoned on nova (master)

#4

Change abandoned by "Mark Goddard <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/nova/+/710847
Reason: Let's go with https://review.opendev.org/c/openstack/nova/+/710848 instead

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-08-13: Related fix merged to nova (master)

#5

Reviewed: https://review.opendev.org/c/openstack/nova/+/760354
Committed: https://opendev.org/openstack/nova/commit/a2d77845ab247f1b09e2ae4f32f9421c3f50b98d
Submitter: "Zuul (22348)"
Branch: master

commit a2d77845ab247f1b09e2ae4f32f9421c3f50b98d
Author: sdmitriev1 <email address hidden>
Date: Sun Dec 12 16:45:59 2021 +0000

Functional test test_boot_reschedule_with_proper_pci_device_count

Lets first ensure we have a test that proves we have bad behaviour,
then follow up with the fix and the test tweak to prove it.

On the first compute node it fails due to group policy error.
On the second compute node instance should have exactly one PCI device.

Related-Bug: #1860555
Change-Id: Ia122fff268c8f45ad3e5a3071d2cb7c990cb2c1d

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-08-19: Fix merged to nova (master)

#6

Reviewed: https://review.opendev.org/c/openstack/nova/+/926407
Committed: https://opendev.org/openstack/nova/commit/f8b98390dc99f6cb0101c88223eb840e0d1c7124
Submitter: "Zuul (22348)"
Branch: master

commit f8b98390dc99f6cb0101c88223eb840e0d1c7124
Author: Balazs Gibizer <email address hidden>
Date: Thu Aug 15 13:06:39 2024 +0200

Fix PCI passthrough cleanup on reschedule

    The resource tracker Claim object works on a copy of the instance object
    got from the compute manager. But the PCI claim logic does not use the
    copy but use the original instance object. However the abort claim logic
    including the abort PCI claim logic worked on the copy only. Therefore the
    claimed PCI devices are visible to the compute manager in the
    instance.pci_decives list even after the claim is aborted.

    There was another bug in the PCIDevice object where the instance object
    wasn't passed to the free() function and therefore the
    instance.pci_devices list wasn't updated when the device was freed.

Closes-Bug: #1860555
Change-Id: Iff343d4d78996cd17a6a584fefa7071c81311673

Changed in nova:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-08-19: Related fix proposed to nova (stable/2023.2)

#7

Related fix proposed to branch: stable/2023.2
Review: https://review.opendev.org/c/openstack/nova/+/926554

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-08-19: Fix proposed to nova (stable/2023.2)

#8

Fix proposed to branch: stable/2023.2
Review: https://review.opendev.org/c/openstack/nova/+/926555

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-08-19: Related fix proposed to nova (stable/2024.1)

#9

Related fix proposed to branch: stable/2024.1
Review: https://review.opendev.org/c/openstack/nova/+/926556

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-08-19: Fix proposed to nova (stable/2024.1)

#10

Fix proposed to branch: stable/2024.1
Review: https://review.opendev.org/c/openstack/nova/+/926557

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-08-19: Related fix proposed to nova (stable/2023.1)

#11

Related fix proposed to branch: stable/2023.1
Review: https://review.opendev.org/c/openstack/nova/+/926558

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-08-19: Fix proposed to nova (stable/2023.1)

#12

Fix proposed to branch: stable/2023.1
Review: https://review.opendev.org/c/openstack/nova/+/926559

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-08-26: Change abandoned on nova (master)

#13

Change abandoned by "Balazs Gibizer <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/nova/+/710848
Reason: fix merged in https://review.opendev.org/c/openstack/nova/+/926407

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-08-26: Related fix merged to nova (stable/2024.1)

#14

Reviewed: https://review.opendev.org/c/openstack/nova/+/926556
Committed: https://opendev.org/openstack/nova/commit/7b0c0cf591a6baf20cc7abd09e7fecc558cf6116
Submitter: "Zuul (22348)"
Branch: stable/2024.1

commit 7b0c0cf591a6baf20cc7abd09e7fecc558cf6116
Author: sdmitriev1 <email address hidden>
Date: Sun Dec 12 16:45:59 2021 +0000

Functional test test_boot_reschedule_with_proper_pci_device_count

Lets first ensure we have a test that proves we have bad behaviour,
then follow up with the fix and the test tweak to prove it.

On the first compute node it fails due to group policy error.
On the second compute node instance should have exactly one PCI device.

    Related-Bug: #1860555
    Change-Id: Ia122fff268c8f45ad3e5a3071d2cb7c990cb2c1d
    (cherry picked from commit a2d77845ab247f1b09e2ae4f32f9421c3f50b98d)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-08-26: Fix merged to nova (stable/2024.1)

#15

Reviewed: https://review.opendev.org/c/openstack/nova/+/926557
Committed: https://opendev.org/openstack/nova/commit/a0be2c97a642b48d4512bd40e136d0eb7c0696f6
Submitter: "Zuul (22348)"
Branch: stable/2024.1

commit a0be2c97a642b48d4512bd40e136d0eb7c0696f6
Author: Balazs Gibizer <email address hidden>
Date: Thu Aug 15 13:06:39 2024 +0200

Fix PCI passthrough cleanup on reschedule

    The resource tracker Claim object works on a copy of the instance object
    got from the compute manager. But the PCI claim logic does not use the
    copy but use the original instance object. However the abort claim logic
    including the abort PCI claim logic worked on the copy only. Therefore the
    claimed PCI devices are visible to the compute manager in the
    instance.pci_decives list even after the claim is aborted.

    There was another bug in the PCIDevice object where the instance object
    wasn't passed to the free() function and therefore the
    instance.pci_devices list wasn't updated when the device was freed.

    Closes-Bug: #1860555
    Change-Id: Iff343d4d78996cd17a6a584fefa7071c81311673
    (cherry picked from commit f8b98390dc99f6cb0101c88223eb840e0d1c7124)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-08-28: Related fix merged to nova (stable/2023.2)

#16

Reviewed: https://review.opendev.org/c/openstack/nova/+/926554
Committed: https://opendev.org/openstack/nova/commit/4dadb087b65f3650fa75f21e6f3505d1e98b1f89
Submitter: "Zuul (22348)"
Branch: stable/2023.2

commit 4dadb087b65f3650fa75f21e6f3505d1e98b1f89
Author: sdmitriev1 <email address hidden>
Date: Sun Dec 12 16:45:59 2021 +0000

Functional test test_boot_reschedule_with_proper_pci_device_count

Lets first ensure we have a test that proves we have bad behaviour,
then follow up with the fix and the test tweak to prove it.

On the first compute node it fails due to group policy error.
On the second compute node instance should have exactly one PCI device.

    Related-Bug: #1860555
    Change-Id: Ia122fff268c8f45ad3e5a3071d2cb7c990cb2c1d
    (cherry picked from commit a2d77845ab247f1b09e2ae4f32f9421c3f50b98d)
    (cherry picked from commit 7b0c0cf591a6baf20cc7abd09e7fecc558cf6116)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-08-28: Fix merged to nova (stable/2023.2)

#17

Reviewed: https://review.opendev.org/c/openstack/nova/+/926555
Committed: https://opendev.org/openstack/nova/commit/f9b2fdd8692a5f973f560867cebb6db0ee998ba9
Submitter: "Zuul (22348)"
Branch: stable/2023.2

commit f9b2fdd8692a5f973f560867cebb6db0ee998ba9
Author: Balazs Gibizer <email address hidden>
Date: Thu Aug 15 13:06:39 2024 +0200

Fix PCI passthrough cleanup on reschedule

    The resource tracker Claim object works on a copy of the instance object
    got from the compute manager. But the PCI claim logic does not use the
    copy but use the original instance object. However the abort claim logic
    including the abort PCI claim logic worked on the copy only. Therefore the
    claimed PCI devices are visible to the compute manager in the
    instance.pci_decives list even after the claim is aborted.

    There was another bug in the PCIDevice object where the instance object
    wasn't passed to the free() function and therefore the
    instance.pci_devices list wasn't updated when the device was freed.

    Closes-Bug: #1860555
    Change-Id: Iff343d4d78996cd17a6a584fefa7071c81311673
    (cherry picked from commit f8b98390dc99f6cb0101c88223eb840e0d1c7124)
    (cherry picked from commit a0be2c97a642b48d4512bd40e136d0eb7c0696f6)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-08-30: Related fix merged to nova (stable/2023.1)

#18

Reviewed: https://review.opendev.org/c/openstack/nova/+/926558
Committed: https://opendev.org/openstack/nova/commit/4ae8aa82d606b8d5cf134b0eaae0d81ccac6f13a
Submitter: "Zuul (22348)"
Branch: stable/2023.1

commit 4ae8aa82d606b8d5cf134b0eaae0d81ccac6f13a
Author: sdmitriev1 <email address hidden>
Date: Sun Dec 12 16:45:59 2021 +0000

Functional test test_boot_reschedule_with_proper_pci_device_count

Lets first ensure we have a test that proves we have bad behaviour,
then follow up with the fix and the test tweak to prove it.

On the first compute node it fails due to group policy error.
On the second compute node instance should have exactly one PCI device.

    Related-Bug: #1860555
    Change-Id: Ia122fff268c8f45ad3e5a3071d2cb7c990cb2c1d
    (cherry picked from commit a2d77845ab247f1b09e2ae4f32f9421c3f50b98d)
    (cherry picked from commit 7b0c0cf591a6baf20cc7abd09e7fecc558cf6116)
    (cherry picked from commit 4dadb087b65f3650fa75f21e6f3505d1e98b1f89)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-08-30: Fix merged to nova (stable/2023.1)

#19

Reviewed: https://review.opendev.org/c/openstack/nova/+/926559
Committed: https://opendev.org/openstack/nova/commit/6c3d72f4776eb8f1e1b8857d1b7120c12abffdfe
Submitter: "Zuul (22348)"
Branch: stable/2023.1

commit 6c3d72f4776eb8f1e1b8857d1b7120c12abffdfe
Author: Balazs Gibizer <email address hidden>
Date: Thu Aug 15 13:06:39 2024 +0200

Fix PCI passthrough cleanup on reschedule

    The resource tracker Claim object works on a copy of the instance object
    got from the compute manager. But the PCI claim logic does not use the
    copy but use the original instance object. However the abort claim logic
    including the abort PCI claim logic worked on the copy only. Therefore the
    claimed PCI devices are visible to the compute manager in the
    instance.pci_decives list even after the claim is aborted.

    There was another bug in the PCIDevice object where the instance object
    wasn't passed to the free() function and therefore the
    instance.pci_devices list wasn't updated when the device was freed.

    Closes-Bug: #1860555
    Change-Id: Iff343d4d78996cd17a6a584fefa7071c81311673
    (cherry picked from commit f8b98390dc99f6cb0101c88223eb840e0d1c7124)
    (cherry picked from commit a0be2c97a642b48d4512bd40e136d0eb7c0696f6)
    (cherry picked from commit f9b2fdd8692a5f973f560867cebb6db0ee998ba9)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-09-20: Fix included in openstack/nova 30.0.0.0rc1

#20

This issue was fixed in the openstack/nova 30.0.0.0rc1 release candidate.

OpenStack Compute (nova)

PCI passthrough reschedule race condition

Bug Description

Other bug subscribers

Remote bug watches