archive_deleted_rows archives pci_devices records as residue because of 'instance_uuid'

Bug #1899541 reported by melanie witt
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
melanie witt
Queens
In Progress
Undecided
melanie witt
Rocky
In Progress
Undecided
melanie witt
Stein
Fix Released
Undecided
melanie witt
Train
Fix Released
Undecided
melanie witt
Ussuri
Fix Released
Undecided
melanie witt
Victoria
Fix Released
Undecided
melanie witt

Bug Description

This is based on a bug reported downstream [1] where after a random amount of time, update_available_resource began to fail with the following trace on nodes with PCI devices:

  "traceback": [
    "Traceback (most recent call last):",
    " File \"/usr/lib/python2.7/site-packages/nova/compute/manager.py\", line 7447, in update_available_resource_for_node",
    " rt.update_available_resource(context, nodename)",
    " File \"/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py\", line 706, in update_available_resource",
    " self._update_available_resource(context, resources)",
    " File \"/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py\", line 274, in inner",
    " return f(*args, **kwargs)",
    " File \"/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py\", line 782, in _update_available_resource",
    " self._update(context, cn)",
    " File \"/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py\", line 926, in _update",
    " self.pci_tracker.save(context)",
    " File \"/usr/lib/python2.7/site-packages/nova/pci/manager.py\", line 92, in save",
    " dev.save()",
    " File \"/usr/lib/python2.7/site-packages/oslo_versionedobjects/base.py\", line 210, in wrapper",
    " ctxt, self, fn.__name__, args, kwargs)",
    " File \"/usr/lib/python2.7/site-packages/nova/conductor/rpcapi.py\", line 245, in object_action",
    " objmethod=objmethod, args=args, kwargs=kwargs)",
    " File \"/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py\", line 174, in call",
    " retry=self.retry)",
    " File \"/usr/lib/python2.7/site-packages/oslo_messaging/transport.py\", line 131, in _send",
    " timeout=timeout, retry=retry)",
    " File \"/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py\", line 559, in send",
    " retry=retry)",
    " File \"/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py\", line 550, in _send",
    " raise result",
    "RemoteError: Remote error: DBError (pymysql.err.IntegrityError) (1048, u\"Column 'compute_node_id' cannot be null\") [SQL: u'INSERT INTO pci_devices (created_at, updated_at, deleted_at, deleted, uuid, compute_node_id, address, vendor_id, product_id, dev_type, dev_id, label, status, request_id, extra_info, instance_uuid, numa_node, parent_addr) VALUES (%(created_at)s, %(updated_at)s, %(deleted_at)s, %(deleted)s, %(uuid)s, %(compute_node_id)s, %(address)s, %(vendor_id)s, %(product_id)s, %(dev_type)s, %(dev_id)s, %(label)s, %(status)s, %(request_id)s, %(extra_info)s, %(instance_uuid)s, %(numa_node)s, %(parent_addr)s)'] [parameters: {'status': u'available', 'instance_uuid': None, 'dev_type': None, 'uuid': None, 'dev_id': None, 'parent_addr': None, 'numa_node': None, 'created_at': datetime.datetime(2020, 8, 7, 11, 51, 19, 643044), 'vendor_id': None, 'updated_at': None, 'label': None, 'deleted': 0, 'extra_info': '{}', 'compute_node_id': None, 'request_id': None, 'deleted_at': None, 'address': None, 'product_id': None}] (Background on this error at: http://sqlalche.me/e/gkpj)",

Here ^ we see an attempt to insert a nearly empty (NULL fields) record into the pci_devices table. Inspection of the code shows that the way this can occur is if we fail to lookup the pci_devices record we want and then we try to create a new one [2]:

@pick_context_manager_writer
def pci_device_update(context, node_id, address, values):
    query = model_query(context, models.PciDevice, read_deleted="no").\
                    filter_by(compute_node_id=node_id).\
                    filter_by(address=address)
    if query.update(values) == 0:
        device = models.PciDevice()
        device.update(values)
        context.session.add(device)
    return query.one()

Turns out what was happening was when a request came in to delete an instance that had allocated a PCI device, if the archive_deleted_rows cron job fired at just the right (wrong) moment, it would sweep away the pci_devices record matching the instance_uuid because archive is treating any table with an 'instance_uuid' column as instance "residue" needing cleanup.

So after the pci_devices record was swept away, we tried to update the resource tracker as part of the _complete_deletion method in the compute manager and that failed because we could not locate the pci_devices record to free the PCI device (null out the instance_uuid field).

What we need to do here is not to treat the pci_devices table records as instance residue. The records in pci_devices are not tied to instance lifecycles at all and they are managed independently by the PCI trackers.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1867124
[2] https://github.com/openstack/nova/blob/261de76104ca67bed3ea6cdbcaaab0e44030f1e2/nova/db/sqlalchemy/api.py#L4406-L4409

Tags: compute db pci
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/757656

Changed in nova:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/757656
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=1c256cf774693e2395ae8fe4a7a2f416a7aeb03a
Submitter: Zuul
Branch: master

commit 1c256cf774693e2395ae8fe4a7a2f416a7aeb03a
Author: melanie witt <email address hidden>
Date: Mon Oct 12 22:27:52 2020 +0000

    Prevent archiving of pci_devices records because of 'instance_uuid'

    Currently in the archive_deleted_rows code, we will attempt to clean up
    "residue" of deleted instance records by assuming any table with a
    'instance_uuid' column represents data tied to an instance's lifecycle
    and delete such records.

    This behavior poses a problem in the case where an instance has a PCI
    device allocated and someone deletes the instance. The 'instance_uuid'
    column in the pci_devices table is used to track the allocation
    association of a PCI with an instance. There is a small time window
    during which the instance record has been deleted but the PCI device
    has not yet been freed from a database record perspective as PCI
    devices are freed during the _complete_deletion method in the compute
    manager as part of the resource tracker update call.

    Records in the pci_devices table are anyway not related to the
    lifecycle of instances so they should not be considered residue to
    clean up if an instance is deleted. This adds a condition to avoid
    archiving pci_devices on the basis of an instance association.

    Closes-Bug: #1899541

    Change-Id: Ie62d3566230aa3e2786d129adbb2e3570b06e4c6

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/victoria)

Fix proposed to branch: stable/victoria
Review: https://review.opendev.org/758837

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/760977

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/760978

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/760984

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/760985

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/760987

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 22.2.0

This issue was fixed in the openstack/nova 22.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 23.0.0.0rc1

This issue was fixed in the openstack/nova 23.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 21.2.1

This issue was fixed in the openstack/nova 21.2.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 20.6.1

This issue was fixed in the openstack/nova 20.6.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/stein)

Reviewed: https://review.opendev.org/c/openstack/nova/+/760984
Committed: https://opendev.org/openstack/nova/commit/da91b19d8be3b9cad8f713a3218a08e2d50238c8
Submitter: "Zuul (22348)"
Branch: stable/stein

commit da91b19d8be3b9cad8f713a3218a08e2d50238c8
Author: melanie witt <email address hidden>
Date: Mon Oct 12 22:27:52 2020 +0000

    Prevent archiving of pci_devices records because of 'instance_uuid'

    Currently in the archive_deleted_rows code, we will attempt to clean up
    "residue" of deleted instance records by assuming any table with a
    'instance_uuid' column represents data tied to an instance's lifecycle
    and delete such records.

    This behavior poses a problem in the case where an instance has a PCI
    device allocated and someone deletes the instance. The 'instance_uuid'
    column in the pci_devices table is used to track the allocation
    association of a PCI with an instance. There is a small time window
    during which the instance record has been deleted but the PCI device
    has not yet been freed from a database record perspective as PCI
    devices are freed during the _complete_deletion method in the compute
    manager as part of the resource tracker update call.

    Records in the pci_devices table are anyway not related to the
    lifecycle of instances so they should not be considered residue to
    clean up if an instance is deleted. This adds a condition to avoid
    archiving pci_devices on the basis of an instance association.

    Closes-Bug: #1899541

    Conflicts:
        nova/db/sqlalchemy/api.py
        nova/tests/functional/db/test_archive.py

    NOTE(melwitt): The conflicts are because change
    I9725f752f8aef8066f7c9705e87610cad887bf8e (refactor nova-manage
    archive_deleted_rows) and change
    Id16c3d91d9ce5db9ffd125b59fffbfedf4a6843d (nova-manage db
    archive_deleted_rows is not multi-cell aware) are not in Stein.

    Change-Id: Ie62d3566230aa3e2786d129adbb2e3570b06e4c6
    (cherry picked from commit 1c256cf774693e2395ae8fe4a7a2f416a7aeb03a)
    (cherry picked from commit 09784db62fcd01124a101c4c69cab6e71e1ac781)
    (cherry picked from commit 79df36fecf8c8be5ae9d59397882ac844852043e)
    (cherry picked from commit e3bb6119cf2d0a503768979312aea4d10cf85cda)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/queens)

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/queens
Review: https://review.opendev.org/c/openstack/nova/+/760987
Reason: This branch transitioned to End of Life for this project, open patches needs to be closed to be able to delete the branch.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova stein-eol

This issue was fixed in the openstack/nova stein-eol release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/rocky)

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/rocky
Review: https://review.opendev.org/c/openstack/nova/+/760985
Reason: This branch transitioned to End of Life for this project, open patches needs to be closed to be able to delete the branch.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.