archive_deleted_rows archives pci_devices records as residue because of 'instance_uuid'

Bug #1899541 reported by melanie witt on 2020-10-12
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
melanie witt
Queens
Undecided
melanie witt
Rocky
Undecided
melanie witt
Stein
Undecided
melanie witt
Train
Undecided
melanie witt
Ussuri
Undecided
melanie witt
Victoria
Undecided
melanie witt

Bug Description

This is based on a bug reported downstream [1] where after a random amount of time, update_available_resource began to fail with the following trace on nodes with PCI devices:

  "traceback": [
    "Traceback (most recent call last):",
    " File \"/usr/lib/python2.7/site-packages/nova/compute/manager.py\", line 7447, in update_available_resource_for_node",
    " rt.update_available_resource(context, nodename)",
    " File \"/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py\", line 706, in update_available_resource",
    " self._update_available_resource(context, resources)",
    " File \"/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py\", line 274, in inner",
    " return f(*args, **kwargs)",
    " File \"/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py\", line 782, in _update_available_resource",
    " self._update(context, cn)",
    " File \"/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py\", line 926, in _update",
    " self.pci_tracker.save(context)",
    " File \"/usr/lib/python2.7/site-packages/nova/pci/manager.py\", line 92, in save",
    " dev.save()",
    " File \"/usr/lib/python2.7/site-packages/oslo_versionedobjects/base.py\", line 210, in wrapper",
    " ctxt, self, fn.__name__, args, kwargs)",
    " File \"/usr/lib/python2.7/site-packages/nova/conductor/rpcapi.py\", line 245, in object_action",
    " objmethod=objmethod, args=args, kwargs=kwargs)",
    " File \"/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py\", line 174, in call",
    " retry=self.retry)",
    " File \"/usr/lib/python2.7/site-packages/oslo_messaging/transport.py\", line 131, in _send",
    " timeout=timeout, retry=retry)",
    " File \"/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py\", line 559, in send",
    " retry=retry)",
    " File \"/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py\", line 550, in _send",
    " raise result",
    "RemoteError: Remote error: DBError (pymysql.err.IntegrityError) (1048, u\"Column 'compute_node_id' cannot be null\") [SQL: u'INSERT INTO pci_devices (created_at, updated_at, deleted_at, deleted, uuid, compute_node_id, address, vendor_id, product_id, dev_type, dev_id, label, status, request_id, extra_info, instance_uuid, numa_node, parent_addr) VALUES (%(created_at)s, %(updated_at)s, %(deleted_at)s, %(deleted)s, %(uuid)s, %(compute_node_id)s, %(address)s, %(vendor_id)s, %(product_id)s, %(dev_type)s, %(dev_id)s, %(label)s, %(status)s, %(request_id)s, %(extra_info)s, %(instance_uuid)s, %(numa_node)s, %(parent_addr)s)'] [parameters: {'status': u'available', 'instance_uuid': None, 'dev_type': None, 'uuid': None, 'dev_id': None, 'parent_addr': None, 'numa_node': None, 'created_at': datetime.datetime(2020, 8, 7, 11, 51, 19, 643044), 'vendor_id': None, 'updated_at': None, 'label': None, 'deleted': 0, 'extra_info': '{}', 'compute_node_id': None, 'request_id': None, 'deleted_at': None, 'address': None, 'product_id': None}] (Background on this error at: http://sqlalche.me/e/gkpj)",

Here ^ we see an attempt to insert a nearly empty (NULL fields) record into the pci_devices table. Inspection of the code shows that the way this can occur is if we fail to lookup the pci_devices record we want and then we try to create a new one [2]:

@pick_context_manager_writer
def pci_device_update(context, node_id, address, values):
    query = model_query(context, models.PciDevice, read_deleted="no").\
                    filter_by(compute_node_id=node_id).\
                    filter_by(address=address)
    if query.update(values) == 0:
        device = models.PciDevice()
        device.update(values)
        context.session.add(device)
    return query.one()

Turns out what was happening was when a request came in to delete an instance that had allocated a PCI device, if the archive_deleted_rows cron job fired at just the right (wrong) moment, it would sweep away the pci_devices record matching the instance_uuid because archive is treating any table with an 'instance_uuid' column as instance "residue" needing cleanup.

So after the pci_devices record was swept away, we tried to update the resource tracker as part of the _complete_deletion method in the compute manager and that failed because we could not locate the pci_devices record to free the PCI device (null out the instance_uuid field).

What we need to do here is not to treat the pci_devices table records as instance residue. The records in pci_devices are not tied to instance lifecycles at all and they are managed independently by the PCI trackers.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1867124
[2] https://github.com/openstack/nova/blob/261de76104ca67bed3ea6cdbcaaab0e44030f1e2/nova/db/sqlalchemy/api.py#L4406-L4409

Fix proposed to branch: master
Review: https://review.opendev.org/757656

Changed in nova:
status: New → In Progress

Reviewed: https://review.opendev.org/757656
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=1c256cf774693e2395ae8fe4a7a2f416a7aeb03a
Submitter: Zuul
Branch: master

commit 1c256cf774693e2395ae8fe4a7a2f416a7aeb03a
Author: melanie witt <email address hidden>
Date: Mon Oct 12 22:27:52 2020 +0000

    Prevent archiving of pci_devices records because of 'instance_uuid'

    Currently in the archive_deleted_rows code, we will attempt to clean up
    "residue" of deleted instance records by assuming any table with a
    'instance_uuid' column represents data tied to an instance's lifecycle
    and delete such records.

    This behavior poses a problem in the case where an instance has a PCI
    device allocated and someone deletes the instance. The 'instance_uuid'
    column in the pci_devices table is used to track the allocation
    association of a PCI with an instance. There is a small time window
    during which the instance record has been deleted but the PCI device
    has not yet been freed from a database record perspective as PCI
    devices are freed during the _complete_deletion method in the compute
    manager as part of the resource tracker update call.

    Records in the pci_devices table are anyway not related to the
    lifecycle of instances so they should not be considered residue to
    clean up if an instance is deleted. This adds a condition to avoid
    archiving pci_devices on the basis of an instance association.

    Closes-Bug: #1899541

    Change-Id: Ie62d3566230aa3e2786d129adbb2e3570b06e4c6

Changed in nova:
status: In Progress → Fix Released

This issue was fixed in the openstack/nova 22.2.0 release.

This issue was fixed in the openstack/nova 23.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers