archive_deleted_rows archives pci_devices records as residue because of 'instance_uuid'

Bug #1899541 reported by melanie witt on 2020-10-12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
melanie witt
melanie witt
melanie witt
melanie witt
melanie witt
melanie witt
melanie witt

Bug Description

This is based on a bug reported downstream [1] where after a random amount of time, update_available_resource began to fail with the following trace on nodes with PCI devices:

  "traceback": [
    "Traceback (most recent call last):",
    " File \"/usr/lib/python2.7/site-packages/nova/compute/\", line 7447, in update_available_resource_for_node",
    " rt.update_available_resource(context, nodename)",
    " File \"/usr/lib/python2.7/site-packages/nova/compute/\", line 706, in update_available_resource",
    " self._update_available_resource(context, resources)",
    " File \"/usr/lib/python2.7/site-packages/oslo_concurrency/\", line 274, in inner",
    " return f(*args, **kwargs)",
    " File \"/usr/lib/python2.7/site-packages/nova/compute/\", line 782, in _update_available_resource",
    " self._update(context, cn)",
    " File \"/usr/lib/python2.7/site-packages/nova/compute/\", line 926, in _update",
    " File \"/usr/lib/python2.7/site-packages/nova/pci/\", line 92, in save",
    " File \"/usr/lib/python2.7/site-packages/oslo_versionedobjects/\", line 210, in wrapper",
    " ctxt, self, fn.__name__, args, kwargs)",
    " File \"/usr/lib/python2.7/site-packages/nova/conductor/\", line 245, in object_action",
    " objmethod=objmethod, args=args, kwargs=kwargs)",
    " File \"/usr/lib/python2.7/site-packages/oslo_messaging/rpc/\", line 174, in call",
    " retry=self.retry)",
    " File \"/usr/lib/python2.7/site-packages/oslo_messaging/\", line 131, in _send",
    " timeout=timeout, retry=retry)",
    " File \"/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/\", line 559, in send",
    " retry=retry)",
    " File \"/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/\", line 550, in _send",
    " raise result",
    "RemoteError: Remote error: DBError (pymysql.err.IntegrityError) (1048, u\"Column 'compute_node_id' cannot be null\") [SQL: u'INSERT INTO pci_devices (created_at, updated_at, deleted_at, deleted, uuid, compute_node_id, address, vendor_id, product_id, dev_type, dev_id, label, status, request_id, extra_info, instance_uuid, numa_node, parent_addr) VALUES (%(created_at)s, %(updated_at)s, %(deleted_at)s, %(deleted)s, %(uuid)s, %(compute_node_id)s, %(address)s, %(vendor_id)s, %(product_id)s, %(dev_type)s, %(dev_id)s, %(label)s, %(status)s, %(request_id)s, %(extra_info)s, %(instance_uuid)s, %(numa_node)s, %(parent_addr)s)'] [parameters: {'status': u'available', 'instance_uuid': None, 'dev_type': None, 'uuid': None, 'dev_id': None, 'parent_addr': None, 'numa_node': None, 'created_at': datetime.datetime(2020, 8, 7, 11, 51, 19, 643044), 'vendor_id': None, 'updated_at': None, 'label': None, 'deleted': 0, 'extra_info': '{}', 'compute_node_id': None, 'request_id': None, 'deleted_at': None, 'address': None, 'product_id': None}] (Background on this error at:",

Here ^ we see an attempt to insert a nearly empty (NULL fields) record into the pci_devices table. Inspection of the code shows that the way this can occur is if we fail to lookup the pci_devices record we want and then we try to create a new one [2]:

def pci_device_update(context, node_id, address, values):
    query = model_query(context, models.PciDevice, read_deleted="no").\
    if query.update(values) == 0:
        device = models.PciDevice()

Turns out what was happening was when a request came in to delete an instance that had allocated a PCI device, if the archive_deleted_rows cron job fired at just the right (wrong) moment, it would sweep away the pci_devices record matching the instance_uuid because archive is treating any table with an 'instance_uuid' column as instance "residue" needing cleanup.

So after the pci_devices record was swept away, we tried to update the resource tracker as part of the _complete_deletion method in the compute manager and that failed because we could not locate the pci_devices record to free the PCI device (null out the instance_uuid field).

What we need to do here is not to treat the pci_devices table records as instance residue. The records in pci_devices are not tied to instance lifecycles at all and they are managed independently by the PCI trackers.


Fix proposed to branch: master

Changed in nova:
status: New → In Progress

Submitter: Zuul
Branch: master

commit 1c256cf774693e2395ae8fe4a7a2f416a7aeb03a
Author: melanie witt <email address hidden>
Date: Mon Oct 12 22:27:52 2020 +0000

    Prevent archiving of pci_devices records because of 'instance_uuid'

    Currently in the archive_deleted_rows code, we will attempt to clean up
    "residue" of deleted instance records by assuming any table with a
    'instance_uuid' column represents data tied to an instance's lifecycle
    and delete such records.

    This behavior poses a problem in the case where an instance has a PCI
    device allocated and someone deletes the instance. The 'instance_uuid'
    column in the pci_devices table is used to track the allocation
    association of a PCI with an instance. There is a small time window
    during which the instance record has been deleted but the PCI device
    has not yet been freed from a database record perspective as PCI
    devices are freed during the _complete_deletion method in the compute
    manager as part of the resource tracker update call.

    Records in the pci_devices table are anyway not related to the
    lifecycle of instances so they should not be considered residue to
    clean up if an instance is deleted. This adds a condition to avoid
    archiving pci_devices on the basis of an instance association.

    Closes-Bug: #1899541

    Change-Id: Ie62d3566230aa3e2786d129adbb2e3570b06e4c6

Changed in nova:
status: In Progress → Fix Released

This issue was fixed in the openstack/nova 22.2.0 release.

This issue was fixed in the openstack/nova release candidate.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers