OpenStack Compute (nova)

Comment 1 for bug 1882608

Revision history for this message

melanie witt (melwitt) wrote on 2020-06-09:

I found that this happens when a server is requested to be deleted while it's in the middle of booting, as seen in the ServersNegativeTestJSON code:

    @classmethod
    def resource_setup(cls):
        super(ServersNegativeTestJSON, cls).resource_setup()
        server = cls.create_test_server(wait_until='ACTIVE')
        cls.server_id = server['id']

        server = cls.create_test_server()
        cls.client.delete_server(server['id'])
        waiters.wait_for_server_termination(cls.client, server['id'])
        cls.deleted_server_id = server['id']

There is a race where nova-conductor will delete the instance mapping while nova-api is trying to update the queued_for_delete field for the instance mapping record. When that happens, nova-conductor deletes the instance mapping after nova-api has retrieved it for the intended update, and then nova-api fails with StaleDataError when it tries to save the instance mapping record to the database. We see the following log in screen-n-cond.txt[1]:

Jun 01 14:33:57.487787 ubuntu-bionic-rax-iad-0016890725 nova-conductor[14142]: DEBUG nova.conductor.manager [None req-e73643cb-efb2-445d-a6dc-5c6fd956c989 tempest-ServersNegativeTestJSON-1435542876 tempest-ServersNegativeTestJSON-1435542876] [instance: d87b9767-d6ac-4c23-ad5b-d1fd139f1662] While scheduling instance, the build request was already deleted. {{(pid=15387) schedule_and_build_instances /opt/stack/new/nova/nova/conductor/manager.py:1515}}

which triggers nova-conductor to delete the instance mapping [2]. Then we fail in the delete path while trying to update queued_for_delete [3].

I think we could fix this with a try-except to catch StaleDataError and then raise InstanceMappingNotFound to treat it as a missing instance mapping.

[1] https://zuul.opendev.org/t/openstack/build/58647aab9847469cb1dc474a7e7a1e6d/log/logs/screen-n-cond.txt#2379
[2] https://github.com/openstack/nova/blob/2061ce1125039f3595999457da3a6ad3c202ea2a/nova/conductor/manager.py#L1514-L1524
[3] https://github.com/openstack/nova/blob/2061ce1125039f3595999457da3a6ad3c202ea2a/nova/compute/api.py#L2423-L2434

I found that this happens when a server is requested to be deleted while it's in the middle of booting, as seen in the ServersNegativeTestJSON code:

server = cls.create_test_server()
        cls.client.delete_server(server['id'])
        waiters.wait_for_server_termination(cls.client, server['id'])
        cls.deleted_server_id = server['id']

which triggers nova-conductor to delete the instance mapping [2]. Then we fail in the delete path while trying to update queued_for_delete [3].

I think we could fix this with a try-except to catch StaleDataError and then raise InstanceMappingNotFound to treat it as a missing instance mapping.