Comment 1 for bug 1946339

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote : Re: test_unshelve_offloaded_server_with_qos_port_pci_update_fails

It seems it happens pretty frequently on the gate[1]. I run the whole functional suit locally for days with random ordering but I was not able to reproduce it. So it is really hard to figure out what happens.

The first exception nova.exception.UnexpectedResourceProviderNameForPCIRequest is expected that is part of the test case. The the logs shows that we reverted the allocation in placement as the VM creation is failed due to the first exception.

The I see the following from the stack trace

LOG.warning("Failed to revert task state for instance.

So we end up at[2]. Then the compute thread seems to end. Then the parallel conductor thread times out in self.conductor_compute_rpcapi.build_instances:

File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/oslo_messaging/_drivers/impl_fake.py", line 213, in _send
    raise oslo_messaging.MessagingTimeout(
oslo_messaging.exceptions.MessagingTimeout: No reply on topic conductor

But it seems that at that point DB was already deleted by the test env leading to errors like:

sqlite3.OperationalError: no such table: instance_faults

I don't see how this can be timing related, but other then having some kind of race condition between the compute and the conductor I have no other idea.

[1] https://paste.opendev.org/show/809926/
[2] https://github.com/openstack/nova/blob/fdfdba265833d237e22676f9a223ab8ca0fe1e03/nova/compute/manager.py#L183