Comment 6 for bug 1946339

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote (last edit ): Re: test_unshelve_offloaded_server_with_qos_port_pci_update_fails

I managed to create a stable reproduction locally. \o/

1) duplicate the failing tests

diff --git a/nova/tests/functional/test_servers_resource_request.py b/nova/tests/functional/test_servers_resource_request.py
index e2746b3669..d678067c18 100644
--- a/nova/tests/functional/test_servers_resource_request.py
+++ b/nova/tests/functional/test_servers_resource_request.py
@@ -2572,6 +2572,18 @@ class ServerMoveWithPortResourceRequestTest(
         self._delete_server_and_check_allocations(
             server, qos_normal_port, qos_sriov_port)

+ def test_unshelve_offloaded_server_with_qos_port_pci_update_fails0(self):
+ self.test_unshelve_offloaded_server_with_qos_port_pci_update_fails()
+
+ def test_unshelve_offloaded_server_with_qos_port_pci_update_fails1(self):
+ self.test_unshelve_offloaded_server_with_qos_port_pci_update_fails()
+
+ def test_unshelve_offloaded_server_with_qos_port_pci_update_fails2(self):
+ self.test_unshelve_offloaded_server_with_qos_port_pci_update_fails()
+
+ def test_unshelve_offloaded_server_with_qos_port_pci_update_fails3(self):
+ self.test_unshelve_offloaded_server_with_qos_port_pci_update_fails()
+

2) run tox with tox -e functional-py38 "test_description_errors|test_unshelve_offloaded_server_with_qos_port_pci_update_fails" -- --serial

This make one of the test_unshelve_offloaded_server_with_qos_port_pci_update_fails test case fail with the reported error.

What I know so far:
* nova.tests.functional.test_servers.ServersTestV219.test_description_errors() starts a new instance but does not wait for it to become ACTIVE. The test case passes and finishes.

* But the build_and_run_instance RPC call still runs in compute service in a greenlet building up the instance. The service.kill at the end of the test case does not kill the running / waiting greenlets. I proved this by dumping a gmr at the end of the test run after the service.kill was called by the Fixture.cleanup. The build_and_run_instance greenlet was visible there.

* Still other fixture cleanup drops the database from behind the compute service. This leads to ComputeNodeNotFound error. You can see this by simply adding a self.fail() at the end of test_description_errors test case and run it. There will be ComputeNodeNotFound in the logs.

* After test_description_errors passes the test executor runs the next test cases

* After ~60 seconds an RPC timeout happens in the original greenlet from test_description_errors() and it *somehow* interferes with the currently running test making that to fail.

What I don't know yet is what is the way of the interference.