This is the same exact scenario as https://review.opendev.org/c/openstack/oslo.messaging/+/663074 except with amqp1 instead of rabbit.
Because threading.Thread is monkeypatched to be an eventlet greenthread instead of a proper pthread, it gets parked by the mod_wsgi worker whenever an active request is not being processed. If the worker does not service an api request for a period of time longer than the heartbeat timeout, then the connection will die and will need to be re-established for the next request. There is also a second related yet distinct issue (more below) that I will lump together here since it's all affected by the same code paths.
Here's what happens in the logs:
2021-08-25 15:22:47.675 30 DEBUG pyngus.sockets [-] Socket closed read_socket_input /usr/lib/python3.6/site-packages/pyngus/sockets.py:69
2021-08-25 15:22:47.675 30 DEBUG pyngus.connection [-] close_input read done close_input /usr/lib/python3.6/site-packages/pyngus/connection.py:589
2021-08-25 15:22:47.675 30 DEBUG pyngus.connection [-] close output write done close_output /usr/lib/python3.6/site-packages/pyngus/connection.py:639
2021-08-25 15:22:47.675 30 ERROR pyngus.connection [-] Connection failed: Condition('amqp:connection:framing-error', 'SSL Failure: Unknown error')
2021-08-25 15:22:47.676 30 INFO oslo_messaging._drivers.amqp1_driver.controller [-] Delaying reconnect attempt for 1 seconds
2021-08-25 15:22:47.681 30 DEBUG nova.api.openstack.wsgi [req-790df3b4-1db4-4203-b5a1-2409e5da5fe0 5bcfa66bc22d4c16a5d6499a51a0041e 1552dc859dbf4b9f9bb78c2e83aa928d - default default] Action: 'create', calling method: <bound method ServersController.creat
e of <nova.api.openstack.compute.servers.ServersController object at 0x7f36f008dcc0>>, body: {"server": {"name": "tempest-TestServerMultinode-server-1696971020", "imageRef": "6721bfe9-1a61-4929-8310-54567bc76ad9", "flavorRef": "6442770a-92e6-40c6-b79b-e5a
42cd61ddd", "availability_zone": "nova:compute-1.alejandro.ftw", "networks": [{"uuid": "69d27ce5-e51b-4637-ba56-ec3ba7ec9693"}]}} _process_stack /usr/lib/python3.6/site-packages/nova/api/openstack/wsgi.py:511
This part above is whenever the worker wakes up from its extended nap. The first part to get scheduled by the eventlet event loop notices that the underlying socket for the pyngus connection has died because the router closed the other end due to heartbeat timeout with:
2021-08-25 15:21:07.279018 +0000 SERVER (info) [C37] Connection from 172.17.0.185:36200 (to 0.0.0.0:31459) failed: amqp:resource-limit-exceeded local-idle-timeout expired
2021-08-25 15:21:07.280411 +0000 ROUTER (info) [C37][L192] Link lost: del=0 presett=0 psdrop=0 acc=0 rej=0 rel=0 mod=0 delay1=0 delay10=0
Above after the connection is dead we see the actual wsgi request that caused the wakeup in the first place (req-790df3b4-1db4-4203-b5a1-2409e5da5fe0).
The next thing to happen is the separate-yet-related bug that I mentioned previously:
2021-08-25 15:22:48.479 30 ERROR oslo_messaging._drivers.amqp1_driver.controller [-] Error processing task: 'NoneType' object has no attribute 'resolve': AttributeError: 'NoneType' object has no attribute 'resolve'
2021-08-25 15:22:48.479 30 ERROR oslo_messaging._drivers.amqp1_driver.controller Traceback (most recent call last):
2021-08-25 15:22:48.479 30 ERROR oslo_messaging._drivers.amqp1_driver.controller File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqp1_driver/controller.py", line 1086, in _process_tasks
2021-08-25 15:22:48.479 30 ERROR oslo_messaging._drivers.amqp1_driver.controller self._tasks.get(False)._execute(self)
2021-08-25 15:22:48.479 30 ERROR oslo_messaging._drivers.amqp1_driver.controller File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqp1_driver/controller.py", line 122, in _execute
2021-08-25 15:22:48.479 30 ERROR oslo_messaging._drivers.amqp1_driver.controller controller.send(self)
2021-08-25 15:22:48.479 30 ERROR oslo_messaging._drivers.amqp1_driver.controller File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqp1_driver/controller.py", line 987, in send
2021-08-25 15:22:48.479 30 ERROR oslo_messaging._drivers.amqp1_driver.controller self.reply_link, self.addresser)
2021-08-25 15:22:48.479 30 ERROR oslo_messaging._drivers.amqp1_driver.controller File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqp1_driver/controller.py", line 343, in attach
2021-08-25 15:22:48.479 30 ERROR oslo_messaging._drivers.amqp1_driver.controller self._address = addresser.resolve(self._destination, self._service)
2021-08-25 15:22:48.479 30 ERROR oslo_messaging._drivers.amqp1_driver.controller AttributeError: 'NoneType' object has no attribute 'resolve'
2021-08-25 15:22:48.479 30 ERROR oslo_messaging._drivers.amqp1_driver.controller
Because the connection is currently down, the Controller.addresser object is set to None (that's the first thing that happens in Controller._handle_connection_loss). However it then passes that None value into the amqp1 eventloop thread which subsequently crashes as above.
At this point that api request is hopelessly lost. Later we'll see it show up in the logs again when it finally fails completely:
2021-08-25 15:23:19.423 30 WARNING oslo_messaging._drivers.amqp1_driver.controller [-] RPC Cast message sent to <Target topic=conductor, namespace=compute_task, version=1.17> failed: timed out
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi [req-790df3b4-1db4-4203-b5a1-2409e5da5fe0 5bcfa66bc22d4c16a5d6499a51a0041e 1552dc859dbf4b9f9bb78c2e83aa928d - default default] Unexpected exception in API method: oslo_messaging.exceptions.MessageDeliveryFailure: RPC Cast message sent to <Target topic=conductor, namespace=compute_task, version=1.17> failed: timed out
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi Traceback (most recent call last):
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi File "/usr/lib/python3.6/site-packages/nova/api/openstack/wsgi.py", line 658, in wrapped
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi return f(*args, **kwargs)
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi File "/usr/lib/python3.6/site-packages/nova/api/validation/__init__.py", line 110, in wrapper
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi return func(*args, **kwargs)
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi File "/usr/lib/python3.6/site-packages/nova/api/validation/__init__.py", line 110, in wrapper
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi return func(*args, **kwargs)
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi File "/usr/lib/python3.6/site-packages/nova/api/validation/__init__.py", line 110, in wrapper
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi return func(*args, **kwargs)
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi [Previous line repeated 9 more times]
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi File "/usr/lib/python3.6/site-packages/nova/api/openstack/compute/servers.py", line 782, in create
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi **create_kwargs)
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi File "/usr/lib/python3.6/site-packages/nova/compute/api.py", line 2046, in create
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi requested_hypervisor_hostname=requested_hypervisor_hostname)
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi File "/usr/lib/python3.6/site-packages/nova/compute/api.py", line 1584, in _create_instance
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi tags=tags)
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi File "/usr/lib/python3.6/site-packages/nova/conductor/api.py", line 137, in schedule_and_build_instances
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi block_device_mapping, tags)
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi File "/usr/lib/python3.6/site-packages/nova/conductor/rpcapi.py", line 413, in schedule_and_build_instances
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi cctxt.cast(context, 'schedule_and_build_instances', **kw)
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/client.py", line 152, in cast
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi transport_options=self.transport_options)
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi File "/usr/lib/python3.6/site-packages/oslo_messaging/transport.py", line 128, in _send
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi transport_options=transport_options)
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/impl_amqp1.py", line 295, in wrap
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi return func(self, *args, **kws)
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/impl_amqp1.py", line 357, in send
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi raise reply
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi oslo_messaging.exceptions.MessageDeliveryFailure: RPC Cast message sent to <Target topic=conductor, namespace=compute_task, version=1.17> failed: timed out
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi
2021-08-25 15:23:19.428 30 INFO nova.api.openstack.wsgi [req-790df3b4-1db4-4203-b5a1-2409e5da5fe0 5bcfa66bc22d4c16a5d6499a51a0041e 1552dc859dbf4b9f9bb78c2e83aa928d - default default] HTTP exception thrown: Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'oslo_messaging.exceptions.MessageDeliveryFailure'>
2021-08-25 15:23:19.430 30 DEBUG nova.api.openstack.wsgi [req-790df3b4-1db4-4203-b5a1-2409e5da5fe0 5bcfa66bc22d4c16a5d6499a51a0041e 1552dc859dbf4b9f9bb78c2e83aa928d - default default] Returning 500 to user: Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'oslo_messaging.exceptions.MessageDeliveryFailure'> __call__ /usr/lib/python3.6/site-packages/nova/api/openstack/wsgi.py:928
So there's probably two points of action here:
1. Be like the rabbit driver and have the amqp1 eventloop (optionally?) run in a real pthread so it can heartbeat as necessary. I need to look into the detailed mechanics a bit more but that's the general idea.
2. Be more careful when passing Controller.addresser into the amqp1 eventloop thread so we don't crash it unnecessarily when it's None.
Fix proposed to branch: master /review. opendev. org/c/openstack /oslo.messaging /+/817080
Review: https:/