amqp1 driver does not heartbeat when run under mod_wsgi

Bug #1941652 reported by John Eckersberg
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
oslo.messaging
Fix Released
Undecided
John Eckersberg

Bug Description

This is the same exact scenario as https://review.opendev.org/c/openstack/oslo.messaging/+/663074 except with amqp1 instead of rabbit.

Because threading.Thread is monkeypatched to be an eventlet greenthread instead of a proper pthread, it gets parked by the mod_wsgi worker whenever an active request is not being processed. If the worker does not service an api request for a period of time longer than the heartbeat timeout, then the connection will die and will need to be re-established for the next request. There is also a second related yet distinct issue (more below) that I will lump together here since it's all affected by the same code paths.

Here's what happens in the logs:

2021-08-25 15:22:47.675 30 DEBUG pyngus.sockets [-] Socket closed read_socket_input /usr/lib/python3.6/site-packages/pyngus/sockets.py:69
2021-08-25 15:22:47.675 30 DEBUG pyngus.connection [-] close_input read done close_input /usr/lib/python3.6/site-packages/pyngus/connection.py:589
2021-08-25 15:22:47.675 30 DEBUG pyngus.connection [-] close output write done close_output /usr/lib/python3.6/site-packages/pyngus/connection.py:639
2021-08-25 15:22:47.675 30 ERROR pyngus.connection [-] Connection failed: Condition('amqp:connection:framing-error', 'SSL Failure: Unknown error')
2021-08-25 15:22:47.676 30 INFO oslo_messaging._drivers.amqp1_driver.controller [-] Delaying reconnect attempt for 1 seconds
2021-08-25 15:22:47.681 30 DEBUG nova.api.openstack.wsgi [req-790df3b4-1db4-4203-b5a1-2409e5da5fe0 5bcfa66bc22d4c16a5d6499a51a0041e 1552dc859dbf4b9f9bb78c2e83aa928d - default default] Action: 'create', calling method: <bound method ServersController.creat
e of <nova.api.openstack.compute.servers.ServersController object at 0x7f36f008dcc0>>, body: {"server": {"name": "tempest-TestServerMultinode-server-1696971020", "imageRef": "6721bfe9-1a61-4929-8310-54567bc76ad9", "flavorRef": "6442770a-92e6-40c6-b79b-e5a
42cd61ddd", "availability_zone": "nova:compute-1.alejandro.ftw", "networks": [{"uuid": "69d27ce5-e51b-4637-ba56-ec3ba7ec9693"}]}} _process_stack /usr/lib/python3.6/site-packages/nova/api/openstack/wsgi.py:511

This part above is whenever the worker wakes up from its extended nap. The first part to get scheduled by the eventlet event loop notices that the underlying socket for the pyngus connection has died because the router closed the other end due to heartbeat timeout with:

2021-08-25 15:21:07.279018 +0000 SERVER (info) [C37] Connection from 172.17.0.185:36200 (to 0.0.0.0:31459) failed: amqp:resource-limit-exceeded local-idle-timeout expired
2021-08-25 15:21:07.280411 +0000 ROUTER (info) [C37][L192] Link lost: del=0 presett=0 psdrop=0 acc=0 rej=0 rel=0 mod=0 delay1=0 delay10=0

Above after the connection is dead we see the actual wsgi request that caused the wakeup in the first place (req-790df3b4-1db4-4203-b5a1-2409e5da5fe0).

The next thing to happen is the separate-yet-related bug that I mentioned previously:

2021-08-25 15:22:48.479 30 ERROR oslo_messaging._drivers.amqp1_driver.controller [-] Error processing task: 'NoneType' object has no attribute 'resolve': AttributeError: 'NoneType' object has no attribute 'resolve'
2021-08-25 15:22:48.479 30 ERROR oslo_messaging._drivers.amqp1_driver.controller Traceback (most recent call last):
2021-08-25 15:22:48.479 30 ERROR oslo_messaging._drivers.amqp1_driver.controller File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqp1_driver/controller.py", line 1086, in _process_tasks
2021-08-25 15:22:48.479 30 ERROR oslo_messaging._drivers.amqp1_driver.controller self._tasks.get(False)._execute(self)
2021-08-25 15:22:48.479 30 ERROR oslo_messaging._drivers.amqp1_driver.controller File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqp1_driver/controller.py", line 122, in _execute
2021-08-25 15:22:48.479 30 ERROR oslo_messaging._drivers.amqp1_driver.controller controller.send(self)
2021-08-25 15:22:48.479 30 ERROR oslo_messaging._drivers.amqp1_driver.controller File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqp1_driver/controller.py", line 987, in send
2021-08-25 15:22:48.479 30 ERROR oslo_messaging._drivers.amqp1_driver.controller self.reply_link, self.addresser)
2021-08-25 15:22:48.479 30 ERROR oslo_messaging._drivers.amqp1_driver.controller File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqp1_driver/controller.py", line 343, in attach
2021-08-25 15:22:48.479 30 ERROR oslo_messaging._drivers.amqp1_driver.controller self._address = addresser.resolve(self._destination, self._service)
2021-08-25 15:22:48.479 30 ERROR oslo_messaging._drivers.amqp1_driver.controller AttributeError: 'NoneType' object has no attribute 'resolve'
2021-08-25 15:22:48.479 30 ERROR oslo_messaging._drivers.amqp1_driver.controller

Because the connection is currently down, the Controller.addresser object is set to None (that's the first thing that happens in Controller._handle_connection_loss). However it then passes that None value into the amqp1 eventloop thread which subsequently crashes as above.

At this point that api request is hopelessly lost. Later we'll see it show up in the logs again when it finally fails completely:

2021-08-25 15:23:19.423 30 WARNING oslo_messaging._drivers.amqp1_driver.controller [-] RPC Cast message sent to <Target topic=conductor, namespace=compute_task, version=1.17> failed: timed out
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi [req-790df3b4-1db4-4203-b5a1-2409e5da5fe0 5bcfa66bc22d4c16a5d6499a51a0041e 1552dc859dbf4b9f9bb78c2e83aa928d - default default] Unexpected exception in API method: oslo_messaging.exceptions.MessageDeliveryFailure: RPC Cast message sent to <Target topic=conductor, namespace=compute_task, version=1.17> failed: timed out
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi Traceback (most recent call last):
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi File "/usr/lib/python3.6/site-packages/nova/api/openstack/wsgi.py", line 658, in wrapped
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi return f(*args, **kwargs)
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi File "/usr/lib/python3.6/site-packages/nova/api/validation/__init__.py", line 110, in wrapper
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi return func(*args, **kwargs)
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi File "/usr/lib/python3.6/site-packages/nova/api/validation/__init__.py", line 110, in wrapper
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi return func(*args, **kwargs)
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi File "/usr/lib/python3.6/site-packages/nova/api/validation/__init__.py", line 110, in wrapper
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi return func(*args, **kwargs)
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi [Previous line repeated 9 more times]
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi File "/usr/lib/python3.6/site-packages/nova/api/openstack/compute/servers.py", line 782, in create
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi **create_kwargs)
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi File "/usr/lib/python3.6/site-packages/nova/compute/api.py", line 2046, in create
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi requested_hypervisor_hostname=requested_hypervisor_hostname)
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi File "/usr/lib/python3.6/site-packages/nova/compute/api.py", line 1584, in _create_instance
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi tags=tags)
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi File "/usr/lib/python3.6/site-packages/nova/conductor/api.py", line 137, in schedule_and_build_instances
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi block_device_mapping, tags)
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi File "/usr/lib/python3.6/site-packages/nova/conductor/rpcapi.py", line 413, in schedule_and_build_instances
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi cctxt.cast(context, 'schedule_and_build_instances', **kw)
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/client.py", line 152, in cast
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi transport_options=self.transport_options)
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi File "/usr/lib/python3.6/site-packages/oslo_messaging/transport.py", line 128, in _send
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi transport_options=transport_options)
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/impl_amqp1.py", line 295, in wrap
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi return func(self, *args, **kws)
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/impl_amqp1.py", line 357, in send
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi raise reply
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi oslo_messaging.exceptions.MessageDeliveryFailure: RPC Cast message sent to <Target topic=conductor, namespace=compute_task, version=1.17> failed: timed out
2021-08-25 15:23:19.424 30 ERROR nova.api.openstack.wsgi
2021-08-25 15:23:19.428 30 INFO nova.api.openstack.wsgi [req-790df3b4-1db4-4203-b5a1-2409e5da5fe0 5bcfa66bc22d4c16a5d6499a51a0041e 1552dc859dbf4b9f9bb78c2e83aa928d - default default] HTTP exception thrown: Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'oslo_messaging.exceptions.MessageDeliveryFailure'>
2021-08-25 15:23:19.430 30 DEBUG nova.api.openstack.wsgi [req-790df3b4-1db4-4203-b5a1-2409e5da5fe0 5bcfa66bc22d4c16a5d6499a51a0041e 1552dc859dbf4b9f9bb78c2e83aa928d - default default] Returning 500 to user: Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'oslo_messaging.exceptions.MessageDeliveryFailure'> __call__ /usr/lib/python3.6/site-packages/nova/api/openstack/wsgi.py:928

So there's probably two points of action here:

1. Be like the rabbit driver and have the amqp1 eventloop (optionally?) run in a real pthread so it can heartbeat as necessary. I need to look into the detailed mechanics a bit more but that's the general idea.
2. Be more careful when passing Controller.addresser into the amqp1 eventloop thread so we don't crash it unnecessarily when it's None.

description: updated
description: updated
Changed in oslo.messaging:
status: New → In Progress
Changed in oslo.messaging:
assignee: nobody → John Eckersberg (jeckersb)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (master)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.messaging (master)

Reviewed: https://review.opendev.org/c/openstack/oslo.messaging/+/817080
Committed: https://opendev.org/openstack/oslo.messaging/commit/02a38f507d8f0c377a2ef468e3497b5e897f1b09
Submitter: "Zuul (22348)"
Branch: master

commit 02a38f507d8f0c377a2ef468e3497b5e897f1b09
Author: John Eckersberg <email address hidden>
Date: Mon Nov 8 15:19:45 2021 -0500

    amqp1: fix race when reconnecting

    Currently this is how reconnect works:

    - pyngus detects failure and invokes callback
      Controller.connection_failed() which in turn calls
      Controller._handle_connection_loss()

    - The first thing that _handle_connection_loss does is to set
      self.addresser to None (important later)

    - Then it defers _do_reconnect after a delay (normally 1 second)

    - (1 second passes)

    - _do_reconnect calls _hard_reset which resets the controller state

    However, there is a race here. This can happen:

    - The above, up until it defers and waits for 1 second

    - Controller.send() is invoked on a task

    - A new Sender is created, and critically because self.reply_link
      still exists and is active, we call sender.attach and pass in
      self.addresser. Remember _handle_connection_loss sets
      self.addresser to None.

    - Eventually Sender.attach throws an AttributeError because it
      attempts to call addresser.resolve() but addresser is None

    The reason this happens is because although the connection is dead,
    the controller state is still half-alive because _hard_reset hasn't
    been called yet since it's deferred one second in _do_reconnect.

    The fix here is to move _hard_reset out of _do_reconnect and directly
    into _handle_connection_loss. The eventloop is woken up immediately
    to process _hard_reset but _do_reconnect is still deferred as before
    so as to retain the desired reconnect backoff behavior.

    Closes-Bug: #1941652
    Change-Id: Ife62a7d76022908f0dc6a77f1ad607cb2fbd3e8f

Changed in oslo.messaging:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (stable/xena)

Fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/oslo.messaging/+/821764

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.messaging 12.11.1

This issue was fixed in the openstack/oslo.messaging 12.11.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.messaging (stable/xena)

Reviewed: https://review.opendev.org/c/openstack/oslo.messaging/+/821764
Committed: https://opendev.org/openstack/oslo.messaging/commit/d57eccd862c1a008f9b1f868c0832e938a75415e
Submitter: "Zuul (22348)"
Branch: stable/xena

commit d57eccd862c1a008f9b1f868c0832e938a75415e
Author: John Eckersberg <email address hidden>
Date: Mon Nov 8 15:19:45 2021 -0500

    amqp1: fix race when reconnecting

    Currently this is how reconnect works:

    - pyngus detects failure and invokes callback
      Controller.connection_failed() which in turn calls
      Controller._handle_connection_loss()

    - The first thing that _handle_connection_loss does is to set
      self.addresser to None (important later)

    - Then it defers _do_reconnect after a delay (normally 1 second)

    - (1 second passes)

    - _do_reconnect calls _hard_reset which resets the controller state

    However, there is a race here. This can happen:

    - The above, up until it defers and waits for 1 second

    - Controller.send() is invoked on a task

    - A new Sender is created, and critically because self.reply_link
      still exists and is active, we call sender.attach and pass in
      self.addresser. Remember _handle_connection_loss sets
      self.addresser to None.

    - Eventually Sender.attach throws an AttributeError because it
      attempts to call addresser.resolve() but addresser is None

    The reason this happens is because although the connection is dead,
    the controller state is still half-alive because _hard_reset hasn't
    been called yet since it's deferred one second in _do_reconnect.

    The fix here is to move _hard_reset out of _do_reconnect and directly
    into _handle_connection_loss. The eventloop is woken up immediately
    to process _hard_reset but _do_reconnect is still deferred as before
    so as to retain the desired reconnect backoff behavior.

    Closes-Bug: #1941652
    Change-Id: Ife62a7d76022908f0dc6a77f1ad607cb2fbd3e8f
    (cherry picked from commit 02a38f507d8f0c377a2ef468e3497b5e897f1b09)

tags: added: in-stable-xena
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (stable/wallaby)

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/oslo.messaging/+/839530

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.messaging 12.9.4

This issue was fixed in the openstack/oslo.messaging 12.9.4 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on oslo.messaging (stable/wallaby)

Change abandoned by "Daniel Bengtsson <email address hidden>" on branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/oslo.messaging/+/839530
Reason: The wallaby release is not anymore maintain so we can abandon that patch.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on oslo.messaging (master)

Change abandoned by "John Eckersberg <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/oslo.messaging/+/810510
Reason: amqp1 driver is deprecated, this patch doesn't work as-is, and I have no time/priority to fix it.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.