impl_kafka calls logging methods from tpool.execute causing deadlock

Bug #1981093 reported by Guillaume Espanel
28
This bug affects 5 people
Affects Status Importance Assigned to Milestone
oslo.messaging
Fix Released
Undecided
Unassigned

Bug Description

We recently experienced an issue with nova-compute "locking up" after encountering an error sending notifications to kafka. It basically stops logging anything and performing tasks (rebooting an instance for example).

The last message logged by nova-compute is :

===
2022-07-06 19:48:30.634 62228 ERROR oslo_messaging._drivers.impl_kafka [-] Produce message failed: KafkaError{code=TOPIC_AUTHORIZATION_FAILED,val=29,str="Unable to produce message: Broker: Topic authorization failed"}: cimpl.KafkaException: KafkaError{code=TOPIC_AUTHORIZATION_FAILED,val=29,str="Unable to produce message: Broker: Topic authorization failed"}
===

The guru meditation report shows many greenthreads waiting on a lock related to logging, for example :

===
...
python3[62228]: /opt/openstack/nova/lib/python3.6/logging/__init__.py:1308 in info
python3[62228]: `self._log(INFO, msg, args, **kwargs)`
python3[62228]: /opt/openstack/nova/lib/python3.6/logging/__init__.py:1444 in _log
python3[62228]: `self.handle(record)`
python3[62228]: /opt/openstack/nova/lib/python3.6/logging/__init__.py:1454 in handle
python3[62228]: `self.callHandlers(record)`
python3[62228]: /opt/openstack/nova/lib/python3.6/logging/__init__.py:1516 in callHandlers
python3[62228]: `hdlr.handle(record)`
python3[62228]: /opt/openstack/nova/lib/python3.6/logging/__init__.py:863 in handle
python3[62228]: `self.acquire()`
python3[62228]: /opt/openstack/nova/lib/python3.6/logging/__init__.py:814 in acquire
python3[62228]: `self.lock.acquire()`
python3[62228]: /opt/openstack/nova/lib/python3.6/threading.py:147 in acquire
python3[62228]: `rc = self._block.acquire(blocking, timeout)`
python3[62228]: /opt/openstack/nova/lib/python3.6/site-packages/eventlet/semaphore.py:115 in acquire
python3[62228]: `hubs.get_hub().switch()`
python3[62228]: /opt/openstack/nova/lib/python3.6/site-packages/eventlet/hubs/hub.py:297 in switch
python3[62228]: `return self.greenlet.switch()`
===

After looking a little bit through bugs in eventlet, I found this bug [1] where calling logging functions from tpool.execute can cause subsequent calls to logging to hang. It turns out we do exactly that in impl_kafka.py[2] : we could call a logging function from within tpool.execute, thus triggering this bug in eventlet.

This seems to have happened to other projects. In swift and was addressed in this patch [3], but I am not sure we can do that in oslo.messaging.

[1] https://github.com/eventlet/eventlet/issues/432
[2] https://github.com/openstack/oslo.messaging/blob/5a43d4548a8cab82222d8d4d0fddc246a1f1fa32/oslo_messaging/_drivers/impl_kafka.py#L297
[3] https://review.opendev.org/c/openstack/swift/+/493636

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (master)
Changed in oslo.messaging:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.messaging (master)

Reviewed: https://review.opendev.org/c/openstack/oslo.messaging/+/851852
Committed: https://opendev.org/openstack/oslo.messaging/commit/43f2224aacb668aa51de3d1274ff8939d8aa73ae
Submitter: "Zuul (22348)"
Branch: master

commit 43f2224aacb668aa51de3d1274ff8939d8aa73ae
Author: Guillaume Espanel <email address hidden>
Date: Mon Jul 11 10:58:32 2022 +0200

    Remove logging from ProducerConnection._produce_message

    In impl_kafka, _produce_message is run in a tpool.execute
    context but it was also calling logging functions.
    This could cause subsequent calls to logging functions to
    deadlock.

    This patch moves the logging calls out of the tpool.execute scope.

    Change-Id: I81167eea0a6b1a43a88baa3bc383af684f4b1345
    Closes-bug: #1981093

Changed in oslo.messaging:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.messaging 14.1.0

This issue was fixed in the openstack/oslo.messaging 14.1.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.