impl_kafka calls logging methods from tpool.execute causing deadlock
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
oslo.messaging |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
We recently experienced an issue with nova-compute "locking up" after encountering an error sending notifications to kafka. It basically stops logging anything and performing tasks (rebooting an instance for example).
The last message logged by nova-compute is :
===
2022-07-06 19:48:30.634 62228 ERROR oslo_messaging.
===
The guru meditation report shows many greenthreads waiting on a lock related to logging, for example :
===
...
python3[62228]: /opt/openstack/
python3[62228]: `self._log(INFO, msg, args, **kwargs)`
python3[62228]: /opt/openstack/
python3[62228]: `self.handle(
python3[62228]: /opt/openstack/
python3[62228]: `self.callHandl
python3[62228]: /opt/openstack/
python3[62228]: `hdlr.handle(
python3[62228]: /opt/openstack/
python3[62228]: `self.acquire()`
python3[62228]: /opt/openstack/
python3[62228]: `self.lock.
python3[62228]: /opt/openstack/
python3[62228]: `rc = self._block.
python3[62228]: /opt/openstack/
python3[62228]: `hubs.get_
python3[62228]: /opt/openstack/
python3[62228]: `return self.greenlet.
===
After looking a little bit through bugs in eventlet, I found this bug [1] where calling logging functions from tpool.execute can cause subsequent calls to logging to hang. It turns out we do exactly that in impl_kafka.py[2] : we could call a logging function from within tpool.execute, thus triggering this bug in eventlet.
This seems to have happened to other projects. In swift and was addressed in this patch [3], but I am not sure we can do that in oslo.messaging.
[1] https:/
[2] https:/
[3] https:/
Fix proposed to branch: master /review. opendev. org/c/openstack /oslo.messaging /+/851852
Review: https:/