Eventlet 0.19.0 with python2.7 blocking main-thread greenthread

Bug #2023023 reported by lijie.xie
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
oslo.service
New
Undecided
Unassigned

Bug Description

I don't know much about oslo.service, I hope to get help from the community, thank you

OS: Centos8
package version:
eventlet (0.19.0)
oslo.log (3.16.1)
oslo.service (1.16.1)

Start my process with oslo.service, and use eventlet.monkey_patch(thread=False)
For the convenience of description, I paste the main code logic

from oslo_log import log as logging
from oslo_service import periodic_task

CONF = config.CONF
LOG = logging.getLogger(name)

class NodeManager(periodic_task.PeriodicTasks):

    target = messaging.Target(version='1.0')

    def init(self, host=None, service_name='node-agent'):
        ...
        super(NodeManager, self).init(CONF)

    def add_tasks(self, tg):
        tg.add_thread(self.do_set_compute_ready_to_run)
        tg.add_thread(self.do_update_host_info)
        thread_heartbeat = threading.Thread(target=self.do_heartbeat)
        thread_heartbeat.start()`

    def do_update_host_info(self):
        while True:
            LOG.info('Do action: do_update_host_info')
            try:
                self.cluster_obj.update_agent_info()
                self.cluster_obj.update_remote_hosts()
                self.cluster_obj.update_cluster_info()
            except Exception as ex:
                LOG.warning('Unexpected: do update host info failed, '
                        'err: %s.', ex)
            eventlet.sleep(20)

    def do_heartbeat(self):
        while True:
            LOG.info('Do action: do_heartbeat')
            try:
                self.cluster_obj.do_heartbeat(self.host_obj)
            except Exception as ex:
                LOG.warning('Unexpected error: do heartbeat %s.', ex)
            eventlet.sleep(10)
eventlet.spawn setup do_set_compute_ready_to_run(it usually ends quickly) and do_update_host_info(long run),do_heartbeat (long run)use native thread,They all use oslo_log to record key logs.

Sometimes this do_update_host_info coroutine will block and there will be no log output,only do_heartbeat native thread run fine.
I can't reproduce this problem stably, it usually takes days or even months to happen

I find, when greenthread hang, self.cluster_obj.update_agent_info() runs ok,but self.cluster_obj.update_remote_hosts() don't print logs, that is the following code:<LOG.info('Get remote hosts: %s.', mgmt_remotes)>

class ClusterInfo(object):
    def init(self, hostha_apiclient=None):
        ...

    def update_remote_hosts(self):
        mgmt_remotes = self.get_heartbeat_remote_hosts()

        for network, client in self.heartbeat_client.items():
            if network == constant.MGMT_NET and mgmt_remotes:
                client.update_remote_hosts(mgmt_remotes)`

    def get_heartbeat_remote_hosts(self):
        mgmt_remotes = {}
        for controller in self.controller_agents:
            hostname = controller['hostname']
            mgmt_ip = controller.get('br-mgmt')
            if mgmt_ip:
                mgmt_remotes.update({hostname: mgmt_ip})
        LOG.info('Get remote hosts: %s.', mgmt_remotes)
        return mgmt_remotes

I suspect that when greenthread is mixed with native thread, there is a deadlock in logging, but I don't know how to analyze it.

I dumped thread stack for a few minutes and strace process:
Stack for thread 140586048190208
File "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap
self.__bootstrap_inner()
File "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
self.run()
File "", line 167, in run
File "/usr/lib64/python2.7/code.py", line 243, in interact
more = self.push(line)
File "/usr/lib64/python2.7/code.py", line 265, in push
more = self.runsource(source, self.filename)
File "/usr/lib64/python2.7/code.py", line 87, in runsource
self.runcode(code)
File "/usr/lib64/python2.7/code.py", line 103, in runcode
exec code in self.locals
File "", line 3, in

Stack for thread 140586056582912
File "/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 346, in run
self.wait(sleep_time)
File "/usr/lib/python2.7/site-packages/eventlet/hubs/poll.py", line 85, in wait
presult = self.do_poll(seconds)
File "/usr/lib/python2.7/site-packages/eventlet/hubs/epolls.py", line 62, in do_poll
return self.poll.poll(seconds)

Stack for thread 140586521663296
File "/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 346, in run
self.wait(sleep_time)
File "/usr/lib/python2.7/site-packages/eventlet/hubs/poll.py", line 85, in wait
presult = self.do_poll(seconds)
File "/usr/lib/python2.7/site-packages/eventlet/hubs/epolls.py", line 62, in do_poll
return self.poll.poll(seconds)

)[root@node-1 /]# strace -t -p 38688
strace: Process 38688 attached
20:34:13 epoll_pwait(3, [], 1023, 22085, NULL, 8) = 0
20:34:35 epoll_pwait(3, [], 1023, 59999, NULL, 8) = 0
20:35:35 epoll_pwait(3, [], 1023, 59999, NULL, 8) = 0
20:36:35 epoll_pwait(3, ^Cstrace: Process 38688 detached
<detached ...>

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.