Comment 11 for bug 1518430

Revision history for this message
John Eckersberg (jeckersb) wrote :

Copy/paste from https://bugzilla.redhat.com/show_bug.cgi?id=1384183#c10:

In the oslo.messaging rabbit driver, there is a thread for heartbeats:

https://git.openstack.org/cgit/openstack/oslo.messaging/tree/oslo_messaging/_drivers/impl_rabbit.py?id=293084f9b5ef59c4b2dcd8c7180f7053b47337a9#n968

specifically, the three methods, _heartbeat_start, _heartbeat_stop,
and _heartbeat_thread_job.

In _heartbeat_start, it creates a threading.Event (this event ends up
being the root of the problem), a threading.Thread for the heartbeat
thread, and then starts the heartbeat thread.

The heartbeat thread does work, and then at the end executes:

    self._heartbeat_exit_event.wait(
        timeout=self._heartbeat_wait_timeout)

The _heartbeat_stop method sets the exit event when the connection is
stopped, this is where it receives that and shuts down.

With the default heartbeat configuration, _heartbeat_wait_timeout ends
up being 15.0 seconds. This sounds reasonable, but explodes under the
hood.

Event.wait() ultimately calls Condition.wait(), which runs this gem[1] of
code:

    # Balancing act: We can't afford a pure busy loop, so we
    # have to sleep; but if we sleep the whole timeout time,
    # we'll be unresponsive. The scheme here sleeps very
    # little at first, longer as time goes on, but never longer
    # than 20 times per second (or the timeout time remaining).
    endtime = _time() + timeout
    delay = 0.0005 # 500 us -> initial delay of 1 ms
    while True:
        gotit = waiter.acquire(0)
        if gotit:
            break
        remaining = endtime - _time()
        if remaining <= 0:
            break
        delay = min(delay * 2, remaining, .05)
        _sleep(delay)

Now, the problem is that _sleep here is time.sleep, which has been
previously monkey-patched by eventlet to be eventlet.greenthread.sleep
instead.

When you call sleep() in eventlet, it adds a timer onto an internal
heap that it maintains[2]. When the main event loop runs, it checks
whatever the shortest timer is, and uses that delta as the argument
for waiting (this ultimately ends up to epoll_wait). Here's the
crucial part of the code[3]:

    wakeup_when = self.sleep_until()
    if wakeup_when is None:
        sleep_time = self.default_sleep()
    else:
        sleep_time = wakeup_when - self.clock()
    if sleep_time > 0:
        self.wait(sleep_time)
    else:
        self.wait(0)

Because Condition.wait() uses such small sleep times between
0.0005..0.05, these tiny timers are almost always the next timer to
fire, and sleep_time above ends up being negative (or small enough to
practically be zero anyways). The self.wait() call propagates down to
epoll_wait() and you get the behavior we're seeing.

So, the root of the problem is that using threading.Event inside of
eventlet is bad.

[1] https://hg.python.org/cpython/file/v2.7.12/Lib/threading.py#l344
[2] https://github.com/eventlet/eventlet/blob/7343f2e817fbc09d07c36727dfb2e4655a37901f/eventlet/hubs/hub.py#L389-L392
[3] https://github.com/eventlet/eventlet/blob/7343f2e817fbc09d07c36727dfb2e4655a37901f/eventlet/hubs/hub.py#L340-L348