liberty: ~busy loop on epoll_wait being called with zero timeout

Bug #1518430 reported by JuanJo Ciarlante on 2015-11-20
30
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
Medium
Unassigned
Kilo
Medium
Unassigned
Liberty
Medium
Unassigned
Mitaka
Medium
Unassigned
Newton
Medium
Unassigned
oslo.messaging
Undecided
John Eckersberg
python-oslo.messaging (Ubuntu)
Medium
Unassigned
Xenial
Medium
Unassigned
Yakkety
Medium
Unassigned
Zesty
Medium
Unassigned

Bug Description

Context: openstack juju/maas deploy using 1510 charms release
on trusty, with:
  openstack-origin: "cloud:trusty-liberty"
  source: "cloud:trusty-updates/liberty

* Several openstack nova- and neutron- services, at least:
nova-compute, neutron-server, nova-conductor,
neutron-openvswitch-agent,neutron-vpn-agent
show almost busy looping on epoll_wait() calls, with zero timeout set
most frequently.
- nova-compute (chose it b/cos single proc'd) strace and ltrace captures:
  http://paste.ubuntu.com/13371248/ (ltrace, strace)

As comparison, this is how it looks on a kilo deploy:
- http://paste.ubuntu.com/13371635/

* 'top' sample from a nova-cloud-controller unit from
   this completely idle stack:
  http://paste.ubuntu.com/13371809/

FYI *not* seeing this behavior on keystone, glance, cinder,
ceilometer-api.

As this issue is present on several components, it likely comes
from common libraries (oslo concurrency?), fyi filed the bug to
nova itself as a starting point for debugging.

Note: The description in the following bug gives a good overview of the issue and points to a possible fix for oslo.messaging: https://bugs.launchpad.net/mos/+bug/1380220

JuanJo Ciarlante (jjo) on 2015-11-20
description: updated
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in nova (Ubuntu):
status: New → Confirmed

I could confirm that I could observe this bug definitely on Liberty on following services: neutron-server, nova-conductor. I couldn't confirm that for other because it requires additional monitoring/debugging.
Steps to reproduce for neutron-server:

1.Deploy new cloud.
2. Measure CPU and MEM usage by neutron-server and take a look on strace of any pid which is associated with RPC worker.
3. Create 200-400 networks with routers.
4. Spawn 10000 vms (not at once, you could just spawn and delete them by 10-20 pcs).
5. Delete everything (networks, routers, vms).
6. Measure CPU/MEM usage and take a look on strace of neutron-server RPC worker.

Expected behavior:
Everything should be the same as at step 2 with non-significant changes in CPU/MEM.

Observer behavior:
Huge CPU load (1-2 cores), huge MEMORY load (6-7Gb) without doing anything useful at the moment ;) Stracing of RPC thread will show ~1k per second of EPOLL_WAITs with 0 timeout.

Here is some graphs of resources usage by neutron-server even after full cleanup (no networks presents in DB)

Dina Belova (dbelova) wrote :

For MOS it's tracked separately - https://bugs.launchpad.net/mos/+bug/1380220 - removing from this bug

no longer affects: mos
description: updated
JuanJo Ciarlante (jjo) wrote :

For experimenting purposes / measuring the difference against
what would be a (better) behaving epoll_wait usage, I created:
https://github.com/jjo/dl-hack-lp1518430
, which implements hooking epoll_wait() and select()
(via LD_PRELOAD) to limit the rate of calls with zero timeouts.

WfM'd on an experimental stack, on one node hosting all OS
API LXCs and nova-compute, I measured a
700K/s -> 120K/s reduction of epoll_wait calls (as shown
by: sysdig -c topscalls).

Sean Dague (sdague) wrote :

This is really an oslo.messaging issue. Not valid for Nova upstream.

Changed in nova:
status: New → Invalid
Chuck Short (zulcss) on 2016-10-05
Changed in nova (Ubuntu):
status: Confirmed → Invalid
Mehdi Abaakouk (sileht) wrote :

Is this still valid on Newton/Mitaka ?

Changed in oslo.messaging:
status: New → Won't Fix
status: Won't Fix → Incomplete
John Eckersberg (jeckersb) wrote :

@sileht, looks like it is still valid - https://bugzilla.redhat.com/show_bug.cgi?id=1384183

Changed in oslo.messaging:
assignee: nobody → John Eckersberg (jeckersb)
status: Incomplete → In Progress
John Eckersberg (jeckersb) wrote :

Copy/paste from https://bugzilla.redhat.com/show_bug.cgi?id=1384183#c10:

In the oslo.messaging rabbit driver, there is a thread for heartbeats:

https://git.openstack.org/cgit/openstack/oslo.messaging/tree/oslo_messaging/_drivers/impl_rabbit.py?id=293084f9b5ef59c4b2dcd8c7180f7053b47337a9#n968

specifically, the three methods, _heartbeat_start, _heartbeat_stop,
and _heartbeat_thread_job.

In _heartbeat_start, it creates a threading.Event (this event ends up
being the root of the problem), a threading.Thread for the heartbeat
thread, and then starts the heartbeat thread.

The heartbeat thread does work, and then at the end executes:

    self._heartbeat_exit_event.wait(
        timeout=self._heartbeat_wait_timeout)

The _heartbeat_stop method sets the exit event when the connection is
stopped, this is where it receives that and shuts down.

With the default heartbeat configuration, _heartbeat_wait_timeout ends
up being 15.0 seconds. This sounds reasonable, but explodes under the
hood.

Event.wait() ultimately calls Condition.wait(), which runs this gem[1] of
code:

    # Balancing act: We can't afford a pure busy loop, so we
    # have to sleep; but if we sleep the whole timeout time,
    # we'll be unresponsive. The scheme here sleeps very
    # little at first, longer as time goes on, but never longer
    # than 20 times per second (or the timeout time remaining).
    endtime = _time() + timeout
    delay = 0.0005 # 500 us -> initial delay of 1 ms
    while True:
        gotit = waiter.acquire(0)
        if gotit:
            break
        remaining = endtime - _time()
        if remaining <= 0:
            break
        delay = min(delay * 2, remaining, .05)
        _sleep(delay)

Now, the problem is that _sleep here is time.sleep, which has been
previously monkey-patched by eventlet to be eventlet.greenthread.sleep
instead.

When you call sleep() in eventlet, it adds a timer onto an internal
heap that it maintains[2]. When the main event loop runs, it checks
whatever the shortest timer is, and uses that delta as the argument
for waiting (this ultimately ends up to epoll_wait). Here's the
crucial part of the code[3]:

    wakeup_when = self.sleep_until()
    if wakeup_when is None:
        sleep_time = self.default_sleep()
    else:
        sleep_time = wakeup_when - self.clock()
    if sleep_time > 0:
        self.wait(sleep_time)
    else:
        self.wait(0)

Because Condition.wait() uses such small sleep times between
0.0005..0.05, these tiny timers are almost always the next timer to
fire, and sleep_time above ends up being negative (or small enough to
practically be zero anyways). The self.wait() call propagates down to
epoll_wait() and you get the behavior we're seeing.

So, the root of the problem is that using threading.Event inside of
eventlet is bad.

[1] https://hg.python.org/cpython/file/v2.7.12/Lib/threading.py#l344
[2] https://github.com/eventlet/eventlet/blob/7343f2e817fbc09d07c36727dfb2e4655a37901f/eventlet/hubs/hub.py#L389-L392
[3] https://github.com/eventlet/eventlet/blob/7343f2e817fbc09d07c36727dfb2e4655a37901f/eventlet/hubs/hub.py#L340-L348

Changed in oslo.messaging:
assignee: John Eckersberg (jeckersb) → Gevorg Davoian (gdavoian)
Changed in oslo.messaging:
assignee: Gevorg Davoian (gdavoian) → nobody
Changed in oslo.messaging:
assignee: nobody → John Eckersberg (jeckersb)

Reviewed: https://review.openstack.org/386656
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=a6c193f3eba62cdcbfe04d0fa93e95352bcfb1c3
Submitter: Jenkins
Branch: master

commit a6c193f3eba62cdcbfe04d0fa93e95352bcfb1c3
Author: John Eckersberg <email address hidden>
Date: Fri Oct 14 11:02:47 2016 -0400

    rabbit: Avoid busy loop on epoll_wait with heartbeat+eventlet

    Calling threading.Event.wait() when using eventlet results in a busy
    loop calling epoll_wait, because the Python 2.x
    threading.Condition.wait() implementation busy-waits by calling
    sleep() with very small values (0.0005..0.05s). Because sleep() is
    monkey-patched by eventlet, this results in many very short timers
    being added to the eventlet hub, and forces eventlet to constantly
    epoll_wait looking for new data unecessarily.

    This utilizes a new Event from eventletutils which conditionalizes the
    event primitive depending on whether or not eventlet is being used.
    If it is, eventlet.event.Event is used instead of threading.Event.
    The eventlet.event.Event implementation does not suffer from the same
    busy-wait sleep problem. If eventlet is not used, the previous
    behavior is retained.

    Change-Id: I5c211092d282e724d1c87ce4d06b6c44b592e764
    Depends-On: Id33c9f8c17102ba1fe24c12b053c336b6d265501
    Closes-bug: #1518430

Changed in oslo.messaging:
status: In Progress → Fix Released
no longer affects: nova (Ubuntu Trusty)
no longer affects: nova (Ubuntu Xenial)
no longer affects: nova (Ubuntu Yakkety)
no longer affects: nova (Ubuntu Zesty)

Reviewed: https://review.openstack.org/394963
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=f4bf955879747338b0ab7a6ca41d0dea74508353
Submitter: Jenkins
Branch: stable/newton

commit f4bf955879747338b0ab7a6ca41d0dea74508353
Author: John Eckersberg <email address hidden>
Date: Fri Oct 14 11:02:47 2016 -0400

    rabbit: Avoid busy loop on epoll_wait with heartbeat+eventlet

    Calling threading.Event.wait() when using eventlet results in a busy
    loop calling epoll_wait, because the Python 2.x
    threading.Condition.wait() implementation busy-waits by calling
    sleep() with very small values (0.0005..0.05s). Because sleep() is
    monkey-patched by eventlet, this results in many very short timers
    being added to the eventlet hub, and forces eventlet to constantly
    epoll_wait looking for new data unecessarily.

    This utilizes a new Event from eventletutils which conditionalizes the
    event primitive depending on whether or not eventlet is being used.
    If it is, eventlet.event.Event is used instead of threading.Event.
    The eventlet.event.Event implementation does not suffer from the same
    busy-wait sleep problem. If eventlet is not used, the previous
    behavior is retained.

    For Newton backport, this bundles the Event from eventletutils
    directly in oslo.messaging under the _utils module. It is taken from:

    https://review.openstack.org/#/c/389739/

    combined with the followup fix:

    https://review.openstack.org/#/c/394460/

    Change-Id: I5c211092d282e724d1c87ce4d06b6c44b592e764
    Depends-On: Id33c9f8c17102ba1fe24c12b053c336b6d265501
    Closes-bug: #1518430
    (cherry picked from commit a6c193f3eba62cdcbfe04d0fa93e95352bcfb1c3)

This issue was fixed in the openstack/oslo.messaging 5.10.1 release.

Reviewed: https://review.openstack.org/395018
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=d690cac08ca0f6c18518e7760d8bdcfd14cdee84
Submitter: Jenkins
Branch: stable/mitaka

commit d690cac08ca0f6c18518e7760d8bdcfd14cdee84
Author: John Eckersberg <email address hidden>
Date: Fri Oct 14 11:02:47 2016 -0400

    rabbit: Avoid busy loop on epoll_wait with heartbeat+eventlet

    Calling threading.Event.wait() when using eventlet results in a busy
    loop calling epoll_wait, because the Python 2.x
    threading.Condition.wait() implementation busy-waits by calling
    sleep() with very small values (0.0005..0.05s). Because sleep() is
    monkey-patched by eventlet, this results in many very short timers
    being added to the eventlet hub, and forces eventlet to constantly
    epoll_wait looking for new data unecessarily.

    This utilizes a new Event from eventletutils which conditionalizes the
    event primitive depending on whether or not eventlet is being used.
    If it is, eventlet.event.Event is used instead of threading.Event.
    The eventlet.event.Event implementation does not suffer from the same
    busy-wait sleep problem. If eventlet is not used, the previous
    behavior is retained.

    For Newton backport, this bundles the Event from eventletutils
    directly in oslo.messaging under the _utils module. It is taken from:

    https://review.openstack.org/#/c/389739/

    combined with the followup fix:

    https://review.openstack.org/#/c/394460/

    Change-Id: I5c211092d282e724d1c87ce4d06b6c44b592e764
    Depends-On: Id33c9f8c17102ba1fe24c12b053c336b6d265501
    Closes-bug: #1518430
    (cherry picked from commit a6c193f3eba62cdcbfe04d0fa93e95352bcfb1c3)

This issue was fixed in the openstack/oslo.messaging 5.13.0 release.

Changed in python-oslo.messaging (Ubuntu Zesty):
status: New → Fix Released
Corey Bryant (corey.bryant) wrote :

I can't see where this bug exists on base Trusty/Icehouse so I've dropped those targets for Ubuntu.

no longer affects: cloud-archive/icehouse
no longer affects: python-oslo.messaging (Ubuntu Trusty)
Corey Bryant (corey.bryant) wrote :

This has been uploaded to ubuntu zesty, yakkety, xenial and is awaiting sru team review for yakkety and xenial. This has also been uploaded to kilo-staging and liberty-staging for the ubuntu cloud archive. These should all be available in -proposed soon for testing.

Corey Bryant (corey.bryant) wrote :

Ubuntu SRU details:
[Description]
Avoid busy loop on epoll_wait with heartbeat+eventlet in rabbit driver.

[Test Case]
See bug description.

[Regression Potential]
Minimal as the patches are cherry picked from upstream stable branches.

This issue was fixed in the openstack/oslo.messaging 5.10.1 release.

Hello JuanJo, or anyone else affected,

Accepted python-oslo.messaging into liberty-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:liberty-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-liberty-needed to verification-liberty-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-liberty-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-liberty-needed
affects: nova → ubuntu-translations
no longer affects: ubuntu-translations
no longer affects: nova (Ubuntu)
Brian Murray (brian-murray) wrote :

Hello JuanJo, or anyone else affected,

Accepted python-oslo.messaging into yakkety-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/python-oslo.messaging/5.10.0-0ubuntu2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in python-oslo.messaging (Ubuntu Yakkety):
status: New → Fix Committed
tags: added: verification-needed
Changed in python-oslo.messaging (Ubuntu Xenial):
status: New → Fix Committed
Brian Murray (brian-murray) wrote :

Hello JuanJo, or anyone else affected,

Accepted python-oslo.messaging into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/python-oslo.messaging/4.6.1-2ubuntu2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in cloud-archive:
importance: Undecided → Medium
Changed in python-oslo.messaging (Ubuntu Xenial):
importance: Undecided → Medium
Changed in python-oslo.messaging (Ubuntu Yakkety):
importance: Undecided → Medium
Changed in python-oslo.messaging (Ubuntu Zesty):
importance: Undecided → Medium
tags: added: verification-done
removed: verification-needed
Corey Bryant (corey.bryant) wrote :

Marked as verifcation-done as this has passed tests on xenial and yakkety. We perform regression testing by juju deploying two openstack clouds, one against -proposed and one against -updates, and we then run tempest integration smoke tests[1] to verify those deployed clouds, and evaluate any differences in results. [1] https://github.com/openstack/tempest/blob/master/README.rst

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package python-oslo.messaging - 5.10.0-0ubuntu2

---------------
python-oslo.messaging (5.10.0-0ubuntu2) yakkety; urgency=medium

  * d/p/rabbit-avoid-busy-loop.patch: Cherry pick patch from upstream
    to avoid rabbit driver busy loop on epoll_wait with heartbeat+eventlet
    (LP: #1518430).

 -- Corey Bryant <email address hidden> Fri, 02 Dec 2016 12:39:50 -0500

Changed in python-oslo.messaging (Ubuntu Yakkety):
status: Fix Committed → Fix Released

The verification of the Stable Release Update for python-oslo.messaging has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package python-oslo.messaging - 4.6.1-2ubuntu2

---------------
python-oslo.messaging (4.6.1-2ubuntu2) xenial; urgency=medium

  * d/p/rabbit-avoid-busy-loop.patch: Cherry pick patch from upstream
    to avoid rabbit driver busy loop on epoll_wait with heartbeat+eventlet
    (LP: #1518430).

 -- Corey Bryant <email address hidden> Fri, 02 Dec 2016 12:51:26 -0500

Changed in python-oslo.messaging (Ubuntu Xenial):
status: Fix Committed → Fix Released
tags: added: verification-liberty-done
removed: verification-liberty-needed
James Page (james-page) wrote :

The verification of the Stable Release Update for python-oslo.messaging has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

James Page (james-page) wrote :

This bug was fixed in the package python-oslo.messaging - 5.10.0-0ubuntu2~cloud0
---------------

 python-oslo.messaging (5.10.0-0ubuntu2~cloud0) xenial-newton; urgency=medium
 .
   * New update for the Ubuntu Cloud Archive.
 .
 python-oslo.messaging (5.10.0-0ubuntu2) yakkety; urgency=medium
 .
   * d/p/rabbit-avoid-busy-loop.patch: Cherry pick patch from upstream
     to avoid rabbit driver busy loop on epoll_wait with heartbeat+eventlet
     (LP: #1518430).

James Page (james-page) wrote :

The verification of the Stable Release Update for python-oslo.messaging has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

James Page (james-page) wrote :

This bug was fixed in the package python-oslo.messaging - 2.5.0-1ubuntu2~cloud1
---------------

 python-oslo.messaging (2.5.0-1ubuntu2~cloud1) trusty-liberty; urgency=medium
 .
   * d/p/rabbit-avoid-busy-loop.patch: Cherry pick patch from upstream
     stable/mitaka branch to avoid rabbit driver busy loop on epoll_wait with
     heartbeat+eventlet (LP: #1518430).

Hello JuanJo, or anyone else affected,

Accepted python-oslo.messaging into mitaka-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:mitaka-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-mitaka-needed to verification-mitaka-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-mitaka-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-mitaka-needed
tags: added: verification-mitaka-done
removed: verification-mitaka-needed

The verification of the Stable Release Update for python-oslo.messaging has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

James Page (james-page) wrote :

This bug was fixed in the package python-oslo.messaging - 4.6.1-2ubuntu2~cloud0
---------------

 python-oslo.messaging (4.6.1-2ubuntu2~cloud0) trusty-mitaka; urgency=medium
 .
   * New update for the Ubuntu Cloud Archive.
 .
 python-oslo.messaging (4.6.1-2ubuntu2) xenial; urgency=medium
 .
   * d/p/rabbit-avoid-busy-loop.patch: Cherry pick patch from upstream
     to avoid rabbit driver busy loop on epoll_wait with heartbeat+eventlet
     (LP: #1518430).

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.