OpenStack services excessively poll socket events when oslo.messaging is used

Bug #1380220 reported by Roman Podoliaka
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Status tracked in 10.0.x
10.0.x
Fix Committed
Medium
MOS Oslo
5.1.x
Won't Fix
Medium
MOS Oslo
6.0.x
Won't Fix
Medium
MOS Oslo
6.1.x
Won't Fix
Medium
MOS Oslo
8.0.x
Won't Fix
Medium
MOS Oslo
9.x
Fix Released
Medium
MOS Oslo

Bug Description

On a newly deployed cluster, after creating some load (e.g. running Rally scenarios), top shows that many of OpenStack services start to consume CPU time when they are *idle* (no user activity): http://paste.openstack.org/show/120460/

This is caused by the fact those services are excessively polling open sockets (http://paste.openstack.org/show/120461/) using a very small timeout value (close to 0, while the eventlet default is 60).

Further investigation shown that services, which didn't use oslo.messaging, weren't affected.

It turns out, that CPython 2.6/2.7 implementation of condition variables plays badly with eventlet event loop. oslo.messaging has a place in the code (https://gerrit.mirantis.com/gitweb?p=openstack/oslo.messaging.git;a=blob;f=oslo/messaging/_drivers/impl_rabbit.py;h=dfed27851a36143e31448c77772e2a77597c94c6;hb=45d0e2742aa29c242f027de5edb54ba3db95cc33#l857) in which it tries to put the current thread into sleep until some condition is true passing a sane timeout value (24.0 s). Unfortunately, CPython provides its own implementation of conditional variables and doesn't use corresponding pthreads calls. In CPython 2.6/2.7 wait(timeout) for conditional variables is implemented as polling after a short sleep in a loop (https://github.com/akheron/cpython/blob/2.7/Lib/threading.py#L344-L369). Sleeps of 0.0005 to 0.05 seconds are the values passed to poll()/epoll_wait() in eventlet eventually, causing the process to wake up much more often than it really should (as there are no socket events to process). And user space <-> kernel space switches are expensive.

FWIW, PyPy and CPython 3.2+ shouldn't have this bug, but their compatibility with eventlet is an open question.

There must be at least two ways to fix this:

1) backport changes to thread.c and threading.py from CPython 3.2 to CPython 2.6/2.7, build and use custom packages

2) add a workaround to oslo.messaging (don't use a conditional variable in that particular place)

The former might affect CPython stability and should be throughly tested, so the latter seems to be a 'good enough' work around for now.

Changed in mos:
milestone: none → 6.0
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Test snippet to demonstrate the issue with eventlet/CPython CVs implementation: http://xsnippet.org/360230/raw/

Igor Marnat (imarnat)
tags: added: scale
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Lowering this to Medium, as it doesn't really affect users, but is only a waste of CPU time (and electricity? :) )

Changed in mos:
importance: High → Medium
summary: - OpenStack services consume a lot of CPU time when oslo.messaging is used
+ OpenStack services excessively poll socket events when oslo.messaging is
+ used
description: updated
Changed in mos:
milestone: 6.0 → 6.0.1
Changed in mos:
status: Triaged → Won't Fix
Changed in mos:
milestone: 6.0.1 → 6.1
status: Won't Fix → Triaged
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

Seems like we will not fix this bug in 6.1 either. I am setting it to won't fix for 6.1, and assigning it to Oslo team. Lets reiterate on it next release.

Changed in mos:
milestone: 6.1 → 7.0
Revision history for this message
Leontii Istomin (listomin) wrote :

Will this fixed by adding rabbitmq_heartbeat parameter to OpenStack services's configs?
If so, then we can mark this one as duplicate of https://bugs.launchpad.net/mos/+bug/1430894

Revision history for this message
Viktor Serhieiev (vsergeyev) wrote :

If I understand correctly, no - it doesn't. But it seems to be, that this bug should gone in 7x, when we will use oslo.messaging 1.9.0.

Revision history for this message
Serge Kovaleff (serge-kovaleff) wrote :

When do we expect first version of 7.0 to verify this?

Revision history for this message
Viktor Serhieiev (vsergeyev) wrote :

Folks, please verify - is it affects 7.0?

Changed in mos:
assignee: MOS Oslo (mos-oslo) → MOS QA Team (mos-qa)
Changed in mos:
milestone: 7.0 → 8.0
Dmitry Pyzhov (dpyzhov)
tags: added: area-qa
Revision history for this message
Andrey Epifanov (aepifanov) wrote :

This bug still affects our customers and happens on the new MOS versions

tags: added: ct1 customer-found support
Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

Hi Andrey, could you please describe the version of MOS where the issue is actual for customers?

And it is not clear why the issue is assigned to QA team, we can't fix it.
So, I've added new versions of MOS as well to make sure that it will be fixed in the latest releases.

Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

It is medium priority, can be fixed only in updates for MOS 8.0 / MOS 9.0 because it is customer found issue.

Dina Belova (dbelova)
tags: added: move-to-mu
tags: added: 10.0-reviewed
Revision history for this message
Denis Meltsaykin (dmeltsaykin) wrote :

Closing as Won't Fix for 8.0 as this is a medium importance bug without solution.

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Perhaps this can be fixed by https://review.openstack.org/#/c/386656/

Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

The fix https://review.openstack.org/#/c/386656/ was merged downstream by https://review.fuel-infra.org/#/c/29072/ , so the issue is fixed in 9.2.

Revision history for this message
Michael Semenov (msemenov) wrote :

Verified on last 9.2 RC1 scale certification run.

Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

The fix was merged into 10.0 by merge commit https://review.fuel-infra.org/#/c/30554/

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.