wait_child() without greenthread.sleep() uses excessive CPU

Bug #1254089 reported by Scott DAngelo on 2013-11-22
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Scott DAngelo

Bug Description

https://bugs.launchpad.net/oslo/+bug/1095346 tracked a fix in service.py: wait() to prevent the cinder-volume process from hogging the CPU while looping in _wait_child()
However, the wait() function will also loop in _wait_child() when the parent catched SIGTERM and the parent is waiting to reap the children. This also causes excessive CPU usage if the child does not die quickly for some reason.

I've an instance of cinder-volume that has been running for several days in this wait state. An strace of the parent shows:
wait4(0, 0x7fff081b7780, WNOHANG, NULL) = 0
wait4(0, 0x7fff081b7780, WNOHANG, NULL) = 0
wait4(0, 0x7fff081b7780, WNOHANG, NULL) = 0
this is because the parent has caught SIGTERM is and is looping on wait_child without the eventlet.greenthread.sleep(.01)

during the normal running state of cinder-volume the parent does an epoll for the eventlet.greenthread.sleep(.01):
wait4(0, 0x7fff25185430, WNOHANG, NULL) = 0
select(0, NULL, NULL, NULL, {0, 9984}) = 0 (Timeout)
wait4(0, 0x7fff25185430, WNOHANG, NULL) = 0
select(0, NULL, NULL, NULL, {0, 9984}) = 0 (Timeout)

My CPU is pegged at 100% with this process (shown in top):
27313 cinder 20 0 103m 25m 4080 R 100 0.0 56343:11 cinder-volume

To fix this, the second call to _wait_child() needs the same sleep as the first:
    def wait(self):
        """Loop waiting on children to die and respawning as necessary."""
        while self.running:
            wrap = self._wait_child()
            if not wrap:
                # Yield to other threads if no children have exited
                # Sleep for a short time to avoid excessive CPU usage
                # (see bug #1095346)

            LOG.info(_('wait wrap.failed %s'), wrap.failed)
            while (self.running and len(wrap.children) < wrap.workers
                   and not wrap.failed):

        if self.sigcaught:
            signame = {signal.SIGTERM: 'SIGTERM',
                       signal.SIGINT: 'SIGINT'}[self.sigcaught]
            LOG.info(_('Caught %s, stopping children'), signame)

        for pid in self.children:
                os.kill(pid, signal.SIGTERM)
            except OSError as exc:
                if exc.errno != errno.ESRCH:

        # Wait for children to die
        if self.children:
            LOG.info(_('Waiting on %d children to exit'), len(self.children))
            while self.children:

Patch is :
--- a/cinder/service.py
+++ b/cinder/service.py
@@ -323,7 +323,10 @@ class ProcessLauncher(object):
         if self.children:
             LOG.info(_('Waiting on %d children to exit'), len(self.children))
             while self.children:
- self._wait_child()
+ wrap = self._wait_child()
+ if not wrap:
+ eventlet.greenthread.sleep(.01)
+ continue

Changed in cinder:
assignee: nobody → Scott DAngelo (scott-dangelo)

Fix proposed to branch: master
Review: https://review.openstack.org/57971

Changed in cinder:
status: New → In Progress

Reviewed: https://review.openstack.org/57971
Committed: http://github.com/openstack/cinder/commit/273d8765115f0936d24549db8698dbe7e8d2ec1a
Submitter: Jenkins
Branch: master

commit 273d8765115f0936d24549db8698dbe7e8d2ec1a
Author: scott-dangelo <email address hidden>
Date: Fri Nov 22 16:28:07 2013 +0000

    Add greenthread.sleep() to parent wait()

    Add an eventlet.greenthread.sleep() to wait() to keep
    cinder-volume parent from looping after catching SIGTERM
    and hogging the CPU. This was already added to the running
    state but is missing from the SIGTERM caught state when
    the parent waits to reap the child.

    Change-Id: Ia871e31db5bf9ec4e86f926f8f6c4f0f4ecb7925
    Closes-Bug: #1254089

Changed in cinder:
status: In Progress → Fix Committed
Thierry Carrez (ttx) on 2013-12-04
Changed in cinder:
milestone: none → icehouse-1
status: Fix Committed → Fix Released
Thierry Carrez (ttx) on 2014-04-17
Changed in cinder:
milestone: icehouse-1 → 2014.1
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers