wait_child() without greenthread.sleep() uses excessive CPU

Bug #1254089 reported by Scott DAngelo
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Cinder
Fix Released
Undecided
Scott DAngelo

Bug Description

https://bugs.launchpad.net/oslo/+bug/1095346 tracked a fix in service.py: wait() to prevent the cinder-volume process from hogging the CPU while looping in _wait_child()
However, the wait() function will also loop in _wait_child() when the parent catched SIGTERM and the parent is waiting to reap the children. This also causes excessive CPU usage if the child does not die quickly for some reason.

I've an instance of cinder-volume that has been running for several days in this wait state. An strace of the parent shows:
wait4(0, 0x7fff081b7780, WNOHANG, NULL) = 0
wait4(0, 0x7fff081b7780, WNOHANG, NULL) = 0
wait4(0, 0x7fff081b7780, WNOHANG, NULL) = 0
......forever....
this is because the parent has caught SIGTERM is and is looping on wait_child without the eventlet.greenthread.sleep(.01)

during the normal running state of cinder-volume the parent does an epoll for the eventlet.greenthread.sleep(.01):
wait4(0, 0x7fff25185430, WNOHANG, NULL) = 0
select(0, NULL, NULL, NULL, {0, 9984}) = 0 (Timeout)
wait4(0, 0x7fff25185430, WNOHANG, NULL) = 0
select(0, NULL, NULL, NULL, {0, 9984}) = 0 (Timeout)

My CPU is pegged at 100% with this process (shown in top):
  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
27313 cinder 20 0 103m 25m 4080 R 100 0.0 56343:11 cinder-volume

To fix this, the second call to _wait_child() needs the same sleep as the first:
    def wait(self):
        """Loop waiting on children to die and respawning as necessary."""
        while self.running:
            wrap = self._wait_child()
            if not wrap:
                # Yield to other threads if no children have exited
                # Sleep for a short time to avoid excessive CPU usage
                # (see bug #1095346)
                eventlet.greenthread.sleep(.01)
                continue

            LOG.info(_('wait wrap.failed %s'), wrap.failed)
            while (self.running and len(wrap.children) < wrap.workers
                   and not wrap.failed):
                self._start_child(wrap)

        if self.sigcaught:
            signame = {signal.SIGTERM: 'SIGTERM',
                       signal.SIGINT: 'SIGINT'}[self.sigcaught]
            LOG.info(_('Caught %s, stopping children'), signame)

        for pid in self.children:
            try:
                os.kill(pid, signal.SIGTERM)
            except OSError as exc:
                if exc.errno != errno.ESRCH:
                    raise

        # Wait for children to die
        if self.children:
            LOG.info(_('Waiting on %d children to exit'), len(self.children))
            while self.children:
                self._wait_child()

Patch is :
--- a/cinder/service.py
+++ b/cinder/service.py
@@ -323,7 +323,10 @@ class ProcessLauncher(object):
         if self.children:
             LOG.info(_('Waiting on %d children to exit'), len(self.children))
             while self.children:
- self._wait_child()
+ wrap = self._wait_child()
+ if not wrap:
+ eventlet.greenthread.sleep(.01)
+ continue

Changed in cinder:
assignee: nobody → Scott DAngelo (scott-dangelo)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (master)

Fix proposed to branch: master
Review: https://review.openstack.org/57971

Changed in cinder:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cinder (master)

Reviewed: https://review.openstack.org/57971
Committed: http://github.com/openstack/cinder/commit/273d8765115f0936d24549db8698dbe7e8d2ec1a
Submitter: Jenkins
Branch: master

commit 273d8765115f0936d24549db8698dbe7e8d2ec1a
Author: scott-dangelo <email address hidden>
Date: Fri Nov 22 16:28:07 2013 +0000

    Add greenthread.sleep() to parent wait()

    Add an eventlet.greenthread.sleep() to wait() to keep
    cinder-volume parent from looping after catching SIGTERM
    and hogging the CPU. This was already added to the running
    state but is missing from the SIGTERM caught state when
    the parent waits to reap the child.

    Change-Id: Ia871e31db5bf9ec4e86f926f8f6c4f0f4ecb7925
    Closes-Bug: #1254089

Changed in cinder:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in cinder:
milestone: none → icehouse-1
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in cinder:
milestone: icehouse-1 → 2014.1
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.