LoopingCall sleep causes graceful process shutdown delay

Bug #1660210 reported by Brent Tang
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
oslo.service
Fix Released
Undecided
Allain Legacy

Bug Description

When trying do a SIGTERM graceful shutdown of various OpenStack services (Cinder, Neutron, etc), it was noticed that the processes can take up to a minute or more to shutdown.

When looking into this, the delay seems to be caused by the fact that the loopingcall.py _run_loop method will do a sleep for the entire time it wants to wait (both on the initial delay and the time until the next interval). By sleeping for this entire time (which could be 60 seconds or even 5 minutes in some cases), this causes the process to take this amount of time to die (unless killed prior to that), even thought it really isn't doing any processing or cleanup during that time.

There might be a better way than this to do it, but seems like it might be better to have the sleep/wait dependent on being notified that it is stopping (like using a threading.Condition wait/notify). There is some overhead involved in waking up more often, but since it isn't processing anything isn't using a lot of cpu. So I was thinking something like this in loopingcall.py could eliminate most of that delay:

    def stop(self):
        self._cond.acquire()
        self._running = False
        self._cond.notify_all()
        self._cond.release()

    def _run_loop(self, idle_for_func,
                  initial_delay=None, stop_on_exception=True):
        .....
        if initial_delay:
            self._cond.acquire()
            if self._running:
                self._cond.wait(initial_delay)
            self._cond.release()
        .....
                self._cond.acquire()
                if self._running:
                    self._cond.wait(idle)
                self._cond.release()
        except LoopingCallDone as e:

Allain Legacy (alegacy)
Changed in oslo.service:
status: New → Confirmed
Allain Legacy (alegacy)
Changed in oslo.service:
assignee: nobody → Allain Legacy (alegacy)
status: Confirmed → In Progress
Revision history for this message
Allain Legacy (alegacy) wrote :

I have proposed the following change as a possible fix.

https://review.openstack.org/#/c/469859

Not sure why a comment wasn't added automatically to point to it so figured I would add it here manually.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.service (master)

Reviewed: https://review.openstack.org/469859
Committed: https://git.openstack.org/cgit/openstack/oslo.service/commit/?id=ba28d511e08c02803ac834bf563eb30a135c2c6e
Submitter: Zuul
Branch: master

commit ba28d511e08c02803ac834bf563eb30a135c2c6e
Author: Allain Legacy <email address hidden>
Date: Wed May 31 16:18:19 2017 -0400

    Permit aborting loopingcall while sleeping

    Some of the openstack services implement worker tasks that are based on
    the oslo-service LoopingCallBase objects. They do this as a way to have
    a task that runs periodically as a greenthread within a child worker
    process. For example, the neutron-server runs AgentStatusCheckWorker()
    objects as base service workers in its child worker processes.

    When the parent server process handles a SIGTERM signal it attempts to
    stop all services launched on each of the child worker processes (i.e.,
    ProcessLauncher.stop()). That results in a stop() being called on each
    of the underlying base services and then a wait() to ensure that they
    complete before shutdown.

    If any service that is implemented on a LoopingCallBase related object
    is suspended on a greenthread.sleep() the previous call to stop() will
    have no effect and so the wait() will block until the sleep() finishes.
    For tasks that either have a frequent FixedLoopingBase interface or a
    short initial_delay this may not be a problem, but for those with a long
    delay this could mean that the wait() blocks for minutes before the
    process is allowed to shutdown.

    To solve this the LoopingCallBase calls to greenthread.sleep() are being
    replaced with a threading.Event() object's wait() method. This allows a
    caller of stop() to interrupt the sleep and expedite the shutdown.

    Closes-Bug: #1660210

    Change-Id: I5835f9595826df5349e4cc8b1da8529bb960ee04
    Signed-off-by: Allain Legacy <email address hidden>

Changed in oslo.service:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.service 1.28.1

This issue was fixed in the openstack/oslo.service 1.28.1 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.