[SRU] Race condition in SIGTERM signal handler

Bug #1524907 reported by Victor Stinner
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
Fix Released
Undecided
Unassigned
Liberty
Fix Released
High
Edward Hope-Morley
oslo.service
Fix Released
Undecided
Unassigned
python-oslo.service (Ubuntu)
Fix Released
Undecided
Unassigned
Wily
Won't Fix
Undecided
Unassigned
Xenial
Fix Released
Undecided
Unassigned
Yakkety
Fix Released
Undecided
Unassigned

Bug Description

[Impact]

 * See bug description. We are seeing this in a Liberty production
   environment and (at least) nova-conductor services are failing to
   restart properly.

 * this fix just missed the version of python-oslo.service we have in the
   Liberty UCA so queueing up for backport

[Test Case]

 * Start a service that has a high number of workers, check that all
   are up then do a service stop (or killall -s SIGTERM nova-conductor)
   and check that all workers/process are stopped.

[Regression Potential]

 * none

If the process launcher gets a SIGTERM signal, it calls _sigterm() to
handle it. This function calls SignalHandler() singleton to get the
instance of SignalHandler. This singleton acquires a lock to ensure
that the singleton is unique.

Problem arises when the process launcher gets a second SIGTERM while
the singleton lock (called 'singleton_lock') is locked. _sigterm() is
called again (reentrant call!), but we enter a dead lock. If eventlet
is used, eventlet fails on an assertion error: "Cannot switch to
MAINLOOP from MAINLOOP".

The bug can occurs with SIGTERM and SIGHUP signals.

I saw this issue with OpenStack services managed by systemd with a wrong configuration: SIGTERM is sent to all processes of the cgroups, instead of only sending the SIGTERM signal to the "main" process ("Main PID" in systemd). When the process launcher gets a SIGTERM, it sends a new SIGTERM signal to each child process. If systemd already sent a first SIGTERM to child processes, they now get two SIGTERM "shortly".

For OpenStack services managed by systemd, the service file must contain "KillMode=process" to only send SIGTERM to the main process ("Main PID").

Revision history for this message
Victor Stinner (vstinner) wrote :

I have a fix, I will send it tomorrow.

Revision history for this message
Victor Stinner (vstinner) wrote :
Revision history for this message
Victor Stinner (vstinner) wrote :

The bug is a regression introduced by the commit e3983f23b5ce8a0f606c0b62e279acec6634f208:
---
commit e3983f23b5ce8a0f606c0b62e279acec6634f208
Author: Marian Horban <email address hidden>
Date: Wed Jul 22 20:07:48 2015 -0400

    Added class SignalHandler

    Singleton class Signal Handler is responsible for interacting
    with signal module: allow to have several handlers assigned
    to distinct signal. Added correct handling SIGINT signal.

    Closes-Bug: #1477102
    Change-Id: I4bc4c98645e9afeb18635576d50a442fb7ce8425

---

The first release including this change is oslo.service 0.6.0.

To be clear: OpenStack Kilo is not impacted (oslo.service was not used in Kilo), only OpenStack Liberty and Mitaka.

Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote :
Changed in oslo.service:
status: New → Fix Committed
Revision history for this message
Victor Stinner (vstinner) wrote :
Download full text (3.7 KiB)

I forgot to give an example of traceback:

[req-4c7d45e7-8220-42cd-9ecd-903566198453 - - - - -] Unhandled exception
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/oslo_service/service.py", line 377, in _child_wait_for_exit_or_signal
    launcher.wait()
  File "/usr/lib/python2.7/site-packages/oslo_service/service.py", line 204, in wait
    self.services.wait()
  File "/usr/lib/python2.7/site-packages/oslo_service/service.py", line 625, in wait
    service.wait()
  File "/usr/lib/python2.7/site-packages/oslo_service/service.py", line 591, in wait
    self._done.wait()
  File "/usr/lib/python2.7/site-packages/eventlet/event.py", line 121, in wait
    return hubs.get_hub().switch()
  File "/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 294, in switch
    return self.greenlet.switch()
  File "/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 346, in run
    self.wait(sleep_time)
  File "/usr/lib/python2.7/site-packages/eventlet/hubs/poll.py", line 85, in wait
    presult = self.do_poll(seconds)
  File "/usr/lib/python2.7/site-packages/eventlet/hubs/epolls.py", line 62, in do_poll
    return self.poll.poll(seconds)
  File "/usr/lib/python2.7/site-packages/oslo_service/service.py", line 160, in _handle_signals
    handler(signo, frame)
  File "/usr/lib/python2.7/site-packages/oslo_service/service.py", line 355, in _sigterm
    SignalHandler().clear()
  File "/usr/lib/python2.7/site-packages/oslo_service/service.py", line 116, in __call__
    with lockutils.lock('singleton_lock', semaphores=cls._semaphores):
  File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py", line 195, in lock
    int_lock = internal_lock(name, semaphores=semaphores)
  File "/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py", line 160, in internal_lock
    return semaphores.get(name)
  File "/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py", line 109, in get
    sem = threading.Semaphore()
  File "/usr/lib64/python2.7/threading.py", line 423, in Semaphore
    return _Semaphore(*args, **kwargs)
  File "/usr/lib64/python2.7/threading.py", line 439, in __init__
    self.__cond = Condition(Lock())
  File "/usr/lib64/python2.7/threading.py", line 252, in Condition
    return _Condition(*args, **kwargs)
  File "/usr/lib64/python2.7/threading.py", line 260, in __init__
    _Verbose.__init__(self, verbose)
  File "/usr/lib/python2.7/site-packages/oslo_service/service.py", line 160, in _handle_signals
    handler(signo, frame)
  File "/usr/lib/python2.7/site-packages/oslo_service/service.py", line 355, in _sigterm
    SignalHandler().clear()
  File "/usr/lib/python2.7/site-packages/oslo_service/service.py", line 116, in __call__
    with lockutils.lock('singleton_lock', semaphores=cls._semaphores):
  File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py", line 195, in lock
    int_lock = internal_lock(name, semaphores=semaphores)
  File "/usr/lib/python2.7/site-packages/oslo_concurrency/locku...

Read more...

Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote : Fix included in openstack/oslo.service 1.2.0

This issue was fixed in the openstack/oslo.service 1.2.0 release.

Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/oslo.service 0.9.1

This issue was fixed in the openstack/oslo.service 0.9.1 release.

Changed in oslo.service:
status: Fix Committed → Fix Released
Changed in cloud-archive:
status: New → Fix Released
Louis Bouchard (louis)
Changed in python-oslo.service (Ubuntu Wily):
status: New → Won't Fix
Changed in python-oslo.service (Ubuntu Xenial):
status: New → Fix Released
Changed in python-oslo.service (Ubuntu Yakkety):
status: New → Fix Released
description: updated
summary: - Race condition in SIGTERM signal handler
+ [SRU] Race condition in SIGTERM signal handler
tags: added: sts sts-sru
Revision history for this message
Edward Hope-Morley (hopem) wrote :
Revision history for this message
James Page (james-page) wrote : Please test proposed package

Hello Victor, or anyone else affected,

Accepted python-oslo.service into liberty-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:liberty-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-liberty-needed to verification-liberty-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-liberty-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-liberty-needed
Revision history for this message
Edward Hope-Morley (hopem) wrote :

Just tested as described and lgtm.

tags: added: verification-liberty-done
removed: verification-liberty-needed
Revision history for this message
Ryan Beisner (1chb1n) wrote : Update Released

The verification of the Stable Release Update for python-oslo.service has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Ryan Beisner (1chb1n) wrote :

This bug was fixed in the package python-oslo.service - 0.9.0-1~cloud1
---------------

 python-oslo.service (0.9.0-1~cloud1) trusty-liberty; urgency=medium
 .
   * Backport fix for race condition in SIGTERM signal handler (LP: #1524907)
     - d/p/fix-race-condition-on-handling-signals.patch
   * Also including unit test monkeypath fix
     - d/p/doing-monkey_patch-for-unittest.patch

Louis Bouchard (louis)
tags: removed: sts-sru
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.service 0.9.1

This issue was fixed in the openstack/oslo.service 0.9.1 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.