[SRU] Race condition in SIGTERM signal handler

Bug #1524907 reported by Victor Stinner
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
Fix Released
Undecided
Unassigned
Liberty
Fix Released
High
Edward Hope-Morley
oslo.service
Fix Released
Undecided
Unassigned
python-oslo.service (Ubuntu)
Fix Released
Undecided
Unassigned
Wily
Won't Fix
Undecided
Unassigned
Xenial
Fix Released
Undecided
Unassigned
Yakkety
Fix Released
Undecided
Unassigned

Bug Description

[Impact]

 * See bug description. We are seeing this in a Liberty production
   environment and (at least) nova-conductor services are failing to
   restart properly.

 * this fix just missed the version of python-oslo.service we have in the
   Liberty UCA so queueing up for backport

[Test Case]

 * Start a service that has a high number of workers, check that all
   are up then do a service stop (or killall -s SIGTERM nova-conductor)
   and check that all workers/process are stopped.

[Regression Potential]

 * none

If the process launcher gets a SIGTERM signal, it calls _sigterm() to
handle it. This function calls SignalHandler() singleton to get the
instance of SignalHandler. This singleton acquires a lock to ensure
that the singleton is unique.

Problem arises when the process launcher gets a second SIGTERM while
the singleton lock (called 'singleton_lock') is locked. _sigterm() is
called again (reentrant call!), but we enter a dead lock. If eventlet
is used, eventlet fails on an assertion error: "Cannot switch to
MAINLOOP from MAINLOOP".

The bug can occurs with SIGTERM and SIGHUP signals.

I saw this issue with OpenStack services managed by systemd with a wrong configuration: SIGTERM is sent to all processes of the cgroups, instead of only sending the SIGTERM signal to the "main" process ("Main PID" in systemd). When the process launcher gets a SIGTERM, it sends a new SIGTERM signal to each child process. If systemd already sent a first SIGTERM to child processes, they now get two SIGTERM "shortly".

For OpenStack services managed by systemd, the service file must contain "KillMode=process" to only send SIGTERM to the main process ("Main PID").

Revision history for this message
Victor Stinner (vstinner) wrote :

I have a fix, I will send it tomorrow.

Revision history for this message
Victor Stinner (vstinner) wrote :
Revision history for this message
Victor Stinner (vstinner) wrote :

The bug is a regression introduced by the commit e3983f23b5ce8a0f606c0b62e279acec6634f208:
---
commit e3983f23b5ce8a0f606c0b62e279acec6634f208
Author: Marian Horban <email address hidden>
Date: Wed Jul 22 20:07:48 2015 -0400

    Added class SignalHandler

    Singleton class Signal Handler is responsible for interacting
    with signal module: allow to have several handlers assigned
    to distinct signal. Added correct handling SIGINT signal.

    Closes-Bug: #1477102
    Change-Id: I4bc4c98645e9afeb18635576d50a442fb7ce8425

---

The first release including this change is oslo.service 0.6.0.

To be clear: OpenStack Kilo is not impacted (oslo.service was not used in Kilo), only OpenStack Liberty and Mitaka.

Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote :
Changed in oslo.service:
status: New → Fix Committed
Revision history for this message
Victor Stinner (vstinner) wrote :
Download full text (3.7 KiB)

I forgot to give an example of traceback:

[req-4c7d45e7-8220-42cd-9ecd-903566198453 - - - - -] Unhandled exception
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/oslo_service/service.py", line 377, in _child_wait_for_exit_or_signal
    launcher.wait()
  File "/usr/lib/python2.7/site-packages/oslo_service/service.py", line 204, in wait
    self.services.wait()
  File "/usr/lib/python2.7/site-packages/oslo_service/service.py", line 625, in wait
    service.wait()
  File "/usr/lib/python2.7/site-packages/oslo_service/service.py", line 591, in wait
    self._done.wait()
  File "/usr/lib/python2.7/site-packages/eventlet/event.py", line 121, in wait
    return hubs.get_hub().switch()
  File "/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 294, in switch
    return self.greenlet.switch()
  File "/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 346, in run
    self.wait(sleep_time)
  File "/usr/lib/python2.7/site-packages/eventlet/hubs/poll.py", line 85, in wait
    presult = self.do_poll(seconds)
  File "/usr/lib/python2.7/site-packages/eventlet/hubs/epolls.py", line 62, in do_poll
    return self.poll.poll(seconds)
  File "/usr/lib/python2.7/site-packages/oslo_service/service.py", line 160, in _handle_signals
    handler(signo, frame)
  File "/usr/lib/python2.7/site-packages/oslo_service/service.py", line 355, in _sigterm
    SignalHandler().clear()
  File "/usr/lib/python2.7/site-packages/oslo_service/service.py", line 116, in __call__
    with lockutils.lock('singleton_lock', semaphores=cls._semaphores):
  File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py", line 195, in lock
    int_lock = internal_lock(name, semaphores=semaphores)
  File "/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py", line 160, in internal_lock
    return semaphores.get(name)
  File "/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py", line 109, in get
    sem = threading.Semaphore()
  File "/usr/lib64/python2.7/threading.py", line 423, in Semaphore
    return _Semaphore(*args, **kwargs)
  File "/usr/lib64/python2.7/threading.py", line 439, in __init__
    self.__cond = Condition(Lock())
  File "/usr/lib64/python2.7/threading.py", line 252, in Condition
    return _Condition(*args, **kwargs)
  File "/usr/lib64/python2.7/threading.py", line 260, in __init__
    _Verbose.__init__(self, verbose)
  File "/usr/lib/python2.7/site-packages/oslo_service/service.py", line 160, in _handle_signals
    handler(signo, frame)
  File "/usr/lib/python2.7/site-packages/oslo_service/service.py", line 355, in _sigterm
    SignalHandler().clear()
  File "/usr/lib/python2.7/site-packages/oslo_service/service.py", line 116, in __call__
    with lockutils.lock('singleton_lock', semaphores=cls._semaphores):
  File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py", line 195, in lock
    int_lock = internal_lock(name, semaphores=semaphores)
  File "/usr/lib/python2.7/site-packages/oslo_concurrency/locku...

Read more...

Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote : Fix included in openstack/oslo.service 1.2.0

This issue was fixed in the openstack/oslo.service 1.2.0 release.

Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/oslo.service 0.9.1

This issue was fixed in the openstack/oslo.service 0.9.1 release.

Changed in oslo.service:
status: Fix Committed → Fix Released
Changed in cloud-archive:
status: New → Fix Released
Louis Bouchard (louis)
Changed in python-oslo.service (Ubuntu Wily):
status: New → Won't Fix
Changed in python-oslo.service (Ubuntu Xenial):
status: New → Fix Released
Changed in python-oslo.service (Ubuntu Yakkety):
status: New → Fix Released
description: updated
summary: - Race condition in SIGTERM signal handler
+ [SRU] Race condition in SIGTERM signal handler
tags: added: sts sts-sru
Revision history for this message
Edward Hope-Morley (hopem) wrote :
Revision history for this message
James Page (james-page) wrote : Please test proposed package

Hello Victor, or anyone else affected,

Accepted python-oslo.service into liberty-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:liberty-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-liberty-needed to verification-liberty-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-liberty-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-liberty-needed
Revision history for this message
Edward Hope-Morley (hopem) wrote :

Just tested as described and lgtm.

tags: added: verification-liberty-done
removed: verification-liberty-needed
Revision history for this message
Ryan Beisner (1chb1n) wrote : Update Released

The verification of the Stable Release Update for python-oslo.service has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Ryan Beisner (1chb1n) wrote :

This bug was fixed in the package python-oslo.service - 0.9.0-1~cloud1
---------------

 python-oslo.service (0.9.0-1~cloud1) trusty-liberty; urgency=medium
 .
   * Backport fix for race condition in SIGTERM signal handler (LP: #1524907)
     - d/p/fix-race-condition-on-handling-signals.patch
   * Also including unit test monkeypath fix
     - d/p/doing-monkey_patch-for-unittest.patch

Louis Bouchard (louis)
tags: removed: sts-sru
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.service 0.9.1

This issue was fixed in the openstack/oslo.service 0.9.1 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers