[SRU] Race condition in SIGTERM signal handler
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Ubuntu Cloud Archive |
Fix Released
|
Undecided
|
Unassigned | ||
Liberty |
Fix Released
|
High
|
Edward Hope-Morley | ||
oslo.service |
Fix Released
|
Undecided
|
Unassigned | ||
python-oslo.service (Ubuntu) |
Fix Released
|
Undecided
|
Unassigned | ||
Wily |
Won't Fix
|
Undecided
|
Unassigned | ||
Xenial |
Fix Released
|
Undecided
|
Unassigned | ||
Yakkety |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
[Impact]
* See bug description. We are seeing this in a Liberty production
environment and (at least) nova-conductor services are failing to
restart properly.
* this fix just missed the version of python-oslo.service we have in the
Liberty UCA so queueing up for backport
[Test Case]
* Start a service that has a high number of workers, check that all
are up then do a service stop (or killall -s SIGTERM nova-conductor)
and check that all workers/process are stopped.
[Regression Potential]
* none
If the process launcher gets a SIGTERM signal, it calls _sigterm() to
handle it. This function calls SignalHandler() singleton to get the
instance of SignalHandler. This singleton acquires a lock to ensure
that the singleton is unique.
Problem arises when the process launcher gets a second SIGTERM while
the singleton lock (called 'singleton_lock') is locked. _sigterm() is
called again (reentrant call!), but we enter a dead lock. If eventlet
is used, eventlet fails on an assertion error: "Cannot switch to
MAINLOOP from MAINLOOP".
The bug can occurs with SIGTERM and SIGHUP signals.
I saw this issue with OpenStack services managed by systemd with a wrong configuration: SIGTERM is sent to all processes of the cgroups, instead of only sending the SIGTERM signal to the "main" process ("Main PID" in systemd). When the process launcher gets a SIGTERM, it sends a new SIGTERM signal to each child process. If systemd already sent a first SIGTERM to child processes, they now get two SIGTERM "shortly".
For OpenStack services managed by systemd, the service file must contain "KillMode=process" to only send SIGTERM to the main process ("Main PID").
Changed in oslo.service: | |
status: | Fix Committed → Fix Released |
Changed in cloud-archive: | |
status: | New → Fix Released |
Changed in python-oslo.service (Ubuntu Wily): | |
status: | New → Won't Fix |
Changed in python-oslo.service (Ubuntu Xenial): | |
status: | New → Fix Released |
Changed in python-oslo.service (Ubuntu Yakkety): | |
status: | New → Fix Released |
description: | updated |
summary: |
- Race condition in SIGTERM signal handler + [SRU] Race condition in SIGTERM signal handler |
tags: | added: sts sts-sru |
tags: | removed: sts-sru |
I have a fix, I will send it tomorrow.