Service children uselessly restart on signals

Bug #1794708 reported by Thomas Herve on 2018-09-27
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
oslo.service
Undecided
Mohammed Naser

Bug Description

While debugging some service behavior, I found something weird with SIGUP and restart.

If you look here: https://github.com/openstack/oslo.service/blob/master/oslo_service/service.py#L577

there is a self.launcher.restart() in the children when it received a SIGHUP signal.

But, if you look in the parent: https://github.com/openstack/oslo.service/blob/master/oslo_service/service.py#L679

When a SIGHUP signal is received, it calls reset on all of them, and then os.kill them.

In practice, it means your service stop, start, and then stop again. It doesn't seem to be the intended behavior.

Herve Beraud (herveberaud) wrote :

If you look at this line https://github.com/openstack/oslo.service/blob/master/oslo_service/service.py#L668 you can see that a break are realised on SIGHUP so we don't make another shutdown.

The commit 286a6ea8 replace the SIGHUP by a SIGTERM.

$ git show 286a6ea8
commit 286a6ea80dd4419137d8c35c7db6a0f687027418
Author: Marian Horban <email address hidden>
Date: Thu Sep 17 10:54:02 2015 -0400

    Termination children on SIGHUP added

    Reloading of application can be performed by sending HUP signal to
    master process. But it is not useful right now because there should
    be implemented many lines of code to support real reloading for each
    of config options, and of course it is not implemented.

    The easiest way to achieve reloading and applying configuration is
    gracefully stop children and then respawn them.

    When master process receives HUP signal it reloads it own
    configuration, sends TERM signal to children to shutdown them
    gracefully and respawns new children with updated config.

    There is no impact on the user because sockets are still listening
    in master process. So new requests will be put in queue.

    Change-Id: I3e7264a1efcbf66a9afc69d8ed20f600c985c296

diff --git a/oslo_service/service.py b/oslo_service/service.py
index 27cbed9..64b5f21 100644
--- a/oslo_service/service.py
+++ b/oslo_service/service.py
@@ -524,7 +524,7 @@ class ProcessLauncher(object):
                     service.reset()

                 for pid in self.children:
- os.kill(pid, signal.SIGHUP)
+ os.kill(pid, signal.SIGTERM)

                 self.running = True
                 self.sigcaught = None
diff --git a/oslo_service/tests/test_service.py b/oslo_service/tests/test_service.py
index 909a14c..db16297 100644
--- a/oslo_service/tests/test_service.py
+++ b/oslo_service/tests/test_service.py
@@ -215,15 +215,16 @@ class ServiceLauncherTest(ServiceTestBase):
         start_workers = self._spawn()

         os.kill(self.pid, signal.SIGHUP)
+
+ def cond():
+ workers = self._get_workers()
+ return (len(workers) == len(start_workers) and
+ not set(start_workers).intersection(workers))
+
         # Wait at most 5 seconds to respawn a worker
- cond = lambda: start_workers == self._get_workers()
- timeout = 5
+ timeout = 10
         self._wait(cond, timeout)
-
- # Make sure worker pids match
- end_workers = self._get_workers()
- LOG.info('workers: %r' % end_workers)
- self.assertEqual(start_workers, end_workers)
+ self.assertTrue(cond())

 class ServiceRestartTest(ServiceTestBase):

Herve Beraud (herveberaud) wrote :

This comment try to describe the "wait" method main changes timeline concerning this bug.

1 - The original version of the code, it's use SIGTERM only:
https://github.com/openstack/oslo.service/commit/ad08fa70#diff-49ecd82bf041c1fd494a0bc4a96782deR261

2 - Then we manage respawn with a dedicated function and adding SIGHUP for children and keep SIGTERM at the end of the method (SIGHUP and SIGTERM used together):
https://github.com/openstack/oslo.service/commit/1b11dd4b#diff-49ecd82bf041c1fd494a0bc4a96782deR332

3 - Introduce eventlet, keep SIGHUP and SIGTERM together:
https://github.com/openstack/oslo.service/commit/3c389e9b

4 - Manage SIGTERM externaly from the 'wait' method:
https://github.com/openstack/oslo.service/commit/5f29c3637e0a82571af51b7bb98df0df6eccc716#diff-49ecd82bf041c1fd494a0bc4a96782deR410

5 - Prefer to use SIGTERM instead SIGHUP but continue to use SIGTERM by calling the 'stop' method:
https://github.com/openstack/oslo.service/commit/286a6ea8

I hope these elements can help to find the right way to handle these signals and help to find the right behavior to implement.

Herve Beraud (herveberaud) wrote :
Victor Stinner (victor-stinner) wrote :

FYI this bug report comes from https://bugs.launchpad.net/tripleo/+bug/1789680 if I understood correctly.

Fix proposed to branch: master
Review: https://review.openstack.org/641793

Changed in oslo.service:
assignee: nobody → Dan Smith (danms)
status: New → In Progress

Fix proposed to branch: master
Review: https://review.openstack.org/641907

Changed in oslo.service:
assignee: Dan Smith (danms) → Mohammed Naser (mnaser)

Change abandoned by Dan Smith (<email address hidden>) on branch: master
Review: https://review.openstack.org/641793
Reason: I still think the is-daemon check (and original justification) is wrong, but it's not the acute problem right now so I don't care enough to push on this.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers