Service children uselessly restart on signals

Bug #1794708 reported by Thomas Herve on 2018-09-27
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
oslo.service
Undecided
Eric Fried

Bug Description

While debugging some service behavior, I found something weird with SIGUP and restart.

If you look here: https://github.com/openstack/oslo.service/blob/master/oslo_service/service.py#L577

there is a self.launcher.restart() in the children when it received a SIGHUP signal.

But, if you look in the parent: https://github.com/openstack/oslo.service/blob/master/oslo_service/service.py#L679

When a SIGHUP signal is received, it calls reset on all of them, and then os.kill them.

In practice, it means your service stop, start, and then stop again. It doesn't seem to be the intended behavior.

Herve Beraud (herveberaud) wrote :

If you look at this line https://github.com/openstack/oslo.service/blob/master/oslo_service/service.py#L668 you can see that a break are realised on SIGHUP so we don't make another shutdown.

The commit 286a6ea8 replace the SIGHUP by a SIGTERM.

$ git show 286a6ea8
commit 286a6ea80dd4419137d8c35c7db6a0f687027418
Author: Marian Horban <email address hidden>
Date: Thu Sep 17 10:54:02 2015 -0400

    Termination children on SIGHUP added

    Reloading of application can be performed by sending HUP signal to
    master process. But it is not useful right now because there should
    be implemented many lines of code to support real reloading for each
    of config options, and of course it is not implemented.

    The easiest way to achieve reloading and applying configuration is
    gracefully stop children and then respawn them.

    When master process receives HUP signal it reloads it own
    configuration, sends TERM signal to children to shutdown them
    gracefully and respawns new children with updated config.

    There is no impact on the user because sockets are still listening
    in master process. So new requests will be put in queue.

    Change-Id: I3e7264a1efcbf66a9afc69d8ed20f600c985c296

diff --git a/oslo_service/service.py b/oslo_service/service.py
index 27cbed9..64b5f21 100644
--- a/oslo_service/service.py
+++ b/oslo_service/service.py
@@ -524,7 +524,7 @@ class ProcessLauncher(object):
                     service.reset()

                 for pid in self.children:
- os.kill(pid, signal.SIGHUP)
+ os.kill(pid, signal.SIGTERM)

                 self.running = True
                 self.sigcaught = None
diff --git a/oslo_service/tests/test_service.py b/oslo_service/tests/test_service.py
index 909a14c..db16297 100644
--- a/oslo_service/tests/test_service.py
+++ b/oslo_service/tests/test_service.py
@@ -215,15 +215,16 @@ class ServiceLauncherTest(ServiceTestBase):
         start_workers = self._spawn()

         os.kill(self.pid, signal.SIGHUP)
+
+ def cond():
+ workers = self._get_workers()
+ return (len(workers) == len(start_workers) and
+ not set(start_workers).intersection(workers))
+
         # Wait at most 5 seconds to respawn a worker
- cond = lambda: start_workers == self._get_workers()
- timeout = 5
+ timeout = 10
         self._wait(cond, timeout)
-
- # Make sure worker pids match
- end_workers = self._get_workers()
- LOG.info('workers: %r' % end_workers)
- self.assertEqual(start_workers, end_workers)
+ self.assertTrue(cond())

 class ServiceRestartTest(ServiceTestBase):

Herve Beraud (herveberaud) wrote :

This comment try to describe the "wait" method main changes timeline concerning this bug.

1 - The original version of the code, it's use SIGTERM only:
https://github.com/openstack/oslo.service/commit/ad08fa70#diff-49ecd82bf041c1fd494a0bc4a96782deR261

2 - Then we manage respawn with a dedicated function and adding SIGHUP for children and keep SIGTERM at the end of the method (SIGHUP and SIGTERM used together):
https://github.com/openstack/oslo.service/commit/1b11dd4b#diff-49ecd82bf041c1fd494a0bc4a96782deR332

3 - Introduce eventlet, keep SIGHUP and SIGTERM together:
https://github.com/openstack/oslo.service/commit/3c389e9b

4 - Manage SIGTERM externaly from the 'wait' method:
https://github.com/openstack/oslo.service/commit/5f29c3637e0a82571af51b7bb98df0df6eccc716#diff-49ecd82bf041c1fd494a0bc4a96782deR410

5 - Prefer to use SIGTERM instead SIGHUP but continue to use SIGTERM by calling the 'stop' method:
https://github.com/openstack/oslo.service/commit/286a6ea8

I hope these elements can help to find the right way to handle these signals and help to find the right behavior to implement.

Herve Beraud (herveberaud) wrote :
Victor Stinner (victor-stinner) wrote :

FYI this bug report comes from https://bugs.launchpad.net/tripleo/+bug/1789680 if I understood correctly.

Fix proposed to branch: master
Review: https://review.openstack.org/641793

Changed in oslo.service:
assignee: nobody → Dan Smith (danms)
status: New → In Progress

Fix proposed to branch: master
Review: https://review.openstack.org/641907

Changed in oslo.service:
assignee: Dan Smith (danms) → Mohammed Naser (mnaser)

Change abandoned by Dan Smith (<email address hidden>) on branch: master
Review: https://review.openstack.org/641793
Reason: I still think the is-daemon check (and original justification) is wrong, but it's not the acute problem right now so I don't care enough to push on this.

Changed in oslo.service:
assignee: Mohammed Naser (mnaser) → Ben Nemec (bnemec)
Changed in oslo.service:
assignee: Ben Nemec (bnemec) → Eric Fried (efried)

Reviewed: https://review.opendev.org/641907
Committed: https://git.openstack.org/cgit/openstack/oslo.service/commit/?id=e7dd2916893157854ca0fa5f4215d536151abbb3
Submitter: Zuul
Branch: master

commit e7dd2916893157854ca0fa5f4215d536151abbb3
Author: Mohammed Naser <email address hidden>
Date: Thu Mar 7 22:09:49 2019 -0500

    restart: don't stop process on sighup when mutating

    It seems that the code for handling SIGHUP currently calls stop()
    on the service, then calls reset(), then calls start() on it again.

    This is effectively a full service restart, which breaks the whole
    point behind using SIGHUP for hot and quick reloads. It also breaks
    our downstream projects in a few ways where they lose RPC on reload
    due to the fact that they don't expect to have stop() called on a
    reset().

    This patch removes the stop and start when the restart_method is
    set to 'mutate' because in that case we should just be signaling
    the service to check for changes in its mutable config options.
    It also changes the signal sent to children in that case to
    SIGHUP, since SIGTERM will cause unnecessary restarts of child
    processes.

    The previous behavior is maintained for the 'reload' restart_method
    since that does a complete reload of the service config, which is
    not safe to do without restarting the service completely.

    Change-Id: I86a34c22d41d87a9cce2d4ac6d95562d05823ecf
    Closes-Bug: #1794708
    Co-Authored-By: Ben Nemec <email address hidden>

Changed in oslo.service:
status: In Progress → Fix Released

This issue was fixed in the openstack/oslo.service 1.40.1 release.

Reviewed: https://review.opendev.org/679465
Committed: https://git.openstack.org/cgit/openstack/oslo.service/commit/?id=a7621c8c46378fb456f306dcf5f8cc7836b110d2
Submitter: Zuul
Branch: master

commit a7621c8c46378fb456f306dcf5f8cc7836b110d2
Author: Eric Fried <email address hidden>
Date: Fri Aug 30 09:53:05 2019 -0500

    Reno for SIGHUP fix

    Adds a release note for the fix at
    I86a34c22d41d87a9cce2d4ac6d95562d05823ecf

    Change-Id: I4682950ac12f763737489c510246d54aed80b80f
    Related-Bug: #1794708

Reviewed: https://review.opendev.org/680505
Committed: https://git.openstack.org/cgit/openstack/oslo.service/commit/?id=abfec6a48a01c80b861ebe7e151b8df23ff6b398
Submitter: Zuul
Branch: stable/stein

commit abfec6a48a01c80b861ebe7e151b8df23ff6b398
Author: Mohammed Naser <email address hidden>
Date: Thu Mar 7 22:09:49 2019 -0500

    restart: don't stop process on sighup when mutating

    It seems that the code for handling SIGHUP currently calls stop()
    on the service, then calls reset(), then calls start() on it again.

    This is effectively a full service restart, which breaks the whole
    point behind using SIGHUP for hot and quick reloads. It also breaks
    our downstream projects in a few ways where they lose RPC on reload
    due to the fact that they don't expect to have stop() called on a
    reset().

    This patch removes the stop and start when the restart_method is
    set to 'mutate' because in that case we should just be signaling
    the service to check for changes in its mutable config options.
    It also changes the signal sent to children in that case to
    SIGHUP, since SIGTERM will cause unnecessary restarts of child
    processes.

    The previous behavior is maintained for the 'reload' restart_method
    since that does a complete reload of the service config, which is
    not safe to do without restarting the service completely.

    Change-Id: I86a34c22d41d87a9cce2d4ac6d95562d05823ecf
    Closes-Bug: #1794708
    Co-Authored-By: Ben Nemec <email address hidden>
    (cherry picked from commit e7dd2916893157854ca0fa5f4215d536151abbb3)

tags: added: in-stable-stein

This issue was fixed in the openstack/oslo.service 1.38.1 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers