Bug #1794708 “Service children uselessly restart on signals” : Bugs : oslo.service

Revision history for this message

Herve Beraud (herveberaud) wrote on 2018-09-27:

#1

If you look at this line https://github.com/openstack/oslo.service/blob/master/oslo_service/service.py#L668 you can see that a break are realised on SIGHUP so we don't make another shutdown.

The commit 286a6ea8 replace the SIGHUP by a SIGTERM.

$ git show 286a6ea8
commit 286a6ea80dd4419137d8c35c7db6a0f687027418
Author: Marian Horban <email address hidden>
Date: Thu Sep 17 10:54:02 2015 -0400

Termination children on SIGHUP added

    Reloading of application can be performed by sending HUP signal to
    master process. But it is not useful right now because there should
    be implemented many lines of code to support real reloading for each
    of config options, and of course it is not implemented.

The easiest way to achieve reloading and applying configuration is
gracefully stop children and then respawn them.

    When master process receives HUP signal it reloads it own
    configuration, sends TERM signal to children to shutdown them
    gracefully and respawns new children with updated config.

There is no impact on the user because sockets are still listening
in master process. So new requests will be put in queue.

Change-Id: I3e7264a1efcbf66a9afc69d8ed20f600c985c296

diff --git a/oslo_service/service.py b/oslo_service/service.py
index 27cbed9..64b5f21 100644
--- a/oslo_service/service.py
+++ b/oslo_service/service.py
@@ -524,7 +524,7 @@ class ProcessLauncher(object):
service.reset()

for pid in self.children:
- os.kill(pid, signal.SIGHUP)
+ os.kill(pid, signal.SIGTERM)

                 self.running = True
                 self.sigcaught = None
diff --git a/oslo_service/tests/test_service.py b/oslo_service/tests/test_service.py
index 909a14c..db16297 100644
--- a/oslo_service/tests/test_service.py
+++ b/oslo_service/tests/test_service.py
@@ -215,15 +215,16 @@ class ServiceLauncherTest(ServiceTestBase):
         start_workers = self._spawn()

         os.kill(self.pid, signal.SIGHUP)
+
+ def cond():
+ workers = self._get_workers()
+ return (len(workers) == len(start_workers) and
+ not set(start_workers).intersection(workers))
+
         # Wait at most 5 seconds to respawn a worker
- cond = lambda: start_workers == self._get_workers()
- timeout = 5
+ timeout = 10
         self._wait(cond, timeout)
-
- # Make sure worker pids match
- end_workers = self._get_workers()
- LOG.info('workers: %r' % end_workers)
- self.assertEqual(start_workers, end_workers)
+ self.assertTrue(cond())

class ServiceRestartTest(ServiceTestBase):

If you look at this line https://github.com/openstack/oslo.service/blob/master/oslo_service/service.py#L668 you can see that a break are realised on SIGHUP so we don't make another shutdown.

The commit 286a6ea8 replace the SIGHUP by a SIGTERM.

$ git show 286a6ea8
commit 286a6ea80dd4419137d8c35c7db6a0f687027418
Author: Marian Horban <mhorban@mirantis.com>
Date:   Thu Sep 17 10:54:02 2015 -0400

Termination children on SIGHUP added
    
    Reloading of application can be performed by sending HUP signal to
    master process. But it is not useful right now because there should
    be implemented many lines of code to support real reloading for each
    of config options, and of course it is not implemented.
    
    The easiest way to achieve reloading and applying configuration is
    gracefully stop children and then respawn them.
    
    When master process receives HUP signal it reloads it own
    configuration, sends TERM signal to children to shutdown them
    gracefully and respawns new children with updated config.
    
    There is no impact on the user because sockets are still listening
    in master process. So new requests will be put in queue.
    
    Change-Id: I3e7264a1efcbf66a9afc69d8ed20f600c985c296

diff --git a/oslo_service/service.py b/oslo_service/service.py
index 27cbed9..64b5f21 100644
--- a/oslo_service/service.py
+++ b/oslo_service/service.py
@@ -524,7 +524,7 @@ class ProcessLauncher(object):
                     service.reset()
 
                 for pid in self.children:
-                    os.kill(pid, signal.SIGHUP)
+                    os.kill(pid, signal.SIGTERM)
 
                 self.running = True
                 self.sigcaught = None
diff --git a/oslo_service/tests/test_service.py b/oslo_service/tests/test_service.py
index 909a14c..db16297 100644
--- a/oslo_service/tests/test_service.py
+++ b/oslo_service/tests/test_service.py
@@ -215,15 +215,16 @@ class ServiceLauncherTest(ServiceTestBase):
         start_workers = self._spawn()
 
         os.kill(self.pid, signal.SIGHUP)
+
+        def cond():
+            workers = self._get_workers()
+            return (len(workers) == len(start_workers) and
+                    not set(start_workers).intersection(workers))
+
         # Wait at most 5 seconds to respawn a worker
-        cond = lambda: start_workers == self._get_workers()
-        timeout = 5
+        timeout = 10
         self._wait(cond, timeout)
-
-        # Make sure worker pids match
-        end_workers = self._get_workers()
-        LOG.info('workers: %r' % end_workers)
-        self.assertEqual(start_workers, end_workers)
+        self.assertTrue(cond())
 
 
 class ServiceRestartTest(ServiceTestBase):

Revision history for this message

Herve Beraud (herveberaud) wrote on 2018-09-27:

#2

This comment try to describe the "wait" method main changes timeline concerning this bug.

1 - The original version of the code, it's use SIGTERM only:
https://github.com/openstack/oslo.service/commit/ad08fa70#diff-49ecd82bf041c1fd494a0bc4a96782deR261

2 - Then we manage respawn with a dedicated function and adding SIGHUP for children and keep SIGTERM at the end of the method (SIGHUP and SIGTERM used together):
https://github.com/openstack/oslo.service/commit/1b11dd4b#diff-49ecd82bf041c1fd494a0bc4a96782deR332

3 - Introduce eventlet, keep SIGHUP and SIGTERM together:
https://github.com/openstack/oslo.service/commit/3c389e9b

4 - Manage SIGTERM externaly from the 'wait' method:
https://github.com/openstack/oslo.service/commit/5f29c3637e0a82571af51b7bb98df0df6eccc716#diff-49ecd82bf041c1fd494a0bc4a96782deR410

5 - Prefer to use SIGTERM instead SIGHUP but continue to use SIGTERM by calling the 'stop' method:
https://github.com/openstack/oslo.service/commit/286a6ea8

I hope these elements can help to find the right way to handle these signals and help to find the right behavior to implement.

Revision history for this message

Herve Beraud (herveberaud) wrote on 2018-09-27:

#3

Possibly linked to this problem https://bugs.launchpad.net/oslo.service/+bug/1793830 ?

Revision history for this message

Victor Stinner (vstinner) wrote on 2018-09-28:

#4

FYI this bug report comes from https://bugs.launchpad.net/tripleo/+bug/1789680 if I understood correctly.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-03-07: Fix proposed to oslo.service (master)

#5

Fix proposed to branch: master
Review: https://review.openstack.org/641793

Changed in oslo.service:
assignee:	nobody → Dan Smith (danms)
status:	New → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-03-08:

#6

Fix proposed to branch: master
Review: https://review.openstack.org/641907

Changed in oslo.service:
assignee:	Dan Smith (danms) → Mohammed Naser (mnaser)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-03-08: Change abandoned on oslo.service (master)

#7

Change abandoned by Dan Smith (<email address hidden>) on branch: master
Review: https://review.openstack.org/641793
Reason: I still think the is-daemon check (and original justification) is wrong, but it's not the acute problem right now so I don't care enough to push on this.

OpenStack Infra (hudson-openstack) on 2019-07-16

Changed in oslo.service:
assignee:	Mohammed Naser (mnaser) → Ben Nemec (bnemec)

OpenStack Infra (hudson-openstack) on 2019-08-29

Changed in oslo.service:
assignee:	Ben Nemec (bnemec) → Eric Fried (efried)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-30: Fix merged to oslo.service (master)

#8

Reviewed: https://review.opendev.org/641907
Committed: https://git.openstack.org/cgit/openstack/oslo.service/commit/?id=e7dd2916893157854ca0fa5f4215d536151abbb3
Submitter: Zuul
Branch: master

commit e7dd2916893157854ca0fa5f4215d536151abbb3
Author: Mohammed Naser <email address hidden>
Date: Thu Mar 7 22:09:49 2019 -0500

restart: don't stop process on sighup when mutating

It seems that the code for handling SIGHUP currently calls stop()
on the service, then calls reset(), then calls start() on it again.

    This is effectively a full service restart, which breaks the whole
    point behind using SIGHUP for hot and quick reloads. It also breaks
    our downstream projects in a few ways where they lose RPC on reload
    due to the fact that they don't expect to have stop() called on a
    reset().

    This patch removes the stop and start when the restart_method is
    set to 'mutate' because in that case we should just be signaling
    the service to check for changes in its mutable config options.
    It also changes the signal sent to children in that case to
    SIGHUP, since SIGTERM will cause unnecessary restarts of child
    processes.

    The previous behavior is maintained for the 'reload' restart_method
    since that does a complete reload of the service config, which is
    not safe to do without restarting the service completely.

    Change-Id: I86a34c22d41d87a9cce2d4ac6d95562d05823ecf
    Closes-Bug: #1794708
    Co-Authored-By: Ben Nemec <email address hidden>

Changed in oslo.service:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-30: Related fix proposed to oslo.service (master)

#9

Related fix proposed to branch: master
Review: https://review.opendev.org/679465

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-09-03: Fix included in openstack/oslo.service 1.40.1

#10

This issue was fixed in the openstack/oslo.service 1.40.1 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-09-05: Fix proposed to oslo.service (stable/stein)

#11

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/680505

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-09-06: Related fix merged to oslo.service (master)

#12

Reviewed: https://review.opendev.org/679465
Committed: https://git.openstack.org/cgit/openstack/oslo.service/commit/?id=a7621c8c46378fb456f306dcf5f8cc7836b110d2
Submitter: Zuul
Branch: master

commit a7621c8c46378fb456f306dcf5f8cc7836b110d2
Author: Eric Fried <email address hidden>
Date: Fri Aug 30 09:53:05 2019 -0500

Reno for SIGHUP fix

Adds a release note for the fix at
I86a34c22d41d87a9cce2d4ac6d95562d05823ecf

Change-Id: I4682950ac12f763737489c510246d54aed80b80f
Related-Bug: #1794708

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-09-23: Fix merged to oslo.service (stable/stein)

#13

Reviewed: https://review.opendev.org/680505
Committed: https://git.openstack.org/cgit/openstack/oslo.service/commit/?id=abfec6a48a01c80b861ebe7e151b8df23ff6b398
Submitter: Zuul
Branch: stable/stein

commit abfec6a48a01c80b861ebe7e151b8df23ff6b398
Author: Mohammed Naser <email address hidden>
Date: Thu Mar 7 22:09:49 2019 -0500

restart: don't stop process on sighup when mutating

It seems that the code for handling SIGHUP currently calls stop()
on the service, then calls reset(), then calls start() on it again.

    This is effectively a full service restart, which breaks the whole
    point behind using SIGHUP for hot and quick reloads. It also breaks
    our downstream projects in a few ways where they lose RPC on reload
    due to the fact that they don't expect to have stop() called on a
    reset().

    This patch removes the stop and start when the restart_method is
    set to 'mutate' because in that case we should just be signaling
    the service to check for changes in its mutable config options.
    It also changes the signal sent to children in that case to
    SIGHUP, since SIGTERM will cause unnecessary restarts of child
    processes.

    The previous behavior is maintained for the 'reload' restart_method
    since that does a complete reload of the service config, which is
    not safe to do without restarting the service completely.

    Change-Id: I86a34c22d41d87a9cce2d4ac6d95562d05823ecf
    Closes-Bug: #1794708
    Co-Authored-By: Ben Nemec <email address hidden>
    (cherry picked from commit e7dd2916893157854ca0fa5f4215d536151abbb3)

tags:

added: in-stable-stein

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-01-09: Fix included in openstack/oslo.service 1.38.1

#14

This issue was fixed in the openstack/oslo.service 1.38.1 release.

oslo.service

Service children uselessly restart on signals

Bug Description

Other bug subscribers

Remote bug watches