Signal handling expansion and repairs

Bug #1204123 reported by Bill Erickson
30
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenSRF
Fix Released
Undecided
Jason Stephenson

Bug Description

This ticket is a combination and extension of bug #1203791, bug #193473, and bug #1195018. These all have to do with repairing or expanding opensrf signal handling (for C and Perl). The code for some of these relies on the code for others and they all pretty much collide with each other. Instead of managing them all separately, I intend to create a single branch responsible for all of the proposed signal handling improvements.

Proposed features:

* Listener receives SIGTERM causes router de-register and graceful child shutdown (bug #1203791)
* Listener receives SIGINT or SIGQUIT causes router de-register and non-graceful child shutdown (bug #1203791)
* Listener receives SIGUSR1 causes router de-registration (bug #1193473)
* Listener receives SIGUSR2 causes router registration
* Listener receives SIGHUP causes reload of max/min/idle child settings from settings server
* General signal handling repairs and consistency (e.g. recovering default handling in child processes, bug #1195018)

Will push a single branch with the existing code soon; new code to follow.

Tags: pullrequest
Revision history for this message
Bill Erickson (berick) wrote :

Code in progress pushed to

working/user/berick/lp1204123-signal-expansion

TODO:

* SIGHUP reloads max/min/idle children settings
* SIGUSR2 causes router (re)register

Revision history for this message
Bill Erickson (berick) wrote :

SIGUSR2 handling pushed for Perl and C.

Also included, in anticipation of HUP expansion, is a new force-reload API parameter for opensrf.settings calls. It tells opensrf.settings to re-read opensrf.xml from the file system instead of using the locally cached file.

Revision history for this message
Bill Erickson (berick) wrote :

Pushed a commit to support sending signals to listener processes directly from osrf_ctl.sh. From the commit:

Using the following new options to osrf_ctl.sh, the script can now send signals to any/all OpenSRF Listener processes:

-a signal / signal_all
-k <signal> [value passed to 'kill -s']
-s <service>

With an action of 'signal' and service provided by -s, the specified signal is sent only to the listener process of the service.

With an action of 'signal_all', the specified signal is sent to all listener processes.

Revision history for this message
Bill Erickson (berick) wrote :

While researching C HUP control, I realized that things would be easier of C behaved more like Perl in regard to listener process management. To that end, I've pushed a commit the working branch to support per-service control for OpenSRF C. From the commit:

C opensrf no longer mantains a single master process. Instead, like Perl, each Listener process writes its own PID file and can be managed individually. Related to this, much code was cleaned up in osrf_system.c.

osrf_ctl.sh now has the ability to stop/start individual services for Perl and C, using the new -s <service> option. Examples:

osrf_ctl.sh -l -a restart_c -s opensrf.math
osrf_ctl.sh -l -a restart_perl -s opensrf.settings

Revision history for this message
Mike Rylander (mrylander) wrote :

As a refinement to the above awesomeness, I'd like to propose:

 -a graceful_shutdown -s <service>
 -a graceful_shutdown_perl
 -a graceful_shutdown_c
 -a graceful_shutdown_all

 -a deregister -s <service>
 -a deregister_perl
 -a deregister_c
 -a deregister_all

 -a reregister -s <service>
 -a reregister_perl
 -a reregister_c
 -a reregister_all

as convenience options to "-a" that use "-k <value> -a signal" appropriately. I think not having to remember signal mapping would be a big help in usability.

Revision history for this message
Mike Rylander (mrylander) wrote :

And, I suppose, separate "stop" variants or modes beyond graceful_shutdown are warranted. Perhaps "immediate_shutdown" (aka, SIGKILL), "fast_shutdown" (aka, SIGINT/SIGQUIT) and "graceful_shutdown" (aka, SIGTERM).

If we want the least amount of default behavioral change I suppose the "stop" action should probably map to "immediate_shutdown", but I'd lobby for the other end of the spectrum, personally. Maybe wait 30 seconds on a "graceful_shutdown", another 30 seconds on a "fast_shutdown" if the listener hasn't gone away yet, and follow that with an "immediate_shutdown" if we're still waiting for the listener process to go away.

Thoughts?

Revision history for this message
Galen Charlton (gmc) wrote : Re: [Bug 1204123] Re: Signal handling expansion and repairs

> as convenience options to "-a" that use "-k <value> -a signal"
> appropriately. I think not having to remember signal mapping would be a
> big help in usability.

+1

Revision history for this message
Bill Erickson (berick) wrote :

+1 to all of this. I would also be in favor of having the default "stop" action perform: graceful -> wait -> fast -> wait -> kill_it_with_fire, maybe with a timeout option (defaulting to 30, as suggested).

Revision history for this message
Galen Charlton (gmc) wrote :

> +1 to all of this. I would also be in favor of having the default
> "stop" action perform: graceful -> wait -> fast -> wait ->
> kill_it_with_fire, maybe with a timeout option (defaulting to 30, as
> suggested).

+1

Revision history for this message
Bill Erickson (berick) wrote :

It occurred to me that adding all piles of new features would be much easier to do in Perl than shell scripts. I've pushed a commit to give opensrf-perl.pl all the power of osrf_ctl.sh and then some. We get all of our signal / shutdown controls and no longer have to differentiate between Perl, C, etc. services. Everything is just a service.

There is still lots of work to do to ease the transition away from osrf_ctl (docs, etc.), but I did give osrf_ctl power to continue using stop_all/start_all/restart_all so as not to yank it out from under everyone.

No Python bits have been added to opensrf-perl.pl just yet, but that should be minor.

More testing is needed. See a quick summary here:

http://git.evergreen-ils.org/?p=working/OpenSRF.git;a=commitdiff;h=e934cf520b7a3bd8711f17564dec83534ac2b4bd

Revision history for this message
Bill Erickson (berick) wrote :

* I've pushed a few more fixes and doc improvements for the perl-based osrf_control script.

* I've pushed SIGHUP handling for opensrf-c. It works the same as Perl, except that as far as logging goes, it only updates the log level, not the log files / destinations.

I'm scratching this one from the list for now => Listener receives SIGHUP causes reload of max/min/idle child settings from settings server. For various reasons, it's more complicated and I'd like to get the code we have out the door before adding that.

The final TODO is updating the OpenSRF README to reflect the move to osrf_control. Working on that now.

Revision history for this message
Bill Erickson (berick) wrote :

Update INSTALL/README and sample opensrf.xml to reference osrf_control instead of osrf_ctl.sh. I think the code is ready for non-me testing.

tags: added: pullrequest
Revision history for this message
Bill Erickson (berick) wrote :

Added a --diagnostic command to osrf_control which lists info about running (vs. configured) processes and warns of various discrepancies.

FWIW, I've been running this branch on my test systems since the beginning. I don't make heavy use of signals (except the default shutdown signals), but otherwise everything seems kosher.

Revision history for this message
Jason Stephenson (jstephenson) wrote :

I'm setting a new development VM. Seems like the perfect time to test something like this.

I should be done by the end of next week if anyone is waiting on this.

Changed in opensrf:
status: New → In Progress
assignee: nobody → Jason Stephenson (jstephenson)
Revision history for this message
Bill Erickson (berick) wrote :

Thanks Jason!

FYI, just pushed a minor cleanup commit.

Revision history for this message
Jason Stephenson (jstephenson) wrote :

Not a complaint or report of a problem, just a point for discussion.

When I have my server configured to require --localhost with osrf_control, and I run a restart command without --loclahost, I get the following:

 osrf_control --restart --service opensrf.settings
* sending TERM signal to pid=7631 opensrf.settings
* opensrf.settings is not configured to run on jasondev.mvlcstaff.org

So, the service stops, but fails to restart of course.

I did the above on purpose to see what would happen, because I'm sure someone will eventually do it by accident. I wonder if we should spend the time to protect the user from his/herself by not stopping a service unless it is configured to start with the current host name or localhost (if --localhost is provided)?

Just a question, and I'm perfectly happy with the answer being no, but thought I'd throw it out there.

Revision history for this message
Bill Erickson (berick) wrote :

I think that's a good idea. To restate, any time a user requests an action on a service, we should first see if that service is configured to run on the requested (intentionally or otherwise) host. If not, we perform no actions and warn the user.

I can add this..

Revision history for this message
Jason Etheridge (phasefx) wrote :

I think protection is a great idea, and just provide a --force option to bypass it. Or maybe do like git config and set the host vs localhost params just once and then forget them.

Revision history for this message
Jason Etheridge (phasefx) wrote :

That makes even more sense hearing it again. Forget this --force stuff.

Revision history for this message
Bill Erickson (berick) wrote :

And... pushed.

Revision history for this message
Jason Stephenson (jstephenson) wrote :

Great stuff, Bill! I like the message:

opensrf@jasondev:~/OpenSRF$ osrf_control --restart --service opensrf.math
* No services are configured to run on jasondev.mvlcstaff.org
* Perhaps you meant to use --localhost?

Give me a few days to kick this around some more and I'll be happy to sign off and commit to master, unless someone else beats me to it.

Revision history for this message
Bill Erickson (berick) wrote :

Pushed a fix for makefile warning:

autoreconf -i ==> src/Makefile.am:72: whitespace following trailing backslash

Revision history for this message
Jason Stephenson (jstephenson) wrote :

Checked the last commit to remove the warning this morning. Everything works for me.

Committed to master.

Thanks, Bill!

Changed in opensrf:
status: In Progress → Fix Committed
Galen Charlton (gmc)
Changed in opensrf:
milestone: none → 2.3.0-beta
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.