osrf_ctl.sh Doesn't check if process is actually running.

Bug #741088 reported by Joseph
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenSRF
Fix Released
Medium
Dan Scott

Bug Description

Version: All livecds

To reproduce place a false pid in any pid file under the folder /openils/var/run/opensrf, such as "open-ils.acq.pid" before running opensrf_ctl.sh, the service won't be started because opensrf_ctl.sh assumes these files to be correct.

Tags: bitesize
Joseph (joehms22)
tags: added: bitesize
Revision history for this message
Joseph (joehms22) wrote :

In this new version PIDs are checked against the ones listed in the file, if they don't match the file is discarded and a new process started.

For Perl and Python, if PID files indicate that they did not shut down properly last time, stop is called and then the processes are restarted.

A new command was added that clears all PIDs from the PID folder, for cleaning up in case something terrible happens.

Revision history for this message
Anoop Atre (anoop-atre) wrote :

Joseph
This is quite useful, thanks! I'd already addressed this in the startup script I'd put together a while back, it should be floating around on the dev/dig mailing list. I might modify the one I use with some improvements you've provided here.

One question I have is that I'd expect a start call to just check if there are pid files and if it finds one respond saying something like "Perl processes is not running but a pid file exists from an incomplete shutdown. You may run this script with clear_pid to clear stray pids if the shutdown was expected". I don't think it's wise to automatically remove the file since it 'might' help in diagnosing the crash.

Also there is a typo, clear_all should be clear_pid or vice-versa.

Cheers

Revision history for this message
Joseph (joehms22) wrote : Re: [Bug 741088] Re: osrf_ctl.sh Doesn't check if process is actually running.

Anoop,
Thanks for that catch. I do like that idea about not removing them, that
should be very helpful.

I wasn't able to find the script in any either of the two subversion
repositories so I hope this was the right launchpad location to submit it
to.

Cheers,
  Joe

--
Public Key: [0xF8462E1593141C16]<http://pgp.mit.edu:11371/pks/lookup?op=get&search=0xF8462E1593141C16>

Man is the lowest-cost, 150-pound, nonlinear, all-purpose computer system
which can be mass-produced by unskilled labor.
- NASA

On Wed, Mar 23, 2011 at 3:22 PM, Anoop Atre <email address hidden>wrote:

> Joseph
> This is quite useful, thanks! I'd already addressed this in the startup
> script I'd put together a while back, it should be floating around on the
> dev/dig mailing list. I might modify the one I use with some improvements
> you've provided here.
>
> One question I have is that I'd expect a start call to just check if
> there are pid files and if it finds one respond saying something like
> "Perl processes is not running but a pid file exists from an incomplete
> shutdown. You may run this script with clear_pid to clear stray pids if
> the shutdown was expected". I don't think it's wise to automatically
> remove the file since it 'might' help in diagnosing the crash.
>
> Also there is a typo, clear_all should be clear_pid or vice-versa.
>
> Cheers
>
> ** Attachment added: "osrf_ctl.sh.typo_fix"
>
> https://bugs.launchpad.net/evergreen/+bug/741088/+attachment/1934504/+files/osrf_ctl.sh.typo_fix
>
> --
> You received this bug notification because you are a direct subscriber
> of the bug.
> https://bugs.launchpad.net/bugs/741088
>
> Title:
> osrf_ctl.sh Doesn't check if process is actually running.
>
> Status in Evergreen - Open ILS:
> New
>
> Bug description:
> Version: All livecds
>
> To reproduce place a false pid in any pid file under the folder
> /openils/var/run/opensrf, such as "open-ils.acq.pid" before running
> opensrf_ctl.sh, the service won't be started because opensrf_ctl.sh
> assumes these files to be correct.
>
> To unsubscribe from this bug, go to:
> https://bugs.launchpad.net/evergreen/+bug/741088/+subscribe
>

Revision history for this message
Dan Scott (denials) wrote :

Hi Joe:

First, thanks a ton for your effort on this!

For some housekeeping, please read up on our procedures and
conventions for contributing to Evergreen
(http://evergreen-ils.org/dokuwiki/doku.php?id=contributing) - one of
the legal housekeeping parts that we require for submissions is a
"Developer's Certificate of Origin (DCO)" to confirm that the code
you're submitting is indeed yours to submit.

As for the location of the file in the SVN repo, it is a file that is
generated via autoconf, thus you'll find it in the "bin" directory of
the OpenSRF repository with a ".in" extension: see
http://svn.open-ils.org/trac/OpenSRF/browser/trunk/bin

Now that you know where it is, would you be willing to resubmit your
version as a patch, and to attach a DCO?

Many thanks!

Revision history for this message
Joseph (joehms22) wrote :

Dan,

I most certainly am willing to do that, thanks very much for your help today; I'll try to keep from spamming you all too much more in the future. I incorporated Anoop's suggestion and patch, along with adding some output that should have been made during the removal of the files (you can now tell that it removed files, and which they were).

Basic overview of the feature or change:
    Added support for checking if .pid files were actually correct upon starting OpenSRF. These files could have been corrupted, or the processes they linked to already killed by a user or a system shutdown.

An explanation of the existing code – it's structure and purpose – if applicable:
    The existing code starts up OpenSRF, the new code just adds some warnings and checks.

An explanation of why this code needs to change, if applicable:
   The old code needed to be changed because new users (like me) mess things up, this patch should provide some robustness for the software.

Analysis of what this change will effect in the existing code base:
    One file is changed in the opensrf svn repository, bin/osrf_ctl.sh.in

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

Signed-off-by: Joseph Lewis <email address hidden>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)

iQEcBAEBAgAGBQJNioZnAAoJED67L3ZNVUNm7cIH/Rhpwt2HDABXmRcuGAMoaWwc
Stlc9Y+rjugzIxcwxIXux3UsNSHoB4xQm17XwN5RVi5dc2LMydtqyk8XJTNmHDrs
OmDIW8NIPz6vylGZL0yEf06MjYA+bre3TJbZku74s9kgJxk91lEIYrFOJTTWwaoC
0kkWjQEpLC3LkaYpFptUv+/LEmWeT7RWg3iMiQqqfrVAeRt48z3Iye3V6l9B8lwV
gautfnWGrNB4sQcuHZB9SPlRWlZ8KVcFB9UmeVMUrZz6BtWykY8K2UXGrQ0lj+EL
ddOZgSXkgSjSDDxTAyO1Q1EB1+aTkl3DG8n5NtHxATVIzHFEcLiyyHMnhHNVFto=
=MCWf
-----END PGP SIGNATURE-----

Revision history for this message
Dan Scott (denials) wrote :

Thanks again, Joseph. I've looked over your patch, and there are a few bits left to clean up - but it's very close.

1) Please don't change "prefix=@prefix@" to "prefix=/openils"; this breaks autoconf's ability to specify a different --prefix option to configure and have it affect the osrf_ctl.sh script that is produced from osrf_ctl.sh.in. Same for exec_prefix and OSRF_CONFIG.

2) Please don't mix changes to whitespace (removing trailing spaces at the end of lines, or converting tabs to spaces or vice versa) with actual code changes. You're welcome to submit a separate patch with the whitespace changes, ideally after the functional patch. We want to be able to focus on what changed in a given line without having to determine if it was just a whitespace change.

3) A very minor nit - I'd prefer to see PID in all caps in comments and user-visible text.

4) The spacing on the clear_pid action at the bottom of the patch is off; all the other options are prefixed by a tab. Again, minor.

Functionally, the patch works; I've given it a test run with no visible problems. If you can address these minor points, I think we're golden!

Revision history for this message
Joseph (joehms22) wrote :

Thanks again, I think this one should work.

- Joe

Revision history for this message
Dan Scott (denials) wrote :

So, after a bit more testing of the new patch, I ran into a situation that will probably be relatively common; I ran "start_all" before I had dnsmasq running, and therefore the Python services failed to start. After starting up dnsmasq and running "start_python" a second time, I hit the message:

"""
Python processes may not be running but a PID file exists possibly from an
incomplete shutdown. You may run this script with clear_pid to clear stray
PIDs if the shutdown was expected.
"""

Of course, if I followed this advice, I would wipe out the PIDs for _all_ of the running services, and that would confuse a subsequent "stop_all" horribly.

So, a few more thoughts on how to make this new functionality a bit safer:

1) Instead of writing opensrf-perl.id / opensrf-python.pid immediately after "start_perl" or "start_python" is issued, perhaps kick off the Perl or Python processes and check their return values; if you get a bad return value (like Python's in the case of the missing dnsmasq / no "nameserver 127.0.0.1" in /etc/resolv.conf), then don't write the dummy Perl or Python PID.

2) Instead of making clear_pid immediately wipe all PID files, perhaps have it iterate through the PID files, check for a process with a matching PID, and only remove the PID file if there is no matching process?

Sorry for not anticipating this error situation in my previous response. Are you game for another kick at the patch?

Revision history for this message
Joseph (joehms22) wrote :

We don't want to wipe all of the PIDs to fix a certain PID, there is a new function (I left the other as it may be useful sometimes) that "smart" cleans the PID files, by checking if a process with that PID is running or not, if the system went down and restarted, new processes may have been assigned those PIDs, thus making the original clear all useful.

Python and Perl PID files are now only created _when_ (at least one of) their respective processes start correctly.

There is a new constraint on PID files though, all PIDs within the files must be separated by either a space, tab, or newline.

Cheers,
   Joe

Revision history for this message
Dan Scott (denials) wrote :
Download full text (3.6 KiB)

Thanks for the new patch! Unfortunately, I ran into a small technical
problem in testing the new patch on Fedora 14 (see way below).

First, though, a consideration with some fresh eyes on usability:

I started up all of the services, then tried running "start_perl"
again to see what it had to say, and got this output:

[opensrf@dbs ~]$ osrf_ctl.sh -l -a start_perl
Starting OpenSRF Perl
Perl processes may not be running but a PID file exists possibly from an
incomplete shutdown. You may run this script with smart_clear to clear stray
PIDs if the shutdown was expected.
* starting all services for localhost
* opensrf.settings is already running
* open-ils.acq is already running
* open-ils.booking is already running
* open-ils.cat is already running
* open-ils.supercat is already running
* open-ils.search is already running
* open-ils.circ is already running
* open-ils.actor is already running
* open-ils.storage is already running
* open-ils.penalty is already running
* open-ils.collections is already running
* open-ils.ingest is already running
* open-ils.reporter is already running
* open-ils.permacrud is already running
* open-ils.trigger is already running
* open-ils.fielder is already running
* open-ils.vandelay is already running
* open-ils.serial is already running

I'm not 100% sure about the usability of this for others, as (if I
didn't know about the changes to the code), it's not clear what the
"already running" messages mean:

  * It could mean that the start_perl command went ahead, checked the
service to see if the PID file corresponds to a running process, and
would have started the service if it needed to be restarted
  * Or it could mean that you should basically ignore those messages
and go ahead and run "smart_clear" to get rid of the bogus PID files,
and then run "start_perl" again to start the services that had had
bogus files. (Also, note that we're not explicitly telling people to
rerun "start_perl" after running "smart_clear").

Maybe the first result is the better user experience in a DWIMish way
(DWIM = "Do What I Mean"). In which case the messages to the user
might be:

"""
Perl processes are either already running, or were not correctly shut down.
Now clearing any stale PID files and restarting Perl processes.
* opensrf.settings is already running
* starting service pid=10722 open-ils.storage
* removing stale PID file and starting service pid=22422 open-ils.fielder
...
"""

Where:
  * "opensrf.settings" had a PID file and a matching process, so no
need to start it
  * "open-ils.storage" did not have a PID file, so it needed to be started
  * "open-ils.fielder" had a PID file but did not have a matching
process, so the stale PID file needed to be removed and the process
needed to be started

Eh? That seems to be the cleanest, most user-friendly approach, if we
can pull it off. What do you think?

Now, as to the actual technical problem - when I tried running "smart_clear":

"""
[opensrf@dbs ~]$ osrf_ctl.sh -l -a smart_clear
Smart clearing PID files...
/openils/bin/osrf_ctl.sh: line 262: [: too many arguments
Removed /openils/var/run/opensrf/open-ils.serial.pid
/openils/bin/osrf_ctl.sh: line 262: [: too many arguments
R...

Read more...

Revision history for this message
Joseph (joehms22) wrote :
Download full text (5.2 KiB)

I can see how that would be confusing, I do think the better option is
calling smart-clear from within Perl/Python, like what is done in the other
processes.

Oh, I see what is going on there, I think it was matching super-high
processes too :) Perl had a PID of 1111, and process y had a PID of 21111,
I don't think I saw this because the system I tested on never got enough
processes.

Do you want me to fix those and resubmit the patch or do you have one handy?

--
Public Key: [0xF8462E1593141C16]<http://pgp.mit.edu:11371/pks/lookup?op=get&search=0xF8462E1593141C16>

Man is the lowest-cost, 150-pound, nonlinear, all-purpose computer system
which can be mass-produced by unskilled labor.
- NASA

On Sat, Mar 26, 2011 at 2:47 PM, Dan Scott <email address hidden> wrote:

> Thanks for the new patch! Unfortunately, I ran into a small technical
> problem in testing the new patch on Fedora 14 (see way below).
>
> First, though, a consideration with some fresh eyes on usability:
>
> I started up all of the services, then tried running "start_perl"
> again to see what it had to say, and got this output:
>
> [opensrf@dbs ~]$ osrf_ctl.sh -l -a start_perl
> Starting OpenSRF Perl
> Perl processes may not be running but a PID file exists possibly from an
> incomplete shutdown. You may run this script with smart_clear to clear
> stray
> PIDs if the shutdown was expected.
> * starting all services for localhost
> * opensrf.settings is already running
> * open-ils.acq is already running
> * open-ils.booking is already running
> * open-ils.cat is already running
> * open-ils.supercat is already running
> * open-ils.search is already running
> * open-ils.circ is already running
> * open-ils.actor is already running
> * open-ils.storage is already running
> * open-ils.penalty is already running
> * open-ils.collections is already running
> * open-ils.ingest is already running
> * open-ils.reporter is already running
> * open-ils.permacrud is already running
> * open-ils.trigger is already running
> * open-ils.fielder is already running
> * open-ils.vandelay is already running
> * open-ils.serial is already running
>
> I'm not 100% sure about the usability of this for others, as (if I
> didn't know about the changes to the code), it's not clear what the
> "already running" messages mean:
>
> * It could mean that the start_perl command went ahead, checked the
> service to see if the PID file corresponds to a running process, and
> would have started the service if it needed to be restarted
> * Or it could mean that you should basically ignore those messages
> and go ahead and run "smart_clear" to get rid of the bogus PID files,
> and then run "start_perl" again to start the services that had had
> bogus files. (Also, note that we're not explicitly telling people to
> rerun "start_perl" after running "smart_clear").
>
> Maybe the first result is the better user experience in a DWIMish way
> (DWIM = "Do What I Mean"). In which case the messages to the user
> might be:
>
> """
> Perl processes are either already running, or were not correctly shut down.
> Now clearing any stale PID files and restarting Perl processes.
> * opensrf.settings is already running
> * start...

Read more...

Revision history for this message
Joseph (joehms22) wrote :
Revision history for this message
Dan Scott (denials) wrote :

This is great, Joseph! It passed all of my tests, and I think it's a nice contribution to the OpenSRF project. I have committed your patch to OpenSRF trunk in r2215 and backported it to the rel_2_0 branch as well.

Changed in evergreen:
assignee: nobody → Dan Scott (denials)
importance: Undecided → Medium
status: New → Fix Committed
affects: evergreen → opensrf
Dan Scott (denials)
Changed in opensrf:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.