init: respawn should try again after a while?

Bug #252997 reported by Scott James Remnant (Canonical)
38
This bug affects 6 people
Affects Status Importance Assigned to Milestone
upstart
Triaged
Wishlist
Unassigned
upstart (Fedora)
Won't Fix
Medium

Bug Description

Right now, "respawn limit" is a hard limit and will result in the job being stopped if execeeded.

Instead we should have something more of a rate limit, where we keep trying.

Changed in upstart:
importance: Undecided → Wishlist
status: New → Confirmed
Revision history for this message
Jeff Oliver (jeffrey-oliver) wrote :

Personally, i think upstart should have both. In my environment, if an application/service is acting up (constantly restarting), at some point I'd like to give up on it.

Additionally, I'd like to add a "resting" time between start attempts. This is independent of the number of retries.

For instance:

Job 1 should retry every 10 seconds for 15 attempts, then give up.
and
Job 2 should retry every 30 seconds forever.

I believe that the current implementation is to count the number of respawn over an interval of time. If this is exceeded, the job is considered a run away and is stopped.

Revision history for this message
Angus Salkeld (asalkeld) wrote :

I am curious about what kind of process / daemon you would do this for.

My thinking is ...

For an important daemon:
-restart it X times within a certain time
-don't count "normal exits" so you can reload configuration.
-if it still fails within this time frame you really need to try something new.
 So we need to be able to catch this as an event in a recovery script.
 Currently I use "start on stopping my-daemon RESULT=failed".

For something like getty:
-respawn forever (no limit) I assume that this is the case you are talking about for the rate limiting.

Jeff) What is the delay going to help? Surely you would want to restart your application ASAP.

If you still need it can't you do something like this in your job?

start script
  if [ -e /var/run/my.pid ]
  then
    sleep 10
    rm -f /var/run/my.pid
  fi
  exec /usr/sbin/my-daemon
script

Revision history for this message
Jeff Oliver (jeffrey-oliver) wrote :

Well, there is a question in the "Answers" tab, where someone has some sort of service that connects to another server. It sounds like they need a resting time between the service completing its transaction and being allowed to start a new connection. They complain that they cannot use the sleep, but perhaps they're putting it in the wrong place. They mention post-stop, you put it in start.

My personal requirement is a little bit off the wall. I'm trying to get upstart on my embedded linux box. The services run on this particular box are not ordinary. Furthermore, I have situations where 3rd-party companies install and run other service on this box as well. Everyone is adjusted to how the box operates currently. There is a service watchdog process running on the box currently, that I'm hoping upstart can simply replace.

And finally, the programming practices of the various software suppliers for the embedded server I work on are very amaturish. I can't say for sure that everyone knows what pid files are and what they are for. Each application developer is in control of their own job file. Having a stanza in the job file for causing a delay would allow for poor programming practices in the service itself.

In any case, I can post my changes. Take it or leave it. If it gets merged into the main line, fine, if not, fine. I'm here trying to improve it as best as I can. My usage case is very unique and I'm trying to generalize it as best as I can.

Revision history for this message
Jeff Oliver (jeffrey-oliver) wrote :

There is one other reason I can think of as why why you'd want to either try starting a job up at some later time, or having a delay between restarts.

I have a situation where my server startup depends on some other service being running on a different piece of hardware. If for some reason my server attempts to start up before the other server is ready, then I need to try again some time later. Starting up again immediately does not make sense because I already know its not going to respond yet. So instead, I'll wait a few seconds before trying again.

Changed in upstart:
milestone: none → 0.10
Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :

Also worth thinking about - the respawn limit shouldn't be a dead end - once <interval> elapses, it should be able to respawn another <limit> times.

summary: - respawn should try again after a while?
+ init: respawn should try again after a while?
Changed in upstart:
milestone: 0.10 → none
status: Confirmed → Triaged
Revision history for this message
Laurent Glayal (spglegle) wrote :

Same problem, we have processes depending on some other processes on the same or on remote servers. We have N servers controlling/controlled by M servers which are controlled by telco hardware. If one out of M servers goes down (hardware or some processes due to administrative task or telco alarm or anything else) we have to manually restart some processes on M servers.

We ported some daemons from one server/distrribution using old init system to new server using upstart (0.3.x), assuming plain old unix init behaviour when respawning daemons. We had to workaround about upstart respawn behaviour using scripting in order to emulate old init behaviour... quite dirty and not responsive....

Generally speaking, we encountered this 'bug', but shouldn't upstart provide all old init behaviour regarding managed processes (may be this one is the one missing, don't know) ?

Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote : Re: [Bug 252997] Re: init: respawn should try again after a while?

On Mon, 2009-07-20 at 08:51 +0000, Laurent Glayal wrote:

> Same problem, we have processes depending on some other processes on the
> same or on remote servers. We have N servers controlling/controlled by M
> servers which are controlled by telco hardware. If one out of M servers
> goes down (hardware or some processes due to administrative task or
> telco alarm or anything else) we have to manually restart some processes
> on M servers.
>
> We ported some daemons from one server/distrribution using old init
> system to new server using upstart (0.3.x), assuming plain old unix init
> behaviour when respawning daemons. We had to workaround about upstart
> respawn behaviour using scripting in order to emulate old init
> behaviour... quite dirty and not responsive....
>
> Generally speaking, we encountered this 'bug', but shouldn't upstart
> provide all old init behaviour regarding managed processes (may be this
> one is the one missing, don't know) ?
>
Upstart provides the exact same behaviour as the UNIX System V init
daemon used on Linux previously (sysvinit)

In sysvinit, processes that respawn too fast are disabled.

Scott
--
Scott James Remnant
<email address hidden>

Revision history for this message
Laurent Glayal (spglegle) wrote :

At least on FC2, FC8, RHES3 It seems processes are disabled for 5 minutes, after that a new respawn cycle is done .

test:2345:respawn:date >>/tmp/deblg.txt 2>&1

results in

Mon Jul 20 13:22:58 CEST 2009
... 8 more run
Mon Jul 20 13:22:58 CEST 2009
Mon Jul 20 13:27:59 CEST 2009
... 8 more run
Mon Jul 20 13:27:59 CEST 2009
Mon Jul 20 13:33:00 CEST 2009
... 8 more run
Mon Jul 20 13:33:00 CEST 2009
...etc

Revision history for this message
In , Laurent (laurent-redhat-bugs) wrote :

Description of problem:

upstart does not restart daemons when these ones terminate too rapidly. Daemons should be respawned after a while .
On previous distributions releases (Fedora/RH) the init process issued a new respawne cycle every 5 minutes. Upstart does not.

Version-Release number of selected component (if applicable):
0.3.X releases of upstart but newer releases seem to have the same lack.
cf
https://bugs.launchpad.net/upstart/+bug/252997

How reproducible:

On pervious inittab / old init process
the previous line added to inittab
test:2345:respawn:date >>/tmp/deblg.txt 2>&1

issued the following results :
Mon Jul 20 13:22:58 CEST 2009
... 8 more run
Mon Jul 20 13:22:58 CEST 2009
Mon Jul 20 13:27:59 CEST 2009
... 8 more run
Mon Jul 20 13:27:59 CEST 2009
Mon Jul 20 13:33:00 CEST 2009
... 8 more run
Mon Jul 20 13:33:00 CEST 2009
...etc

The same 'date >>/tmp/deblg.txt 2>&1' added into a file
in /etc/event.d/ results in the command executed 10 times.
No more respawn cycle appears every 5 minutes.

This missing respawn cycle make migration of some processes
from init to upstrat more difficult.

Changed in upstart (Fedora):
status: Unknown → Confirmed
Revision history for this message
Laurent Glayal (spglegle) wrote :

Attached is a patch inspired from branch https://code.launchpad.net/~jeffrey-oliver/upstart/devel adding a new 'DELAYING' state. When using keywords 'respawn forever' in configuration files the respawn cycle is repeated every 5 minutes. This patch applies on upstart 0.3.11 .

Please give me comments about any missing parts in this patch.

Revision history for this message
Laurent Glayal (spglegle) wrote :

Attached is a patch inspired from branch https://code.launchpad.net/~jeffrey-oliver/upstart/devel adding a new 'DELAYING' state. When using keywords 'respawn forever' in configuration files the respawn cycle is repeated every 5 minutes. This patch applies on upstart 0.3.11 .

Please give me comments about any missing parts in this patch.

Revision history for this message
In , Laurent (laurent-redhat-bugs) wrote :

Created attachment 367816
add 'respawn forever' support

Revision history for this message
In , Laurent (laurent-redhat-bugs) wrote :

Created attachment 367817
keep state for processes in new state DELAYING used by patch for 'respawn forever' behaviour

Revision history for this message
In , Laurent (laurent-redhat-bugs) wrote :

Patch proposed in launchpad to get a 'respawn forever' behaviour, attached is the patch and a modified fedora patch to keep job status around reexec.

Patch inspired from branch https://code.launchpad.net/~jeffrey-oliver/upstart/devel adding a new 'DELAYING' state. When using keywords 'respawn forever' in configuration files the respawn cycle is repeated every 5 minutes. This patch applies on upstart 0.3.11 .

Revision history for this message
In , Bug (bug-redhat-bugs) wrote :

This bug appears to have been reported against 'rawhide' during the Fedora 13 development cycle.
Changing version to '13'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Revision history for this message
In , Fedora (fedora-redhat-bugs) wrote :

This package has changed ownership in the Fedora Package Database. Reassigning to the new owner of this component.

Revision history for this message
In , Bug (bug-redhat-bugs) wrote :

This message is a reminder that Fedora 13 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 13. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora
'version' of '13'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version'
to a later Fedora version prior to Fedora 13's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that
we may not be able to fix it before Fedora 13 is end of life. If you
would still like to see this bug fixed and are able to reproduce it
against a later version of Fedora please change the 'version' of this
bug to the applicable version. If you are unable to change the version,
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's
lifetime, sometimes those efforts are overtaken by events. Often a
more recent Fedora release includes newer upstream software that fixes
bugs or makes them obsolete.

The process we are following is described here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Revision history for this message
In , Bug (bug-redhat-bugs) wrote :

Fedora 13 changed to end-of-life (EOL) status on 2011-06-25. Fedora 13 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.

Revision history for this message
Andri Möll (moll) wrote :

Any news on this since 2009?

Changed in upstart (Fedora):
importance: Unknown → Medium
status: Confirmed → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.