init: respawn should try again after a while?

Reported by Scott James Remnant (Canonical) on 2008-07-29
38
This bug affects 6 people
Affects Status Importance Assigned to Milestone
upstart
Wishlist
Unassigned
upstart (Fedora)
Confirmed
Unknown

Bug Description

Right now, "respawn limit" is a hard limit and will result in the job being stopped if execeeded.

Instead we should have something more of a rate limit, where we keep trying.

Changed in upstart:
importance: Undecided → Wishlist
status: New → Confirmed
Jeff Oliver (jeffrey-oliver) wrote :

Personally, i think upstart should have both. In my environment, if an application/service is acting up (constantly restarting), at some point I'd like to give up on it.

Additionally, I'd like to add a "resting" time between start attempts. This is independent of the number of retries.

For instance:

Job 1 should retry every 10 seconds for 15 attempts, then give up.
and
Job 2 should retry every 30 seconds forever.

I believe that the current implementation is to count the number of respawn over an interval of time. If this is exceeded, the job is considered a run away and is stopped.

Angus Salkeld (asalkeld) wrote :

I am curious about what kind of process / daemon you would do this for.

My thinking is ...

For an important daemon:
-restart it X times within a certain time
-don't count "normal exits" so you can reload configuration.
-if it still fails within this time frame you really need to try something new.
 So we need to be able to catch this as an event in a recovery script.
 Currently I use "start on stopping my-daemon RESULT=failed".

For something like getty:
-respawn forever (no limit) I assume that this is the case you are talking about for the rate limiting.

Jeff) What is the delay going to help? Surely you would want to restart your application ASAP.

If you still need it can't you do something like this in your job?

start script
  if [ -e /var/run/my.pid ]
  then
    sleep 10
    rm -f /var/run/my.pid
  fi
  exec /usr/sbin/my-daemon
script

Jeff Oliver (jeffrey-oliver) wrote :

Well, there is a question in the "Answers" tab, where someone has some sort of service that connects to another server. It sounds like they need a resting time between the service completing its transaction and being allowed to start a new connection. They complain that they cannot use the sleep, but perhaps they're putting it in the wrong place. They mention post-stop, you put it in start.

My personal requirement is a little bit off the wall. I'm trying to get upstart on my embedded linux box. The services run on this particular box are not ordinary. Furthermore, I have situations where 3rd-party companies install and run other service on this box as well. Everyone is adjusted to how the box operates currently. There is a service watchdog process running on the box currently, that I'm hoping upstart can simply replace.

And finally, the programming practices of the various software suppliers for the embedded server I work on are very amaturish. I can't say for sure that everyone knows what pid files are and what they are for. Each application developer is in control of their own job file. Having a stanza in the job file for causing a delay would allow for poor programming practices in the service itself.

In any case, I can post my changes. Take it or leave it. If it gets merged into the main line, fine, if not, fine. I'm here trying to improve it as best as I can. My usage case is very unique and I'm trying to generalize it as best as I can.

Jeff Oliver (jeffrey-oliver) wrote :

There is one other reason I can think of as why why you'd want to either try starting a job up at some later time, or having a delay between restarts.

I have a situation where my server startup depends on some other service being running on a different piece of hardware. If for some reason my server attempts to start up before the other server is ready, then I need to try again some time later. Starting up again immediately does not make sense because I already know its not going to respond yet. So instead, I'll wait a few seconds before trying again.

Changed in upstart:
milestone: none → 0.10

Also worth thinking about - the respawn limit shouldn't be a dead end - once <interval> elapses, it should be able to respawn another <limit> times.

summary: - respawn should try again after a while?
+ init: respawn should try again after a while?
Changed in upstart:
milestone: 0.10 → none
status: Confirmed → Triaged
Laurent Glayal (spglegle) wrote :

Same problem, we have processes depending on some other processes on the same or on remote servers. We have N servers controlling/controlled by M servers which are controlled by telco hardware. If one out of M servers goes down (hardware or some processes due to administrative task or telco alarm or anything else) we have to manually restart some processes on M servers.

We ported some daemons from one server/distrribution using old init system to new server using upstart (0.3.x), assuming plain old unix init behaviour when respawning daemons. We had to workaround about upstart respawn behaviour using scripting in order to emulate old init behaviour... quite dirty and not responsive....

Generally speaking, we encountered this 'bug', but shouldn't upstart provide all old init behaviour regarding managed processes (may be this one is the one missing, don't know) ?

On Mon, 2009-07-20 at 08:51 +0000, Laurent Glayal wrote:

> Same problem, we have processes depending on some other processes on the
> same or on remote servers. We have N servers controlling/controlled by M
> servers which are controlled by telco hardware. If one out of M servers
> goes down (hardware or some processes due to administrative task or
> telco alarm or anything else) we have to manually restart some processes
> on M servers.
>
> We ported some daemons from one server/distrribution using old init
> system to new server using upstart (0.3.x), assuming plain old unix init
> behaviour when respawning daemons. We had to workaround about upstart
> respawn behaviour using scripting in order to emulate old init
> behaviour... quite dirty and not responsive....
>
> Generally speaking, we encountered this 'bug', but shouldn't upstart
> provide all old init behaviour regarding managed processes (may be this
> one is the one missing, don't know) ?
>
Upstart provides the exact same behaviour as the UNIX System V init
daemon used on Linux previously (sysvinit)

In sysvinit, processes that respawn too fast are disabled.

Scott
--
Scott James Remnant
<email address hidden>

Laurent Glayal (spglegle) wrote :

At least on FC2, FC8, RHES3 It seems processes are disabled for 5 minutes, after that a new respawn cycle is done .

test:2345:respawn:date >>/tmp/deblg.txt 2>&1

results in

Mon Jul 20 13:22:58 CEST 2009
... 8 more run
Mon Jul 20 13:22:58 CEST 2009
Mon Jul 20 13:27:59 CEST 2009
... 8 more run
Mon Jul 20 13:27:59 CEST 2009
Mon Jul 20 13:33:00 CEST 2009
... 8 more run
Mon Jul 20 13:33:00 CEST 2009
...etc

Changed in upstart (Fedora):
status: Unknown → Confirmed
Laurent Glayal (spglegle) wrote :

Attached is a patch inspired from branch https://code.launchpad.net/~jeffrey-oliver/upstart/devel adding a new 'DELAYING' state. When using keywords 'respawn forever' in configuration files the respawn cycle is repeated every 5 minutes. This patch applies on upstart 0.3.11 .

Please give me comments about any missing parts in this patch.

Laurent Glayal (spglegle) wrote :

Attached is a patch inspired from branch https://code.launchpad.net/~jeffrey-oliver/upstart/devel adding a new 'DELAYING' state. When using keywords 'respawn forever' in configuration files the respawn cycle is repeated every 5 minutes. This patch applies on upstart 0.3.11 .

Please give me comments about any missing parts in this patch.

Andri Möll (moll) wrote :

Any news on this since 2009?

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.