init: failure to respawn finishes the start event, even if other jobs reference it later

Bug #388873 reported by Slawomir Czarko
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
upstart
Fix Released
High
Unassigned
0.3
Fix Released
High
Unassigned
upstart (Fedora)
Fix Released
Medium

Bug Description

I'm running upstart 0.3.9

After changing date to previous day and using stop/start commands to restart some upstart jobs I get a segfault in init task. Below is the backtrace from gdb. I can reproduce this pretty much 100% of the time and the backtrace looks always the same (only some pointer values are different).

(gdb) bt full
#0 0x00a9b416 in __kernel_vsyscall ()
No symbol table info available.
#1 0x00c318ff in sigprocmask () from /lib/libc.so.6
No symbol table info available.
#2 0x0804a4d5 in crash_handler (signum=11) at main.c:400
 act = {__sigaction_handler = {sa_handler = 0, sa_sigaction = 0}, sa_mask = {__val = {0 <repeats 32 times>}}, sa_flags = 0, sa_restorer = 0x4}
 limit = {rlim_cur = 4294967295, rlim_max = 4294967295}
 mask = {__val = {2147482623, 4294967294, 4294967295 <repeats 30 times>}}
 err = <value optimized out>
 loglevel = <value optimized out>
 mask = {__val = {0, 0, 11121686, 115, 582, 3217000600, 123, 0, 0, 0, 3087986688, 15, 3087987304, 128, 1211428, 1748533395, 54641668, 3, 12627208, 13109375, 142491099, 142490795, 0,
    0, 1, 1797, 3087987408, 3087986688, 134515571, 12663656, 13521706, 0}}
 oldmask = {__val = {14, 4294967295, 0, 12611584, 134618112, 3217001336, 3217001336, 4, 13455331, 2, 134598413, 4, 134615746, 1, 11121664, 17, 51, 3217000744, 11121664, 17, 51,
    3228041216, 123, 123, 14073844, 8, 3217000744, 3217000600, 2, 0, 3217000780, 0}}
 __FUNCTION__ = "crash_handler"
#3 <signal handler called>
No symbol table info available.
#4 nih_str_array_append (array=0xbfbf8a48, parent=0x0, len=0xbfbf8a44, args=0x41) at string.c:534
 c_len = 5
 o_len = <value optimized out>
 arg = (char * const *) 0x5
 __FUNCTION__ = "nih_str_array_append"
#5 0x0804bc61 in job_run_process (job=0x87e7f30, process=PROCESS_MAIN) at job.c:1368
 statbuf = {st_dev = 59308975770263300, __pad1 = 61930, st_ino = 134616700, st_mode = 11, st_nlink = 0, st_uid = 13894081, st_gid = 142479368, st_rdev = 21617315872, __pad2 = 35248,
  st_size = -1077966368, st_blksize = 0, st_blocks = 13808947, st_atim = {tv_sec = 13890026, tv_nsec = 134616700}, st_mtim = {tv_sec = 11, tv_nsec = 14073844}, st_ctim = {
    tv_sec = 142508104, tv_nsec = 142507824}, __unused4 = 3217001048, __unused5 = 12760163}
 proc = (JobProcess *) 0x87e8560
 argv = (char **) 0x887cea8
 script = <value optimized out>
 argc = 5
 error = <value optimized out>
 fds = {142479392, 134616700}
 __FUNCTION__ = "job_run_process"
#6 0x0804c53c in job_change_state (job=0x87e7f30, state=JOB_SPAWNED) at job.c:856
 old_state = <value optimized out>
 __FUNCTION__ = "job_change_state"
#7 0x0804cff2 in job_child_reaper (data=0x0, pid=6764, killed=0, status=0) at job.c:1729
 job = <value optimized out>
 process = PROCESS_PRE_START
 failed = 0
 stop = 0
 state = 1
 __FUNCTION__ = "job_child_reaper"
#8 0x0805959e in nih_child_poll () at child.c:158
 iter = (NihList *) 0x87e1600
 _iter = (NihList *) 0x87e15d0
 pid = 6764
 killed = 0
 status = 0
 info = {si_signo = 17, si_errno = 0, si_code = 1, _sifields = {_pad = {6764, 0 <repeats 28 times>}, _kill = {si_pid = 6764, si_uid = 0}, _timer = {si_tid = 6764, si_overrun = 0,
      si_sigval = {sival_int = 0, sival_ptr = 0x0}}, _rt = {si_pid = 6764, si_uid = 0, si_sigval = {sival_int = 0, sival_ptr = 0x0}}, _sigchld = {si_pid = 6764, si_uid = 0, si_status = 0,
      si_utime = 0, si_stime = 0}, _sigfault = {si_addr = 0x1a6c}, _sigpoll = {si_band = 6764, si_fd = 0}}}
#9 0x0805d09d in nih_main_loop () at main.c:606
 timeout = {tv_sec = 4, tv_usec = 992000}
---Type <return> to continue, or q <return> to quit---
 readfds = {fds_bits = {8, 0 <repeats 31 times>}}
 writefds = {fds_bits = {32, 0 <repeats 31 times>}}
 buf = ""
 ret = <value optimized out>
 next_timer = (NihTimer *) 0x0
 exceptfds = {fds_bits = {0 <repeats 32 times>}}
 nfds = 7
#10 0x08049e42 in main (argc=5, argv=0x887cea8) at main.c:302
 args = <value optimized out>
 ret = 5

Revision history for this message
In , Slawomir (slawomir-redhat-bugs) wrote :
Download full text (4.3 KiB)

Description of problem:
After changing date to previous day and using stop/start commands to restart some upstart jobs I get a segfault in init task. Below is the backtrace from gdb. I can reproduce this pretty much 100% of the time and the backtrace looks always the same (only some pointer values are different).

Version-Release number of selected component (if applicable):

upstart-0.3.9-22.fc9.i386

How reproducible:

100%

Steps to Reproduce:
1. Change date to previous day
2. stop <upstart jobs>
3. emit event which will trigger starting of <upstart jobs>

Actual results:
Message from syslogd@sonostation-usb26-eth at Mar 27 10:52:22 ...
 kernel:init[1]: segfault at c1 ip 08057d32 sp bf83b980 error 4 in init[8048000+20000]

Expected results:
No segfault, tasks started.

Additional info:

(gdb) bt full
#0 0x00a9b416 in __kernel_vsyscall ()
No symbol table info available.
#1 0x00c318ff in sigprocmask () from /lib/libc.so.6
No symbol table info available.
#2 0x0804a4d5 in crash_handler (signum=11) at main.c:400
 act = {__sigaction_handler = {sa_handler = 0, sa_sigaction = 0}, sa_mask = {__val = {0 <repeats 32 times>}}, sa_flags = 0, sa_restorer = 0x4}
 limit = {rlim_cur = 4294967295, rlim_max = 4294967295}
 mask = {__val = {2147482623, 4294967294, 4294967295 <repeats 30 times>}}
 err = <value optimized out>
 loglevel = <value optimized out>
 mask = {__val = {0, 0, 11121686, 115, 582, 3217000600, 123, 0, 0, 0, 3087986688, 15, 3087987304, 128, 1211428, 1748533395, 54641668, 3, 12627208, 13109375, 142491099, 142490795, 0,
    0, 1, 1797, 3087987408, 3087986688, 134515571, 12663656, 13521706, 0}}
 oldmask = {__val = {14, 4294967295, 0, 12611584, 134618112, 3217001336, 3217001336, 4, 13455331, 2, 134598413, 4, 134615746, 1, 11121664, 17, 51, 3217000744, 11121664, 17, 51,
    3228041216, 123, 123, 14073844, 8, 3217000744, 3217000600, 2, 0, 3217000780, 0}}
 __FUNCTION__ = "crash_handler"
#3 <signal handler called>
No symbol table info available.
#4 nih_str_array_append (array=0xbfbf8a48, parent=0x0, len=0xbfbf8a44, args=0x41) at string.c:534
 c_len = 5
 o_len = <value optimized out>
 arg = (char * const *) 0x5
 __FUNCTION__ = "nih_str_array_append"
#5 0x0804bc61 in job_run_process (job=0x87e7f30, process=PROCESS_MAIN) at job.c:1368
 statbuf = {st_dev = 59308975770263300, __pad1 = 61930, st_ino = 134616700, st_mode = 11, st_nlink = 0, st_uid = 13894081, st_gid = 142479368, st_rdev = 21617315872, __pad2 = 35248,
  st_size = -1077966368, st_blksize = 0, st_blocks = 13808947, st_atim = {tv_sec = 13890026, tv_nsec = 134616700}, st_mtim = {tv_sec = 11, tv_nsec = 14073844}, st_ctim = {
    tv_sec = 142508104, tv_nsec = 142507824}, __unused4 = 3217001048, __unused5 = 12760163}
 proc = (JobProcess *) 0x87e8560
 argv = (char **) 0x887cea8
 script = <value optimized out>
 argc = 5
 error = <value optimized out>
 fds = {142479392, 134616700}
 __FUNCTION__ = "job_run_process"
#6 0x0804c53c in job_change_state (job=0x87e7f30, state=JOB_SPAWNED) at job.c:856
 old_state = <value optimized out>
 __FUNCTION__ = "job_change_state"
#7 0x0804cff2 in job_child_reaper (data=0x0, pid=6764, killed=0, status=0) at job.c:1729
 job = <value optimized out>
 process...

Read more...

Revision history for this message
In , Slawomir (slawomir-redhat-bugs) wrote :

I tried upstart version from Fedora 11 and the problem is still present.

Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :

Could you confirm whether it is initctl or init segfaulting here? This looks a lot like the init daemon segfaults.

Could you supply the exact sequence of commands that you run, including any appropriate upstart jobs

If you have a core file, could you gzip that and attach it to the bug

Thanks

Changed in upstart:
importance: Undecided → High
status: New → Incomplete
summary: - initctl segfault after changing date
+ init: segfault after changing date
Revision history for this message
In , Casey (casey-redhat-bugs) wrote :

I think this may be fixed upstream. Checking now.

Changed in upstart (Fedora):
status: Unknown → Confirmed
Revision history for this message
Slawomir Czarko (lauchpad-net-sklep) wrote : Re: init: segfault after changing date

It is init segfaulting. In the distribution I use (Fedora 9) init is part of upstart RPM.

Unfortunately I can't provide right now the exact sequence of commands plus the upstart jobs since that would involve quite a big amount of my employers code.

The general idea is that I've got about 20 tasks which are started by upstart when the main task (watchdog) starts (I'm using "start on" stanza). While the tasks are running I change the date to previous day using "date" command. Then I stop all these tasks (including watchdog) and when they all have stopped I start the watchdog task again. At this point init daemon segfaults when starting one of the 20 tasks (it happens randomly with a different task each time). The same problem happens (but a bit less often, around 50% of the time) when I change the date after all the tasks were stopped.

Is there anything else I can provide to help with fixing this bug?

I'm attaching three core files from different crashes.

Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :

Which version of Upstart are those core files against? I've tried opening them against the binaries from fc9 (upstart-0.3.9-22-fc9.i386.rpm) but haven't had any luck

It _looks_ like the crash is due to bad data in the job->cause event information

#4 nih_str_array_append (array=0xbfbf8a48, parent=0x0, len=0xbfbf8a44, args=0x41) at string.c:534
 c_len = 5
 o_len = <value optimized out>
 arg = (char * const *) 0x5
 __FUNCTION__ = "nih_str_array_append"

I'm not sure why args would be 0x41 or why arg would 0x5 there - that looks wrong to me. I need to examine the data to figure it out.

Unfortunately I can't seem to replicate the issue here, I'm not convinced it is related to the date change though

Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :

This could be simply a "start 20 tasks with the same event" bug ;)

Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :

(Though a test here with 50 is fine)

Revision history for this message
Slawomir Czarko (lauchpad-net-sklep) wrote :

The core files are again upstart-0.3.9-22.fc9.i386

I opened them like this:

gdb /sbin/init core.6764

I had to install upstart-debuginfo-0.3.9-22.fc9.i386 first to see anything useful.

Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :

Right - I'm obviously trying to read them on Ubuntu with the RPM and debug info unpacked alongside - not getting anywhere with that

I may have to try and install fc9 in a VM

Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :

Aha, found a gdb incantation that worked:

wing-commander /tmp% gdb ./upstart/usr/lib/debug/sbin/init.debug ./upstart-segfault/core.10726

Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :

The job's cause (event that caused it to start/stop) is invalid, this causes the segfault

(gdb) p job->cause
$26 = (EventEmission *) 0x9a0eb48

(gdb) p *job->cause
$17 = {event = {entry = {prev = 0x9a0eb10, next = 0x9a0dd18},
    name = 0x8 <Address 0x8 out of bounds>, args = 0x9a0eb00,
    env = 0x9a0eb58}, id = 161540952, progress = 134517312, jobs = -1,
  failed = 1918989427}

(gdb) p *job
$18 = {entry = {prev = 0x9a0ed50, next = 0x9a1a508}, id = 22,
  name = 0x9a18848 "ezono-cyclades-t_ltl", description = 0x0, author = 0x0,
  version = 0x0, replacement = 0x0, replacement_for = 0x0, instance_of = 0x0,
  goal = JOB_START, state = JOB_SPAWNED, cause = 0x9a0eb48, blocked = 0x0,
  failed = 0, failed_process = 4294967295, exit_status = 0, start_events = {
    prev = 0x9a18f10, next = 0x9a18f10}, stop_events = {prev = 0x9a19208,
    next = 0x9a18fe8}, emits = {prev = 0x9a18e14, next = 0x9a18e14},
  process = 0x9a18ed8, normalexit = 0x9a192d8, normalexit_len = 7,
  kill_timeout = 30, kill_timer = 0x0, instance = 0, service = 1, respawn = 1,
  respawn_limit = 2, respawn_interval = 30, respawn_count = 2,
  respawn_time = 1239344371, daemon = 0, pid_file = 0x0, pid_binary = 0x0,
  pid_timeout = 10, pid_timer = 0x0, console = CONSOLE_OUTPUT, env = 0x0,
  umask = 18, nice = 0, limits = {0x0 <repeats 15 times>}, chroot = 0x0,
  chdir = 0x9a19050 "/tmp"}

There are two events in the list:

(gdb) p *((Event *)events->next)
$24 = {entry = {prev = 0x9a0e1f0, next = 0x9a0ec98},
  name = 0x9a0d938 "stopped", args = 0x9a0e400, env = 0x0}
(gdb) p *((Event *)events->next->next)
$25 = {entry = {prev = 0x9a0dc68, next = 0x9a0d9a0},
  name = 0x9a0e2e8 "restart-bgr", args = 0x0, env = 0x0}

(gdb) p *((EventEmission *)events->next)
$27 = {event = {entry = {prev = 0x9a0e1f0, next = 0x9a0ec98},
    name = 0x9a0d938 "stopped", args = 0x9a0e400, env = 0x0}, id = 242,
  progress = EVENT_HANDLING, jobs = 1, failed = 0}
(gdb) p *((EventEmission *)events->next->next)
$28 = {event = {entry = {prev = 0x9a0dc68, next = 0x9a0d9a0},
    name = 0x9a0e2e8 "restart-bgr", args = 0x0, env = 0x0}, id = 245,
  progress = EVENT_HANDLING, jobs = 1, failed = 0}

Revision history for this message
Slawomir Czarko (lauchpad-net-sklep) wrote :

restart-bgr event is emitted by one of the job files after "watchdog" task stopped if auto-restart option is configured.
The problem happens also with auto-restart disabled.

Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :

The odd thing is that this looks like job->cause has been freed, this is odd because job->cause is only ever set inside job_change_cause() which takes care to increment and decrement its ->jobs reference count.

The cause address is greater than the "restart-bgr" and "stopped" events, this implies (but doesn't guarantee) that it was emitted later. It's certainly not in the events list anymore.

I'll take a look through the code to see whether this event is attached to any other jobs.

It shouldn't be possible to free the event emission while it's jobs count is greater than 0. It's possible, for example, that I'm incorrectly freeing an event emission elsewhere (e.g. from job->blocked)

Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :

I've marked this bug Private for now, since my debugging may reveal details that you don't wish revealed (plus your core file contains them)

If you're ok for the information to be revealed, please feel free to mark it un-private.

Feel free to delete the core file attachment if you're worried about that, but not worried about my debugging traces

visibility: public → private
Revision history for this message
Slawomir Czarko (lauchpad-net-sklep) wrote :

Does the core file contain anything else than the state of the init daemon? If not then this can be marked un-private.

Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :
Download full text (5.3 KiB)

First up, here's a dump of the jobs that you have registered. I've put the names followed by the goal and state in brackets. Then looked at the "cause" and "blocked" parameters, NULL means there wasn't one.

Please shout if any of this system state looks unusual to you:

rc jobs, all stopped:
  rcS (stop/waiting) NULL NULL
  rcS-sulogin (stop/waiting) NULL NULL
  rc0 (stop/waiting) NULL NULL
  rc1 (stop/waiting) NULL NULL
  rc2 (stop/waiting) NULL NULL
  rc3 (stop/waiting) NULL NULL
  rc4 (stop/waiting) NULL NULL
  rc5 (stop/waiting) NULL NULL
  rc6 (stop/waiting) NULL NULL

This is what I'd expect, the system has booted so there's no runlevel change going on right now.

tty jobs, both running:
  tty1 (start/running) NULL NULL
  tty2 (start/running) NULL NULL

Again, what we'd expect.

Now, a bunch of system services, some running, some not:
  autofs (start/running) NULL NULL
  acpid (start/running) NULL NULL
  haldaemon (start/running) NULL NULL
  messagebus (start/running) NULL NULL
  netplugd (start/running) NULL NULL
  prefdm (start/running) NULL NULL
  rsyslog (start/running) NULL NULL

  logd (stop/waiting) NULL NULL
  microcode_ctl (stop/waiting) NULL NULL
  netfs (stop/waiting) NULL NULL
  network (stop/waiting) NULL NULL
  serial (stop/waiting) NULL NULL
  sshd (stop/waiting) NULL NULL
  sulogin (stop/waiting) NULL NULL

Now your custom bits:

some jobs are stopped:
  ezono-cyclades-calibration (stop/waiting) NULL NULL
  ezono-cyclades-pre-start (stop/waiting) NULL NULL

  ezono-cyclades-t_ane (stop/waiting) NULL NULL
  ezono-cyclades-t_ccs (stop/waiting) NULL NULL
  ezono-cyclades-t_emg (stop/waiting) NULL NULL
  ezono-cyclades-t_mco (stop/waiting) NULL NULL
  ezono-cyclades-t_mse (stop/waiting) NULL NULL
  ezono-cyclades-t_par (stop/waiting) NULL NULL
  ezono-cyclades-t_pro (stop/waiting) NULL NULL
  ezono-cyclades-t_ssm (stop/waiting) NULL NULL
  ezono-cyclades-t_upg (stop/waiting) NULL NULL
  ezono-cyclades-t_xss (stop/waiting) NULL NULL

These might be "due to be started", but not yet reached, that's ok - if Upstart hasn't reached them they're not relevant yet.

some are running:
  ezono-cyclades-ac-adapter (start/running) NULL NULL
  ezono-cyclades-list-tasks (start/running) NULL NULL
  ezono-cyclades-stop-software-control (start/running) NULL NULL
  ezono-cyclades-suspend-control (start/running) NULL NULL
  ezono-cyclades-xss-control (start/running) NULL NULL
  ezono-malta-fpga-control (start/running) NULL NULL

  ezono-cyclades-t_bbd (start/running) NULL NULL
  ezono-cyclades-t_cal (start/running) NULL NULL
  ezono-cyclades-t_dso (start/running) NULL NULL
  ezono-cyclades-t_fcl (start/running) NULL NULL
  ezono-cyclades-t_fem (start/running) NULL NULL
  ezono-cyclades-t_emg-control (start/running) NULL NULL
  ezono-cyclades-t_mco-control (start/running) NULL NULL
  ezono-cyclades-t_prd (start/running) NULL NULL
  ezono-cyclades-t_scp (start/running) NULL NULL
  ezono-cyclades-t_smr (start/running) NULL NULL
  ezono-cyclades-t_ssc (start/running) NULL NULL
  ezono-cyclades-t_upg-control (start/running) NULL NUL...

Read more...

Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :

(For the record, I'm looking at core.10726)

Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :

Slawomir: The core file contains the complete state of the init daemon, which includes the parsed result of your job files and their names as well as their pids and suchlike

Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :

The start conditions for the other jobs are all:

  start on started ezono-cyclades-t_smr

The state shows this has started.

Thus we can theorise that when ezono-cyclades-t_smr started, its event was created, was attached to these jobs, and they begun starting - but for whatever reason, the event was discarded prematurely

Could you confirm that these four jobs all have this start condition, and no other jobs have that?

Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :

I see you have rsyslogd running

Could you run "initctl log-priority debug", make sure that the output appears in your logs "initctl version" is a good test for that, and then try doing your date change again.

("initctl list" output before would be nice too)

Thanks

Revision history for this message
Slawomir Czarko (lauchpad-net-sklep) wrote :

Yes, all the jobs are defined as services.
The four jobs which have invalid pointer to their cause all use the same stanza "start on started ezono-cyclades-t_smr" so I guess this event should be still in the list.

Revision history for this message
Slawomir Czarko (lauchpad-net-sklep) wrote :

I'm attaching /var/log/messages, output of "initctl list" before SW was started and the new core file

Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :

Thanks for all your information so far.

Out of interest, at which point in the log did you change the date?

Revision history for this message
Slawomir Czarko (lauchpad-net-sklep) wrote :

The log only contains what happened when SW was being started. Do you want to see also what happens when date is being changed and SW is stopped?

Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :

Yes, please - it's helpful to see the whole picture

Revision history for this message
Slawomir Czarko (lauchpad-net-sklep) wrote :

Here it is.
In /var/log/messages I marked when date was changed (although it's clear because syslog timestamps change) by "Changing date to previous day" and when upstart finished stopping my software again ("SW stopped").

Revision history for this message
Slawomir Czarko (lauchpad-net-sklep) wrote :

I got permission to attach the upstart job files. I zipped the whole /etc/event.d/ directory. You can mark this bug un-private.

visibility: private → public
Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :
Download full text (7.0 KiB)

Thanks very much for all your help so far.

From the logs, this looks like the key moment:

May 17 15:13:32 sonostation-usb26-eth init: Handling started event
May 17 15:13:32 sonostation-usb26-eth init: ezono-cyclades-t_ssm goal changed from stop to start
May 17 15:13:32 sonostation-usb26-eth init: ezono-cyclades-t_ssm state changed from waiting to starting
May 17 15:13:32 sonostation-usb26-eth init: ezono-cyclades-t_ssm respawning too fast, stopped
May 17 15:13:32 sonostation-usb26-eth init: ezono-cyclades-t_ssm respawning too fast, stopped
May 17 15:13:32 sonostation-usb26-eth init: ezono-cyclades-t_ssm goal changed from start to stop
May 17 15:13:32 sonostation-usb26-eth init: ezono-cyclades-t_ssm state changed from starting to waiting
May 17 15:13:32 sonostation-usb26-eth init: event_emit: Pending stopped event

This starting event is "started ezono-cyclades-t_smr", which we've established is the one that is unexpectedly freed.

Two things strike me about this:

 - "respawning too fast" - this wrongly uses the system time, so is one of the very few things to be broken by spooling the clock back a day

 - there's two of them printed, that's odd...

The next sequence of log message is everything else starting from this event:

May 17 15:13:32 sonostation-usb26-eth init: ezono-cyclades-stop-software-control goal changed from stop to start
May 17 15:13:32 sonostation-usb26-eth init: ezono-cyclades-stop-software-control state changed from waiting to starting
May 17 15:13:32 sonostation-usb26-eth init: event_emit: Pending starting event
May 17 15:13:32 sonostation-usb26-eth init: ezono-cyclades-t_dso goal changed from stop to start
May 17 15:13:32 sonostation-usb26-eth init: ezono-cyclades-t_dso state changed from waiting to starting
May 17 15:13:32 sonostation-usb26-eth init: event_emit: Pending starting event
May 17 15:13:32 sonostation-usb26-eth init: ezono-cyclades-t_ssc goal changed from stop to start
May 17 15:13:32 sonostation-usb26-eth init: ezono-cyclades-t_ssc state changed from waiting to starting
May 17 15:13:32 sonostation-usb26-eth init: event_emit: Pending starting event
May 17 15:13:32 sonostation-usb26-eth init: ezono-cyclades-t_fem goal changed from stop to start
May 17 15:13:32 sonostation-usb26-eth init: ezono-cyclades-t_fem state changed from waiting to starting
May 17 15:13:32 sonostation-usb26-eth init: event_emit: Pending starting event
May 17 15:13:32 sonostation-usb26-eth init: ezono-cyclades-t_cal goal changed from stop to start
May 17 15:13:32 sonostation-usb26-eth init: ezono-cyclades-t_cal state changed from waiting to starting
May 17 15:13:32 sonostation-usb26-eth init: event_emit: Pending starting event
May 17 15:13:32 sonostation-usb26-eth init: ezono-cyclades-t_prd goal changed from stop to start
May 17 15:13:32 sonostation-usb26-eth init: ezono-cyclades-t_prd state changed from waiting to starting
May 17 15:13:32 sonostation-usb26-eth init: event_emit: Pending starting event
May 17 15:13:32 sonostation-usb26-eth init: ezono-cyclades-t_scp goal changed from stop to start
May 17 15:13:32 sonostation-usb26-eth init: ezono-cyclades-t_scp state changed from waiting to starting
May 17 15:13:32 sonostation-usb26-eth ...

Read more...

Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :

Thank you again for all your help.

This bug is indeed caused by the first process failing to respawn.

It takes a reference to the emitted event, and attempts to start; this fails due to hitting the respawn limit, so it stops again and puts back the reference to the emitted event. As the first process to try and do so, this means the emitted event once again has zero references - and so is marked EVENT_FINISHED.

Even though further processes take a reference the event, it's already too late - the event is scheduled for deletion and Upstart does that at the next opportunity.

Changed in upstart:
status: Incomplete → Triaged
summary: - init: segfault after changing date
+ init: failure to respawn finishes the start event, even if other jobs
+ reference it later
Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :

Will need a new upload of 0.3 for this

Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :

Here's a test case for this

        TEST_FEATURE ("with respawn limit and second job");
        event = (Event *)job2->stop_events.next;
        nih_list_add (&job2->start_events, &event->entry);

        em = event_emit ("wibble", NULL, NULL);
        em->progress = EVENT_HANDLING;

        TEST_ALLOC_FAIL {
                em->jobs = 0;

                job1->goal = JOB_STOP;
                job1->state = JOB_WAITING;
                job1->cause = NULL;

                job2->goal = JOB_STOP;
                job2->state = JOB_WAITING;
                job2->cause = NULL;

                /* job2 hashes first */
                job2->respawn_limit = 10;
                job2->respawn_interval = 1000;
                job2->respawn_time = time (NULL);
                job2->respawn_count = 10;

                TEST_DIVERT_STDERR (output) {
   job_handle_event (em);
                }
                rewind (output);

  TEST_EQ (em->jobs, 1);
                TEST_EQ (em->progress, EVENT_HANDLING);

                TEST_EQ (job1->goal, JOB_STOP);
  TEST_EQ (job1->state, JOB_WAITING);
                TEST_EQ_P (job1->cause, NULL);

                TEST_EQ (job2->goal, JOB_START);
  TEST_EQ (job2->state, JOB_STARTING);
                TEST_EQ_P (job2->cause, em);

                TEST_FILE_EQ (output, "test: foo respawning too fast, stopped\n$
                TEST_FILE_END (output);
                TEST_FILE_RESET (output);
 }

 nih_list_free (&em->event.entry);

 nih_list_free (&job2->entry);

Revision history for this message
In , Slawomir (slawomir-redhat-bugs) wrote :

The bug has been triaged upstream. It is targeted for 0.3.11

Revision history for this message
In , Casey (casey-redhat-bugs) wrote :

Yes. Scott seems to have the issue worked out. He should be coming back with a fix soon.

Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :

If you could try and apply the job.c change in this patch, that should hopefully fix the crasher http://bazaar.launchpad.net/~scott/upstart/0.3/revision/706

Revision history for this message
Slawomir Czarko (lauchpad-net-sklep) wrote :

The fix solves the crash. Thanks a lot for such a quick response.

Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :

This bug has been fixed in Upstart 0.3.11

0.3.11 2009-06-19 "For Friday, June 19th 2009, I'm Jon Masters"

        * Fixed crash when the first job started by an event hits the
          respawn limit and frees the event, even though other jobs are
          also started by the same event. (Bug: #388873)

        * Added temporary support for "telinit u" until we have true re-exec
          support. This will be replaced by an initctl command in future.
          (Bug: #388742)

        * Corrected formatting of initctl(8) manpage. (Bug: #388745)

Revision history for this message
Slawomir Czarko (lauchpad-net-sklep) wrote :

The fix works also with 0.3.9

Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :

I've gone through the Upstart 0.5 and trunk code, and am confident that they are not affected by this bug.

Changed in upstart:
status: Triaged → Fix Released
Revision history for this message
In , Slawomir (slawomir-redhat-bugs) wrote :

Any idea when will the patch be included in Fedora?

Revision history for this message
In , Petr (petr-redhat-bugs) wrote :

I've prepared rpm with 0.3.11 update with merged patches from fedora and already included patches in upstream. I'm testing it right now.

If Casey agree I'm ready to commit it to devel branch.

Rpms are available here:

http://plautrba.fedorapeople.org/upstart/upstart-0.3.11-1.fc12.src.rpm
http://koji.fedoraproject.org/koji/taskinfo?taskID=1429470

Revision history for this message
In , Casey (casey-redhat-bugs) wrote :

Sure, this is worth having. Go ahead and commit it.

We're likely expecting a 0.3.12 soon (Scott said he'd take the state transfer patch as soon as I get test cases written), but how soon is uncertain.

Revision history for this message
In , Bug (bug-redhat-bugs) wrote :

Fedora 9 changed to end-of-life (EOL) status on 2009-07-10. Fedora 9 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.

Changed in upstart (Fedora):
status: Confirmed → Won't Fix
Revision history for this message
In , Slawomir (slawomir-redhat-bugs) wrote :

The problem is still present in Fedora 11.

Revision history for this message
In , Slawomir (slawomir-redhat-bugs) wrote :

It looks like the patch was applied in the development branch of Fedora. Any idea when will this be released as an update to Fedora 11?

Revision history for this message
In , Fedora (fedora-redhat-bugs) wrote :

upstart-0.3.11-1.fc11 has been submitted as an update for Fedora 11.
http://admin.fedoraproject.org/updates/upstart-0.3.11-1.fc11

Changed in upstart (Fedora):
status: Won't Fix → In Progress
Revision history for this message
In , Fedora (fedora-redhat-bugs) wrote :

upstart-0.3.11-1.fc11 has been pushed to the Fedora 11 testing repository. If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with
 su -c 'yum --enablerepo=updates-testing update upstart'. You can provide feedback for this update here: http://admin.fedoraproject.org/updates/F11/FEDORA-2009-7922

Changed in upstart (Fedora):
status: In Progress → Fix Committed
Revision history for this message
In , Fedora (fedora-redhat-bugs) wrote :

upstart-0.3.11-1.fc11 has been pushed to the Fedora 11 stable repository. If problems still persist, please make note of it in this bug report.

Changed in upstart (Fedora):
status: Fix Committed → Fix Released
Changed in upstart (Fedora):
importance: Unknown → Medium
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.