init: failure to respawn finishes the start event, even if other jobs reference it later

Reported by Slawomir Czarko on 2009-06-18
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
upstart
High
Unassigned
0.3
High
Unassigned
upstart (Fedora)
Fix Released
Unknown

Bug Description

I'm running upstart 0.3.9

After changing date to previous day and using stop/start commands to restart some upstart jobs I get a segfault in init task. Below is the backtrace from gdb. I can reproduce this pretty much 100% of the time and the backtrace looks always the same (only some pointer values are different).

(gdb) bt full
#0 0x00a9b416 in __kernel_vsyscall ()
No symbol table info available.
#1 0x00c318ff in sigprocmask () from /lib/libc.so.6
No symbol table info available.
#2 0x0804a4d5 in crash_handler (signum=11) at main.c:400
 act = {__sigaction_handler = {sa_handler = 0, sa_sigaction = 0}, sa_mask = {__val = {0 <repeats 32 times>}}, sa_flags = 0, sa_restorer = 0x4}
 limit = {rlim_cur = 4294967295, rlim_max = 4294967295}
 mask = {__val = {2147482623, 4294967294, 4294967295 <repeats 30 times>}}
 err = <value optimized out>
 loglevel = <value optimized out>
 mask = {__val = {0, 0, 11121686, 115, 582, 3217000600, 123, 0, 0, 0, 3087986688, 15, 3087987304, 128, 1211428, 1748533395, 54641668, 3, 12627208, 13109375, 142491099, 142490795, 0,
    0, 1, 1797, 3087987408, 3087986688, 134515571, 12663656, 13521706, 0}}
 oldmask = {__val = {14, 4294967295, 0, 12611584, 134618112, 3217001336, 3217001336, 4, 13455331, 2, 134598413, 4, 134615746, 1, 11121664, 17, 51, 3217000744, 11121664, 17, 51,
    3228041216, 123, 123, 14073844, 8, 3217000744, 3217000600, 2, 0, 3217000780, 0}}
 __FUNCTION__ = "crash_handler"
#3 <signal handler called>
No symbol table info available.
#4 nih_str_array_append (array=0xbfbf8a48, parent=0x0, len=0xbfbf8a44, args=0x41) at string.c:534
 c_len = 5
 o_len = <value optimized out>
 arg = (char * const *) 0x5
 __FUNCTION__ = "nih_str_array_append"
#5 0x0804bc61 in job_run_process (job=0x87e7f30, process=PROCESS_MAIN) at job.c:1368
 statbuf = {st_dev = 59308975770263300, __pad1 = 61930, st_ino = 134616700, st_mode = 11, st_nlink = 0, st_uid = 13894081, st_gid = 142479368, st_rdev = 21617315872, __pad2 = 35248,
  st_size = -1077966368, st_blksize = 0, st_blocks = 13808947, st_atim = {tv_sec = 13890026, tv_nsec = 134616700}, st_mtim = {tv_sec = 11, tv_nsec = 14073844}, st_ctim = {
    tv_sec = 142508104, tv_nsec = 142507824}, __unused4 = 3217001048, __unused5 = 12760163}
 proc = (JobProcess *) 0x87e8560
 argv = (char **) 0x887cea8
 script = <value optimized out>
 argc = 5
 error = <value optimized out>
 fds = {142479392, 134616700}
 __FUNCTION__ = "job_run_process"
#6 0x0804c53c in job_change_state (job=0x87e7f30, state=JOB_SPAWNED) at job.c:856
 old_state = <value optimized out>
 __FUNCTION__ = "job_change_state"
#7 0x0804cff2 in job_child_reaper (data=0x0, pid=6764, killed=0, status=0) at job.c:1729
 job = <value optimized out>
 process = PROCESS_PRE_START
 failed = 0
 stop = 0
 state = 1
 __FUNCTION__ = "job_child_reaper"
#8 0x0805959e in nih_child_poll () at child.c:158
 iter = (NihList *) 0x87e1600
 _iter = (NihList *) 0x87e15d0
 pid = 6764
 killed = 0
 status = 0
 info = {si_signo = 17, si_errno = 0, si_code = 1, _sifields = {_pad = {6764, 0 <repeats 28 times>}, _kill = {si_pid = 6764, si_uid = 0}, _timer = {si_tid = 6764, si_overrun = 0,
      si_sigval = {sival_int = 0, sival_ptr = 0x0}}, _rt = {si_pid = 6764, si_uid = 0, si_sigval = {sival_int = 0, sival_ptr = 0x0}}, _sigchld = {si_pid = 6764, si_uid = 0, si_status = 0,
      si_utime = 0, si_stime = 0}, _sigfault = {si_addr = 0x1a6c}, _sigpoll = {si_band = 6764, si_fd = 0}}}
#9 0x0805d09d in nih_main_loop () at main.c:606
 timeout = {tv_sec = 4, tv_usec = 992000}
---Type <return> to continue, or q <return> to quit---
 readfds = {fds_bits = {8, 0 <repeats 31 times>}}
 writefds = {fds_bits = {32, 0 <repeats 31 times>}}
 buf = ""
 ret = <value optimized out>
 next_timer = (NihTimer *) 0x0
 exceptfds = {fds_bits = {0 <repeats 32 times>}}
 nfds = 7
#10 0x08049e42 in main (argc=5, argv=0x887cea8) at main.c:302
 args = <value optimized out>
 ret = 5

Could you confirm whether it is initctl or init segfaulting here? This looks a lot like the init daemon segfaults.

Could you supply the exact sequence of commands that you run, including any appropriate upstart jobs

If you have a core file, could you gzip that and attach it to the bug

Thanks

Changed in upstart:
importance: Undecided → High
status: New → Incomplete
summary: - initctl segfault after changing date
+ init: segfault after changing date
Changed in upstart (Fedora):
status: Unknown → Confirmed

It is init segfaulting. In the distribution I use (Fedora 9) init is part of upstart RPM.

Unfortunately I can't provide right now the exact sequence of commands plus the upstart jobs since that would involve quite a big amount of my employers code.

The general idea is that I've got about 20 tasks which are started by upstart when the main task (watchdog) starts (I'm using "start on" stanza). While the tasks are running I change the date to previous day using "date" command. Then I stop all these tasks (including watchdog) and when they all have stopped I start the watchdog task again. At this point init daemon segfaults when starting one of the 20 tasks (it happens randomly with a different task each time). The same problem happens (but a bit less often, around 50% of the time) when I change the date after all the tasks were stopped.

Is there anything else I can provide to help with fixing this bug?

I'm attaching three core files from different crashes.

Which version of Upstart are those core files against? I've tried opening them against the binaries from fc9 (upstart-0.3.9-22-fc9.i386.rpm) but haven't had any luck

It _looks_ like the crash is due to bad data in the job->cause event information

#4 nih_str_array_append (array=0xbfbf8a48, parent=0x0, len=0xbfbf8a44, args=0x41) at string.c:534
 c_len = 5
 o_len = <value optimized out>
 arg = (char * const *) 0x5
 __FUNCTION__ = "nih_str_array_append"

I'm not sure why args would be 0x41 or why arg would 0x5 there - that looks wrong to me. I need to examine the data to figure it out.

Unfortunately I can't seem to replicate the issue here, I'm not convinced it is related to the date change though

This could be simply a "start 20 tasks with the same event" bug ;)

(Though a test here with 50 is fine)

The core files are again upstart-0.3.9-22.fc9.i386

I opened them like this:

gdb /sbin/init core.6764

I had to install upstart-debuginfo-0.3.9-22.fc9.i386 first to see anything useful.

Right - I'm obviously trying to read them on Ubuntu with the RPM and debug info unpacked alongside - not getting anywhere with that

I may have to try and install fc9 in a VM

Aha, found a gdb incantation that worked:

wing-commander /tmp% gdb ./upstart/usr/lib/debug/sbin/init.debug ./upstart-segfault/core.10726

The job's cause (event that caused it to start/stop) is invalid, this causes the segfault

(gdb) p job->cause
$26 = (EventEmission *) 0x9a0eb48

(gdb) p *job->cause
$17 = {event = {entry = {prev = 0x9a0eb10, next = 0x9a0dd18},
    name = 0x8 <Address 0x8 out of bounds>, args = 0x9a0eb00,
    env = 0x9a0eb58}, id = 161540952, progress = 134517312, jobs = -1,
  failed = 1918989427}

(gdb) p *job
$18 = {entry = {prev = 0x9a0ed50, next = 0x9a1a508}, id = 22,
  name = 0x9a18848 "ezono-cyclades-t_ltl", description = 0x0, author = 0x0,
  version = 0x0, replacement = 0x0, replacement_for = 0x0, instance_of = 0x0,
  goal = JOB_START, state = JOB_SPAWNED, cause = 0x9a0eb48, blocked = 0x0,
  failed = 0, failed_process = 4294967295, exit_status = 0, start_events = {
    prev = 0x9a18f10, next = 0x9a18f10}, stop_events = {prev = 0x9a19208,
    next = 0x9a18fe8}, emits = {prev = 0x9a18e14, next = 0x9a18e14},
  process = 0x9a18ed8, normalexit = 0x9a192d8, normalexit_len = 7,
  kill_timeout = 30, kill_timer = 0x0, instance = 0, service = 1, respawn = 1,
  respawn_limit = 2, respawn_interval = 30, respawn_count = 2,
  respawn_time = 1239344371, daemon = 0, pid_file = 0x0, pid_binary = 0x0,
  pid_timeout = 10, pid_timer = 0x0, console = CONSOLE_OUTPUT, env = 0x0,
  umask = 18, nice = 0, limits = {0x0 <repeats 15 times>}, chroot = 0x0,
  chdir = 0x9a19050 "/tmp"}

There are two events in the list:

(gdb) p *((Event *)events->next)
$24 = {entry = {prev = 0x9a0e1f0, next = 0x9a0ec98},
  name = 0x9a0d938 "stopped", args = 0x9a0e400, env = 0x0}
(gdb) p *((Event *)events->next->next)
$25 = {entry = {prev = 0x9a0dc68, next = 0x9a0d9a0},
  name = 0x9a0e2e8 "restart-bgr", args = 0x0, env = 0x0}

(gdb) p *((EventEmission *)events->next)
$27 = {event = {entry = {prev = 0x9a0e1f0, next = 0x9a0ec98},
    name = 0x9a0d938 "stopped", args = 0x9a0e400, env = 0x0}, id = 242,
  progress = EVENT_HANDLING, jobs = 1, failed = 0}
(gdb) p *((EventEmission *)events->next->next)
$28 = {event = {entry = {prev = 0x9a0dc68, next = 0x9a0d9a0},
    name = 0x9a0e2e8 "restart-bgr", args = 0x0, env = 0x0}, id = 245,
  progress = EVENT_HANDLING, jobs = 1, failed = 0}

restart-bgr event is emitted by one of the job files after "watchdog" task stopped if auto-restart option is configured.
The problem happens also with auto-restart disabled.

The odd thing is that this looks like job->cause has been freed, this is odd because job->cause is only ever set inside job_change_cause() which takes care to increment and decrement its ->jobs reference count.

The cause address is greater than the "restart-bgr" and "stopped" events, this implies (but doesn't guarantee) that it was emitted later. It's certainly not in the events list anymore.

I'll take a look through the code to see whether this event is attached to any other jobs.

It shouldn't be possible to free the event emission while it's jobs count is greater than 0. It's possible, for example, that I'm incorrectly freeing an event emission elsewhere (e.g. from job->blocked)

I've marked this bug Private for now, since my debugging may reveal details that you don't wish revealed (plus your core file contains them)

If you're ok for the information to be revealed, please feel free to mark it un-private.

Feel free to delete the core file attachment if you're worried about that, but not worried about my debugging traces

visibility: public → private

Does the core file contain anything else than the state of the init daemon? If not then this can be marked un-private.

Download full text (5.3 KiB)

First up, here's a dump of the jobs that you have registered. I've put the names followed by the goal and state in brackets. Then looked at the "cause" and "blocked" parameters, NULL means there wasn't one.

Please shout if any of this system state looks unusual to you:

rc jobs, all stopped:
  rcS (stop/waiting) NULL NULL
  rcS-sulogin (stop/waiting) NULL NULL
  rc0 (stop/waiting) NULL NULL
  rc1 (stop/waiting) NULL NULL
  rc2 (stop/waiting) NULL NULL
  rc3 (stop/waiting) NULL NULL
  rc4 (stop/waiting) NULL NULL
  rc5 (stop/waiting) NULL NULL
  rc6 (stop/waiting) NULL NULL

This is what I'd expect, the system has booted so there's no runlevel change going on right now.

tty jobs, both running:
  tty1 (start/running) NULL NULL
  tty2 (start/running) NULL NULL

Again, what we'd expect.

Now, a bunch of system services, some running, some not:
  autofs (start/running) NULL NULL
  acpid (start/running) NULL NULL
  haldaemon (start/running) NULL NULL
  messagebus (start/running) NULL NULL
  netplugd (start/running) NULL NULL
  prefdm (start/running) NULL NULL
  rsyslog (start/running) NULL NULL

  logd (stop/waiting) NULL NULL
  microcode_ctl (stop/waiting) NULL NULL
  netfs (stop/waiting) NULL NULL
  network (stop/waiting) NULL NULL
  serial (stop/waiting) NULL NULL
  sshd (stop/waiting) NULL NULL
  sulogin (stop/waiting) NULL NULL

Now your custom bits:

some jobs are stopped:
  ezono-cyclades-calibration (stop/waiting) NULL NULL
  ezono-cyclades-pre-start (stop/waiting) NULL NULL

  ezono-cyclades-t_ane (stop/waiting) NULL NULL
  ezono-cyclades-t_ccs (stop/waiting) NULL NULL
  ezono-cyclades-t_emg (stop/waiting) NULL NULL
  ezono-cyclades-t_mco (stop/waiting) NULL NULL
  ezono-cyclades-t_mse (stop/waiting) NULL NULL
  ezono-cyclades-t_par (stop/waiting) NULL NULL
  ezono-cyclades-t_pro (stop/waiting) NULL NULL
  ezono-cyclades-t_ssm (stop/waiting) NULL NULL
  ezono-cyclades-t_upg (stop/waiting) NULL NULL
  ezono-cyclades-t_xss (stop/waiting) NULL NULL

These might be "due to be started", but not yet reached, that's ok - if Upstart hasn't reached them they're not relevant yet.

some are running:
  ezono-cyclades-ac-adapter (start/running) NULL NULL
  ezono-cyclades-list-tasks (start/running) NULL NULL
  ezono-cyclades-stop-software-control (start/running) NULL NULL
  ezono-cyclades-suspend-control (start/running) NULL NULL
  ezono-cyclades-xss-control (start/running) NULL NULL
  ezono-malta-fpga-control (start/running) NULL NULL

  ezono-cyclades-t_bbd (start/running) NULL NULL
  ezono-cyclades-t_cal (start/running) NULL NULL
  ezono-cyclades-t_dso (start/running) NULL NULL
  ezono-cyclades-t_fcl (start/running) NULL NULL
  ezono-cyclades-t_fem (start/running) NULL NULL
  ezono-cyclades-t_emg-control (start/running) NULL NULL
  ezono-cyclades-t_mco-control (start/running) NULL NULL
  ezono-cyclades-t_prd (start/running) NULL NULL
  ezono-cyclades-t_scp (start/running) NULL NULL
  ezono-cyclades-t_smr (start/running) NULL NULL
  ezono-cyclades-t_ssc (start/running) NULL NULL
  ezono-cyclades-t_upg-control (start/running) NULL NUL...

Read more...

(For the record, I'm looking at core.10726)

Slawomir: The core file contains the complete state of the init daemon, which includes the parsed result of your job files and their names as well as their pids and suchlike

The start conditions for the other jobs are all:

  start on started ezono-cyclades-t_smr

The state shows this has started.

Thus we can theorise that when ezono-cyclades-t_smr started, its event was created, was attached to these jobs, and they begun starting - but for whatever reason, the event was discarded prematurely

Could you confirm that these four jobs all have this start condition, and no other jobs have that?

I see you have rsyslogd running

Could you run "initctl log-priority debug", make sure that the output appears in your logs "initctl version" is a good test for that, and then try doing your date change again.

("initctl list" output before would be nice too)

Thanks

Yes, all the jobs are defined as services.
The four jobs which have invalid pointer to their cause all use the same stanza "start on started ezono-cyclades-t_smr" so I guess this event should be still in the list.

I'm attaching /var/log/messages, output of "initctl list" before SW was started and the new core file

Thanks for all your information so far.

Out of interest, at which point in the log did you change the date?

The log only contains what happened when SW was being started. Do you want to see also what happens when date is being changed and SW is stopped?

Yes, please - it's helpful to see the whole picture

Here it is.
In /var/log/messages I marked when date was changed (although it's clear because syslog timestamps change) by "Changing date to previous day" and when upstart finished stopping my software again ("SW stopped").

I got permission to attach the upstart job files. I zipped the whole /etc/event.d/ directory. You can mark this bug un-private.

visibility: private → public
Download full text (7.0 KiB)

Thanks very much for all your help so far.

From the logs, this looks like the key moment:

May 17 15:13:32 sonostation-usb26-eth init: Handling started event
May 17 15:13:32 sonostation-usb26-eth init: ezono-cyclades-t_ssm goal changed from stop to start
May 17 15:13:32 sonostation-usb26-eth init: ezono-cyclades-t_ssm state changed from waiting to starting
May 17 15:13:32 sonostation-usb26-eth init: ezono-cyclades-t_ssm respawning too fast, stopped
May 17 15:13:32 sonostation-usb26-eth init: ezono-cyclades-t_ssm respawning too fast, stopped
May 17 15:13:32 sonostation-usb26-eth init: ezono-cyclades-t_ssm goal changed from start to stop
May 17 15:13:32 sonostation-usb26-eth init: ezono-cyclades-t_ssm state changed from starting to waiting
May 17 15:13:32 sonostation-usb26-eth init: event_emit: Pending stopped event

This starting event is "started ezono-cyclades-t_smr", which we've established is the one that is unexpectedly freed.

Two things strike me about this:

 - "respawning too fast" - this wrongly uses the system time, so is one of the very few things to be broken by spooling the clock back a day

 - there's two of them printed, that's odd...

The next sequence of log message is everything else starting from this event:

May 17 15:13:32 sonostation-usb26-eth init: ezono-cyclades-stop-software-control goal changed from stop to start
May 17 15:13:32 sonostation-usb26-eth init: ezono-cyclades-stop-software-control state changed from waiting to starting
May 17 15:13:32 sonostation-usb26-eth init: event_emit: Pending starting event
May 17 15:13:32 sonostation-usb26-eth init: ezono-cyclades-t_dso goal changed from stop to start
May 17 15:13:32 sonostation-usb26-eth init: ezono-cyclades-t_dso state changed from waiting to starting
May 17 15:13:32 sonostation-usb26-eth init: event_emit: Pending starting event
May 17 15:13:32 sonostation-usb26-eth init: ezono-cyclades-t_ssc goal changed from stop to start
May 17 15:13:32 sonostation-usb26-eth init: ezono-cyclades-t_ssc state changed from waiting to starting
May 17 15:13:32 sonostation-usb26-eth init: event_emit: Pending starting event
May 17 15:13:32 sonostation-usb26-eth init: ezono-cyclades-t_fem goal changed from stop to start
May 17 15:13:32 sonostation-usb26-eth init: ezono-cyclades-t_fem state changed from waiting to starting
May 17 15:13:32 sonostation-usb26-eth init: event_emit: Pending starting event
May 17 15:13:32 sonostation-usb26-eth init: ezono-cyclades-t_cal goal changed from stop to start
May 17 15:13:32 sonostation-usb26-eth init: ezono-cyclades-t_cal state changed from waiting to starting
May 17 15:13:32 sonostation-usb26-eth init: event_emit: Pending starting event
May 17 15:13:32 sonostation-usb26-eth init: ezono-cyclades-t_prd goal changed from stop to start
May 17 15:13:32 sonostation-usb26-eth init: ezono-cyclades-t_prd state changed from waiting to starting
May 17 15:13:32 sonostation-usb26-eth init: event_emit: Pending starting event
May 17 15:13:32 sonostation-usb26-eth init: ezono-cyclades-t_scp goal changed from stop to start
May 17 15:13:32 sonostation-usb26-eth init: ezono-cyclades-t_scp state changed from waiting to starting
May 17 15:13:32 sonostation-usb26-eth ...

Read more...

Thank you again for all your help.

This bug is indeed caused by the first process failing to respawn.

It takes a reference to the emitted event, and attempts to start; this fails due to hitting the respawn limit, so it stops again and puts back the reference to the emitted event. As the first process to try and do so, this means the emitted event once again has zero references - and so is marked EVENT_FINISHED.

Even though further processes take a reference the event, it's already too late - the event is scheduled for deletion and Upstart does that at the next opportunity.

Changed in upstart:
status: Incomplete → Triaged
summary: - init: segfault after changing date
+ init: failure to respawn finishes the start event, even if other jobs
+ reference it later

Will need a new upload of 0.3 for this

Here's a test case for this

        TEST_FEATURE ("with respawn limit and second job");
        event = (Event *)job2->stop_events.next;
        nih_list_add (&job2->start_events, &event->entry);

        em = event_emit ("wibble", NULL, NULL);
        em->progress = EVENT_HANDLING;

        TEST_ALLOC_FAIL {
                em->jobs = 0;

                job1->goal = JOB_STOP;
                job1->state = JOB_WAITING;
                job1->cause = NULL;

                job2->goal = JOB_STOP;
                job2->state = JOB_WAITING;
                job2->cause = NULL;

                /* job2 hashes first */
                job2->respawn_limit = 10;
                job2->respawn_interval = 1000;
                job2->respawn_time = time (NULL);
                job2->respawn_count = 10;

                TEST_DIVERT_STDERR (output) {
   job_handle_event (em);
                }
                rewind (output);

  TEST_EQ (em->jobs, 1);
                TEST_EQ (em->progress, EVENT_HANDLING);

                TEST_EQ (job1->goal, JOB_STOP);
  TEST_EQ (job1->state, JOB_WAITING);
                TEST_EQ_P (job1->cause, NULL);

                TEST_EQ (job2->goal, JOB_START);
  TEST_EQ (job2->state, JOB_STARTING);
                TEST_EQ_P (job2->cause, em);

                TEST_FILE_EQ (output, "test: foo respawning too fast, stopped\n$
                TEST_FILE_END (output);
                TEST_FILE_RESET (output);
 }

 nih_list_free (&em->event.entry);

 nih_list_free (&job2->entry);

If you could try and apply the job.c change in this patch, that should hopefully fix the crasher http://bazaar.launchpad.net/~scott/upstart/0.3/revision/706

The fix solves the crash. Thanks a lot for such a quick response.

This bug has been fixed in Upstart 0.3.11

0.3.11 2009-06-19 "For Friday, June 19th 2009, I'm Jon Masters"

        * Fixed crash when the first job started by an event hits the
          respawn limit and frees the event, even though other jobs are
          also started by the same event. (Bug: #388873)

        * Added temporary support for "telinit u" until we have true re-exec
          support. This will be replaced by an initctl command in future.
          (Bug: #388742)

        * Corrected formatting of initctl(8) manpage. (Bug: #388745)

The fix works also with 0.3.9

I've gone through the Upstart 0.5 and trunk code, and am confident that they are not affected by this bug.

Changed in upstart:
status: Triaged → Fix Released
Changed in upstart (Fedora):
status: Confirmed → Won't Fix
Changed in upstart (Fedora):
status: Won't Fix → In Progress
Changed in upstart (Fedora):
status: In Progress → Fix Committed
Changed in upstart (Fedora):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.