Configuration reload clears event that others jobs may be waiting on

Bug #1124384 reported by Ben Howard on 2013-02-13
84
This bug affects 13 people
Affects Status Importance Assigned to Milestone
cloud-init
High
Unassigned
cloud-init (Ubuntu)
High
Unassigned
Raring
Undecided
Unassigned
Saucy
High
Unassigned
upstart (Ubuntu)
Medium
James Hunt
Raring
Undecided
Dimitri John Ledkov
Saucy
Medium
James Hunt

Bug Description

[Impact]

 * The status of blocked events was not preserved, when upstart
   performed stateful re-execution or configuration reload. Thus jobs
   with complex start/stop conditions (one or more "and" clauses),
   with at least one event emitted before re-exec/reload, may not
   execute when remaining conditions are finally satisfied.

 * Above may prevent certain system to function correctly, and in the
   cases similar to cloud-init instances may even cause failure to
   boot.

 * The fix includes incriminating reference counts on blocked events,
   whilst job configuration is reloaded and fully serialising all
   upstart objects, including blocked events, during stateful
   re-execution.

 * Since previous versions of upstart, do not serialise blocked events
   the upgrade needs special casing. On upgrade upstart will perform
   stateful re-execution, unless runlevel 2 has been already
   reached. Instead upstart, will re-executed at system shutdown. This
   should allow upgrading upstart during early boot of cloud-init
   instances. But do note, that old instance of upstart will still be
   running as init and the running machine will still be affected by
   the bug described here.

[Test Case]

 * Create a sample job /etc/init/foo.conf similar to this:

start on (event1 and event2)
task
exec date

 * Test reload configuration works correctly:

$ sudo status foo
foo stop/waiting
$ sudo initctl emit -n event1
$ sudo initctl reload-configuration
$ sudo initctl emit -n event2
$ sudo tail /var/log/upstart/foo.log

At the end one should see a timestamp appended in the foo.log.

 * Test stateful re-exec works correctly:

$ sudo initctl emit -n event1
$ sudo telinit u
$ sudo initctl emit -n event2
$ sudo tail /var/log/upstart/foo.log

 * Start an ubuntu-cloud image (in lxc or cloud) with apt-get update & upgrade enabled going from upstart version without this fix included to a one that does have it. Cloud-final should finish and boot-finished under /var/lib/cloud/instances/*/boot-finished. Please note this test should be performed in isolation from dbus security update that does partial stateful re-exec at the moment.

[Regression Potential]

 * The bug fix introduced here is fairly large (approx 1.5k line diff)
   but comes with comprehensive set of test-suites to verify the two
   bug fixes as well as all possible combinations of stateful
   re-execution serialisation formats. Majority of code changes are
   for additional [de]serialisation, which follow existing well tested
   code pattern. And changes to reference counting have been carefully
   reviewed and tested by multiple developers.

 * While the bug report indicates a severe problem, it was not noticed
   until recently, as the system must be under heavy race conditions
   to become affected by this bug. Since systems reaching stable state,
   with little or no blocked events left, would not normally be
   affected.

 * Overall regression potential is deemed low.

[Original Bug Report]

Under bug 1080841 we made cloud-init invoke 'initctl reload-configuration' after it wrote a upstart job. This was necessary because inotify is not supported on all filesystems (overlayfs being the one of most current interst).

This seems to be causing upstart some pain, and resulting in cloud-final (and 'rc') not being run.

Easy user-data to reproduce the problem is:

#cloud-config-archive
- content: |
   #cloud-boothook
   #!/bin/sh
   touch /run/cloud-init-upstart-reload # hack, see trunk commit 783
- content: |
   #!/bin/sh
   echo "==== $(date -R): user-script run ===" | tee /run/user-script.log
- content: |
   #upstart-job
   description "a test upstart job"
   start on stopped rc RUNLEVEL=[2345]
   console output
   task
   script
   echo "==== $(date -R): upstart job run ===" | tee /run/upstart-job.log
   end script

You should (and do on quantal) end up with 2 files written to /run.

I've verified that the same behavior is true on quantal. If you change cloud-init to notify upstart about a job immediately after it writes it, then quantal's upstart gets confused also.

Related bugs:
 * bug 1080841: should reload configuration if an upstart job is added
 * bug 1103881: cloud-final is never executed if upstart is upgraded during initialization of the image

Related branches

Ben Howard (utlemming) on 2013-02-13
affects: ubuntu → cloud-init (Ubuntu)
Scott Moser (smoser) wrote :

The issue here is really that /tmp is getting cleaned after the above runs.
Then, the test harness is going looking for the file, but its long gone.

I'm not sure if its specifically a bug or not to rely on runcmd running after /tmp cleaning has run. but it does seem like /tmp is being cleaned dangerously late in the boot process, given that runcmd is happening via cloud-config.conf at
   start on (filesystem and started rsyslog)

At what point in boot is it safe to put files in /tmp (even temporary files) and assume that they will not be deleted by 'mounted-tmp' ?

affects: cloud-init (Ubuntu) → mountall (Ubuntu)
Steve Langasek (vorlon) wrote :

'mounted' events are handled synchronously in mountall, and the 'filesystem' isn't emitted until they're all done. Prelim tests seem to confirm that this is working correctly (and if it wasn't that would be a very bad bug, that I think we would have caught before now).

In a test instance from smoser, I've confirmed that a file touched directly by /etc/init/cloud-config.conf persists, whereas a file touched from within /var/lib/cloud/instance/scripts/runcmd is not created. As I'm not really familiar with the cloud-init architecture I can't effectively debug this further, but it's definitely not mountall causing the issue.

affects: mountall (Ubuntu) → cloud-init (Ubuntu)
Scott Moser (smoser) on 2013-02-14
summary: - cloud-init parses yaml incorrectly
+ reload-configuration can confuse upstart
Scott Moser (smoser) on 2013-02-14
description: updated
Scott Moser (smoser) wrote :
Changed in cloud-init (Ubuntu):
status: New → Confirmed
Changed in upstart (Ubuntu):
status: New → Confirmed
importance: Undecided → Medium
Scott Moser (smoser) wrote :

just a bit more info, the cloud-config content above results in writing /etc/init/part-002.conf a an upstart job.
The debug logs shown here had that file in place (from a previous boot), and cloud-init overwriting it, then calling (or not calling) reload-configuration.

Ubuntu QA Website (ubuntuqa) wrote :

This bug has been reported on the Ubuntu ISO testing tracker.

A list of all reports related to this bug can be found here:
http://iso.qa.ubuntu.com/qatracker/reports/bugs/1124384

tags: added: iso-testing
Scott Moser (smoser) wrote :

I mistakingly posted recreate instructions at https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/1103881/comments/6 . they're not comletely irrelevant there, but I did intend for that to go here.

Scott Moser (smoser) on 2013-02-27
description: updated
description: updated
Scott Moser (smoser) wrote :

I've verified this is still an issue with the following user-data on upstart at '1.8-0ubuntu1'.
This is re-creatable via the most recent daily image (ubuntu-raring-daily-amd64-server-20130403). I used ami-88a4c1e1.

But you can download and recreate as above. The user-data provided to recreate is below. The 'createmark' is necessary because we've disabled the 'initctl reload' call to lessen the issue. touching that file turns the call back on.

#cloud-config-archive
- filename: createmark
  content: |
   #cloud-boothook
   #!/bin/sh
   touch /run/cloud-init-upstart-reload
- content: |
   #!/bin/sh
   echo "==== $(date -R): user-script run ===" | tee /run/user-script.log
- filename: myjob.conf
  content: |
   #upstart-job
   description "a test upstart job"
   start on stopped rc RUNLEVEL=[2345]
   console output
   task
   script
   echo "==== $(date -R): upstart job run ===" | tee /run/upstart-job.log
   end script
- content: |
   #cloud-config
   password: passw0rd
   chpasswd: { expire: False }
   ssh_pwauth: True

Scott Moser (smoser) wrote :

Ok, another recreate that is possibly simpler to debug. Below, we create 2 jobs, one invokes reload-configuration, the other just writes something to /run. We add those jobs to a pristine raring container, start it, and expect them to run. Only the first will run. I've explicitly ran 'start console' from the first so you can log in, otherwise you wont get a getty, so you can't log in to look around.

cat <<"EOF" | tee job1.conf
description "invoke reload for fun"
start on filesystem
console output
task
script
  read x y < /proc/uptime
  cmd="initctl reload-configuration"
  (
  echo "==== up $x: running $cmd ==="
  initctl reload-configuration
  echo "got $?"
  ) 2>&1 | tee /run/job1.log
  sh -c 'sleep 5; start console' &
end script
EOF

cat <<"EOF" | tee job2.conf
description "do something around rc.local"
start on stopped rc RUNLEVEL=[2345]
console output
task
script
  read x y < /proc/uptime
  echo "==== up $x: touching /run/did-something ===" | tee /run/job2.log
  echo "$x" > /run/did-something
end script
EOF

# precise base
sudo apt-get install lxc
sname="raring-amd64-source"
cname="raring-amd64"

sudo lxc-create -n raring-amd64-source -t ubuntu -- --arch=amd64 raring
sudo lxc-clone -o $sname -n $cname
sudo cp job1.conf job2.conf /var/lib/lxc/$cname/rootfs/etc/init/
sudo lxc-start -n $cname

## now log in as 'ubuntu'. You *should* see a file /run/did-something.

Colin Watson (cjwatson) wrote :

16:54 <cjwatson> jodh: I *think* I may see the general structure of what's going on here
16:54 <cjwatson> jodh: This debugging should make it about as clear to you as it is to me so far, I think
16:54 <cjwatson> <7>init: conf_load_path_with_override: Loading configuration file /etc/init/rc-sysinit.conf
16:54 <cjwatson> ...
16:54 <cjwatson> <7>init: conf_file_destroy: Destroyed unused job rc-sysinit
16:54 <cjwatson> <7>init: event_unblock: name: 'filesystem', new blockers: 3
16:55 <cjwatson> jodh: When we tear down the old job, we end up unreferencing and destroying the event operators it refers to, at least enough to cause them to decrement various event->blockers
16:56 <cjwatson> jodh: So the 'filesystem' event ends up entering the finished state far too early because it's been wrongly unblocked
17:08 <jodh> cjwatson: gotcha - that is indeed subtle. Could be an interesting one to fix too ;)
17:10 <cjwatson> jodh: My feeling is that the nih_free in conf_file_destroy needs to be something more careful involving keeping references
17:10 <cjwatson> jodh: But that also perhaps job_class_reconsider shouldn't replace a job that has events that are blocking some other event? I'm not quite sure, I don't know the structures quite well enough
17:10 <cjwatson> jodh: Is that enough to get you going for now though?
17:11 <jodh> cjwatson: sure is - thanks!!

James Hunt (jamesodhunt) wrote :

I believe the problem here is that -- in the case where a JobClass has no running instances -- a reload is causing the original JobClass to be destroyed. That eventually leads to Event->blockers to reach zero (as a result of the JobClasses EventOperators being freed) causing the event to be destroyed. That behaviour is correct inasmuch as events are being destroyed when they should. What is not correct is that the reload results in all this happening *before* the newly-loaded JobClass gets a chance to increment Event->blockers to avoid the event being freed.

summary: - reload-configuration can confuse upstart
+ Configuration reload clears event that others jobs may be waiting on
Stéphane Graber (stgraber) wrote :

So we've spent a good part of the afternoon going through the two cloud-init bugs with James and came to the conclusion that they are actually the same bug.

Both initctl --reload-configuration and an upstart stateful re-exec cause upstart to reload its configuration, destroy existing jobclass entries and create new ones.

As part of the process of destroy and re-creating job class entries, upstart decrements the reference counter of some related objects, including emitted events.
That has the result that if a job depends on two events, one that has already been emitted and another that hasn't been emitted yet and that the job that emitted the first event is being reloaded, then the record of that event will be dropped, leading to a failure to start the job (as only half the start condition will match).

The part of the code that causes this issue is post-reexec, which means that once we come up with a fix for this, we'll be able to SRU it and have upstart re-exec itself, applying the fix in the process.
That also means that we can't SRU any of upstart's dependencies until this issue is resolved.

James is currently working on testcases for the various scenarios that we know we need to support, so we can have comprehensive regression tests before we attempt to sort this issue. Our current hope is to have a fix for this by the end of the week.

James Hunt (jamesodhunt) wrote :

The attached is a minimal fix for the problem that applies to lp:upstart.

I've tested this in Scotts LXC test scenario with reload-configuration being called mid-boot and the system boots correctly.

We still need to finish writing the tests for this fix - that work will have to be completed at the sprint next week.

Changed in upstart (Ubuntu):
assignee: nobody → James Hunt (jamesodhunt)
tags: added: patch
Dimitri John Ledkov (xnox) wrote :

ppa:xnox/backports has the updated upstart package that has the following fixes:

* reload configuration always works and all events are preserved
* the postinst script was modified to not execute stateful re-exec if runlevel 2 has not been reached yet

The reason for the latter, is that we still loose events that are blocking jobs that are not running yet across the stateful re-exec, because EventJobOperators (aka start_on & stop_on conditions) are not serialised to include progress information w.r.t. their states (e.g. TRUE or FALSE for each condition to tell apart if it has fired or not yet).

The above fixes allow to always run reload-configuration, including early boot.
And safeguard against automatic re-execution if upstart package is upgraded during early boot.

To fix stateful re-execution in early boot, we need to serialise all objects used by upstart which upstream has avoided to do so far. Also it's questionable if stateful re-execution in early boot is a desired requirement.

On Wed, May 01, 2013 at 12:11:19AM -0000, Dmitrijs Ledkovs wrote:
> To fix stateful re-execution in early boot, we need to serialise all
> objects used by upstart which upstream has avoided to do so far. Also
> it's questionable if stateful re-execution in early boot is a desired
> requirement.

No, it isn't. cloud-init needs to be able to run a dist-upgrade of
arbitrary packages, including upstart and core libraries, prior to starting
services in a cloud instance.

Scott Moser (smoser) on 2013-05-15
Changed in cloud-init:
status: New → Confirmed
importance: Undecided → High
description: updated
Dimitri John Ledkov (xnox) wrote :

upstart (1.8-0ubuntu4) saucy; urgency=low

  [ James Hunt ]
  * Cherry pick lp:~jamesodhunt/upstart/serialise-remaining-objects (LP:#1124384):
    - Preserve event blockers during configuration reload.
    - Serialise & deserialise blocking events during stateful
      re-execution.
  * debian/manpages/upstart-events.7: Added ':sys:restarted'.
  * debian/upstart.postinst: Don't re-exec if runlevel is not 2
    yet (to allow cloud-init to dist-upgrade upstart safeily in early boot).

  [ Dmitrijs Ledkovs ]
  * Upload to saucy.
  * Use "serial-tests" harness during build, instead of the automake's
    1.13 default "parallel-tests" harness.
  * Enable parallel build by default.
 -- Dmitrijs Ledkovs <email address hidden> Thu, 06 Jun 2013 16:15:09 +0100

Changed in upstart (Ubuntu Saucy):
status: Confirmed → Fix Released
description: updated
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in cloud-init (Ubuntu Raring):
status: New → Confirmed
Changed in upstart (Ubuntu Raring):
status: New → Confirmed

Hello Ben, or anyone else affected,

Accepted upstart into raring-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/upstart/1.8-0ubuntu1.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in upstart (Ubuntu Raring):
status: Confirmed → Fix Committed
tags: added: verification-needed

As a part of the Stable Release Updates quality process a search for Launchpad bug reports using the version of upstart from raring-proposed was performed and bug 1195955 was found. Please investigate this bug report to ensure that a regression will not be created by this SRU. In the event that this is not a regression remove the "verification-failed" tag from this bug report and tag 1195955 "bot-stop-nagging". Thanks!

tags: added: verification-failed

I confirm the regression in upstart 1.8-0ubuntu1.1 in Raring which also exists in 1.8-0ubuntu7 in Saucy caused by a typo in upstart.postinst:
 83 elif dpkg --compare-version "$UPSTART_VERSION_RUNNING" ge 1.6.1; then

The right option is "--compare-versions"

Changed in upstart (Ubuntu Raring):
assignee: nobody → Dmitrijs Ledkovs (xnox)

Hello Ben, or anyone else affected,

Accepted upstart into raring-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/upstart/1.8-0ubuntu1.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: removed: verification-failed
Steve Langasek (vorlon) wrote :

Bug #1199778 has been identified as a regression in the new version of upstart in saucy. We are assuming this bug also affects the serialization backport that was uploaded to raring-proposed. As a result, this SRU is being marked verification-failed and the current package has been withdrawn from raring-proposed to avoid other users installing it and hitting this bug while it's being resolved.

tags: added: verification-failed
removed: verification-needed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package cloud-init - 0.7.3~bzr826-0ubuntu1

---------------
cloud-init (0.7.3~bzr826-0ubuntu1) saucy; urgency=low

  * New upstream snapshot.
    * Fix omnibus support (LP: #1182265)
    * invoke 'initctl reload-configuration' on upstart jobs again
      (LP: #1124384)
  * Remove unowned files in /etc/apt/apt.conf.d/ after purge. (Closes #674237)
 -- Scott Moser <email address hidden> Wed, 10 Jul 2013 13:35:59 -0400

Changed in cloud-init (Ubuntu Saucy):
status: Confirmed → Fix Released
Scott Moser (smoser) wrote :

this was fixed in cloud-init runk at revno 824. It will be released in 0.7.3

Steve Langasek (vorlon) wrote :

Landing this fix in raring was blocked by bug #1199778, which we ran out of time to resolve before the 13.10 release. My work-in-progress branch is available at lp:~vorlon/ubuntu/raring/upstart/lp.1199778 in case anyone else wants to pick this up for 13.04. Otherwise, this bug is 'wontfix'.

Changed in upstart (Ubuntu Raring):
status: Fix Committed → Won't Fix
Rolf Leggewie (r0lf) wrote :

raring has seen the end of its life and is no longer receiving any updates. Marking the raring task for this ticket as "Won't Fix".

Changed in cloud-init (Ubuntu Raring):
status: Confirmed → Won't Fix
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers