init: using 'and' operators can cause hangs

Reported by Steve Langasek on 2009-10-09
56
This bug affects 10 people
Affects Status Importance Assigned to Milestone
upstart
High
Unassigned
Declined for 0.1 by Scott James Remnant (Canonical)
Declined for 0.2 by Scott James Remnant (Canonical)
Declined for 0.3 by Scott James Remnant (Canonical)
Declined for 0.5 by Scott James Remnant (Canonical)
Declined for 0.6 by Scott James Remnant (Canonical)
Declined for Trunk by Scott James Remnant (Canonical)
mountall (Ubuntu)
High
Steve Langasek
Karmic
High
Steve Langasek
Lucid
High
Steve Langasek
nfs-utils (Ubuntu)
High
Steve Langasek
Karmic
High
Steve Langasek
Lucid
High
Steve Langasek
upstart (Ubuntu)
Medium
Unassigned
Karmic
Undecided
Unassigned
Lucid
Medium
Unassigned

Bug Description

Event operators are reset each time they become TRUE, with the blocking state being transferred to the actual instance that is started. This means that combining the two operators leads to undesirable behaviour.

For example, in /etc/init/quest.conf:

  start on gandalf and (bilbo or thorin)

When gandalf arrives, he'll block waiting for bilbo or thorin. If bilbo then arrives, the operator tree is complete and the quest can start.

If thorin then arrives, he'll block waiting for gandalf. Unfortunately gandalf has already gone, so he has little choice but to sit down and start singing about gold.

A short-term fix is not to combine event operators this way, and instead separate them out into separate jobs. For example if we had an /etc/init/quest/member.conf with:

  start on bilbo or thorin

Then /etc/init/quest.conf would have:

  start on gandalf and started quest/member

In this case when gandalf arrives, he'd still block on a quest member. Bilbo then arrives, "starting" the quest/member job and thus also starting the quest.

If thorn then arrives, the quest/member job has already been started so he doesn't block waiting for it to start.

The longer term fix is included in the move to "while", which means that we'd remember that we had gandalf so when thorin arrived he'd know the quest was already started - and he could either start a new quest or catch up with the existing one.

Steve Langasek (vorlon) wrote :
Changed in mountall (Ubuntu):
importance: Undecided → High
milestone: none → ubuntu-9.10
tags: added: ubuntu-boot
Steve Langasek (vorlon) wrote :

This code turned out to be the culprit:

  } else if (mount_parent && (mount_parent->tag == TAG_LOCAL)
      && strcmp (mount_parent->mountpoint, "/")) {
   mnt->tag = TAG_LOCAL;
   num_local++;
   nih_debug ("%s is local (inherited)", mnt->mountpoint);
  }

This code was added so that virtual filesystems didn't get marked as virtual if they were waiting on other local filesystems, but instead had the effect that all remote filesystems mounted anywhere except / were also treated as local.

Checking now that this is the only code that needs fixed.

Changed in mountall (Ubuntu Karmic):
status: New → In Progress
assignee: nobody → Steve Langasek (vorlon)
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package mountall - 0.2.1

---------------
mountall (0.2.1) karmic; urgency=low

  * Make mountall recognize that *remote* filesystems are still remote, given
    that their parent fs is almost always local. LP: #447654.

 -- Steve Langasek <email address hidden> Fri, 09 Oct 2009 20:46:21 -0700

Changed in mountall (Ubuntu Karmic):
status: In Progress → Fix Released
Steve Langasek (vorlon) wrote :

The uploaded fix solves the problem of mountall not emitting the 'local-filesystems' and 'filesystem' signals until all NFS shares are mounted.

However, it does not fix the problem that nfs4 shares will not automount.

This is a regression vs. mountall 0.1.8, and appears to be a problem with mountall failing to gracefully handle errors from the mount attempts the first time around, and then never retrying even when signalled (-USR1).

My best guess as to the cause of the failure on the first mount attempt is that it happens early enough that starting idmap or gssd fails (possibly because mounting of rpc_pipefs fails?); I haven't been able to catch it happening on the console. Whatever the cause, though, it often leaves behind zombie 'mount' processes.

Changed in mountall (Ubuntu Karmic):
assignee: Steve Langasek (vorlon) → nobody
status: Fix Released → Triaged

On Sat, 2009-10-10 at 00:43 +0000, Steve Langasek wrote:

> This code was added so that virtual filesystems didn't get marked as
> virtual if they were waiting on other local filesystems, but instead had
> the effect that all remote filesystems mounted anywhere except / were
> also treated as local.
>
Yes, I just found a dup of this - bug 447649

Will think on that for a few hours before fixing, as the inheriting
stuff might just be flat out wrong.

Scott
--
Scott James Remnant
<email address hidden>

Judging by where the hang is, it's a bug in the nfs upstart confs. Well, strictly speaking it's an Upstart bug in that the event operator stuff is utterly broken, but for now the configs should be written with that in mind.

The problem can be described by this:

  start on A and (B or C)

When A happens, it is blocked.
Then when B happens, it is also blocked, and the job is started.
The operator tree is cleared and the blocked flags passed to the job itself (the job now blocks the events, not the tree)

Now when C happens, it is blocked.
It's waiting for A to happen again.

So if you have (I'm guessing): start on local-filesystems and (net-device-up or mount) then either net-device-up or mount will block forever.

You have to work around it by only using one type of operator

Steve Langasek (vorlon) wrote :

"using one type of operator" - well, this is probably doable; the only and+or case we have currently is gssd, which wants local-filesystems and (portmap or mount TYPE=nfs4 OPTIONS=sec=*krb5*), and portmap implies local-filesystems at boot time. It's not 100% correct, but it should be usable. Will test and report back.

Prognosis of fixing the upstart bug?

If you do need multiple operators, you can always use muliple jobs.
One with all the ORs and another with the ANDs including it.

Bug fix? Err, it's a bit of a fundamental design flaw. That'll be for
lynx ;)

Sent from my iPhone

On 10 Oct 2009, at 20:20, Steve Langasek
<email address hidden> wrote:

> "using one type of operator" - well, this is probably doable; the only
> and+or case we have currently is gssd, which wants local-filesystems
> and
> (portmap or mount TYPE=nfs4 OPTIONS=sec=*krb5*), and portmap implies
> local-filesystems at boot time. It's not 100% correct, but it
> should be
> usable. Will test and report back.
>
> Prognosis of fixing the upstart bug?
>
> --
> NFSv4 automounting completely broken?
> https://bugs.launchpad.net/bugs/447654
> You received this bug notification because you are subscribed to
> mountall in ubuntu.

Steve Langasek (vorlon) on 2009-10-10
Changed in mountall (Ubuntu Karmic):
assignee: nobody → Steve Langasek (vorlon)
status: Triaged → Fix Released
Changed in nfs-utils (Ubuntu Karmic):
assignee: nobody → Steve Langasek (vorlon)
importance: Undecided → High
milestone: none → ubuntu-9.10
status: New → In Progress

worked around for karmic; opening an upstart task regarding the underlying operator bug.

summary: - NFSv4 automounting completely broken?
+ using 'and' and 'or' operators together in upstart job deps causes hangs
Changed in upstart (Ubuntu Karmic):
status: New → Won't Fix
Changed in upstart (Ubuntu):
importance: Undecided → Medium
status: New → Triaged
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package nfs-utils - 1:1.2.0-2ubuntu6

---------------
nfs-utils (1:1.2.0-2ubuntu6) karmic; urgency=low

  * Drop the gssd upstart job's dependency on "local-filesystems"; at boot
    time this is always implied transitively by the dep on portmap, and using
    a combination of 'or' and 'and' operators in the dependency list seems
    to confuse upstart quite badly, causing kerberized mounts to hang at boot.
    LP: #447654.

 -- Steve Langasek <email address hidden> Sat, 10 Oct 2009 20:12:11 +0000

Changed in nfs-utils (Ubuntu Karmic):
status: In Progress → Fix Released
summary: - using 'and' and 'or' operators together in upstart job deps causes hangs
+ init: using 'and' and 'or' operators together causes hangs
Changed in upstart:
status: New → Triaged
importance: Undecided → High

Adopting as the upstream upstart bug for this problem

description: updated
bluedream (wangjinfajimmy) wrote :

everybody can help me.i mounte karmic-beta-fs error,what's wrong with me?the error msg:
init:sreadahead main process terminated with status 1
init: caught abort core dumped
init:procps main process terminated with status 255
...

Steve Langasek (vorlon) wrote :

Per discussion with Scott, this is not going to be fixed for Lucid because it's going to require deep changes to upstart that we can't conceivably manage in an LTS. Therefore, marking 'wontfix'.

Changed in upstart (Ubuntu Lucid):
status: Triaged → Won't Fix
summary: - init: using 'and' and 'or' operators together causes hangs
+ init: using 'and' operators can cause hangs

For sanity's sake, I'm closing the Ubuntu tasks for upstream Upstart bugs. I've experimented with having both, but it is just making bugs hard to find now. Will use the policy whereby bugs on the Ubuntu package exist in the Ubuntu packaging or patches only, any bugs in the Upstart code are Upstream bugs.

Changed in upstart (Ubuntu):
status: Triaged → Invalid
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers