init: using 'and' operators can cause hangs
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
upstart |
High
|
Unassigned | |||
mountall (Ubuntu) |
High
|
Steve Langasek | |||
Karmic |
High
|
Steve Langasek | |||
Lucid |
High
|
Steve Langasek | |||
nfs-utils (Ubuntu) |
High
|
Steve Langasek | |||
Karmic |
High
|
Steve Langasek | |||
Lucid |
High
|
Steve Langasek | |||
upstart (Ubuntu) |
Medium
|
Unassigned | |||
Karmic |
Undecided
|
Unassigned | |||
Lucid |
Medium
|
Unassigned |
Bug Description
Event operators are reset each time they become TRUE, with the blocking state being transferred to the actual instance that is started. This means that combining the two operators leads to undesirable behaviour.
For example, in /etc/init/
start on gandalf and (bilbo or thorin)
When gandalf arrives, he'll block waiting for bilbo or thorin. If bilbo then arrives, the operator tree is complete and the quest can start.
If thorin then arrives, he'll block waiting for gandalf. Unfortunately gandalf has already gone, so he has little choice but to sit down and start singing about gold.
A short-term fix is not to combine event operators this way, and instead separate them out into separate jobs. For example if we had an /etc/init/
start on bilbo or thorin
Then /etc/init/
start on gandalf and started quest/member
In this case when gandalf arrives, he'd still block on a quest member. Bilbo then arrives, "starting" the quest/member job and thus also starting the quest.
If thorn then arrives, the quest/member job has already been started so he doesn't block waiting for it to start.
The longer term fix is included in the move to "while", which means that we'd remember that we had gandalf so when thorin arrived he'd know the quest was already started - and he could either start a new quest or catch up with the existing one.
Steve Langasek (vorlon) wrote : | #1 |
Changed in mountall (Ubuntu): | |
importance: | Undecided → High |
milestone: | none → ubuntu-9.10 |
tags: | added: ubuntu-boot |
Steve Langasek (vorlon) wrote : | #2 |
Changed in mountall (Ubuntu Karmic): | |
status: | New → In Progress |
assignee: | nobody → Steve Langasek (vorlon) |
Launchpad Janitor (janitor) wrote : | #3 |
This bug was fixed in the package mountall - 0.2.1
---------------
mountall (0.2.1) karmic; urgency=low
* Make mountall recognize that *remote* filesystems are still remote, given
that their parent fs is almost always local. LP: #447654.
-- Steve Langasek <email address hidden> Fri, 09 Oct 2009 20:46:21 -0700
Changed in mountall (Ubuntu Karmic): | |
status: | In Progress → Fix Released |
Steve Langasek (vorlon) wrote : | #4 |
The uploaded fix solves the problem of mountall not emitting the 'local-filesystems' and 'filesystem' signals until all NFS shares are mounted.
However, it does not fix the problem that nfs4 shares will not automount.
This is a regression vs. mountall 0.1.8, and appears to be a problem with mountall failing to gracefully handle errors from the mount attempts the first time around, and then never retrying even when signalled (-USR1).
My best guess as to the cause of the failure on the first mount attempt is that it happens early enough that starting idmap or gssd fails (possibly because mounting of rpc_pipefs fails?); I haven't been able to catch it happening on the console. Whatever the cause, though, it often leaves behind zombie 'mount' processes.
Changed in mountall (Ubuntu Karmic): | |
assignee: | Steve Langasek (vorlon) → nobody |
status: | Fix Released → Triaged |
Scott James Remnant (Canonical) (canonical-scott) wrote : Re: [Bug 447654] Re: NFSv4 automounting completely broken? | #5 |
On Sat, 2009-10-10 at 00:43 +0000, Steve Langasek wrote:
> This code was added so that virtual filesystems didn't get marked as
> virtual if they were waiting on other local filesystems, but instead had
> the effect that all remote filesystems mounted anywhere except / were
> also treated as local.
>
Yes, I just found a dup of this - bug 447649
Will think on that for a few hours before fixing, as the inheriting
stuff might just be flat out wrong.
Scott
--
Scott James Remnant
<email address hidden>
Scott James Remnant (Canonical) (canonical-scott) wrote : Re: NFSv4 automounting completely broken? | #6 |
Judging by where the hang is, it's a bug in the nfs upstart confs. Well, strictly speaking it's an Upstart bug in that the event operator stuff is utterly broken, but for now the configs should be written with that in mind.
The problem can be described by this:
start on A and (B or C)
When A happens, it is blocked.
Then when B happens, it is also blocked, and the job is started.
The operator tree is cleared and the blocked flags passed to the job itself (the job now blocks the events, not the tree)
Now when C happens, it is blocked.
It's waiting for A to happen again.
So if you have (I'm guessing): start on local-filesystems and (net-device-up or mount) then either net-device-up or mount will block forever.
You have to work around it by only using one type of operator
Steve Langasek (vorlon) wrote : | #7 |
"using one type of operator" - well, this is probably doable; the only and+or case we have currently is gssd, which wants local-filesystems and (portmap or mount TYPE=nfs4 OPTIONS=
Prognosis of fixing the upstart bug?
Scott James Remnant (Canonical) (canonical-scott) wrote : Re: [Bug 447654] Re: NFSv4 automounting completely broken? | #8 |
If you do need multiple operators, you can always use muliple jobs.
One with all the ORs and another with the ANDs including it.
Bug fix? Err, it's a bit of a fundamental design flaw. That'll be for
lynx ;)
Sent from my iPhone
On 10 Oct 2009, at 20:20, Steve Langasek
<email address hidden> wrote:
> "using one type of operator" - well, this is probably doable; the only
> and+or case we have currently is gssd, which wants local-filesystems
> and
> (portmap or mount TYPE=nfs4 OPTIONS=
> local-filesystems at boot time. It's not 100% correct, but it
> should be
> usable. Will test and report back.
>
> Prognosis of fixing the upstart bug?
>
> --
> NFSv4 automounting completely broken?
> https:/
> You received this bug notification because you are subscribed to
> mountall in ubuntu.
Changed in mountall (Ubuntu Karmic): | |
assignee: | nobody → Steve Langasek (vorlon) |
status: | Triaged → Fix Released |
Changed in nfs-utils (Ubuntu Karmic): | |
assignee: | nobody → Steve Langasek (vorlon) |
importance: | Undecided → High |
milestone: | none → ubuntu-9.10 |
status: | New → In Progress |
Steve Langasek (vorlon) wrote : Re: using 'and' and 'or' operators together in upstart job deps causes hangs | #9 |
worked around for karmic; opening an upstart task regarding the underlying operator bug.
summary: |
- NFSv4 automounting completely broken? + using 'and' and 'or' operators together in upstart job deps causes hangs |
Changed in upstart (Ubuntu Karmic): | |
status: | New → Won't Fix |
Changed in upstart (Ubuntu): | |
importance: | Undecided → Medium |
status: | New → Triaged |
Launchpad Janitor (janitor) wrote : | #10 |
This bug was fixed in the package nfs-utils - 1:1.2.0-2ubuntu6
---------------
nfs-utils (1:1.2.0-2ubuntu6) karmic; urgency=low
* Drop the gssd upstart job's dependency on "local-
time this is always implied transitively by the dep on portmap, and using
a combination of 'or' and 'and' operators in the dependency list seems
to confuse upstart quite badly, causing kerberized mounts to hang at boot.
LP: #447654.
-- Steve Langasek <email address hidden> Sat, 10 Oct 2009 20:12:11 +0000
Changed in nfs-utils (Ubuntu Karmic): | |
status: | In Progress → Fix Released |
summary: |
- using 'and' and 'or' operators together in upstart job deps causes hangs + init: using 'and' and 'or' operators together causes hangs |
Changed in upstart: | |
status: | New → Triaged |
importance: | Undecided → High |
Scott James Remnant (Canonical) (canonical-scott) wrote : Re: init: using 'and' and 'or' operators together causes hangs | #11 |
Adopting as the upstream upstart bug for this problem
description: | updated |
bluedream (wangjinfajimmy) wrote : | #12 |
everybody can help me.i mounte karmic-beta-fs error,what's wrong with me?the error msg:
init:sreadahead main process terminated with status 1
init: caught abort core dumped
init:procps main process terminated with status 255
...
Steve Langasek (vorlon) wrote : | #13 |
Per discussion with Scott, this is not going to be fixed for Lucid because it's going to require deep changes to upstart that we can't conceivably manage in an LTS. Therefore, marking 'wontfix'.
Changed in upstart (Ubuntu Lucid): | |
status: | Triaged → Won't Fix |
summary: |
- init: using 'and' and 'or' operators together causes hangs + init: using 'and' operators can cause hangs |
For sanity's sake, I'm closing the Ubuntu tasks for upstream Upstart bugs. I've experimented with having both, but it is just making bugs hard to find now. Will use the policy whereby bugs on the Ubuntu package exist in the Ubuntu packaging or patches only, any bugs in the Upstart code are Upstream bugs.
Changed in upstart (Ubuntu): | |
status: | Triaged → Invalid |
This code turned out to be the culprit:
} else if (mount_parent && (mount_parent->tag == TAG_LOCAL) parent- >mountpoint, "/")) {
&& strcmp (mount_
mnt->tag = TAG_LOCAL;
num_local++;
nih_debug ("%s is local (inherited)", mnt->mountpoint);
}
This code was added so that virtual filesystems didn't get marked as virtual if they were waiting on other local filesystems, but instead had the effect that all remote filesystems mounted anywhere except / were also treated as local.
Checking now that this is the only code that needs fixed.