>> The patch looks good to me, except I wonder whether the call to
>> udev_settle() between the two blkids should be moved up to just
>> after $cryptopen (see attached patch).  Thoughts?
>
> Since there was not already a udev_settle call there, I assumed (but
> did not verify) that cryptsetup operates synchronously, and only
> returns once the device mapping has been set up.

You're probably right, given that nobody appears to be complaining
about seeing "cryptsetup: unknown error setting up device mapping".
But maybe there is a race condition there too and we've just been
lucky.

> We should only add one there if someone confirms that cryptsetup
> *doesn't* block until the device is created.

What is the cost of calling udev_settle when unnecessary?  If it's
already settled, does it return immediately or does it sleep a bit?

> I believe the second udev_settle call that you removed is absolutely
> required /at that point/.

I don't think it's required at that point.  As far as I can tell, the
udev_settle can go anywhere between $cryptopen and the invocation of
blkid after changing NEWROOT, depending on whether $cryptopen
guarantees that the crypt device exists by the time it exits.  If so,
udev_settle can go anywhere but should go inside the LVM2_member 'if'
statement to avoid the overhead for non-LVM systems.  If not,
udev_settle should go immediately after $cryptopen to avoid a race
with the "unknown error setting up device mapping" check.

> Even if cryptsetup synchronously ensures the container device
> mapping has been set up, I don't think it ensures that the events
> for the child device have been handled yet -

If it did, then this bug wouldn't exist.

> or even added to the event queue -

That's a good point.  That would mean that a call to udev_settle
immediately after $cryptopen might be too soon, depending on how
udev_settle behaves when there are no events on the queue.  However,
if there's a race by putting udev_settle immediately after $cryptopen,
there's still a race even if the call to udev_settle is kept lower
down in the LVM2_member 'if' statement.

> so udev_settle could still return before vgchange -a has been run.

Do you mean the vgchange -a in the udev rules?  If so, I agree.

>
> But talking through this, I realize this also means that we still
> need to call vgchange ourselves, because otherwise there's still a
> race condition here even with udev_settle being called a little bit
> later.

If:
  * $cryptopen can return before the LV device device events are put
    on the queue, and
  * udev_settle immediately returns if there are no events on the
    queue
then it's possible that vgchange -a won't be run unless we run it
ourselves in this script.

However, if vgchange itself doesn't guarantee that the LV device events
are placed on the queue before it returns, then we still have a race.

I think the following are reasonable assumptions to make, but I would
love to have confirmation:
  1. when an encrypted device is unlocked, cryptsetup doesn't create
     the unlocked device node and symlink itself -- udev does that
  2. cryptsetup doesn't return until after the unlocked device event
     has been put on the udev queue
  3. cryptsetup might return before the unlocked device event has been
     fully processed by udev (before the unlocked device symlink has
     been created)
  4. vgchange -a doesn't return until after the LV device events have
     been put on the udev queue
  5. vgchange -a might return before the LV device events have been
     fully processed by udev (before the LV symlinks have been
     created)

If the above assumptions are correct, then assumption #3 means that we
need a call to udev_settle immediately after $cryptopen, but that
should be all that we need.  We wouldn't need to call vgchange -a
ourselves because that would be handled by the call to udev_settle:
#2 guarantees that the unlocked device event is on the queue, which
means that udev_settle won't return until after vgchange -a (triggered
by the unlocked device event) returns.  And #4 guarantees that
vgchange -a won't return until the LV device events are on the queue.
Thus, by the time the unlocked device event has been processed, the LV
device events are already on the queue, so udev_settle won't return
until those have been handled as well.

If assumption #1 is incorrect -- if cryptsetup creates the unlocked
device node and symlink itself -- then the effect is equivalent to
assumption #1 being correct and #3 being incorrect (cryptsetup
guarantees that the unlocked device event has been fully handled by
udev before it returns).

If assumption #2 is incorrect -- if cryptsetup might return before the
unlocked device event has been added to the queue -- then the only
thing we can do is check if the device is ready and if not sleep and
try again.

If assumption #3 is incorrect -- if cryptsetup guarantees that the
unlocked device event has been fully handled by udev before it returns
-- then the call to udev_settle can be moved down to avoid overhead on
non-LVM systems.

If assumption #4 is incorrect -- if vgchange -a might return before
the LV device events have been added to the queue -- then the only
thing we can do is check if the devices are ready and if not sleep and
try again.

If assumption #5 is incorrect -- if vgchange -a guarantees that the PV
device events have been fully handled by udev before it returns --
then this bug shouldn't exist.  :)  So I think it's pretty safe to say
that assumption #5 is correct.

I think the following pseudocode is correct even if any of the above
assumptions are incorrect, though it could be made faster if
assumption #3 is incorrect (by delaying a call to udev_settle):

    while true:
        udev_settle
        if encrypted device available:
            break
        if timed out:
            error out: encrypted device not found
        sleep

    while true:
        if $cryptopen succeeds:
            break
        message: incorrect password
        if ++unlock_attempts >= max_attempts:
            error out: too many attempts

    while true:
        udev_settle
        if unlocked device available:
            break
        # should only get here if assumption #2 is incorrect and we
        # lost the race (or there's a bug or severe error)
        if timed out:
            error: unlocked device not found
        sleep

    root device = unlocked device
    root fstype = fstype(root device)

    if root fstype is LVM:
        root device = ${cmdline_root}
        while true:
            if root device available:
                break
            # should only get here if assumption #4 is incorrect and
            # we lost the race (or there's a bug or severe error)
            if timed out:
                error: root device not found
            sleep
            udev_settle
        root fstype = fstype(root device)

>
> I'll prepare an updated patch, and test it.

Thank you for working on this!