>> The patch looks good to me, except I wonder whether the call to >> udev_settle() between the two blkids should be moved up to just >> after $cryptopen (see attached patch). Thoughts? > > Since there was not already a udev_settle call there, I assumed (but > did not verify) that cryptsetup operates synchronously, and only > returns once the device mapping has been set up. You're probably right, given that nobody appears to be complaining about seeing "cryptsetup: unknown error setting up device mapping". But maybe there is a race condition there too and we've just been lucky. > We should only add one there if someone confirms that cryptsetup > *doesn't* block until the device is created. What is the cost of calling udev_settle when unnecessary? If it's already settled, does it return immediately or does it sleep a bit? > I believe the second udev_settle call that you removed is absolutely > required /at that point/. I don't think it's required at that point. As far as I can tell, the udev_settle can go anywhere between $cryptopen and the invocation of blkid after changing NEWROOT, depending on whether $cryptopen guarantees that the crypt device exists by the time it exits. If so, udev_settle can go anywhere but should go inside the LVM2_member 'if' statement to avoid the overhead for non-LVM systems. If not, udev_settle should go immediately after $cryptopen to avoid a race with the "unknown error setting up device mapping" check. > Even if cryptsetup synchronously ensures the container device > mapping has been set up, I don't think it ensures that the events > for the child device have been handled yet - If it did, then this bug wouldn't exist. > or even added to the event queue - That's a good point. That would mean that a call to udev_settle immediately after $cryptopen might be too soon, depending on how udev_settle behaves when there are no events on the queue. However, if there's a race by putting udev_settle immediately after $cryptopen, there's still a race even if the call to udev_settle is kept lower down in the LVM2_member 'if' statement. > so udev_settle could still return before vgchange -a has been run. Do you mean the vgchange -a in the udev rules? If so, I agree. > > But talking through this, I realize this also means that we still > need to call vgchange ourselves, because otherwise there's still a > race condition here even with udev_settle being called a little bit > later. If: * $cryptopen can return before the LV device device events are put on the queue, and * udev_settle immediately returns if there are no events on the queue then it's possible that vgchange -a won't be run unless we run it ourselves in this script. However, if vgchange itself doesn't guarantee that the LV device events are placed on the queue before it returns, then we still have a race. I think the following are reasonable assumptions to make, but I would love to have confirmation: 1. when an encrypted device is unlocked, cryptsetup doesn't create the unlocked device node and symlink itself -- udev does that 2. cryptsetup doesn't return until after the unlocked device event has been put on the udev queue 3. cryptsetup might return before the unlocked device event has been fully processed by udev (before the unlocked device symlink has been created) 4. vgchange -a doesn't return until after the LV device events have been put on the udev queue 5. vgchange -a might return before the LV device events have been fully processed by udev (before the LV symlinks have been created) If the above assumptions are correct, then assumption #3 means that we need a call to udev_settle immediately after $cryptopen, but that should be all that we need. We wouldn't need to call vgchange -a ourselves because that would be handled by the call to udev_settle: #2 guarantees that the unlocked device event is on the queue, which means that udev_settle won't return until after vgchange -a (triggered by the unlocked device event) returns. And #4 guarantees that vgchange -a won't return until the LV device events are on the queue. Thus, by the time the unlocked device event has been processed, the LV device events are already on the queue, so udev_settle won't return until those have been handled as well. If assumption #1 is incorrect -- if cryptsetup creates the unlocked device node and symlink itself -- then the effect is equivalent to assumption #1 being correct and #3 being incorrect (cryptsetup guarantees that the unlocked device event has been fully handled by udev before it returns). If assumption #2 is incorrect -- if cryptsetup might return before the unlocked device event has been added to the queue -- then the only thing we can do is check if the device is ready and if not sleep and try again. If assumption #3 is incorrect -- if cryptsetup guarantees that the unlocked device event has been fully handled by udev before it returns -- then the call to udev_settle can be moved down to avoid overhead on non-LVM systems. If assumption #4 is incorrect -- if vgchange -a might return before the LV device events have been added to the queue -- then the only thing we can do is check if the devices are ready and if not sleep and try again. If assumption #5 is incorrect -- if vgchange -a guarantees that the PV device events have been fully handled by udev before it returns -- then this bug shouldn't exist. :) So I think it's pretty safe to say that assumption #5 is correct. I think the following pseudocode is correct even if any of the above assumptions are incorrect, though it could be made faster if assumption #3 is incorrect (by delaying a call to udev_settle): while true: udev_settle if encrypted device available: break if timed out: error out: encrypted device not found sleep while true: if $cryptopen succeeds: break message: incorrect password if ++unlock_attempts >= max_attempts: error out: too many attempts while true: udev_settle if unlocked device available: break # should only get here if assumption #2 is incorrect and we # lost the race (or there's a bug or severe error) if timed out: error: unlocked device not found sleep root device = unlocked device root fstype = fstype(root device) if root fstype is LVM: root device = ${cmdline_root} while true: if root device available: break # should only get here if assumption #4 is incorrect and # we lost the race (or there's a bug or severe error) if timed out: error: root device not found sleep udev_settle root fstype = fstype(root device) > > I'll prepare an updated patch, and test it. Thank you for working on this!