Comment 7 for bug 1592043

Revision history for this message
Sean Dague (sdague) wrote :

os-brick 1.4 was released over the weekend, and was the first os-brick to include privsep. We got a really odd failure rate in the grenade-multinode jobs (1/3 - 1/2) after wards which was super non obvious why. Hemma looks to have figured it out (this is a summary of what I've seen on IRC to pull it all together)

Remembering the following - https://github.com/openstack-dev/grenade#theory-of-upgrade and https://governance.openstack.org/reference/tags/assert_supports-upgrade.html#requirements - New code must work with N-1 configs. So this is `master` running with `mitaka` configuration.

privsep requires a sudo rule or rootwrap rule (to get to sudo) to allow the privsep daemon to be spawned for volume actions.

During gate testing we have a blanket sudoer rule for the stack user during the run of grenade.sh. It has to do system level modifications broadly to perform the upgrade. This sudoer rule is deleted at the end of the grenade.sh run before Tempest tests are run, so that Tempest tests don't accidentally require root privs on their target environment.

Grenade *also* makes sure that some resources live across the upgrade boundary. This includes a boot from volume guest, which is torn down before testing starts. And this is where things get interesting.

This means there is a volume teardown needed before grenade ends. But there is only one. In single node grenade this happens about 30 seconds for the end of the script, triggers the privsep daemon start, and then we're done. And the 50_stack_sh sudoers file is removed. In multinode, *if* the boot from volume server is on the upgrade node, then the same thing happens. *However*, if it instead ended up on the subnode, which is not upgraded, then the volume tear down in on the old node. No os-brick calls are made on the upgraded node before grenade finishes. The 50_stack_sh sudoers file is removed, as expected.

And now all volume tests on those nodes fail.

Which is what should happen. The point is that in production no one is going to put a blanket sudoers rule like that in place. It's just we needed it for this activity, and the userid on the services being the same as the shell user (which is not root) let this fallback rule be used.

The crux of the problem is that os-brick 1.4 and privsep can't be used without a config file change during the upgrade. Which violates our policy, because it breaks rolling upgrades.