os-brick 1.4.0 increases volume setup failure rates

Bug #1592043 reported by Sean Dague
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Undecided
Matt Riedemann
os-brick
Invalid
Critical
Walt Boring
oslo.privsep
Fix Released
Undecided
Unassigned

Bug Description

Since merging upper constraints 1.4.0 into upper-constraints, the multinode grenade jobs are hitting a nearly 1/3 failure rate on boot from volume scenarios around volume setup. This would be on Newton code using Mitaka configs.

Representative failures are of the following form: http://logs.openstack.org/71/327971/5/gate/gate-grenade-dsvm-neutron-multinode/f2690e3/logs/new/screen-n-cpu.txt.gz?level=WARNING#_2016-06-13_15_22_59_095

The 1/3 failure rate is suspicious, and in the past has often hinted towards a race condition interacting between parallel API requests.

The failure rate increase can be seen here - http://tinyurl.com/zrq35e8

Sean Dague (sdague)
description: updated
Revision history for this message
Matt Riedemann (mriedem) wrote :
Changed in os-brick:
status: New → Confirmed
Revision history for this message
Matt Riedemann (mriedem) wrote :

Before the failures I'm seeing this:

2016-06-13 15:22:59.079 26206 WARNING oslo.privsep.daemon [-] privsep log: sudo: no tty present and no askpass program specified

Revision history for this message
Matt Riedemann (mriedem) wrote :

It's actually probably this change that's breaking everything:

https://github.com/openstack/os-brick/commit/dbf77fba1061cb4e93b3db5f8117d6ccc689f702

Revision history for this message
Matt Riedemann (mriedem) wrote :

https://review.openstack.org/#/c/277224/ has several config related dependencies on devstack/cinder/nova to make things work with privsep, which aren't in stable/mitaka, which is probably why things are failing in the grenade job.

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

Affects neutron gate significantly. I see 65-75% failure rate on our 24h window grafana dash: http://grafana.openstack.org/dashboard/db/neutron-failure-rate?panelId=5&fullscreen

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :
Revision history for this message
Sean Dague (sdague) wrote :

os-brick 1.4 was released over the weekend, and was the first os-brick to include privsep. We got a really odd failure rate in the grenade-multinode jobs (1/3 - 1/2) after wards which was super non obvious why. Hemma looks to have figured it out (this is a summary of what I've seen on IRC to pull it all together)

Remembering the following - https://github.com/openstack-dev/grenade#theory-of-upgrade and https://governance.openstack.org/reference/tags/assert_supports-upgrade.html#requirements - New code must work with N-1 configs. So this is `master` running with `mitaka` configuration.

privsep requires a sudo rule or rootwrap rule (to get to sudo) to allow the privsep daemon to be spawned for volume actions.

During gate testing we have a blanket sudoer rule for the stack user during the run of grenade.sh. It has to do system level modifications broadly to perform the upgrade. This sudoer rule is deleted at the end of the grenade.sh run before Tempest tests are run, so that Tempest tests don't accidentally require root privs on their target environment.

Grenade *also* makes sure that some resources live across the upgrade boundary. This includes a boot from volume guest, which is torn down before testing starts. And this is where things get interesting.

This means there is a volume teardown needed before grenade ends. But there is only one. In single node grenade this happens about 30 seconds for the end of the script, triggers the privsep daemon start, and then we're done. And the 50_stack_sh sudoers file is removed. In multinode, *if* the boot from volume server is on the upgrade node, then the same thing happens. *However*, if it instead ended up on the subnode, which is not upgraded, then the volume tear down in on the old node. No os-brick calls are made on the upgraded node before grenade finishes. The 50_stack_sh sudoers file is removed, as expected.

And now all volume tests on those nodes fail.

Which is what should happen. The point is that in production no one is going to put a blanket sudoers rule like that in place. It's just we needed it for this activity, and the userid on the services being the same as the shell user (which is not root) let this fallback rule be used.

The crux of the problem is that os-brick 1.4 and privsep can't be used without a config file change during the upgrade. Which violates our policy, because it breaks rolling upgrades.

Changed in os-brick:
assignee: nobody → Walt Boring (walter-boring)
importance: Undecided → Critical
Changed in os-brick:
status: Confirmed → In Progress
Changed in nova:
assignee: nobody → Angus Lees (gus)
status: New → In Progress
Changed in nova:
assignee: Angus Lees (gus) → Matt Riedemann (mriedem)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on os-brick (master)

Change abandoned by Walter A. Boring IV (hemna) (<email address hidden>) on branch: master
Review: https://review.openstack.org/329586
Reason: This patch tried to fix the initial problem of starting the privsep daemon the first time. That has been fixed by the new version of privsep adding an init() capability to set the initial helper_command.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/329769
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=4a8f2b0d44ee10dfac2d3d828cd9dc574d5ddbb2
Submitter: Jenkins
Branch: master

commit 4a8f2b0d44ee10dfac2d3d828cd9dc574d5ddbb2
Author: Angus Lees <email address hidden>
Date: Wed Jun 15 15:46:38 2016 +1000

    Initialise oslo.privsep early in main

    Any process using oslo.privsep should now initialise the library before
    first use with things like the rootwrap command to use.

    This should be done near the top of main() in any command that expects
    to make privileged calls via oslo.privsep (eg: nova-compute, and not
    nova-api).

    See I3ea73e16b07a870629e7d69e897f2524d7068ae8 for the corresponding
    change in oslo.privsep.

    Change-Id: I3a52f762deb176fe9201b2a0f0da363057f8aaec
    Depends-On: I52259e2023e277e8fd62be5df4fd7f799e9b36d7
    Closes-Bug: #1592043

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 14.0.0.0b3

This issue was fixed in the openstack/nova 14.0.0.0b3 development milestone.

Revision history for this message
Thierry Carrez (ttx) wrote : Fix included in openstack/cinder 9.0.0.0b3

This issue was fixed in the openstack/cinder 9.0.0.0b3 development milestone.

Revision history for this message
Sean McGinnis (sean-mcginnis) wrote :
Changed in os-brick:
status: In Progress → Invalid
Revision history for this message
Ben Nemec (bnemec) wrote :

The oslo.privsep part of this bug was fixed in https://review.openstack.org/#/c/329766/

I'm not sure why that didn't show up as it does appear to have a bug reference.

Changed in oslo.privsep:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.