[SRU] race between nova-compute and neutron-ovs-cleanup

Bug #1471022 reported by Edward Hope-Morley on 2015-07-02
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
nova (Ubuntu)
High
Edward Hope-Morley
Trusty
High
Edward Hope-Morley
Utopic
High
Edward Hope-Morley
Vivid
High
Edward Hope-Morley

Bug Description

[Impact]

This issue appears to be a consequence of https://bugs.launchpad.net/ubuntu/+source/nova/+bug/1420572 where we added a 'wait-for-state running' to the nova-compute upstart so as to ensure that neutron-ovs-cleanup has finished before nova-compute starts.

I have started to spot, however, that on some hosts (metal only) there is now a race between the two whereby nova-compute sometimes fails to start on system boot/reboot with the following in /var/log/upstart/nova-compute.log:

...
libvirt-bin stop/waiting
wait-for-state stop/waiting
neutron-ovs-cleanup start/pre-start, process 3084
start: Job failed to start

If I manually restart nova-compute all is fine. So this looks like a race between nova-compute's wait-for-state and neutron-ovs-cleanup's pre-start -> start/running.

The proposed solution here is add some retry logic to nova-compute upstart job to tolerate neutron-ovs-cleanup not being able to start yet. We, therefore, allow a certain number of retries, every other with an incremented delay, before giving up and allowing nova-compute to start anyway. If ovs-cleanup failed to start after what is a failry liberal retry period, it is assumed to have failed altogether thus making is safe(ish) to start nova-compute.

[Test Case]

In one terminal (as root) do:
service neutron-ovs-cleanup stop; service openvswitch-switch stop; service nova-compute restart

In another do:
sudo tail -F /var/log/upstart/nova-compute.log

Observe the retries occurring

Then do 'sudo service openvswitch-switch start' and observe nova-compute retry and succeed.

[Regression Potential]

If openvswitch-switch does not start within the max retries and intervals nova-compute will start anyway and of ovs-cleanup were at some point to run one would see the behaviour that LP 1420572 was intended to resolve. It does not seem to make sense to wait indefinitely for ovs-cleanup to be up and the coded interval is pretty liberal and should be plenty enough.

affects: nova (Ubuntu) → neutron (Ubuntu)
description: updated
description: updated
Changed in neutron (Ubuntu Trusty):
importance: Undecided → High
Changed in neutron (Ubuntu Utopic):
importance: Undecided → High
Changed in neutron (Ubuntu Vivid):
importance: Undecided → High
Changed in neutron (Ubuntu):
status: New → In Progress
assignee: nobody → Edward Hope-Morley (hopem)
Edward Hope-Morley (hopem) wrote :

So, this is possibly a result of neutron-ovs-cleaunp failing to start at the time nova-compute does the wait-for-state (and implicitly tries to start neutron-ovs-cleanup) due to the fact that openvswitch is not ready to start at that very moment. I am going to attempt to resolve this by making the nova-compute wait-for-state logic more accommodating of the fact that neutron-ovs-cleanup may not be ready to start at the time of the check.

affects: neutron (Ubuntu) → nova (Ubuntu)
Edward Hope-Morley (hopem) wrote :
Download full text (3.3 KiB)

So it does appear that this a result of the openvswitch-switch service taking a while to startup on boot/reboot. I have added some retry logic to nova-compute and it appears that openvswitch is taking around 80s to start:

neutron-ovs-cleanup start/pre-start, process 3055
neutron-ovs-cleanup start/pre-start, process 3055
Attempting to start neutron-ovs-cleanup
start: Job is already running: neutron-ovs-cleanup
Recheck neutron-ovs-cleanup status in 1s
neutron-ovs-cleanup start/pre-start, process 3055
Attempting to start neutron-ovs-cleanup
start: Job is already running: neutron-ovs-cleanup
Recheck neutron-ovs-cleanup status in 3s
neutron-ovs-cleanup start/pre-start, process 3055
Attempting to start neutron-ovs-cleanup
start: Job is already running: neutron-ovs-cleanup
Recheck neutron-ovs-cleanup status in 3s
neutron-ovs-cleanup start/pre-start, process 3055
Attempting to start neutron-ovs-cleanup
start: Job is already running: neutron-ovs-cleanup
Recheck neutron-ovs-cleanup status in 5s
neutron-ovs-cleanup start/pre-start, process 3055
Attempting to start neutron-ovs-cleanup
start: Job is already running: neutron-ovs-cleanup
Recheck neutron-ovs-cleanup status in 5s
neutron-ovs-cleanup start/pre-start, process 3055
Attempting to start neutron-ovs-cleanup
start: Job is already running: neutron-ovs-cleanup
Recheck neutron-ovs-cleanup status in 7s
neutron-ovs-cleanup start/pre-start, process 3055
Attempting to start neutron-ovs-cleanup
start: Job is already running: neutron-ovs-cleanup
Recheck neutron-ovs-cleanup status in 7s
neutron-ovs-cleanup start/pre-start, process 3055
Attempting to start neutron-ovs-cleanup
start: Job is already running: neutron-ovs-cleanup
Recheck neutron-ovs-cleanup status in 9s
neutron-ovs-cleanup start/pre-start, process 3055
Attempting to start neutron-ovs-cleanup
start: Job is already running: neutron-ovs-cleanup
Recheck neutron-ovs-cleanup status in 9s
neutron-ovs-cleanup start/pre-start, process 3055
Attempting to start neutron-ovs-cleanup
start: Job is already running: neutron-ovs-cleanup
Recheck neutron-ovs-cleanup status in 11s
neutron-ovs-cleanup start/pre-start, process 3055
Attempting to start neutron-ovs-cleanup
start: Job is already running: neutron-ovs-cleanup
Recheck neutron-ovs-cleanup status in 11s
neutron-ovs-cleanup start/pre-start, process 3055
Attempting to start neutron-ovs-cleanup
start: Job is already running: neutron-ovs-cleanup
Recheck neutron-ovs-cleanup status in 13s
neutron-ovs-cleanup start/running
2015-07-03 17:51:40.660 15690 DEBUG nova.servicegroup.api [-] ServiceGroup driver defined as an instance of db __new__ /usr/lib/python2.7/dist-packages/nova/servicegroup/api.py:65
2015-07-03 17:51:43.053 15690 INFO nova.openstack.common.periodic_task [-] Skipping periodic task _periodic_update_dns because its interval is negative
2015-07-03 17:51:44.320 15690 DEBUG stevedore.extension [-] found extension EntryPoint.parse('file = nova.image.download.file') _load_plugins /usr/lib/python2.7/dist-packages/stevedore/extension.py:156
2015-07-03 17:51:44.602 15690 DEBUG stevedore.extension [-] found extension EntryPoint.parse('file = nova.image.download.file') _load_plugins /usr/lib/python2.7/dist-packag...

Read more...

Edward Hope-Morley (hopem) wrote :
Edward Hope-Morley (hopem) wrote :
description: updated
Changed in nova (Ubuntu Trusty):
status: New → In Progress
Changed in nova (Ubuntu Utopic):
status: New → In Progress
Changed in nova (Ubuntu Vivid):
status: New → In Progress
Changed in nova (Ubuntu Trusty):
assignee: nobody → Edward Hope-Morley (hopem)
Changed in nova (Ubuntu Utopic):
assignee: nobody → Edward Hope-Morley (hopem)
Changed in nova (Ubuntu Vivid):
assignee: nobody → Edward Hope-Morley (hopem)
description: updated
description: updated

The attachment "trusty.debdiff" seems to be a debdiff. The ubuntu-sponsors team has been subscribed to the bug report so that they can review and hopefully sponsor the debdiff. If the attachment isn't a patch, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are member of the ~ubuntu-sponsors, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issue please contact him.]

tags: added: patch
Edward Hope-Morley (hopem) wrote :

To clarify, the reason I am not using wait-for-state (http://pastebin.ubuntu.com/11820672/) with WAIT_FOREVER=Y is primarily because it will attempt to start the WAIT_FOR service once only then go into an infinite wait loop which could be a problem in nova upgrade scenarios where we would be waiting for neutron-ovs-cleanup to start when it may not, itself, be waiting to start perhaps because it (or it's dependencies) might never start resulting in nova-compute never starting. So what I am implementing here is a repeated retry with increasing interval but with a finite number of attempts so that nova-compute will always eventually start but not after giving neutron-ovs-cleanup (if installed) plenty of time to start.

Edward Hope-Morley (hopem) wrote :
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package nova - 2:12.0.0~b1-0ubuntu2

---------------
nova (2:12.0.0~b1-0ubuntu2) wily; urgency=medium

  * d/nova-compute.upstart.in: Fix (another) race between nova-compute and
    neutron-ovs-cleanup (LP: #1471022).

 -- Edward Hope-Morley <email address hidden> Mon, 06 Jul 2015 11:53:51 +0100

Changed in nova (Ubuntu):
status: In Progress → Fix Released
Edward Hope-Morley (hopem) wrote :

Hello Edward, or anyone else affected,

Accepted nova into trusty-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/nova/1:2014.1.5-0ubuntu1.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in nova (Ubuntu Trusty):
status: In Progress → Fix Committed
tags: added: verification-needed
Changed in nova (Ubuntu Utopic):
status: In Progress → Fix Committed
Chris J Arges (arges) wrote :

Hello Edward, or anyone else affected,

Accepted nova into utopic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/nova/1:2014.2.3-0ubuntu1.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Chris J Arges (arges) wrote :

Hello Edward, or anyone else affected,

Accepted nova into vivid-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/nova/1:2015.1.0-0ubuntu1.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in nova (Ubuntu Vivid):
status: In Progress → Fix Committed
Edward Hope-Morley (hopem) wrote :

T/U/V verified *-proposed builds. Deployed Openstack I/J/K with these packages and all seems good.

tags: added: verification-done
removed: verification-needed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package nova - 1:2014.1.5-0ubuntu1.1

---------------
nova (1:2014.1.5-0ubuntu1.1) trusty; urgency=medium

    [ Edward Hope-Morley ]
    - d/nova-compute.upstart: Fix (another) race between nova-compute
      and neutron-ovs-cleanup (LP: #1471022)

 -- Edward Hope-Morley <email address hidden> Wed, 08 Jul 2015 09:44:18 -0500

Changed in nova (Ubuntu Trusty):
status: Fix Committed → Fix Released

The verification of the Stable Release Update for nova has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package nova - 1:2014.2.3-0ubuntu1.1

---------------
nova (1:2014.2.3-0ubuntu1.1) utopic; urgency=medium

  * d/nova-compute.upstart: Fix (another) race between nova-compute
    and neutron-ovs-cleanup (LP: #1471022).

 -- Edward Hope-Morley <email address hidden> Mon, 06 Jul 2015 12:06:00 +0100

Changed in nova (Ubuntu Utopic):
status: Fix Committed → Fix Released
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package nova - 1:2015.1.0-0ubuntu1.1

---------------
nova (1:2015.1.0-0ubuntu1.1) vivid; urgency=medium

  * d/nova-compute.upstart.in: Fix (another) race between nova-compute
    and neutron-ovs-cleanup (LP: #1471022).

 -- Edward Hope-Morley <email address hidden> Fri, 03 Jul 2015 18:57:59 +0100

Changed in nova (Ubuntu Vivid):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers