[SRU] race between nova-compute and neutron-ovs-cleanup
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
nova (Ubuntu) |
Fix Released
|
High
|
Edward Hope-Morley | ||
Trusty |
Fix Released
|
High
|
Edward Hope-Morley | ||
Utopic |
Fix Released
|
High
|
Edward Hope-Morley | ||
Vivid |
Fix Released
|
High
|
Edward Hope-Morley |
Bug Description
[Impact]
This issue appears to be a consequence of https:/
I have started to spot, however, that on some hosts (metal only) there is now a race between the two whereby nova-compute sometimes fails to start on system boot/reboot with the following in /var/log/
...
libvirt-bin stop/waiting
wait-for-state stop/waiting
neutron-ovs-cleanup start/pre-start, process 3084
start: Job failed to start
If I manually restart nova-compute all is fine. So this looks like a race between nova-compute's wait-for-state and neutron-
The proposed solution here is add some retry logic to nova-compute upstart job to tolerate neutron-ovs-cleanup not being able to start yet. We, therefore, allow a certain number of retries, every other with an incremented delay, before giving up and allowing nova-compute to start anyway. If ovs-cleanup failed to start after what is a failry liberal retry period, it is assumed to have failed altogether thus making is safe(ish) to start nova-compute.
[Test Case]
In one terminal (as root) do:
service neutron-ovs-cleanup stop; service openvswitch-switch stop; service nova-compute restart
In another do:
sudo tail -F /var/log/
Observe the retries occurring
Then do 'sudo service openvswitch-switch start' and observe nova-compute retry and succeed.
[Regression Potential]
If openvswitch-switch does not start within the max retries and intervals nova-compute will start anyway and of ovs-cleanup were at some point to run one would see the behaviour that LP 1420572 was intended to resolve. It does not seem to make sense to wait indefinitely for ovs-cleanup to be up and the coded interval is pretty liberal and should be plenty enough.
affects: | nova (Ubuntu) → neutron (Ubuntu) |
description: | updated |
description: | updated |
Changed in neutron (Ubuntu Trusty): | |
importance: | Undecided → High |
Changed in neutron (Ubuntu Utopic): | |
importance: | Undecided → High |
Changed in neutron (Ubuntu Vivid): | |
importance: | Undecided → High |
Changed in neutron (Ubuntu): | |
status: | New → In Progress |
assignee: | nobody → Edward Hope-Morley (hopem) |
description: | updated |
Changed in nova (Ubuntu Trusty): | |
status: | New → In Progress |
Changed in nova (Ubuntu Utopic): | |
status: | New → In Progress |
Changed in nova (Ubuntu Vivid): | |
status: | New → In Progress |
Changed in nova (Ubuntu Trusty): | |
assignee: | nobody → Edward Hope-Morley (hopem) |
Changed in nova (Ubuntu Utopic): | |
assignee: | nobody → Edward Hope-Morley (hopem) |
Changed in nova (Ubuntu Vivid): | |
assignee: | nobody → Edward Hope-Morley (hopem) |
description: | updated |
description: | updated |
tags: |
added: verification-done removed: verification-needed |
So, this is possibly a result of neutron-ovs-cleaunp failing to start at the time nova-compute does the wait-for-state (and implicitly tries to start neutron- ovs-cleanup) due to the fact that openvswitch is not ready to start at that very moment. I am going to attempt to resolve this by making the nova-compute wait-for-state logic more accommodating of the fact that neutron-ovs-cleanup may not be ready to start at the time of the check.