Race to delete shared subnet in Tempest neutron full jobs
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
| neutron |
High
|
Eugene Nikanorov | ||
| tempest |
Undecided
|
Salvatore Orlando |
Bug Description
This seems to show up in several different tests, basically anything using neutron. I noticed it here:
That's on a stable/icehouse change, but logstash shows this on master mostly.
I see this in the neutron server logs:
This query shows 82 hits in 10 days:
message:"delete failed \(client error\)\: Unable to complete operation on subnet" AND message:"One or more ports have an IP allocation from this subnet" AND tags:"screen-
Logstash doesn't show this in the gate queue but it does show up in the uncategorized bugs list which is in the gate queue.
Matt Riedemann (mriedem) wrote : | #2 |
Interesting, if I take out the 'Exception dispatching event' from the query, there are no hits, so maybe the stable/icehouse backport will clean this up for good?
Changed in nova: | |
status: | New → Confirmed |
importance: | Undecided → Low |
Joe Gordon (jogo) wrote : | #3 |
11 hits in 24 hours marking as critical
Changed in nova: | |
importance: | Low → Critical |
milestone: | none → juno-rc1 |
Changed in neutron: | |
importance: | Undecided → Critical |
This looks like duplicate for bug #1283599
Neutron barfs at the fact that a subnet or network cannot be deleted because there are non-service ports in use on the logical segment. This failure mode may be the manifestation of multiple errors. In the case of bug 1283599, my triaging revealed that the error lied in the Nova caching logic.
I'll drill down into the recent failures, to see if I can provide further input
Changed in neutron: | |
milestone: | none → juno-rc1 |
Alex Xu (xuhj) wrote : | #6 |
At compute manager, Instance 6ce21bd9-
2014-09-20 16:56:09.439 DEBUG nova.compute.
Check the log in nova api, the instance is deleted at nova-api side, it because it deleted before compute manager assign the host to the intsance.
2014-09-20 16:56:08.963 INFO nova.ec2.
2014-09-20 16:56:09.022 DEBUG nova.api.ec2 [req-4093d683-
2014-09-20 16:56:09.023 DEBUG nova.api.ec2 [req-4093d683-
2014-09-20 16:56:09.213 DEBUG nova.api.ec2.cloud [req-4093d683-
2014-09-20 16:56:09.214 DEBUG nova.compute.api [req-4093d683-
2014-09-20 16:56:09.304 DEBUG nova.quota [req-4093d683-
2014-09-20 16:56:09.393 DEBUG nova.quota [req-4093d683-
2014-09-20 16:56:09.432 DEBUG nova.api.
So compute manager try to cleanup the network: https:/
But there is no port found by the cleanup function:
2014-09-20 16:56:09.459 DEBUG neutronclient.
Alex Xu (xuhj) wrote : | #7 |
above analyz is based on this case http://
Alex Xu (xuhj) wrote : | #8 |
Patch for based on this analyze: https:/
Matt Riedemann (mriedem) wrote : | #9 |
Alex Xu (xuhj) wrote : | #10 |
Except the case fixed by https:/
The analyze is based on:
http://
In this case subnet a404dc60-
Port 375d726d-
At nova api side, the port attach at:
2014-09-21 05:27:06.453 AUDIT nova.api.
and detach at:
2014-09-21 05:27:10.937 AUDIT nova.api.
But the detach failed
2014-09-21 05:27:11.047 ERROR oslo.messaging.
2014-09-21 05:27:11.047 4710 TRACE oslo.messaging.
....
So the reason is detach happend before attach finished.
The failed test is:
def test_hotplug_
"""
1. create a new network, with no gateway (to prevent overwriting VM's
2. connect VM to new network
3. set static ip and bring new nic up
4. check VM can ping new network dhcp port
"""
self._hotplug_
and 'self._
So the cleanup may happened before attach interface finished.
The cleanup is setup at 'self._
But this cleanup failed as the log:
2014-09-21 05:27:11.047 ERROR oslo.messaging.
And another cleanup is waiting for the interface deleted,
We also can found this cleanup timeout finally
http://
Alex Xu (xuhj) wrote : | #11 |
Emm....I missing something in the second case. There is code trying waiting for attach_interface finished.
Alex Xu (xuhj) wrote : | #12 |
For the second case:
network info catch update and there is port info for 375d726d-
2014-09-21 05:27:06.976 DEBUG nova.network.
And another network info catch update by _heal_instance_
2014-09-21 05:27:06.999 4710 DEBUG nova.network.
Matt Riedemann (mriedem) wrote : | #13 |
Looks like bug 1326183 is related.
Michael Still (mikal) wrote : | #14 |
Is anyone working the nova side of this bug?
Alex Xu (xuhj) wrote : | #15 |
@Michael, I'm working on it. I fixed one case with patch https:/
I didn't find anymore cases can trigger this bug for now. Hope those two patches can fix this.
And there can find another error message in the log:
http://
I think that is ec2 api bug, and that won't trigger this bug. We can file another bug for it.
Reviewed: https:/
Committed: https:/
Submitter: Jenkins
Branch: master
commit 4adc8376f722e48
Author: He Jie Xu <email address hidden>
Date: Sun Sep 21 14:07:29 2014 +0800
Sync network_info if instance not found before _build_resources yield
If instance is deleted before _build_resources yield, there won't
waiting for network_info sync finish.
So there is race case, when the exception catching for
'InstanceNo
'_cleanup_
create port yet. After '_cleanup_
'_allocate_
isn't any cleanup for the new port anymore.
Change-Id: I5027a1aaeb9a21
Related-Bug: #1357055
Michael Still (mikal) wrote : | #17 |
Alex thanks for those patches. The second one is now approved and awaiting merge.
Changed in nova: | |
assignee: | nobody → Alex Xu (xuhj) |
Sean Dague (sdague) wrote : | #18 |
Both of the fixes provided in comment #15 have merged, so marking as fix committed. If we think there remains another outstanding fix needed, lets reopen.
Changed in nova: | |
status: | Confirmed → Fix Committed |
Kyle Mestery (mestery) wrote : | #19 |
I'm assuming this bug can be closed on the Neutron side as well, but I'll wait to verify the nova patches have fixed things a bit before marking as Closed and removing from RC1.
Strangely enough the trace does not show up in logstash
Mohammad Banikazemi (mb-s) wrote : | #22 |
I think this is relevant too: http://
I did do a recheck bug 1357055 after noticing the failure.
Changed in neutron: | |
assignee: | nobody → Mohammad Banikazemi (mb-s) |
Mohammad Banikazemi (mb-s) wrote : | #23 |
Have been looking at Neutron service logs posted above for comments #20 and #22. In both cases, the port that exists when subnet is being deleted (and therefore the subnet deletion fails) never gets deleted. That is Neutron never receives any request to delete the port. The port in question gets created, gets bound correctly, shows up in get ports, but never gets deleted. This is similar to the observation posted in comment #6 above.
Kyle Mestery (mestery) wrote : | #24 |
Removing Mohammad from this. He has reached out to me and indicated he will not have time to work on this in the coming week or so.
Changed in neutron: | |
assignee: | Mohammad Banikazemi (mb-s) → nobody |
Changed in neutron: | |
milestone: | juno-rc1 → kilo-1 |
Alex Xu (xuhj) wrote : | #25 |
yea, one more case.
I didn't the root cause. But put my analyze at here.
The analyze base on this:
http://
The subnet is deleted at: http://
ServerRescueNeg
server id: b2c534bf-
rescue server id: d281aee9-
The problem is there isn't any log to show those two servers were deleted. So the ports attach to those servers won't be delete.
But I don't know way those two servers won't be deleted.
Anyone has idea?
Alex Xu (xuhj) wrote : | #26 |
I added tempest as affecting project. Because I think there should be log to show tempest try to clean up the servers that created by test.
I can't find those two servers were deleted:
server id: b2c534bf-
rescue server id: d281aee9-
Changed in nova: | |
status: | Fix Committed → Fix Released |
Changed in neutron: | |
assignee: | nobody → Salvatore Orlando (salvatore-orlando) |
tags: | added: juno-rc-potential |
Changed in tempest: | |
assignee: | nobody → Salvatore Orlando (salvatore-orlando) |
I came to the same conclusions as Alex: the servers are not deleted hence the error.
However, the logging which Alex is claiming for is already there.
Indeed here are the delete operations on teardown for a failing test:
salvatore@
2014-10-07 17:49:04.444 25908 INFO tempest.
2014-10-07 17:49:04.444 25908 DEBUG tempest.
2014-10-07 17:52:21.452 25908 INFO tempest.
2014-10-07 17:52:21.452 25908 DEBUG tempest.
2014-10-07 17:52:21.513 25908 INFO tempest.
2014-10-07 17:52:21.514 25908 DEBUG tempest.
No DELETE server command is specified.
Instead for a successful test the two servers are deleted.
2014-09-26 11:48:05.532 7755 INFO tempest.
2014-09-26 11:50:06.733 7755 INFO tempest.
2014-09-26 11:50:06.882 7755 INFO tempest.
2014-09-26 11:50:09.531 7755 INFO tempest.
2014-09-26 11:50:09.583 7755 INFO tempest.
no longer affects: | neutron |
tags: | removed: juno-rc-potential |
Related fix proposed to branch: master
Review: https:/
Alex Xu (xuhj) wrote : | #29 |
As I and Salvatore get same conclusions. This isn't bug neutron and nova. It should be tempest bug. I removed nova from affected projects.
no longer affects: | nova |
Reviewed: https:/
Committed: https:/
Submitter: Jenkins
Branch: master
commit 1422a9ac6c40df7
Author: Salvatore <email address hidden>
Date: Wed Oct 8 15:53:25 2014 +0200
Log identifiers of resources being cleaned up
On teardown compute API tests clean up resources such as servers
and images. This patch simply logs the identifiers of these
resources in order to help debugging failures which might originate
from problems during cleanup.
Change-Id: Ied2924bba78ec8
Related-Bug: #1357055
Changed in tempest: | |
status: | New → Confirmed |
Silvan Kaiser (2-silvan) wrote : | #31 |
This issue still affects ppl, looking at elastic recheck at 2014-12-12 13:14 UTC:
Bug 1357055 - Race to delete shared subnet in Tempest neutron full jobs
23 fails in 24 hrs / 79 fails in 10 days
Any new ideas on this one?
Related fix proposed to branch: master
Review: https:/
Reviewed: https:/
Committed: https:/
Submitter: Jenkins
Branch: master
commit c17faaad2b1030b
Author: Eugene Nikanorov <email address hidden>
Date: Fri Jan 30 00:58:42 2015 +0300
Log tenant ports if subnet could not be deleted
Change-Id: Icd64cda9de2bb1
Related-Bug: #1357055
Matt Riedemann (mriedem) wrote : | #34 |
Since https:/
190 hits in 10 days, check and gate queues, however it's not all failed jobs. When this shows up, only about 26% of the jobs are failures.
Fix proposed to branch: master
Review: https:/
Changed in neutron: | |
assignee: | nobody → Eugene Nikanorov (enikanorov) |
status: | New → In Progress |
Changed in neutron: | |
importance: | Undecided → Wishlist |
Eugene Nikanorov (enikanorov) wrote : | #36 |
The issue appears to be a simple race between DHCP port creation and subnet deletion.
DHCP port gets after 'autodelete' ports was cleared, but before other allocations are checked.
It doesn't happen too often, but is annoying.
Changed in neutron: | |
importance: | Wishlist → High |
Fix proposed to branch: master
Review: https:/
Reviewed: https:/
Committed: https:/
Submitter: Jenkins
Branch: master
commit d5aa1831ac95c16
Author: Eugene Nikanorov <email address hidden>
Date: Wed Apr 8 02:23:22 2015 +0400
Add logging to dangling port to ml2 delete_subnet
This is useful when trouble shooting test failures.
Also, in db_base_plugin_v2, log only port_id instead of
full allocation object.
Change-Id: I3d77318aee7083
Related-Bug: #1357055
Reviewed: https:/
Committed: https:/
Submitter: Jenkins
Branch: master
commit 1642bca4d9c4fee
Author: Eugene Nikanorov <email address hidden>
Date: Thu Apr 9 01:16:18 2015 +0300
Handle race condition on subnet-delete
This fix targets quite rare case of race condition between
port creation and subnet deletion. This usually happens
during API tests that do things quickly.
DHCP port is being created after delete_subnet checks for
DHCP ports, but before it checks for IPAllocations on subnet.
The solution is to apply retrying logic, which is really necessary
as we can't fetch new IPAllocations with the same query and within
the active transaction in mysql because of REPEATABLE READ
transaction isolation.
Change-Id: Ib9da018e654cde
Closes-Bug: 1357055
Changed in neutron: | |
status: | In Progress → Fix Committed |
Eugene Nikanorov (enikanorov) wrote : | #40 |
The fix was not correct in the essential query, so it produces exactly the same behavior with postgresql
Changed in neutron: | |
status: | Fix Committed → In Progress |
Fix proposed to branch: master
Review: https:/
Reviewed: https:/
Committed: https:/
Submitter: Jenkins
Branch: master
commit 0109578a8ec07f7
Author: Eugene Nikanorov <email address hidden>
Date: Sat Apr 18 15:31:44 2015 +0400
Fix incorrect query for user ip allocations
Previously the query was fetching an IPAllocation object incorrectly
relying on the fact that it has port attribute that should be
join-loaded when it really is not.
Incorrect query produced by previous code:
SELECT ipallocations.
FROM ipallocations, ports
WHERE ipallocations.
AND ports.device_owner NOT IN (:device_owner_1)
The query then may have produced results that don't satisfy
the condition intended by the code.
Query produced by the fixed code:
SELECT ipallocations.
FROM ipallocations JOIN ports ON ports.id = ipallocations.
WHERE ipallocations.
AND ports.device_owner NOT IN (:device_owner_1)
Change-Id: I34682df784e30e
Closes-Bug: #1357055
Changed in neutron: | |
status: | In Progress → Fix Committed |
Fix proposed to branch: neutron-pecan
Review: https:/
Changed in neutron: | |
milestone: | none → liberty-1 |
status: | Fix Committed → Fix Released |
Changed in neutron: | |
milestone: | liberty-1 → 7.0.0 |
Ken'ichi Ohmichi (oomichi) wrote : | #44 |
Alex has added Tempest to this bug report for adding more log for the investigation and that is done on https:/
So this bug report should be close now.
Changed in tempest: | |
status: | Confirmed → Fix Released |
e-r query here: https:/ /review. openstack. org/#/c/ 114359/
If this is mostly just neutron full jobs, this might make sense, i.e. those run with tenant isolation so maybe there are multiple tests running at the same time allocating ports on the same subnet and then when one test completes and tries to tear down it wants to delete the subnet which is still got one or more ports on it, so we fail.
Seems like we could just do a simple check in the nova neutronv2 API before trying to delete the subnet to see if there are any ports still assigned, or just handle the 409 and move on.