Neutron confused about overlapping subnet creation
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
neutron |
Fix Released
|
High
|
Kevin Benton |
Bug Description
If I use heat to create a set of networks, each with its own subnet, sometimes I will get an error from Neutron that there is an overlap.
There is an example heat template attached. In my environment, this will fail about 1 or 2 times in 10.
The heat template does:
for X in seq 0 10
create network X
assign subnet 172.16.X.0/24 to network X
One of the stacks will get an error on one of the subnets, indicating:
Requested Subnet With Cidr: 172.16.7.0/24 For Network: 73f99807-
This is in turn coming from _validate_
What is happening is that the 'new_subnet_cidr' passed in is matching against itself.
Since I have 'allow_
subnet_list = network.subnets
but, in the failure case, this subnet is already there. In the 'no failure case' it is not yet there.
I have 3 heat API servers, and 72 heat workers (3 servers). The api servers are round-robin load balanced
I just (2016-06-24) reproduced this against the master/head from github (so midway between mitaka and newton).
I'm not sure if there is some missing locking, but for sure this is a race condition.
I have ipam_driver unset
If i comment out the exception in _validate_
Don Bowman (donbowman) wrote : | #1 |
Don Bowman (donbowman) wrote : | #2 |
tags: | added: l3-ipam-dhcp needs-attention |
Changed in neutron: | |
importance: | Undecided → High |
Carl Baldwin (carl-baldwin) wrote : | #3 |
@Don, what leads you to believe that the new subnet is matching with itself? From what I can tell, this is impossible. Take a look at _save_subnet [1] from where _validate_
The only thing that I can think of is that the API is being called twice for the same cidr. Is there any way that could happen?
Changed in neutron: | |
status: | New → Incomplete |
Don Bowman (donbowman) wrote : | #4 |
Um, i guess i'm not 100% sure.
this tenant has no networks when i call the heat stack-create. as you can see from the template it has no duplicates. So it must therefore be matching against itself.
Either on one call, or 2 calls that are hitting different api servers @ the same instant. All the api calls here are from heat. I guess some tracing might help.
Its definitely a race condition. E.g. if i make a stack w/ 10 networks, each w/ 1 subnet, it will fail ~1 in 10 (not always the same subnet). If i do it w/ 100 networks, each w/ 1 subnet, it will fail ~90% of the time.
Don Bowman (donbowman) wrote : | #5 |
https:/
has the logs. Files are:
test.yaml --> run this with 'openstack stack create -t test.yaml fubar'
stack-output.txt --> the error that comes out (the overlap)
neutron-
Before I ran this test i dropped my neutron database and recreated, so it was empty. This was a single stack create, so there was for sure no overlapping subnet.
the neutron logs have debug+verbose on. I build neutron from github immediately before the test
Don Bowman (donbowman) wrote : | #6 |
In the logs you can also see this error:
DBAPIError exception wrapped from (pymysql.
which indicates a deadlock + unwind/restart.
Don Bowman (donbowman) wrote : | #7 |
https:/
has 3 files:
nubo-[1-
these are the API logs via nginx, much easier to read than the neutron debug output
Don Bowman (donbowman) wrote : | #8 |
- logs (neutron debug, nginx access, heat stack + output_ Edit (177.4 KiB, application/x-tar)
Attached are the output logs
Don Bowman (donbowman) wrote : | #9 |
- logs (neutron debug, nginx access, heat stack + output_ Edit (177.4 KiB, application/x-tar)
Attached are the output logs
Miguel Lavalle (minsel) wrote : | #10 |
I have executed this script http://
Don Bowman (donbowman) wrote : | #11 |
What about with > 1 controller node and round-robin load balanced?
That is where the error occurs for me.
Don Bowman (donbowman) wrote : | #12 |
I have 3 API servers for heat, for neutron, for nova (and glance, cinder, horizon). All are round-robin load balanced via nginx front-ends.
If i delete 2 of the 3, then the error does not occur.
Don Bowman (donbowman) wrote : | #13 |
also, i think that script will be dramatically slower than letting heat drive the API with all its workers.
Miguel Lavalle (minsel) wrote : | #14 |
I don't have that setup ready. In the meantime, I found this in regards to the tracebacks in the log file: https:/
Don Bowman (donbowman) wrote : | #15 |
Yes i was looking @ that one too. I do have the fix in question.
If i switch to python 3.5 for the test, i find that the deadlock is occurring due to quota updates.
I managed to reproduce with a python script instead of heat. It takes a few more tries to get it to happen, but it did happen on the 4th time i tried.
import sys
from neutronclient.v2_0 import client
neutron = client.
ids = []
sids = []
for i in range(1,254):
network = {'name': 'pnet-%s' % i, 'admin_state_up': True}
n = neutron.
id = n['network']['id']
ids.append(id)
srange=
erange=
cidr=
name = "subnet-%u" % i
subnet = {'enable_dhcp': True, 'network_id': id, 'allocation_pools': [{'start': srange, 'end': erange}], 'ip_version': 4, 'cidr': cidr, 'name': name}
s = neutron.
sid = s['subnet']['id']
sids.
for i in sids:
print("delete subnet %u" % i)
neutron.
for i in ids:
neutron.
Don Bowman (donbowman) wrote : | #16 |
https:/
reproduces it using just python. Not every run does, but at least one run in 5 i made did.
i have 3 api servers online, round-robin load balanced from nginx.
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master) | #17 |
Fix proposed to branch: master
Review: https:/
Changed in neutron: | |
assignee: | nobody → Kevin Benton (kevinbenton) |
status: | Incomplete → In Progress |
Kevin Benton (kevinbenton) wrote : | #18 |
Changed in neutron: | |
milestone: | none → newton-2 |
Don Bowman (donbowman) wrote : | #19 |
tested patch.
It does not resolve issue. Same problem
| stack_status_reason | Resource CREATE failed: BadRequest: resources.sn125: Invalid input for operation: Requested subnet with cidr: 172.16.125.0/24 for network: 6d2b3077-
| | another subnet.
comes out, even though I am clearly not creating that twice. I double checked that I applied the patch to all 3 API servers.
Kevin Benton (kevinbenton) wrote : | #20 |
Can you try one more time with the latest version of that patch? I made an adjustment to make sure it works with the latest changes to the engine management.
Also, I downloaded the logs from comment #9 and the neutron server logs don't contain any of the failed API requests. I can't see any failed "POST //v2.0/
Did you attach the logs from the right neutron servers?
Don Bowman (donbowman) wrote : | #21 |
Don Bowman (donbowman) wrote : | #22 |
see #21 logs are there for sure, don't know what happened before.
the subnets.json calls are in the nginx* logs, and i do see the good/bad ones in there now.
You can also see the
InternalError: (1305, u'SAVEPOINT sa_savepoint_1 does not exist')
which is referenced in https:/
def is_retriable(e):
if _is_nested_
return True
# looking savepoints mangled by deadlocks. see bug/1590298 for details.
return _is_nested_
Kevin Benton (kevinbenton) wrote : | #23 |
@Don,
Something is still missing from those logs. I grep them and there are no failed subnet requests in the Neutron server logs.
grep POST * | grep -v '1.1" 201'
That should return the failed subnet requests in neutron server logs but I only see the NGINX logs.
There are also no log messages about overlapping subnets. There should be one like this:
2016-07-08 14:55:52.027 11332 INFO neutron.
Is it possible for you to setup nginx to log which backend it's communicating with an do another run so we know which one is supposed to be failing?
Note that the savepoint errors don't indicate a complete failure, they just indicate that a request had to be retried due to a deadlock so the server they occur on may not be related to the one generating that overlap message.
Don Bowman (donbowman) wrote : | #24 |
n1-neutron.
shows the failure, and the specific backend (the backend are n1...n3).
Is there something more i need to enable for logging? I do see the overlapping subnet in the log.
Don Bowman (donbowman) wrote : | #25 |
Don Bowman (donbowman) wrote : | #26 |
sigh, i think i reuploaded the same log files in #21.
#25 is a different size, and for sure i see that error @n1-neutron.
Kevin Benton (kevinbenton) wrote : | #27 |
Thanks, and this run is with the second revision of https:/
Don Bowman (donbowman) wrote : | #28 |
Don Bowman (donbowman) wrote : | #29 |
yes that was with the second revision.
another set of logs in #28 also has a tcpdump on port 9696 on each api node (and a text dump of that) to show the raw protocol.
Kevin Benton (kevinbenton) wrote : | #30 |
I think I've tracked down the issue to agent-related queries causing DBDeadlocks that happen after the subnet creation (in this particular case updating the agent load value). This causes the API layer to use its retry mechanism even though the subnet has been persisted to the DB, which leads to the overlapping error.
Please try again with the latest version of https:/
Don Bowman (donbowman) wrote : | #31 |
- logs-2016-07-10.tar.gz Edit (692.4 KiB, application/x-tar)
Well... i tried it.
I get the same error out. I did not get that log message to fire. Attached the logs.
Kevin Benton (kevinbenton) wrote : | #32 |
Is it possible to just do another run and capture the logs? It seems like there is some missing data (maybe a bug in oslo logging or in the logging collector?).
I see in the PCAP that an HTTP 500 was returned by neutron for request 'req-2530b6c6-
(Sorry about all of these requests but I haven't been able to reproduce this behavior locally with 3 Neutron API servers behind nginx)
Don Bowman (donbowman) wrote : | #33 |
will do.
FWIW i have 3 nginx as well, not sure if that matters(?)
My architecture is:
1 VIP (keepalived DR-mode) points to port 443 on 3 nginx servers
each nginx server has neutron1...3 in its list (and all other OpenStack services)
so when i do a curl https:/
I am using mysql+galera as 1+1+1 as well, and any or all of them will hit one of the mysql backends.
Don Bowman (donbowman) wrote : | #34 |
Don Bowman (donbowman) wrote : | #35 |
In #34 is another run w/ the logs, w/ change from #30.
The subnet w/ the error this time is : ed204a09-
grep -l ed204a09-
n1-neutron.log
n2-neutron-
n3-neutron.log
n1-neutron.pcap.txt
n3-neutron.pcap.txt
so n2 nginx dealt two requests for this subnet, 1 to n1, one to n3.
The n1 req (from pcap):
010.100.
X-Forwarded-For: 172.17.0.5
X-Real-IP: 172.17.0.5
X-Forwarded-Host: neutron.
X-Forwarded-Server: neutron.
X-Forwarded-Proto: https
Host: neutron.
Connection: Keep-Alive
Content-Length: 241
Accept-Encoding: gzip, deflate
Accept: application/json
User-Agent: python-
Content-Type: application/json
X-Auth-Token: 1100c1044472418
{"subnet": {"enable_dhcp": true, "network_id": "ed204a09-
and the n3 req:
{"network": {"status": "ACTIVE", "subnets": [], "availability_
010.100.
{"network": {"status": "ACTIVE", "subnets": [], "availability_
010.100.
{"networks": [{"id": "ed204a09-
Resource CREATE failed: BadRequest: resources.sn253: Invalid input for operation: Requested subnet with cidr: 172.16.253.0/24 for |
| | network: b1ef80c2-
also failed, i guess @ the same time.
Kevin Benton (kevinbenton) wrote : | #36 |
There is a big gap in the n2 neutron logs that is hiding the original requests from several retries. The two statements below are 24 seconds apart.
2016-07-11 02:47:41.528 248 ERROR oslo_db.
2016-07-11 02:48:05.043 242 DEBUG neutron.
Can you try using the log_dir and log_file options for oslo log in case the stderr printing is being rate limited by something? I'm having difficulty finding a single failed request ID that can be tracked from its initial request all of the way to a response because there is so much lost log data.
An example is '172.16.170.0/24'. You can see in the pcap that it got an overlap error but there is nothing in the logs about receiving a create request for that payload or even that it generated an overlap error for that CIDR.
Don Bowman (donbowman) wrote : | #37 |
- logs-2016-07-11.tar.gz Edit (62.4 MiB, application/x-tar)
attached.
I changed:
log_dir = /var/log/neutron
log_file = neutron.log
use_stderr = false
and restarted the 3 neutron api.
Note the error reported to heat was different:
Resource CREATE failed: DBDeadlock: resources.sn0: (pymysql.
but there is still the overlaps failure:
Validation for CIDR: 172.16.41.0/24 failed - overlaps with subnet 0bce614f-
in the pcap, i made it not just port 9696 (so u can see the mysql too). I decoded to text, but this is likely easier viewed in wireshark with 'follow' on the port 3306 vs 9696
Don Bowman (donbowman) wrote : | #38 |
ps log rate limited by journald
http://
RateLimitInterv
RateLimitBurst=1000
can be overridden. I'll know for next time!
Kevin Benton (kevinbenton) wrote : | #39 |
Thanks! Can you do me a favor and run one more time with the latest version of https:/
Now that we have full logs I probably don't need the pcaps anymore if that makes it easier for you.
Don Bowman (donbowman) wrote : | #40 |
Kevin Benton (kevinbenton) wrote : | #41 |
Excellent. try the latest version of https:/
The error is in the quota committing logic.
Don Bowman (donbowman) wrote : | #42 |
- logs-2016-07-11-2.tar.gz Edit (461.5 KiB, application/x-tar)
Well, new behaviour.
No error, stack create has been going for 10min.
It created 255 networks, and 255 subnets, as expected, no error.
But it doesn't finish.
Attached logs after 15 min of sitting.
Don Bowman (donbowman) wrote : | #43 |
- logs-2016-07-10-3.tar.gz Edit (1.3 MiB, application/x-tar)
additionally, the delete gave an error, attached.
Resource DELETE failed: DBError: resources.n237: This Connection is closed
Don Bowman (donbowman) wrote : | #44 |
- logs-2016-07-11-4.tar.gz Edit (1.7 MiB, application/x-tar)
and a different failure, this time a 500 on
'cf651054-
DBError: This Connection is closed
Kevin Benton (kevinbenton) wrote : | #45 |
I'm not sure that error is coming from Neutron. Can you check the heat engine logs to see if you can get the full traceback?
Kevin Benton (kevinbenton) wrote : | #46 |
Whoops. nevermind. looking into it.
Kevin Benton (kevinbenton) wrote : | #47 |
Ok. Give the latest version of the patch another try. I think that resource closed error might be masking another exception though...
Don Bowman (donbowman) wrote : | #48 |
- logs-2016-07-11-5.tar.gz Edit (253.6 KiB, application/x-tar)
attached.
i reapplied the patch from scratch to confirm i did it properly.
Kevin Benton (kevinbenton) wrote : | #49 |
Found another query that needed to ensure the session was there. Retry with https:/
Don Bowman (donbowman) wrote : | #50 |
- logs-2016-07-12.tar.gz Edit (303.2 KiB, application/x-tar)
logs attached. gets a networknotfound error.
Resource CREATE failed: NetworkNotFound
Neutron server returns request_ids: ['req-577218f9-
Kevin Benton (kevinbenton) wrote : | #51 |
Give it another shot with https:/
(BTW, thanks for helping with this. We have quite a few more db contentious operations and compare and swap logic for better consistency but it results in more deadlock risk, especially in galera, so we need to get these cases shaken out)
Don Bowman (donbowman) wrote : | #52 |
- logs-2016-07-13.tar.gz Edit (3.4 MiB, application/x-tar)
here is a set of logs. same as in #42, it appears to have worked correctly, extremely slowly, but heat doesn't think it is done.
it seems to take ~4-5s/network on the agent (n4 is the dhcp+l3 agent in these logs, it does not run api), so for 255 networks its *hugely* slow.
I see a lot of errors around the deadlock mechanism, but i don't see un-retried errors (but i didn't look that hard).
I will run again.
Don Bowman (donbowman) wrote : | #53 |
forgot to mention. pre patch in #51 i let it run overnight, heat never found that neutron finished.
Don Bowman (donbowman) wrote : | #54 |
- logs-2016-07-13-1.tar.gz Edit (5.9 MiB, application/x-tar)
attached is the stack delete of the above stack create.
This caused a failure from neutron:
Resource DELETE failed: DBError: resources.n119: This Connection is closed
Kevin Benton (kevinbenton) wrote : | #55 |
The "This Connection is closed" error is coming from the heat engine. That one doesn't seem to have anything to do with Neutron.
Kevin Benton (kevinbenton) wrote : | #56 |
The slowness is definitely a result of lots of retries. I think we can look at that as a separate bug though. First I want to get it where everything is correct then worry about performance.
You should get a significant improvement just from using a single-writer deployment for Galera. Galera manifests failures as latent deadlocks on commits when using multiple writers so they are very expensive to recover from. With a single writer, constraint violations and staledata errors are recognized locally before it comes time to commit.
Don Bowman (donbowman) wrote : | #57 |
the primary source of the slowness is on the agent. the database is done, all the entries are there, but now the agent is one by one applying them, which involves netns create, ip route, ip mtu, ovs flush etc, each as a subshell.
I'll take a look @ the heat logs, I let it run all day and it did not discover that it was complete (even w/o that error).
Don Bowman (donbowman) wrote : | #58 |
on the heat delete, I usually get errors like:
Resource DELETE failed: DBError: resources.n149: This Connection is closed
2016-07-13 23:02:14.582 1562 DEBUG heat.common.
i'll dig more
Don Bowman (donbowman) wrote : | #59 |
- logs-16-07-14-03:44:04.tar.gz Edit (26.5 MiB, application/x-tar)
OK I have switched galera to a single writer using MaxScale.
it generally works on most creates, but there are some failures. Attached is an example
Resource CREATE failed: InternalServerE
Neutron server returns request_ids: ['req-07fc7999-
the neutron backend is not responding in time and nginx is timing it out (perhaps a different sort of deadlock?)
Don Bowman (donbowman) wrote : | #60 |
whoops, forgot the neutron logs, see attached
Don Bowman (donbowman) wrote : | #61 |
Don Bowman (donbowman) wrote : | #62 |
- logs-16-07-14-04:33:04.tar.gz Edit (1.7 MiB, application/x-tar)
And one more.
Here the l3-agent (n4) gets
RemoteError: Remote error: OperationalError (pymysql.
this seems to occur after a bunch of
2016-07-14 04:31:43.687 3995 DEBUG neutronclient.
[332B blob data]
[44B blob data]
[23B blob data]
[42B blob data]
[43B blob data]
[8B blob data]
[8B blob data]
http_log_resp /usr/local/
[198B blob data]
[44B blob data]
[23B blob data]
[42B blob data]
[43B blob data]
[8B blob data]
[8B blob data]
_handle_
has occurred in heat.
_handle_
2016-07-14 04:31:43.707 3995 INFO heat.engine.
2016-07-14 04:31:43.707 3995 ERROR heat.engine.
2016-07-14 04:31:43.707 3995 ERROR heat.engine.
2016-07-14 04:31:43.707 3995 ERROR heat.engine.
2016-07-14 04:31:43.707 3995 ERROR heat.engine.
2016-07-14 04:31:43.707 3995 ERROR heat.engine.
2016-07-14 04:31:43.707 3995 ERROR heat.engine.
2016-07-14 04:31:43.707 3995 ERROR heat.engine.
2016-07-14 04:31:43.707 3995 ERROR heat.engine.
2016-07-14 04:31:43.707 3995 ERROR heat.engine.
2016-07-14 04:31:43.707 3995 ERROR heat.engine.
2016-07-14 04:31:43.707 3995 ERROR heat.engine.
2016-07-14 04:31:43.707 3995 ERROR heat.engine.
2016-07-14 04:31:43.707 3995 ERROR heat.engine.
2016-...
Don Bowman (donbowman) wrote : | #63 |
OK i bumped my mysql connections count to 4K / server, and ran some more tests.
Generally a single heat stack will complete correctly.
If 2 are started @ the same time, it will generally fail.
if 3, all will fail in an exception from neutronclient around subnet creation, as in
logs-16-
- - -] Task create from Subnet "sn1" Stack "t31" [dbd77b02-
[332B blob data]
[49B blob data]
[23B blob data]
[47B blob data]
[43B blob data]
[8B blob data]
[8B blob data]
http_log_resp /usr/local/
[198B blob data]
[49B blob data]
[23B blob data]
[47B blob data]
[43B blob data]
[8B blob data]
[8B blob data]
_handle_
2016-07-14 13:29:50.827 3982 DEBUG neutronclient.
2016-07-14 13:29:50.828 3988 INFO heat.engine.
2016-07-14 13:29:50.828 3988 ERROR heat.engine.
2016-07-14 13:29:50.828 3988 ERROR heat.engine.
2016-07-14 13:29:50.828 3988 ERROR heat.engine.
2016-07-14 13:29:50.828 3988 ERROR heat.engine.
2016-07-14 13:29:50.828 3988 ERROR heat.engine.
2016-07-14 13:29:50.828 3988 ERROR heat.engine.
2016-07-14 13:29:50.828 3988 ERROR heat.engine.
2016-07-14 13:29:50.828 3988 ERROR heat.engine.
Don Bowman (donbowman) wrote : | #64 |
Don Bowman (donbowman) wrote : | #65 |
the python script in #15, formerly running a couple of those would cause a failure.
4 have been running all day w/ the patch w/o issue.
I'm not sure why heat is different. Faster? interaction w/ RPC? Other?
Changed in neutron: | |
milestone: | newton-2 → newton-3 |
Don Bowman (donbowman) wrote : | #66 |
- logs-16-07-16-16.tar.gz Edit (16.8 MiB, application/x-tar)
OK after 3 days, one of the python scripts [4 running in parallel on different clients] hit the same error as heat.
Traceback (most recent call last):
File "repo.py", line 30, in <module>
neutron.
File "/cephfs/
return self.delete(
File "/cephfs/
headers=
File "/cephfs/
headers=
File "/cephfs/
self.
File "/cephfs/
exception_
File "/cephfs/
request_
neutronclient.
Neutron server returns request_ids: ['req-0e561b63-
some logs in attached
Kevin Benton (kevinbenton) wrote : | #67 |
Hey,
Sorry about the delay. Looking into the logs from the last several messages.
To answer your question about heat vs the scripts, heat can be experiencing more failures because heat itself makes use of mysql itself to store task states and when you have multiple workers talking to different writers in a galera cluster, things that should have blocked on a record lock instead manifest as a Deadlock error on commit.
Basically it can be experiencing failures for the same reason Neutron is. Openstack doesn't get any in-tree testing of multi-writer environments so there are places where failures can slip through or retries get a lot more expensive than developers initially anticipated.
Kevin Benton (kevinbenton) wrote : | #68 |
Regarding the agent comment. I looked at the logs and the HTTP responses are very slow (likely because of the DB retries). The agents don't block the API response time so even if the agents were running very slowly you shouldn't experience any API failures when creating subnets and networks.
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master) | #69 |
Related fix proposed to branch: master
Review: https:/
Kevin Benton (kevinbenton) wrote : | #70 |
Ok, the error in number 66 appears to be due to a lost connection to the DB[1]. I've adjusted https:/
Most of the contention seems to be related to the quota engine since this is all under one tenant. I have some separate fixes I'm going to work on for that.
Can you try two things for me? One run with the latest version of 339226 and then another with the same patch and the quotas set to unlimited for that tenant (-1 for networks, subnets, and ports) since that will effectively skip a bunch of the contentious reservation logic.
1. error is "DBConnectionError: (pymysql.
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master) | #71 |
Reviewed: https:/
Committed: https:/
Submitter: Jenkins
Branch: master
commit c46edbc7d62585b
Author: Kevin Benton <email address hidden>
Date: Mon Jul 18 22:39:19 2016 -0600
Use db_api.
The quota engine was still using oslo_db wrap_db_retry
which does not automatically take into account deadlocks
that occur inside of nested savepoint transactions.
In one case it didn't matter because it passed in the correct
exception checker but in the one that protected 'set_quota_usage'
it did not use is_retriable so a deadlock inside of the savepoint
would have resulted in a much more expensive retry all of the way
up at the API layer.
This patch just adjusts them to both use the standard neutron
retry_db_errors decorator.
Change-Id: I1e45eb15f14bf3
Related-Bug: #1596075
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master) | #72 |
Fix proposed to branch: master
Review: https:/
OpenStack Infra (hudson-openstack) wrote : | #73 |
Fix proposed to branch: master
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master) | #74 |
Reviewed: https:/
Committed: https:/
Submitter: Jenkins
Branch: master
commit 3213dd884e15d1e
Author: Kevin Benton <email address hidden>
Date: Fri Jul 22 09:56:30 2016 -0700
Mark DBConnectionError as retriable
DBConnectio
to the database gets closed (e.g. sql restart or
network connectivity failure). This should be part
of our standard list of exceptions to retry so we
are resillient to intermittent connectivity issues.
The olso db wrap db retry decorator has this option
as well[1] but we can't use it because our logic to
detect retriable errors in MultipleException types.
Partial-Bug: #1596075
Change-Id: I26423e306173d9
Don Bowman (donbowman) wrote : | #75 |
- logs-16-07-24-20:26:54.tar.gz Edit (1.5 MiB, application/x-tar)
OK, i applied the 3 patches [i did a new pull first] and re-ran w/ the quota still in place.
I got an API timeout on one of the subnets (see attached logs).
2016/07/24 20:20:04 [error] 4876#4876: *266186 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 172.17.0.3, server: neutron.
I will re-run w/ the quota disabled.
Don Bowman (donbowman) wrote : | #76 |
OK w/ the quota disabled, I was able to create two of of the heat stacks simultaneously.
Don Bowman (donbowman) wrote : | #77 |
- logs-16-07-24-21:49:20.tar.gz Edit (6.8 MiB, application/x-tar)
OK, more tries. with quota set to -1, and creating several heat stacks simultaneously, it will still fail.
See attached logs for
don@nubo-
+------
| ID | Stack Name | Stack Status | Creation Time | Updated Time |
+------
| baae4b94-
| e7a3f4c7-
| f39434de-
| a6ca3844-
| be7738b9-
| 3437f4db-
| 9b690521-
| b0fa4727-
| 02d3c641-
| 7299c3cf-
+------
tags: | added: neutron-proactive-backport-potential |
Kevin Benton (kevinbenton) wrote : | #78 |
Waiting on getting some work merged into master then I will have you retest.
OpenStack Infra (hudson-openstack) wrote : | #79 |
Reviewed: https:/
Committed: https:/
Submitter: Jenkins
Branch: master
commit 3ad23f42c7b1fce
Author: Kevin Benton <email address hidden>
Date: Thu Jul 7 00:44:59 2016 -0700
Always start transactions in quota cleanup methods
If the previous action that let to the quota reservation
cancelling was the result of a database connection getting
interrupted, attempting to query without calling session.begin()
will result in a sqlalchemy.
This alters the quota methods that mutate DB state to use a
transaction with the new oslo DB enginefacade decorators that
start a transaction for us.
Partial-Bug: #1596075
Partially-
Change-Id: I3d0539b11795cb
Don Bowman (donbowman) wrote : | #80 |
I'm not clear on whether I'm to test w/ solely this last commit, w/ all of them, to wait for a bit more, ?
Changed in neutron: | |
milestone: | newton-3 → newton-rc1 |
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master) | #81 |
Change abandoned by Kevin Benton (<email address hidden>) on branch: master
Review: https:/
Reason: moving this stuff down into the plugin layer here: I7b8a4a105aabfa
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master) | #82 |
Fix proposed to branch: master
Review: https:/
OpenStack Infra (hudson-openstack) wrote : | #83 |
Fix proposed to branch: master
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master) | #84 |
Reviewed: https:/
Committed: https:/
Submitter: Jenkins
Branch: master
commit 09c87425fa028db
Author: Kevin Benton <email address hidden>
Date: Wed Sep 7 18:27:49 2016 -0700
Prepare retry decorator to move to plugin level
Retrying mutating operations at the API layer caused a
couple of problems. First, when components would call
the core plugin using the neutron manager, they would
have to handle the retriable failures on their own or
undo any work they had done so far and allow retriable
failures to be propagated up to the API. Second, retrying
at the API makes composite operations (e.g. auto allocate,
add_
consider if exceptions are retriable before raising
fatal exceptions on failures of core plugin calls.
This patch begins the process of moving them down to the
core operations with a new decorator called
'retry_
retry logic isn't triggered if there is an ongoing transaction
since retrying inside of a transaction is normally ineffective.
Follow-up patches apply them to various parts of the code-base.
Additionally, the args and kwargs of the method are automatically
deep copied in retries to ensure that any mangling the methods
do to their arguments don't impact their retriability.
Finally, since we are leaving the API decorators in place for now,
the retry logic will not be triggered by another decorator if an
exception has already been retried. This prevents an exponential
explosion of retries on nested retry decorators.
The ultimate goal will be to get rid of the API decorators entirely
so retries are up to each individual plugin.
Partial-Bug: #1596075
Partial-Bug: #1612798
Change-Id: I7b8a4a105aabfa
OpenStack Infra (hudson-openstack) wrote : | #85 |
Reviewed: https:/
Committed: https:/
Submitter: Jenkins
Branch: master
commit acbabaa3db2c12a
Author: Kevin Benton <email address hidden>
Date: Wed Sep 7 18:42:40 2016 -0700
Make l2/l3 operations retriable at plugin level
This adds decorators to ML2/db_
DB modules to make L2 and L3 resource operations retriable
at the plugin level.
Retrying the L2 operations in particular at this level should
go a long way to improving the reliability of complex operations
that require L2 resources in the face of deadlocks under high
concurrency.
Partial-Bug: #1612798
Partial-Bug: #1596075
Change-Id: I3c9437f7ecdd5e
OpenStack Infra (hudson-openstack) wrote : | #86 |
Reviewed: https:/
Committed: https:/
Submitter: Jenkins
Branch: master
commit 12420c1585abadc
Author: Kevin Benton <email address hidden>
Date: Wed Sep 7 18:51:58 2016 -0700
Mark quota operations as retriable
This decorates the quota system operations with
the retry decorators. This will help significantly
with the bug this marks as closed since operations
in the quota engine after commit should no longer
trigger retries of the full API operation.
The logic to find the args in the decorator had
to be adjusted to deal with functions already decorated.
This just uses the getargspec function from pecan that
deals with decorated functions.
Partial-Bug: #1612798
Closes-Bug: #1596075
Change-Id: Ib786117dcea08a
Changed in neutron: | |
status: | In Progress → Fix Released |
Kevin Benton (kevinbenton) wrote : | #87 |
Hi Don,
Can you test with the latest state of master?
Don Bowman (donbowman) wrote : | #88 |
i have tried, i'm getting an error in the schema which is unlikely to be related to this u"Unknown column 'trunks_
I'll debug and try more this week.
Kevin Benton (kevinbenton) wrote : | #89 |
Sounds like you might need a schema migration "neutron-db-manage --config-file /etc/neutron/
Don Bowman (donbowman) wrote : | #90 |
ok database dropped and created fresh.
Updated from head.
galera configured for single writer.
ran single test (e.g. 1 heat create of 255 networks each w/ subnet) -- Passed
ran stack delete -- Passed
(ran 2 more times, Passed)
ran 3 parallel tests: fail (timeout)
increased timeouts in nginx front-end of API, increased workers.
pass
so it looks good. Thanks for the assistance.
Don Bowman (donbowman) wrote : | #91 |
I might have spoken too soon. I'm getting some errors on deleting (when i delete a large number simultaneously, they work when issued serially).
Digging in on why, its a deadlock that exceeds the retriable limit.
: DBDeadlock: (pymysql.
: wrapped /usr/local/
Deleting subnet cc57f513-
: 2016-09-24 19:06:56.762 228 ERROR oslo_db.api [req-a671d942-
Don Bowman (donbowman) wrote : | #92 |
Perhaps worse.
When i retry the stack delete, it works (and shows all deleted). But I end up with networks that are left that did not get deleted.
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 9.0.0.0rc1 | #93 |
This issue was fixed in the openstack/neutron 9.0.0.0rc1 release candidate.
Kevin Benton (kevinbenton) wrote : | #94 |
Do you have a sample trace from a request that says the network successfully deleted when it was actually left behind? (all of the logs from the corresponding network delete request id)
Kevin Benton (kevinbenton) wrote : | #95 |
Can you open a separate bug for the leftover networks issue? The original issue for this bug (the overlapping subnets thing) is fixed.
Don Bowman (donbowman) wrote : | #96 |
I'm not 100% sure this is solved, but ok.
It sometimes still fails on create.
it sometimes still fails on delete.
it sometimes passes the delete, but leaves some things behind.
0f68089d-
is a network left behind. The attached logs in logs-16-
Don Bowman (donbowman) wrote : | #97 |
tags: | removed: neutron-proactive-backport-potential |
Kevin Benton (kevinbenton) wrote : | #98 |
I grepped the logs and there is only one mention of that network UUID during it's creation in n2-neutron.log. I don't see the HTTP request at all for the delete attempt in the neutron logs, heat logs, or nginx logs. Does that log include everything?
Kevin Benton (kevinbenton) wrote : | #99 |
Same looking for the name itself 't2_n70'. Only one mention.
OpenStack Infra (hudson-openstack) wrote : | #100 |
This issue was fixed in the openstack/neutron 9.0.0.0rc1 release candidate.
If I do 254 subnets (using a generation script like below), and then do an 'openstack stack create' using that resultant yaml, there will normally be 1 or more subnets in error.
Some will show:
BadRequest: resources.sn209: Invalid input for operation: Requested subnet with cidr: 172.16.209.0/24 for network: 9cf04cbb- 9b32-4798- bdee-93ed28f278 27 overlaps with another subnet. Neutron server returns request_ids: ['req-85602ea2- 3407-4137- b314-625f0b6dc9 93']
and some will show: err.InternalErr or) (1213, u'Deadlock found when trying to get lock; try restarting transaction')
DBDeadlock: resources.sn28: (pymysql.
#!/bin/bash version: 2013-05-23
cat <<EOF1
heat_template_
description: >
Test subnet creation
parameters:
resources:
EOF1
for i in $(seq 0 254)
do
cat << EOF
n$i:
type: OS::Neutron::Net
properties:
name: n$i
sn$i: n_pools:
type: OS::Neutron::Subnet
properties:
name: sn$i
enable_dhcp: True
network_id: { get_resource: n$i }
cidr: 172.16.$i.0/24
allocatio
- start: 172.16.$i.9
end: 172.16.$i.254
EOF
done