Bug #1596075 “Neutron confused about overlapping subnet creation...” : Bugs : neutron

Revision history for this message

Don Bowman (donbowman) wrote on 2016-06-24:

#1

Example heat template Edit (3.1 KiB, text/plain)

Revision history for this message

Don Bowman (donbowman) wrote on 2016-06-25:

#2

If I do 254 subnets (using a generation script like below), and then do an 'openstack stack create' using that resultant yaml, there will normally be 1 or more subnets in error.

Some will show:

BadRequest: resources.sn209: Invalid input for operation: Requested subnet with cidr: 172.16.209.0/24 for network: 9cf04cbb-9b32-4798-bdee-93ed28f27827 overlaps with another subnet. Neutron server returns request_ids: ['req-85602ea2-3407-4137-b314-625f0b6dc993']

and some will show:
DBDeadlock: resources.sn28: (pymysql.err.InternalError) (1213, u'Deadlock found when trying to get lock; try restarting transaction')

#!/bin/bash
cat <<EOF1
heat_template_version: 2013-05-23

description: >
Test subnet creation

parameters:

resources:
EOF1

for i in $(seq 0 254)
do
    cat << EOF
  n$i:
    type: OS::Neutron::Net
    properties:
      name: n$i

  sn$i:
    type: OS::Neutron::Subnet
    properties:
      name: sn$i
      enable_dhcp: True
      network_id: { get_resource: n$i }
      cidr: 172.16.$i.0/24
      allocation_pools:
        - start: 172.16.$i.9
          end: 172.16.$i.254

EOF
done

Rossella Sblendido (rossella-o) on 2016-06-28

tags:

added: l3-ipam-dhcp needs-attention

Carl Baldwin (carl-baldwin) on 2016-06-28

Changed in neutron:
importance:	Undecided → High

Revision history for this message

Carl Baldwin (carl-baldwin) wrote on 2016-06-28:

#3

@Don, what leads you to believe that the new subnet is matching with itself? From what I can tell, this is impossible. Take a look at _save_subnet [1] from where _validate_subnet_cidr is called. The subnet DB model isn't created and added to the session until after _validate_subnet_cidr has been called to validate the cidr and finished. So, there is no chance that the DB has seen this subnet and so it can't return it with network.subnets.

The only thing that I can think of is that the API is being called twice for the same cidr. Is there any way that could happen?

[1] https://github.com/openstack/neutron/blob/080bf68895860e83b11a0daec8fb1cd96de4e09c/neutron/db/ipam_backend_mixin.py#L461-L469

Changed in neutron:
status:	New → Incomplete

Revision history for this message

Don Bowman (donbowman) wrote on 2016-06-28:

#4

Um, i guess i'm not 100% sure.

this tenant has no networks when i call the heat stack-create. as you can see from the template it has no duplicates. So it must therefore be matching against itself.
Either on one call, or 2 calls that are hitting different api servers @ the same instant. All the api calls here are from heat. I guess some tracing might help.

Its definitely a race condition. E.g. if i make a stack w/ 10 networks, each w/ 1 subnet, it will fail ~1 in 10 (not always the same subnet). If i do it w/ 100 networks, each w/ 1 subnet, it will fail ~90% of the time.

Revision history for this message

Don Bowman (donbowman) wrote on 2016-06-29:

#5

https://gist.github.com/donbowman/f4f4b6ae721e3d1bb1961eaac31f9e4d

has the logs. Files are:

test.yaml --> run this with 'openstack stack create -t test.yaml fubar'
stack-output.txt --> the error that comes out (the overlap)
neutron-nubo-[1-3].log --> the logs from the 3 api servers which are round-robin served

Before I ran this test i dropped my neutron database and recreated, so it was empty. This was a single stack create, so there was for sure no overlapping subnet.

the neutron logs have debug+verbose on. I build neutron from github immediately before the test

Revision history for this message

Don Bowman (donbowman) wrote on 2016-06-29:

#6

In the logs you can also see this error:
DBAPIError exception wrapped from (pymysql.err.InternalError) (1305, u'SAVEPOINT sa_savepoint_1 does not exist') [SQL: u'ROLLBACK TO SAVEPOINT sa_savepoint_1']

which indicates a deadlock + unwind/restart.

Revision history for this message

Don Bowman (donbowman) wrote on 2016-06-29:

#7

https://gist.github.com/donbowman/68be6387191d567342662b655ea92495

has 3 files:
nubo-[1-3]-a-nginx.log
these are the API logs via nginx, much easier to read than the neutron debug output

Revision history for this message

Don Bowman (donbowman) wrote on 2016-06-29:

#8

logs (neutron debug, nginx access, heat stack + output_ Edit (177.4 KiB, application/x-tar)

Attached are the output logs

Revision history for this message

Don Bowman (donbowman) wrote on 2016-06-29:

#9

logs (neutron debug, nginx access, heat stack + output_ Edit (177.4 KiB, application/x-tar)

Attached are the output logs

Revision history for this message

Miguel Lavalle (minsel) wrote on 2016-06-30:

#10

I have executed this script http://paste.openstack.org/show/524149/ about 20 times. No errors whatsoever. My system was built from master a couple of days ago. I have one controller node and 4 computes. Could it be that the API is being called twice for the same CIDR?

Revision history for this message

Don Bowman (donbowman) wrote on 2016-06-30:

#11

What about with > 1 controller node and round-robin load balanced?
That is where the error occurs for me.

Revision history for this message

Don Bowman (donbowman) wrote on 2016-06-30:

#12

I have 3 API servers for heat, for neutron, for nova (and glance, cinder, horizon). All are round-robin load balanced via nginx front-ends.
If i delete 2 of the 3, then the error does not occur.

Revision history for this message

Don Bowman (donbowman) wrote on 2016-06-30:

#13

also, i think that script will be dramatically slower than letting heat drive the API with all its workers.

Revision history for this message

Miguel Lavalle (minsel) wrote on 2016-06-30:

#14

I don't have that setup ready. In the meantime, I found this in regards to the tracebacks in the log file: https://bugs.launchpad.net/neutron/+bug/1590298. Do you have the fix indicated in this bug?

Revision history for this message

Don Bowman (donbowman) wrote on 2016-06-30:

#15

Yes i was looking @ that one too. I do have the fix in question.
If i switch to python 3.5 for the test, i find that the deadlock is occurring due to quota updates.

I managed to reproduce with a python script instead of heat. It takes a few more tries to get it to happen, but it did happen on the 4th time i tried.

import sys
from neutronclient.v2_0 import client
neutron = client.Client(username='don',password='xxx',tenant_name='don',auth_url='https://keystone.sandvine.rocks/v2.0')

ids = []
sids = []
for i in range(1,254):
network = {'name': 'pnet-%s' % i, 'admin_state_up': True}

    n = neutron.create_network({'network':network})
    id = n['network']['id']
    ids.append(id)
    srange="172.17.%u.9" % i
    erange="172.17.%u.254" % i
    cidr="172.17.%u.0/24" % i
    name = "subnet-%u" % i
    subnet = {'enable_dhcp': True, 'network_id': id, 'allocation_pools': [{'start': srange, 'end': erange}], 'ip_version': 4, 'cidr': cidr, 'name': name}
    s = neutron.create_subnet({'subnet':subnet})
    sid = s['subnet']['id']
    sids.append(sid)

for i in sids:
print("delete subnet %u" % i)
neutron.delete_subnet({'subnet':{'id': i}})

for i in ids:
neutron.delete_network({'network':{'id': i}})

Revision history for this message

Don Bowman (donbowman) wrote on 2016-06-30:

#16

https://gist.github.com/donbowman/76b9dbb81cd2e01e8abe9a5fbf42e713

reproduces it using just python. Not every run does, but at least one run in 5 i made did.
i have 3 api servers online, round-robin load balanced from nginx.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-07-07: Fix proposed to neutron (master)

#17

Fix proposed to branch: master
Review: https://review.openstack.org/339226

Changed in neutron:
assignee:	nobody → Kevin Benton (kevinbenton)
status:	Incomplete → In Progress

Revision history for this message

Kevin Benton (kevinbenton) wrote on 2016-07-07:

#18

@Don,

Can you please try with the following patch?

https://review.openstack.org/339226

Armando Migliaccio (armando-migliaccio) on 2016-07-08

Changed in neutron:
milestone:	none → newton-2

Revision history for this message

Don Bowman (donbowman) wrote on 2016-07-08:

#19

tested patch.
It does not resolve issue. Same problem
| stack_status_reason | Resource CREATE failed: BadRequest: resources.sn125: Invalid input for operation: Requested subnet with cidr: 172.16.125.0/24 for network: 6d2b3077-7353-414e-a98f-79afa7058698 overlaps with |
| | another subnet.

comes out, even though I am clearly not creating that twice. I double checked that I applied the patch to all 3 API servers.

Revision history for this message

Kevin Benton (kevinbenton) wrote on 2016-07-08:

#20

Can you try one more time with the latest version of that patch? I made an adjustment to make sure it works with the latest changes to the engine management.

Also, I downloaded the logs from comment #9 and the neutron server logs don't contain any of the failed API requests. I can't see any failed "POST //v2.0/subnets.json" calls or even a successful one that mentions the network ID in the stack-output.txt error log. I also don't see the INFO message that is generated here: https://github.com/openstack/neutron/blob/213167552e5c5a3c546cbd0e74e6a1a18be0dcda/neutron/db/ipam_backend_mixin.py#L230-L235

Did you attach the logs from the right neutron servers?

Revision history for this message

Don Bowman (donbowman) wrote on 2016-07-08:

#21

logs (neutron debug, nginx access, heat stack + output_ Edit (177.4 KiB, application/x-tar)

Revision history for this message

Don Bowman (donbowman) wrote on 2016-07-08:

#22

see #21 logs are there for sure, don't know what happened before.

the subnets.json calls are in the nginx* logs, and i do see the good/bad ones in there now.

You can also see the
InternalError: (1305, u'SAVEPOINT sa_savepoint_1 does not exist')

which is referenced in https://bugs.launchpad.net/neutron/+bug/1590298. I do have that fix. E.g. my 'is_retriable' looks like:

def is_retriable(e):
    if _is_nested_instance(e, (db_exc.DBDeadlock, exc.StaleDataError,
                               db_exc.DBDuplicateEntry, db_exc.RetryRequest)):
        return True
    # looking savepoints mangled by deadlocks. see bug/1590298 for details.
    return _is_nested_instance(e, db_exc.DBError) and '1305' in str(e)

Revision history for this message

Kevin Benton (kevinbenton) wrote on 2016-07-09:

#23

@Don,

Something is still missing from those logs. I grep them and there are no failed subnet requests in the Neutron server logs.

grep POST * | grep -v '1.1" 201'

That should return the failed subnet requests in neutron server logs but I only see the NGINX logs.

There are also no log messages about overlapping subnets. There should be one like this:

2016-07-08 14:55:52.027 11332 INFO neutron.db.ipam_backend_mixin [req-6d05506a-42c0-44c3-8109-fba412585609 admin -] Validation for CIDR: 10.0.0.0/24 failed - overlaps with subnet 338ab7b1-20ee-4af2-9da1-531029f3dce8 (CIDR: 10.0.0.0/24)

Is it possible for you to setup nginx to log which backend it's communicating with an do another run so we know which one is supposed to be failing?

Note that the savepoint errors don't indicate a complete failure, they just indicate that a request had to be retried due to a deadlock so the server they occur on may not be related to the one generating that overlap message.

Revision history for this message

Don Bowman (donbowman) wrote on 2016-07-09:

#24

n1-neutron.log:2016-07-08 12:33:38.708 240 INFO neutron.db.ipam_backend_mixin [req-365067cb-f744-476b-adf4-ad9a874daf73 a4aab8472ebb41faa98f89e1dd34858a 469300d96f9f406cacda5f001d387e52 - - -] Validation for CIDR: 172.16.126.0/24 failed - overlaps with subnet c70ce42f-71cd-456a-89f6-1b49fd5e8f0c (CIDR: 172.16.126.0/24)

shows the failure, and the specific backend (the backend are n1...n3).

Is there something more i need to enable for logging? I do see the overlapping subnet in the log.

Revision history for this message

Don Bowman (donbowman) wrote on 2016-07-09:

#25

logs-2016-07-08.tar.gz Edit (296.4 KiB, application/x-tar)

Revision history for this message

Don Bowman (donbowman) wrote on 2016-07-09:

#26

sigh, i think i reuploaded the same log files in #21.
#25 is a different size, and for sure i see that error @n1-neutron.log:2016-07-08 12:33:38.708

Revision history for this message

Kevin Benton (kevinbenton) wrote on 2016-07-09:

#27

Thanks, and this run is with the second revision of https://review.openstack.org/#/c/339226/ right?

Revision history for this message

Don Bowman (donbowman) wrote on 2016-07-09:

#28

logs-2016-07-09.tar.gz Edit (462.6 KiB, application/x-tar)

Revision history for this message

Don Bowman (donbowman) wrote on 2016-07-09:

#29

yes that was with the second revision.
another set of logs in #28 also has a tcpdump on port 9696 on each api node (and a text dump of that) to show the raw protocol.

Revision history for this message

Kevin Benton (kevinbenton) wrote on 2016-07-10:

#30

I think I've tracked down the issue to agent-related queries causing DBDeadlocks that happen after the subnet creation (in this particular case updating the agent load value). This causes the API layer to use its retry mechanism even though the subnet has been persisted to the DB, which leads to the overlapping error.

Please try again with the latest version of https://review.openstack.org/#/c/339226/. It has a crude fix, but it should cause the issue to go away to confirm my suspicions.

Revision history for this message

Don Bowman (donbowman) wrote on 2016-07-10:

#31

logs-2016-07-10.tar.gz Edit (692.4 KiB, application/x-tar)

Well... i tried it.

I get the same error out. I did not get that log message to fire. Attached the logs.

Revision history for this message

Kevin Benton (kevinbenton) wrote on 2016-07-10:

#32

Is it possible to just do another run and capture the logs? It seems like there is some missing data (maybe a bug in oslo logging or in the logging collector?).

I see in the PCAP that an HTTP 500 was returned by neutron for request 'req-2530b6c6-7c5b-4b21-92e1-c5dcd034fa15' but there are no Neutron logs that show the traceback and the 500 being returned. Also the requests associated with the overlapping subnets are missing log statements from various components (e.g. one starts immediately at a retry).

(Sorry about all of these requests but I haven't been able to reproduce this behavior locally with 3 Neutron API servers behind nginx)

Revision history for this message

Don Bowman (donbowman) wrote on 2016-07-11:

#33

will do.

FWIW i have 3 nginx as well, not sure if that matters(?)

My architecture is:

1 VIP (keepalived DR-mode) points to port 443 on 3 nginx servers
each nginx server has neutron1...3 in its list (and all other OpenStack services)

so when i do a curl https://neutron.sandvine.rocks/, it will hit one of N nginx, and that will forward to one of N API services.

I am using mysql+galera as 1+1+1 as well, and any or all of them will hit one of the mysql backends.

Revision history for this message

Don Bowman (donbowman) wrote on 2016-07-11:

#34

logs-2016-07-10-1.tar.gz Edit (640.7 KiB, application/x-tar)

Revision history for this message

Don Bowman (donbowman) wrote on 2016-07-11:

#35

In #34 is another run w/ the logs, w/ change from #30.

The subnet w/ the error this time is : ed204a09-8e39-4aa1-8b02-9503c2655953 (172.16.72.0/24) and there is activity for it on n1 and n3 simultaneously:

grep -l ed204a09-8e39-4aa1-8b02-9503c2655953 *log *txt
n1-neutron.log
n2-neutron-nginx.log
n3-neutron.log
n1-neutron.pcap.txt
n3-neutron.pcap.txt

so n2 nginx dealt two requests for this subnet, 1 to n1, one to n3.

The n1 req (from pcap):

010.100.010.061.51330-172.017.000.003.09696: POST //v2.0/subnets.json HTTP/1.1
X-Forwarded-For: 172.17.0.5
X-Real-IP: 172.17.0.5
X-Forwarded-Host: neutron.sandvine.rocks
X-Forwarded-Server: neutron.sandvine.rocks
X-Forwarded-Proto: https
Host: neutron.sandvine.rocks
Connection: Keep-Alive
Content-Length: 241
Accept-Encoding: gzip, deflate
Accept: application/json
User-Agent: python-neutronclient
Content-Type: application/json
X-Auth-Token: 1100c1044472418999b542fc41cd8e10

{"subnet": {"enable_dhcp": true, "network_id": "ed204a09-8e39-4aa1-8b02-9503c2655953", "dns_nameservers": [], "allocation_pools": [{"start": "172.16.72.9", "end": "172.16.72.254"}], "ip_version": 4, "cidr": "172.16.72.0/24", "name": "sn72"}}

and the n3 req:

{"network": {"status": "ACTIVE", "subnets": [], "availability_zone_hints": [], "availability_zones": [], "name": "n72", "provider:physical_network": null, "admin_state_up": true, "updated_at": "2016-07-11T02:47:41", "tenant_id": "469300d96f9f406cacda5f001d387e52", "description": "", "tags": [], "ipv6_address_scope": null, "mtu": 8950, "router:external": false, "ipv4_address_scope": null, "vlan_transparent": false, "shared": false, "created_at": "2016-07-11T02:47:41", "provider:network_type": "vxlan", "id": "ed204a09-8e39-4aa1-8b02-9503c2655953", "provider:segmentation_id": 12}}
010.100.010.061.46104-172.017.000.005.09696: GET //v2.0/networks/ed204a09-8e39-4aa1-8b02-9503c2655953.json HTTP/1.1
{"network": {"status": "ACTIVE", "subnets": [], "availability_zone_hints": [], "availability_zones": [], "name": "n72", "provider:physical_network": null, "admin_state_up": true, "updated_at": "2016-07-11T02:47:41", "tenant_id": "469300d96f9f406cacda5f001d387e52", "created_at": "2016-07-11T02:47:41", "tags": [], "ipv6_address_scope": null, "provider:segmentation_id": 12, "mtu": 8950, "router:external": false, "ipv4_address_scope": null, "vlan_transparent": false, "shared": false, "provider:network_type": "vxlan", "id": "ed204a09-8e39-4aa1-8b02-9503c2655953", "description": ""}}
010.100.010.061.46142-172.017.000.005.09696: GET //v2.0/networks.json?fields=id&id=ed204a09-8e39-4aa1-8b02-9503c2655953 HTTP/1.1
{"networks": [{"id": "ed204a09-8e39-4aa1-8b02-9503c2655953"}]}

Resource CREATE failed: BadRequest: resources.sn253: Invalid input for operation: Requested subnet with cidr: 172.16.253.0/24 for |
| | network: b1ef80c2-8236-4a24-a06f-1cfc6b26d2dd overlaps with another subnet.

also failed, i guess @ the same time.

In #34 is another run w/ the logs, w/ change from #30.

The subnet w/ the error this time is : ed204a09-8e39-4aa1-8b02-9503c2655953 (172.16.72.0/24) and there is activity for it on n1 and n3 simultaneously:

grep -l ed204a09-8e39-4aa1-8b02-9503c2655953 *log *txt
n1-neutron.log
n2-neutron-nginx.log
n3-neutron.log
n1-neutron.pcap.txt
n3-neutron.pcap.txt

so n2 nginx dealt two requests for this subnet, 1 to n1, one to n3.

The n1 req (from pcap):

010.100.010.061.51330-172.017.000.003.09696: POST //v2.0/subnets.json HTTP/1.1
X-Forwarded-For: 172.17.0.5
X-Real-IP: 172.17.0.5
X-Forwarded-Host: neutron.sandvine.rocks
X-Forwarded-Server: neutron.sandvine.rocks
X-Forwarded-Proto: https
Host: neutron.sandvine.rocks
Connection: Keep-Alive
Content-Length: 241
Accept-Encoding: gzip, deflate
Accept: application/json
User-Agent: python-neutronclient
Content-Type: application/json
X-Auth-Token: 1100c1044472418999b542fc41cd8e10

{"subnet": {"enable_dhcp": true, "network_id": "ed204a09-8e39-4aa1-8b02-9503c2655953", "dns_nameservers": [], "allocation_pools": [{"start": "172.16.72.9", "end": "172.16.72.254"}], "ip_version": 4, "cidr": "172.16.72.0/24", "name": "sn72"}}

and the n3 req:

{"network": {"status": "ACTIVE", "subnets": [], "availability_zone_hints": [], "availability_zones": [], "name": "n72", "provider:physical_network": null, "admin_state_up": true, "updated_at": "2016-07-11T02:47:41", "tenant_id": "469300d96f9f406cacda5f001d387e52", "description": "", "tags": [], "ipv6_address_scope": null, "mtu": 8950, "router:external": false, "ipv4_address_scope": null, "vlan_transparent": false, "shared": false, "created_at": "2016-07-11T02:47:41", "provider:network_type": "vxlan", "id": "ed204a09-8e39-4aa1-8b02-9503c2655953", "provider:segmentation_id": 12}}
010.100.010.061.46104-172.017.000.005.09696: GET //v2.0/networks/ed204a09-8e39-4aa1-8b02-9503c2655953.json HTTP/1.1
{"network": {"status": "ACTIVE", "subnets": [], "availability_zone_hints": [], "availability_zones": [], "name": "n72", "provider:physical_network": null, "admin_state_up": true, "updated_at": "2016-07-11T02:47:41", "tenant_id": "469300d96f9f406cacda5f001d387e52", "created_at": "2016-07-11T02:47:41", "tags": [], "ipv6_address_scope": null, "provider:segmentation_id": 12, "mtu": 8950, "router:external": false, "ipv4_address_scope": null, "vlan_transparent": false, "shared": false, "provider:network_type": "vxlan", "id": "ed204a09-8e39-4aa1-8b02-9503c2655953", "description": ""}}
010.100.010.061.46142-172.017.000.005.09696: GET //v2.0/networks.json?fields=id&id=ed204a09-8e39-4aa1-8b02-9503c2655953 HTTP/1.1
{"networks": [{"id": "ed204a09-8e39-4aa1-8b02-9503c2655953"}]}

Resource CREATE failed: BadRequest: resources.sn253: Invalid input for operation: Requested subnet with cidr: 172.16.253.0/24 for         |
|                       | network: b1ef80c2-8236-4a24-a06f-1cfc6b26d2dd overlaps with another subnet.

also failed, i guess @ the same time.

Revision history for this message

Kevin Benton (kevinbenton) wrote on 2016-07-11:

#36

There is a big gap in the n2 neutron logs that is hiding the original requests from several retries. The two statements below are 24 seconds apart.

2016-07-11 02:47:41.528 248 ERROR oslo_db.sqlalchemy.exc_filters File "/usr/local/lib/python2.7/dist-packages/pymysql/cursors.py", line 161, in execute
2016-07-11 02:48:05.043 242 DEBUG neutron.quota.resource [req-9213bd6a-eb41-4c3e-9b03-6ff7daf3b668 a4aab8472ebb41faa98f89e1dd34858a 469300d96f9f406cacda5f001d387e52 - - -] Quota usage for subnet was recalculated. Used quota:18. count /usr/local/lib/python2.7/dist-packages/neutron/quota/resource.py:295

Can you try using the log_dir and log_file options for oslo log in case the stderr printing is being rate limited by something? I'm having difficulty finding a single failed request ID that can be tracked from its initial request all of the way to a response because there is so much lost log data.

An example is '172.16.170.0/24'. You can see in the pcap that it got an overlap error but there is nothing in the logs about receiving a create request for that payload or even that it generated an overlap error for that CIDR.

Revision history for this message

Don Bowman (donbowman) wrote on 2016-07-11:

#37

logs-2016-07-11.tar.gz Edit (62.4 MiB, application/x-tar)

attached.
I changed:
log_dir = /var/log/neutron
log_file = neutron.log
use_stderr = false

and restarted the 3 neutron api.

Note the error reported to heat was different:
Resource CREATE failed: DBDeadlock: resources.sn0: (pymysql.err.InternalError) (1213, u'Deadlock found when trying to get lock; try restarting transaction')

but there is still the overlaps failure:

Validation for CIDR: 172.16.41.0/24 failed - overlaps with subnet 0bce614f-9ad4-4602-96ec-fa22a3508d7c (CIDR: 172.16.41.0/24)

in the pcap, i made it not just port 9696 (so u can see the mysql too). I decoded to text, but this is likely easier viewed in wireshark with 'follow' on the port 3306 vs 9696

Revision history for this message

Don Bowman (donbowman) wrote on 2016-07-11:

#38

ps log rate limited by journald
http://manpages.ubuntu.com/manpages/xenial/man5/journald.conf.5.html

RateLimitInterval=30s
RateLimitBurst=1000

can be overridden. I'll know for next time!

Revision history for this message

Kevin Benton (kevinbenton) wrote on 2016-07-11:

#39

Thanks! Can you do me a favor and run one more time with the latest version of https://review.openstack.org/339226 . I added some logging in there so we can see the source of the original retriable exception.

Now that we have full logs I probably don't need the pcaps anymore if that makes it easier for you.

Revision history for this message

Don Bowman (donbowman) wrote on 2016-07-11:

#40

logs-2016-07-11-1.tar.gz Edit (391.5 KiB, application/x-tar)

attached.
patch installed.

Revision history for this message

Kevin Benton (kevinbenton) wrote on 2016-07-12:

#41

Excellent. try the latest version of https://review.openstack.org/339226

The error is in the quota committing logic.

Revision history for this message

Don Bowman (donbowman) wrote on 2016-07-12:

#42

logs-2016-07-11-2.tar.gz Edit (461.5 KiB, application/x-tar)

Well, new behaviour.
No error, stack create has been going for 10min.

It created 255 networks, and 255 subnets, as expected, no error.
But it doesn't finish.
Attached logs after 15 min of sitting.

Revision history for this message

Don Bowman (donbowman) wrote on 2016-07-12:

#43

logs-2016-07-10-3.tar.gz Edit (1.3 MiB, application/x-tar)

additionally, the delete gave an error, attached.

Resource DELETE failed: DBError: resources.n237: This Connection is closed

Revision history for this message

Don Bowman (donbowman) wrote on 2016-07-12:

#44

logs-2016-07-11-4.tar.gz Edit (1.7 MiB, application/x-tar)

and a different failure, this time a 500 on
'cf651054-13cd-4ca7-a92b-2b10c810d1b2'

DBError: This Connection is closed

Revision history for this message

Kevin Benton (kevinbenton) wrote on 2016-07-12:

#45

I'm not sure that error is coming from Neutron. Can you check the heat engine logs to see if you can get the full traceback?

Revision history for this message

Kevin Benton (kevinbenton) wrote on 2016-07-12:

#46

Whoops. nevermind. looking into it.

Revision history for this message

Kevin Benton (kevinbenton) wrote on 2016-07-12:

#47

Ok. Give the latest version of the patch another try. I think that resource closed error might be masking another exception though...

Revision history for this message

Don Bowman (donbowman) wrote on 2016-07-12:

#48

logs-2016-07-11-5.tar.gz Edit (253.6 KiB, application/x-tar)

attached.
i reapplied the patch from scratch to confirm i did it properly.

Revision history for this message

Kevin Benton (kevinbenton) wrote on 2016-07-12:

#49

Found another query that needed to ensure the session was there. Retry with https://review.openstack.org/#/c/339226/8/

Revision history for this message

Don Bowman (donbowman) wrote on 2016-07-12:

#50

logs-2016-07-12.tar.gz Edit (303.2 KiB, application/x-tar)

logs attached. gets a networknotfound error.
Resource CREATE failed: NetworkNotFoundClient: resources.n244: Network 09f06d5a-516d-4efa-b049-0858b2160064 could not be found.
Neutron server returns request_ids: ['req-577218f9-b371-4ad0-9f83-7227430dc0e6']

Revision history for this message

Kevin Benton (kevinbenton) wrote on 2016-07-13:

#51

Give it another shot with https://review.openstack.org/#/c/339226/9 . I suspect its an error in the policy engine that we aren't seeing.

(BTW, thanks for helping with this. We have quite a few more db contentious operations and compare and swap logic for better consistency but it results in more deadlock risk, especially in galera, so we need to get these cases shaken out)

Revision history for this message

Don Bowman (donbowman) wrote on 2016-07-13:

#52

logs-2016-07-13.tar.gz Edit (3.4 MiB, application/x-tar)

here is a set of logs. same as in #42, it appears to have worked correctly, extremely slowly, but heat doesn't think it is done.

it seems to take ~4-5s/network on the agent (n4 is the dhcp+l3 agent in these logs, it does not run api), so for 255 networks its *hugely* slow.

I see a lot of errors around the deadlock mechanism, but i don't see un-retried errors (but i didn't look that hard).

I will run again.

Revision history for this message

Don Bowman (donbowman) wrote on 2016-07-13:

#53

forgot to mention. pre patch in #51 i let it run overnight, heat never found that neutron finished.

Revision history for this message

Don Bowman (donbowman) wrote on 2016-07-13:

#54

logs-2016-07-13-1.tar.gz Edit (5.9 MiB, application/x-tar)

attached is the stack delete of the above stack create.
This caused a failure from neutron:
Resource DELETE failed: DBError: resources.n119: This Connection is closed

Revision history for this message

Kevin Benton (kevinbenton) wrote on 2016-07-13:

#55

The "This Connection is closed" error is coming from the heat engine. That one doesn't seem to have anything to do with Neutron.

Revision history for this message

Kevin Benton (kevinbenton) wrote on 2016-07-13:

#56

The slowness is definitely a result of lots of retries. I think we can look at that as a separate bug though. First I want to get it where everything is correct then worry about performance.

You should get a significant improvement just from using a single-writer deployment for Galera. Galera manifests failures as latent deadlocks on commits when using multiple writers so they are very expensive to recover from. With a single writer, constraint violations and staledata errors are recognized locally before it comes time to commit.

Revision history for this message

Don Bowman (donbowman) wrote on 2016-07-13:

#57

the primary source of the slowness is on the agent. the database is done, all the entries are there, but now the agent is one by one applying them, which involves netns create, ip route, ip mtu, ovs flush etc, each as a subshell.

I'll take a look @ the heat logs, I let it run all day and it did not discover that it was complete (even w/o that error).

Revision history for this message

Don Bowman (donbowman) wrote on 2016-07-13:

#58

on the heat delete, I usually get errors like:
Resource DELETE failed: DBError: resources.n149: This Connection is closed

2016-07-13 23:02:14.582 1562 DEBUG heat.common.serializers [req-ad85b455-9fdc-4ecd-affb-642fb9302153 - don - default default] JSON response : {"stack": {"parent": null, "disable_rollback": true, "description": "Test subnet creation\n", "parameters": {"OS::project_id": "469300d96f9f406cacda5f001d387e52", "OS::stack_id": "1daa0af0-3795-4123-a2d5-2575d86bd6d0", "OS::stack_name": "t23"}, "deletion_time": null, "stack_name": "t23", "stack_user_project_id": "469300d96f9f406cacda5f001d387e52", "tags": null, "creation_time": "2016-07-13T12:46:25", "links": [{"href": "https://heat.sandvine.rocks/v1/469300d96f9f406cacda5f001d387e52/stacks/t23/1daa0af0-3795-4123-a2d5-2575d86bd6d0", "rel": "self"}], "capabilities": [], "notification_topics": [], "timeout_mins": null, "stack_status": "DELETE_FAILED", "stack_owner": null, "updated_time": "2016-07-13T23:00:58", "id": "1daa0af0-3795-4123-a2d5-2575d86bd6d0", "stack_status_reason": "Resource DELETE failed: DBError: resources.n149: This Connection is closed", "template_description": "Test subnet creation\n"}} to_json /usr/local/lib/python2.7/dist-packages/heat/common/serializers.py:40

i'll dig more

Revision history for this message

Don Bowman (donbowman) wrote on 2016-07-14:

#59

logs-16-07-14-03:44:04.tar.gz Edit (26.5 MiB, application/x-tar)

OK I have switched galera to a single writer using MaxScale.

it generally works on most creates, but there are some failures. Attached is an example

Resource CREATE failed: InternalServerError: resources.sn211: Request Failed: internal server error while processing your request.
Neutron server returns request_ids: ['req-07fc7999-dbdb-4c60-a338-7945467b357b']

the neutron backend is not responding in time and nginx is timing it out (perhaps a different sort of deadlock?)

Revision history for this message

Don Bowman (donbowman) wrote on 2016-07-14:

#60

whoops, forgot the neutron logs, see attached

Revision history for this message

Don Bowman (donbowman) wrote on 2016-07-14:

#61

logs-16-07-14-03:55:20.tar.gz Edit (37.1 MiB, application/x-tar)

Revision history for this message

Don Bowman (donbowman) wrote on 2016-07-14:

#62

logs-16-07-14-04:33:04.tar.gz Edit (1.7 MiB, application/x-tar)

Download full text (4.6 KiB)

And one more.
Here the l3-agent (n4) gets
RemoteError: Remote error: OperationalError (pymysql.err.OperationalError) (1045, u'failed to create new session') [SQL: u'SELECT 1'], and the api agents went non-responsive towards nginx

this seems to occur after a bunch of

2016-07-14 04:31:43.687 3995 DEBUG neutronclient.client [req-8678bb34-66ce-410a-b764-a30b5e2470f1 a4aab8472ebb41faa98f89e1dd34858a 469300d96f9f406cacda5f001d387e52 - - -] REQ: curl -i https://neutron.sandvine.rocks//v2.0/networks.json -X POST -H "User-Agent: python-neutronclient" -H "X-Auth-Token: {SHA1}e186ee7429bd4264f0a47d97e1f53f5ea52fc124" -d '{"network": {"shared": false, "name": "t28_n97", "admin_state_up": true}}' http_log_req /usr/local/lib/python2.7/dist-packages/neutronclient/common/utils.py:185
[332B blob data]
[44B blob data]
[23B blob data]
[42B blob data]
[43B blob data]
[8B blob data]
[8B blob data]
http_log_resp /usr/local/lib/python2.7/dist-packages/neutronclient/common/utils.py:194
[198B blob data]
[44B blob data]
[23B blob data]
[42B blob data]
[43B blob data]
[8B blob data]
[8B blob data]
_handle_fault_response /usr/local/lib/python2.7/dist-packages/neutronclient/v2_0/client.py:247

has occurred in heat.

_handle_fault_response /usr/local/lib/python2.7/dist-packages/neutronclient/v2_0/client.py:247
2016-07-14 04:31:43.707 3995 INFO heat.engine.resource [req-8678bb34-66ce-410a-b764-a30b5e2470f1 a4aab8472ebb41faa98f89e1dd34858a 469300d96f9f406cacda5f001d387e52 - - -] CREATE: Net "n97" Stack "t28" [5f6fde63-931b-4bcc-bf10-66ebdba750e9]
2016-07-14 04:31:43.707 3995 ERROR heat.engine.resource Traceback (most recent call last):
2016-07-14 04:31:43.707 3995 ERROR heat.engine.resource File "/usr/local/lib/python2.7/dist-packages/heat/engine/resource.py", line 732, in _action_recorder
2016-07-14 04:31:43.707 3995 ERROR heat.engine.resource yield
2016-07-14 04:31:43.707 3995 ERROR heat.engine.resource File "/usr/local/lib/python2.7/dist-packages/heat/engine/resource.py", line 812, in _do_action
2016-07-14 04:31:43.707 3995 ERROR heat.engine.resource yield self.action_handler_task(action, args=handler_args)
2016-07-14 04:31:43.707 3995 ERROR heat.engine.resource File "/usr/local/lib/python2.7/dist-packages/heat/engine/scheduler.py", line 312, in wrapper
2016-07-14 04:31:43.707 3995 ERROR heat.engine.resource step = next(subtask)
2016-07-14 04:31:43.707 3995 ERROR heat.engine.resource File "/usr/local/lib/python2.7/dist-packages/heat/engine/resource.py", line 775, in action_handler_task
2016-07-14 04:31:43.707 3995 ERROR heat.engine.resource handler_data = handler(*args)
2016-07-14 04:31:43.707 3995 ERROR heat.engine.resource File "/usr/local/lib/python2.7/dist-packages/heat/engine/resources/openstack/neutron/net.py", line 155, in handle_create
2016-07-14 04:31:43.707 3995 ERROR heat.engine.resource net = self.client().create_network({'network': props})['network']
2016-07-14 04:31:43.707 3995 ERROR heat.engine.resource File "/usr/local/lib/python2.7/dist-packages/neutronclient/v2_0/client.py", line 648, in create_network
2016-07-14 04:31:43.707 3995 ERROR heat.engine.resource return self.post(self.networks_path, body=body)
2016-...

OK i bumped my mysql connections count to 4K / server, and ran some more tests.

Generally a single heat stack will complete correctly.
If 2 are started @ the same time, it will generally fail.
if 3, all will fail in an exception from neutronclient around subnet creation, as in
logs-16-07-14-13:31:04.tar.gz and snipped below.

- - -] Task create from Subnet "sn1" Stack "t31" [dbd77b02-d5c1-45e0-96de-7a3d9a954e1b] running step /usr/local/lib/python2.7/dist-packages/heat/engine/scheduler.py:226
[332B blob data]
[49B blob data]
[23B blob data]
[47B blob data]
[43B blob data]
[8B blob data]
[8B blob data]
 http_log_resp /usr/local/lib/python2.7/dist-packages/neutronclient/common/utils.py:194
[198B blob data]
[49B blob data]
[23B blob data]
[47B blob data]
[43B blob data]
[8B blob data]
[8B blob data]
 _handle_fault_response /usr/local/lib/python2.7/dist-packages/neutronclient/v2_0/client.py:247
2016-07-14 13:29:50.827 3982 DEBUG neutronclient.client [req-020e7186-c09b-4897-8b5d-faacb75682b9 a4aab8472ebb41faa98f89e1dd34858a 469300d96f9f406cacda5f001d387e52 - - -] RESP: 201 {'Content-Length': '558', 'Strict-Transport-Security': 'max-age=15768000', 'Server': 'nubo-1', 'Connection': 'keep-alive', 'Date': 'Thu, 14 Jul 2016 13:29:50 GMT', 'Content-Type': 'application/json', 'X-Openstack-Request-Id': 'req-f1986214-98c0-4690-9d09-215876381373'} {"subnet": {"description": "", "enable_dhcp": true, "network_id": "5542d6e9-7624-4242-bbd2-7743741c021c", "tenant_id": "469300d96f9f406cacda5f001d387e52", "created_at": "2016-07-14T13:29:14", "dns_nameservers": [], "updated_at": "2016-07-14T13:29:14", "gateway_ip": "172.16.186.1", "ipv6_ra_mode": null, "allocation_pools": [{"start": "172.16.186.9", "end": "172.16.186.254"}], "host_routes": [], "ip_version": 4, "ipv6_address_mode": null, "cidr": "172.16.186.0/24", "id": "34d3d980-3429-4054-8f85-7a8b24d2f321", "subnetpool_id": null, "name": "t31_sn186"}} http_log_resp /usr/local/lib/python2.7/dist-packages/neutronclient/common/utils.py:194
2016-07-14 13:29:50.828 3988 INFO heat.engine.resource [req-020e7186-c09b-4897-8b5d-faacb75682b9 a4aab8472ebb41faa98f89e1dd34858a 469300d96f9f406cacda5f001d387e52 - - -] CREATE: Subnet "sn72" Stack "t31" [dbd77b02-d5c1-45e0-96de-7a3d9a954e1b]
2016-07-14 13:29:50.828 3988 ERROR heat.engine.resource Traceback (most recent call last):
2016-07-14 13:29:50.828 3988 ERROR heat.engine.resource   File "/usr/local/lib/python2.7/dist-packages/heat/engine/resource.py", line 732, in _action_recorder
2016-07-14 13:29:50.828 3988 ERROR heat.engine.resource     yield
2016-07-14 13:29:50.828 3988 ERROR heat.engine.resource   File "/usr/local/lib/python2.7/dist-packages/heat/engine/resource.py", line 812, in _do_action
2016-07-14 13:29:50.828 3988 ERROR heat.engine.resource     yield self.action_handler_task(action, args=handler_args)
2016-07-14 13:29:50.828 3988 ERROR heat.engine.resource   File "/usr/local/lib/python2.7/dist-packages/heat/engine/scheduler.py", line 312, in wrapper
2016-07-14 13:29:50.828 3988 ERROR heat.engine.resource     step = next(subtask)
2016-07-14 13:29:50.828 3988 ERROR heat.engine.resource   File "/usr/local/lib/python2.7/dist-packages/heat/engine/resource.py", line 775, in action_handler_task
2016-07-14 13:29:50.828 3988 ERROR heat.engine.resource     handler_data = handler(*args)
2016-07-14 13:29:50.828 3988 ERROR heat.engine.resource   File "/usr/local/lib/python2.7/dist-packages/heat/engine/resources/openstack/neutron/subnet.py", line 361, in handle_create
2016-07-14 13:29:50.828 3988 ERROR heat.engine.resource     subnet = self.client().create_subnet({'subnet': props})['subnet']
2016-07-14 13:29:50.828 3988 ERROR heat.engine.resource   File "/usr/local/lib/python2.7/dist-packages/neutronclient/v2_0/client.py", line 669, in create_subnet
2016-07-14 13:29:50.828 3988 ERROR heat.engine.resource     return self.post(self.subnets_path, body=body)
2016-07-14 13:29:50.828 3988 ERROR heat.engine.resource   File "/usr/local/lib/python2.7/dist-packages/neutronclient/v2_0/client.py", line 347, in post
2016-07-14 13:29:50.828 3988 ERROR heat.engine.resource     headers=headers, params=params)
2016-07-14 13:29:50.828 3988 ERROR heat.engine.resource   File "/usr/local/lib/python2.7/dist-packages/neutronclient/v2_0/client.py", line 282, in do_request
2016-07-14 13:29:50.828 3988 ERROR heat.engine.resource     self._handle_fault_response(status_code, replybody, resp)
2016-07-14 13:29:50.828 3988 ERROR heat.engine.resource   File "/usr/local/lib/python2.7/dist-packages/neutronclient/v2_0/client.py", line 257, in _handle_fault_response
2016-07-14 13:29:50.828 3988 ERROR heat.engine.resource     exception_handler_v20(status_code, error_body)
2016-07-14 13:29:50.828 3988 ERROR heat.engine.resource   File "/usr/local/lib/python2.7/dist-packages/neutronclient/v2_0/client.py", line 84, in exception_handler_v20
2016-07-14 13:29:50.828 3988 ERROR heat.engine.resource     request_ids=request_ids)
[87B blob data]
[105B blob data]
[79B blob data]
[103B blob data]
[99B blob data]
[64B blob data]
[64B blob data]
2016-07-14 13:29:50.828 3988 ERROR heat.engine.resource 
2016-07-14 13:29:50.828 3988 ERROR heat.engine.resource 
[332B blob data]
[49B blob data]
[23B blob data]
[47B blob data]
[43B blob data]
[8B blob data]
[8B blob data]
 http_log_resp /usr/local/lib/python2.7/dist-packages/neutronclient/common/utils.py:194
[198B blob data]
[49B blob data]
[23B blob data]
[47B blob data]
[43B blob data]
[8B blob data]
[8B blob data]
 _handle_fault_response /usr/local/lib/python2.7/dist-packages/neutronclient/v2_0/client.py:247
2016-07-14 13:29:50.834 3981 INFO heat.engine.resource [req-020e7186-c09b-4897-8b5d-faacb75682b9 a4aab8472ebb41faa98f89e1dd34858a 469300d96f9f406cacda5f001d387e52 - - -] CREATE: Subnet "sn236" Stack "t31" [dbd77b02-d5c1-45e0-96de-7a3d9a954e1b]
2016-07-14 13:29:50.834 3981 ERROR heat.engine.resource Traceback (most recent call last):
2016-07-14 13:29:50.834 3981 ERROR heat.engine.resource   File "/usr/local/lib/python2.7/dist-packages/heat/engine/resource.py", line 732, in _action_recorder
2016-07-14 13:29:50.834 3981 ERROR heat.engine.resource     yield
2016-07-14 13:29:50.834 3981 ERROR heat.engine.resource   File "/usr/local/lib/python2.7/dist-packages/heat/engine/resource.py", line 812, in _do_action
2016-07-14 13:29:50.834 3981 ERROR heat.engine.resource     yield self.action_handler_task(action, args=handler_args)
2016-07-14 13:29:50.834 3981 ERROR heat.engine.resource   File "/usr/local/lib/python2.7/dist-packages/heat/engine/scheduler.py", line 312, in wrapper
2016-07-14 13:29:50.834 3981 ERROR heat.engine.resource     step = next(subtask)
2016-07-14 13:29:50.834 3981 ERROR heat.engine.resource   File "/usr/local/lib/python2.7/dist-packages/heat/engine/resource.py", line 775, in action_handler_task
2016-07-14 13:29:50.834 3981 ERROR heat.engine.resource     handler_data = handler(*args)
2016-07-14 13:29:50.834 3981 ERROR heat.engine.resource   File "/usr/local/lib/python2.7/dist-packages/heat/engine/resources/openstack/neutron/subnet.py", line 361, in handle_create
2016-07-14 13:29:50.834 3981 ERROR heat.engine.resource     subnet = self.client().create_subnet({'subnet': props})['subnet']
2016-07-14 13:29:50.834 3981 ERROR heat.engine.resource   File "/usr/local/lib/python2.7/dist-packages/neutronclient/v2_0/client.py", line 669, in create_subnet
2016-07-14 13:29:50.834 3981 ERROR heat.engine.resource     return self.post(self.subnets_path, body=body)
2016-07-14 13:29:50.834 3981 ERROR heat.engine.resource   File "/usr/local/lib/python2.7/dist-packages/neutronclient/v2_0/client.py", line 347, in post
2016-07-14 13:29:50.834 3981 ERROR heat.engine.resource     headers=headers, params=params)
2016-07-14 13:29:50.834 3981 ERROR heat.engine.resource   File "/usr/local/lib/python2.7/dist-packages/neutronclient/v2_0/client.py", line 282, in do_request
2016-07-14 13:29:50.834 3981 ERROR heat.engine.resource     self._handle_fault_response(status_code, replybody, resp)
2016-07-14 13:29:50.834 3981 ERROR heat.engine.resource   File "/usr/local/lib/python2.7/dist-packages/neutronclient/v2_0/client.py", line 257, in _handle_fault_response
2016-07-14 13:29:50.834 3981 ERROR heat.engine.resource     exception_handler_v20(status_code, error_body)
2016-07-14 13:29:50.834 3981 ERROR heat.engine.resource   File "/usr/local/lib/python2.7/dist-packages/neutronclient/v2_0/client.py", line 84, in exception_handler_v20
2016-07-14 13:29:50.834 3981 ERROR heat.engine.resource     request_ids=request_ids)
[87B blob data]
[105B blob data]
[79B blob data]
[103B blob data]
[99B blob data]
[64B blob data]
[64B blob data]
2016-07-14 13:29:50.834 3981 ERROR heat.engine.resource 
2016-07-14 13:29:50.834 3981 ERROR heat.engine.resource 
[266B blob data]
[49B blob data]
[23B blob data]
[47B blob data]
[43B blob data]
[8B blob data]
[8B blob data]

Revision history for this message

Don Bowman (donbowman) wrote on 2016-07-14:

#64

logs-16-07-14-13:31:04.tar.gz Edit (1.8 MiB, application/x-tar)

Revision history for this message

Don Bowman (donbowman) wrote on 2016-07-14:

#65

the python script in #15, formerly running a couple of those would cause a failure.
4 have been running all day w/ the patch w/o issue.

I'm not sure why heat is different. Faster? interaction w/ RPC? Other?

Armando Migliaccio (armando-migliaccio) on 2016-07-15

Changed in neutron:
milestone:	newton-2 → newton-3

Revision history for this message

Don Bowman (donbowman) wrote on 2016-07-16:

#66

logs-16-07-16-16.tar.gz Edit (16.8 MiB, application/x-tar)

OK after 3 days, one of the python scripts [4 running in parallel on different clients] hit the same error as heat.

Traceback (most recent call last):
  File "repo.py", line 30, in <module>
    neutron.delete_network(i)
  File "/cephfs/home/don/src/openstack-client/local/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 656, in delete_network
    return self.delete(self.network_path % (network))
  File "/cephfs/home/don/src/openstack-client/local/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 338, in delete
    headers=headers, params=params)
  File "/cephfs/home/don/src/openstack-client/local/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 319, in retry_request
    headers=headers, params=params)
  File "/cephfs/home/don/src/openstack-client/local/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 282, in do_request
    self._handle_fault_response(status_code, replybody, resp)
  File "/cephfs/home/don/src/openstack-client/local/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 257, in _handle_fault_response
    exception_handler_v20(status_code, error_body)
  File "/cephfs/home/don/src/openstack-client/local/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 84, in exception_handler_v20
    request_ids=request_ids)
neutronclient.common.exceptions.InternalServerError: Request Failed: internal server error while processing your request.
Neutron server returns request_ids: ['req-0e561b63-8667-4e49-8029-0ad0ec0b4ede']

some logs in attached

Revision history for this message

Kevin Benton (kevinbenton) wrote on 2016-07-19:

#67

Hey,

Sorry about the delay. Looking into the logs from the last several messages.

To answer your question about heat vs the scripts, heat can be experiencing more failures because heat itself makes use of mysql itself to store task states and when you have multiple workers talking to different writers in a galera cluster, things that should have blocked on a record lock instead manifest as a Deadlock error on commit.

Basically it can be experiencing failures for the same reason Neutron is. Openstack doesn't get any in-tree testing of multi-writer environments so there are places where failures can slip through or retries get a lot more expensive than developers initially anticipated.

Revision history for this message

Kevin Benton (kevinbenton) wrote on 2016-07-19:

#68

Regarding the agent comment. I looked at the logs and the HTTP responses are very slow (likely because of the DB retries). The agents don't block the API response time so even if the agents were running very slowly you shouldn't experience any API failures when creating subnets and networks.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-07-19: Related fix proposed to neutron (master)

#69

Related fix proposed to branch: master
Review: https://review.openstack.org/344013

Revision history for this message

Kevin Benton (kevinbenton) wrote on 2016-07-19:

#70

Ok, the error in number 66 appears to be due to a lost connection to the DB[1]. I've adjusted https://review.openstack.org/#/c/339226/ to also catch those errors so it can try to reconnect.

Most of the contention seems to be related to the quota engine since this is all under one tenant. I have some separate fixes I'm going to work on for that.

Can you try two things for me? One run with the latest version of 339226 and then another with the same patch and the quotas set to unlimited for that tenant (-1 for networks, subnets, and ports) since that will effectively skip a bunch of the contentious reservation logic.

1. error is "DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query')"

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-07-19: Related fix merged to neutron (master)

#71

Reviewed: https://review.openstack.org/344013
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=c46edbc7d62585b0181dfe4bb72c1be0620fc937
Submitter: Jenkins
Branch: master

commit c46edbc7d62585b0181dfe4bb72c1be0620fc937
Author: Kevin Benton <email address hidden>
Date: Mon Jul 18 22:39:19 2016 -0600

Use db_api.retry_db_errors in quota engine

    The quota engine was still using oslo_db wrap_db_retry
    which does not automatically take into account deadlocks
    that occur inside of nested savepoint transactions.

    In one case it didn't matter because it passed in the correct
    exception checker but in the one that protected 'set_quota_usage'
    it did not use is_retriable so a deadlock inside of the savepoint
    would have resulted in a much more expensive retry all of the way
    up at the API layer.

This patch just adjusts them to both use the standard neutron
retry_db_errors decorator.

Change-Id: I1e45eb15f14bf35881e5b1dce77733e831e9c6b1
Related-Bug: #1596075

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-07-23: Fix proposed to neutron (master)

#72

Fix proposed to branch: master
Review: https://review.openstack.org/346288

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-07-23:

#73

Fix proposed to branch: master
Review: https://review.openstack.org/346289

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-07-24: Fix merged to neutron (master)

#74

Reviewed: https://review.openstack.org/346288
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=3213dd884e15d1e2b2209aed3aed82cd0213a677
Submitter: Jenkins
Branch: master

commit 3213dd884e15d1e2b2209aed3aed82cd0213a677
Author: Kevin Benton <email address hidden>
Date: Fri Jul 22 09:56:30 2016 -0700

Mark DBConnectionError as retriable

    DBConnectionError can be thrown if the connection
    to the database gets closed (e.g. sql restart or
    network connectivity failure). This should be part
    of our standard list of exceptions to retry so we
    are resillient to intermittent connectivity issues.

    The olso db wrap db retry decorator has this option
    as well[1] but we can't use it because our logic to
    detect retriable errors in MultipleException types.

1. https://github.com/openstack/oslo.db/blob/a79479088029e4fa51def91cb36bc652356462b6/oslo_db/api.py#L123

Partial-Bug: #1596075
Change-Id: I26423e306173d956667b9d1e40f07f2b5bc8d805

Revision history for this message

Don Bowman (donbowman) wrote on 2016-07-24:

#75

logs-16-07-24-20:26:54.tar.gz Edit (1.5 MiB, application/x-tar)

OK, i applied the 3 patches [i did a new pull first] and re-ran w/ the quota still in place.

I got an API timeout on one of the subnets (see attached logs).

2016/07/24 20:20:04 [error] 4876#4876: *266186 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 172.17.0.3, server: neutron.sandvine.rocks, request: "POST //v2.0/subnets.json HTTP/1.1", upstream: "http://10.100.10.62:9696//v2.0/subnets.json", host: "neutron.sandvine.rocks"

I will re-run w/ the quota disabled.

Revision history for this message

Don Bowman (donbowman) wrote on 2016-07-24:

#76

OK w/ the quota disabled, I was able to create two of of the heat stacks simultaneously.

Revision history for this message

Don Bowman (donbowman) wrote on 2016-07-24:

#77

logs-16-07-24-21:49:20.tar.gz Edit (6.8 MiB, application/x-tar)

OK, more tries. with quota set to -1, and creating several heat stacks simultaneously, it will still fail.

Ihar Hrachyshka (ihar-hrachyshka) on 2016-07-25

tags:

added: neutron-proactive-backport-potential

Revision history for this message

Kevin Benton (kevinbenton) wrote on 2016-08-10:

#78

Waiting on getting some work merged into master then I will have you retest.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-08-12:

#79

Reviewed: https://review.openstack.org/339226
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=3ad23f42c7b1fce8e34fce651e03614f28740868
Submitter: Jenkins
Branch: master

commit 3ad23f42c7b1fce8e34fce651e03614f28740868
Author: Kevin Benton <email address hidden>
Date: Thu Jul 7 00:44:59 2016 -0700

Always start transactions in quota cleanup methods

    If the previous action that let to the quota reservation
    cancelling was the result of a database connection getting
    interrupted, attempting to query without calling session.begin()
    will result in a sqlalchemy.exc.ResourceClosedError.

    This alters the quota methods that mutate DB state to use a
    transaction with the new oslo DB enginefacade decorators that
    start a transaction for us.

    Partial-Bug: #1596075
    Partially-Implements: blueprint enginefacade-switch
    Change-Id: I3d0539b11795cbcf97e70e1ec39013221a00d6d5

Revision history for this message

Don Bowman (donbowman) wrote on 2016-08-22:

#80

I'm not clear on whether I'm to test w/ solely this last commit, w/ all of them, to wait for a bit more, ?

Armando Migliaccio (armando-migliaccio) on 2016-09-01

Changed in neutron:
milestone:	newton-3 → newton-rc1

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-02: Change abandoned on neutron (master)

#81

Change abandoned by Kevin Benton (<email address hidden>) on branch: master
Review: https://review.openstack.org/346289
Reason: moving this stuff down into the plugin layer here: I7b8a4a105aabfa1b5f5dd7a638099007b0933e66

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-08: Fix proposed to neutron (master)

#82

Fix proposed to branch: master
Review: https://review.openstack.org/367179

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-08:

#83

Fix proposed to branch: master
Review: https://review.openstack.org/367180

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-09: Fix merged to neutron (master)

#84

Reviewed: https://review.openstack.org/356530
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=09c87425fa028dbe669e8c215e334297ccbf1c2a
Submitter: Jenkins
Branch: master

commit 09c87425fa028dbe669e8c215e334297ccbf1c2a
Author: Kevin Benton <email address hidden>
Date: Wed Sep 7 18:27:49 2016 -0700

Prepare retry decorator to move to plugin level

    Retrying mutating operations at the API layer caused a
    couple of problems. First, when components would call
    the core plugin using the neutron manager, they would
    have to handle the retriable failures on their own or
    undo any work they had done so far and allow retriable
    failures to be propagated up to the API. Second, retrying
    at the API makes composite operations (e.g. auto allocate,
    add_router_interface, etc) painful because they have to
    consider if exceptions are retriable before raising
    fatal exceptions on failures of core plugin calls.

    This patch begins the process of moving them down to the
    core operations with a new decorator called
    'retry_if_session_inactive', which ensures that the
    retry logic isn't triggered if there is an ongoing transaction
    since retrying inside of a transaction is normally ineffective.
    Follow-up patches apply them to various parts of the code-base.

    Additionally, the args and kwargs of the method are automatically
    deep copied in retries to ensure that any mangling the methods
    do to their arguments don't impact their retriability.

    Finally, since we are leaving the API decorators in place for now,
    the retry logic will not be triggered by another decorator if an
    exception has already been retried. This prevents an exponential
    explosion of retries on nested retry decorators.

The ultimate goal will be to get rid of the API decorators entirely
so retries are up to each individual plugin.

    Partial-Bug: #1596075
    Partial-Bug: #1612798
    Change-Id: I7b8a4a105aabfa1b5f5dd7a638099007b0933e66

Reviewed:  https://review.openstack.org/356530
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=09c87425fa028dbe669e8c215e334297ccbf1c2a
Submitter: Jenkins
Branch:    master

commit 09c87425fa028dbe669e8c215e334297ccbf1c2a
Author: Kevin Benton <kevin@benton.pub>
Date:   Wed Sep 7 18:27:49 2016 -0700

Prepare retry decorator to move to plugin level
    
    Retrying mutating operations at the API layer caused a
    couple of problems. First, when components would call
    the core plugin using the neutron manager, they would
    have to handle the retriable failures on their own or
    undo any work they had done so far and allow retriable
    failures to be propagated up to the API. Second, retrying
    at the API makes composite operations (e.g. auto allocate,
    add_router_interface, etc) painful because they have to
    consider if exceptions are retriable before raising
    fatal exceptions on failures of core plugin calls.
    
    This patch begins the process of moving them down to the
    core operations with a new decorator called
    'retry_if_session_inactive', which ensures that the
    retry logic isn't triggered if there is an ongoing transaction
    since retrying inside of a transaction is normally ineffective.
    Follow-up patches apply them to various parts of the code-base.
    
    Additionally, the args and kwargs of the method are automatically
    deep copied in retries to ensure that any mangling the methods
    do to their arguments don't impact their retriability.
    
    Finally, since we are leaving the API decorators in place for now,
    the retry logic will not be triggered by another decorator if an
    exception has already been retried. This prevents an exponential
    explosion of retries on nested retry decorators.
    
    The ultimate goal will be to get rid of the API decorators entirely
    so retries are up to each individual plugin.
    
    Partial-Bug: #1596075
    Partial-Bug: #1612798
    Change-Id: I7b8a4a105aabfa1b5f5dd7a638099007b0933e66

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-12:

#85

Reviewed: https://review.openstack.org/367179
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=acbabaa3db2c12a41c52662a82e9a900d5929846
Submitter: Jenkins
Branch: master

commit acbabaa3db2c12a41c52662a82e9a900d5929846
Author: Kevin Benton <email address hidden>
Date: Wed Sep 7 18:42:40 2016 -0700

Make l2/l3 operations retriable at plugin level

    This adds decorators to ML2/db_base_plugin_v2 and the L3
    DB modules to make L2 and L3 resource operations retriable
    at the plugin level.

    Retrying the L2 operations in particular at this level should
    go a long way to improving the reliability of complex operations
    that require L2 resources in the face of deadlocks under high
    concurrency.

    Partial-Bug: #1612798
    Partial-Bug: #1596075
    Change-Id: I3c9437f7ecdd5ebb188b622144b7bd7bed74231d

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-13:

#86

Reviewed: https://review.openstack.org/367180
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=12420c1585abadc741440dcd48a5540ef85357db
Submitter: Jenkins
Branch: master

commit 12420c1585abadc741440dcd48a5540ef85357db
Author: Kevin Benton <email address hidden>
Date: Wed Sep 7 18:51:58 2016 -0700

Mark quota operations as retriable

    This decorates the quota system operations with
    the retry decorators. This will help significantly
    with the bug this marks as closed since operations
    in the quota engine after commit should no longer
    trigger retries of the full API operation.

    The logic to find the args in the decorator had
    to be adjusted to deal with functions already decorated.
    This just uses the getargspec function from pecan that
    deals with decorated functions.

    Partial-Bug: #1612798
    Closes-Bug: #1596075
    Change-Id: Ib786117dcea08af75551770ea4c30d460382b829

Changed in neutron:
status:	In Progress → Fix Released

Revision history for this message

Kevin Benton (kevinbenton) wrote on 2016-09-16:

#87

Hi Don,

Can you test with the latest state of master?

Revision history for this message

Don Bowman (donbowman) wrote on 2016-09-19:

#88

i have tried, i'm getting an error in the schema which is unlikely to be related to this u"Unknown column 'trunks_1.admin_state_up' in 'field list'").
I'll debug and try more this week.

Revision history for this message

Kevin Benton (kevinbenton) wrote on 2016-09-19:

#89

Sounds like you might need a schema migration "neutron-db-manage --config-file /etc/neutron/neutron.conf upgrade head"

Revision history for this message

Don Bowman (donbowman) wrote on 2016-09-24:

#90

ok database dropped and created fresh.
Updated from head.
galera configured for single writer.

ran single test (e.g. 1 heat create of 255 networks each w/ subnet) -- Passed
ran stack delete -- Passed
(ran 2 more times, Passed)

ran 3 parallel tests: fail (timeout)
increased timeouts in nginx front-end of API, increased workers.
pass

so it looks good. Thanks for the assistance.

Revision history for this message

Don Bowman (donbowman) wrote on 2016-09-24:

#91

I might have spoken too soon. I'm getting some errors on deleting (when i delete a large number simultaneously, they work when issued serially).

Digging in on why, its a deadlock that exceeds the retriable limit.

: DBDeadlock: (pymysql.err.InternalError) (1213, u'Deadlock found when trying to get lock; try restarting transaction') [SQL: u'INSERT INTO ipallocations (port_id, ip_address, subnet_id, network_id) VALUES (%(port_id)s, %(ip_address)s, %(subnet_id)s, %(network_id)s)'] [parameters: {'network_id': u'5e97298b-f175-493f-afc6-4492adeb3a9d', 'subnet_id': u'be0f3101-ff62-4ac0-901c-fc60bf561dd9', 'ip_address': '172.16.226.9', 'port_id': '0824d42b-3ed7-4dd5-a641-d7627530a638'}]
: wrapped /usr/local/lib/python2.7/dist-packages/neutron/db/api.py:124

Deleting subnet cc57f513-40fe-4824-b9e8-a0b69e3f8e37 delete_subnet /usr/local/lib/python2.7/dist-packages/neutron/plugins/ml2/plugin.py:1062
: 2016-09-24 19:06:56.762 228 ERROR oslo_db.api [req-a671d942-fc21-4237-85b5-657f00c66bed - - - - -] DB exceeded retry limit.

Revision history for this message

Don Bowman (donbowman) wrote on 2016-09-24:

#92

Perhaps worse.
When i retry the stack delete, it works (and shows all deleted). But I end up with networks that are left that did not get deleted.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-26: Fix included in openstack/neutron 9.0.0.0rc1

#93

This issue was fixed in the openstack/neutron 9.0.0.0rc1 release candidate.

Revision history for this message

Kevin Benton (kevinbenton) wrote on 2016-09-28:

#94

Do you have a sample trace from a request that says the network successfully deleted when it was actually left behind? (all of the logs from the corresponding network delete request id)

Revision history for this message

Kevin Benton (kevinbenton) wrote on 2016-09-28:

#95

Can you open a separate bug for the leftover networks issue? The original issue for this bug (the overlapping subnets thing) is fixed.

Revision history for this message

Don Bowman (donbowman) wrote on 2016-10-02:

#96

I'm not 100% sure this is solved, but ok.
It sometimes still fails on create.
it sometimes still fails on delete.
it sometimes passes the delete, but leaves some things behind.

0f68089d-8e94-47de-8cbd-3d023519e935 | t2_n70 |

is a network left behind. The attached logs in logs-16-10-02-16:58:23.tar.gz show both the failures on stack create, and then the success on stack delete that left this.

Revision history for this message

Don Bowman (donbowman) wrote on 2016-10-02:

#97

logs-16-10-02-16:58:23.tar.gz Edit (4.2 MiB, application/x-tar)

Ihar Hrachyshka (ihar-hrachyshka) on 2016-10-07

tags:

removed: neutron-proactive-backport-potential

Revision history for this message

Kevin Benton (kevinbenton) wrote on 2016-10-13:

#98

I grepped the logs and there is only one mention of that network UUID during it's creation in n2-neutron.log. I don't see the HTTP request at all for the delete attempt in the neutron logs, heat logs, or nginx logs. Does that log include everything?

Revision history for this message

Kevin Benton (kevinbenton) wrote on 2016-10-13:

#99

Same looking for the name itself 't2_n70'. Only one mention.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-10-18:

#100

This issue was fixed in the openstack/neutron 9.0.0.0rc1 release candidate.

neutron

Neutron confused about overlapping subnet creation

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches