Too many DBDeadlockError and IP address collision during port creating

Bug #1777968 reported by LIU Yulong
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
LIU Yulong

Bug Description

Too many DBDeadlockError and IP address collision during port creating

ENV:
Neutron stable/queens (12.0.1)
CentOS 7 (3.10.0-514.26.2.el7.x86_64)

This is a edge scenario testing after we meet bug:
https://bugs.launchpad.net/neutron/+bug/1777965

We have 3 neutron-server node, every each one enable 8 API worker.

Try create 1000 port in a single network 2.0.0.0/16.

Exception:
IP address collision:
2018-06-20 18:57:31.121 440352 ERROR oslo_db.api NeutronDbObjectDuplicateEntry: Failed to create a duplicate IpamAllocation: for attribute(s) ['PRIMARY'] with value(s) 2.0.8.33-c62094a1-1f21-42a0-bc12-a06db9661463
Deadlock:
2018-06-20 18:53:33.367 440348 ERROR neutron.pecan_wsgi.hooks.translation DBDeadlock: (pymysql.err.InternalError) (1205, u'Lock wait timeout exceeded; try restarting transaction') [SQL: u'INSERT INTO ipamallocations (ip_address, status, ipam_subnet_id) VALUES (%(ip_address)s, %(status)s, %(ipam_subnet_id)s)'] [parameters: {'status': u'ALLOCATED', 'ip_address': '2.0.7.138', 'ipam_subnet_id': 'c62094a1-1f21-42a0-bc12-a06db9661463'}] (Background on this error at: http://sqlalche.me/e/2j85)

LOG:
IP address collision 409:
http://paste.openstack.org/show/723983/

Deadlock 500:
http://paste.openstack.org/show/723986/

REQ AND RESP:
request:
2018-06-20 18:47:26.993 440352 DEBUG neutron.api.v2.base [req-fe6df326-a34e-4766-a280-24bf6e4d023f a187bcd0bd814d20aad5b3359868db3d 3ae80176a60f4d989c84c084ca05df1c - default default] Request body: {u'port': {u'network_id': u'ca054e1a-5646-42bd-9ab1-1fd5c6d26c92'}} prepare_request_body /usr/lib/python2.7/site-packages/neutron/api/v2/base.py:690

HTTP 409 response:
2018-06-20 18:57:31.217 440352 INFO neutron.wsgi [req-fe6df326-a34e-4766-a280-24bf6e4d023f a187bcd0bd814d20aad5b3359868db3d 3ae80176a60f4d989c84c084ca05df1c - default default] 10.129.169.147 "POST /v2.0/ports HTTP/1.1" status: 409 len: 0 time: 605.5961649

Exception count in each neutron server:
[root@147 neutron]# grep "duplicate IpamAllocation" server.log.bak4|wc -l
3651
[root@148 neutron]# grep "duplicate IpamAllocation" server.log.bak4|wc -l
3831
[root@149 neutron]# grep "duplicate IpamAllocation" server.log.bak4|wc -l
3145

We found two key point of neutron code related to such error:
In [1], the IP range is a little small, and here only return the last one IP, not a random pick, then the IP address collision chance raise.
In [2], there are fixed value max_retries=10 and retry_interval=0.1. It seems not a good practice. And this wrapper just ingore the [database] config
attribute db_retry_interval, db_max_retry_interval and db_max_retries.

[1] https://github.com/openstack/neutron/blob/master/neutron/ipam/drivers/neutrondb_ipam/driver.py#L170-#L173
[2] https://github.com/openstack/neutron/blob/master/neutron/db/api.py#L71-#L76

Revision history for this message
Brian Haley (brian-haley) wrote :

Hi Liu, are you working on a change for this?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/577739

Changed in neutron:
assignee: nobody → LIU Yulong (dragon889)
status: New → In Progress
Revision history for this message
LIU Yulong (dragon889) wrote :

@Brain,
We try to add such change [1] locally, and the port creating result looks better, 987 ports created successfully for 1000 API requests. Without such change, it will have 500+ API fail.
Yes, such change seems not the best way, we need to look deeper in [2], [3] and the [database] config.

[1] https://review.openstack.org/577739
[2] https://github.com/openstack/neutron/blob/master/neutron/ipam/drivers/neutrondb_ipam/driver.py#L170-#L173
[3] https://github.com/openstack/neutron/blob/master/neutron/db/api.py#L71-#L76

Changed in neutron:
importance: Undecided → High
Changed in neutron:
assignee: LIU Yulong (dragon889) → hujin (hujin)
Changed in neutron:
assignee: hujin (hujin) → LIU Yulong (dragon889)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron-lib (master)

Fix proposed to branch: master
Review: https://review.openstack.org/583527

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/577739
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=ec5cd0d1b85d56adef836bf1167287993ec35ca6
Submitter: Zuul
Branch: master

commit ec5cd0d1b85d56adef836bf1167287993ec35ca6
Author: LIU Yulong <email address hidden>
Date: Sat Jun 23 04:52:15 2018 +0800

    Reduce IP address collision during port creating

    Try to give it a large random chance during generate IP
    address.

    The DB retry mechanism change moved to neutron-lib:
    I5ad139bdfb3ae125658b36d05f85f139a1b47bee

    Closes-Bug: #1777968
    Change-Id: I2acc9c720b39271bde2a89da4a66958f7aba5b7d

Changed in neutron:
status: In Progress → Fix Released
tags: added: neutron-proactive-backport-potential
tags: added: neutron-easy-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/584307

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/584308

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/584309

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 13.0.0.0b3

This issue was fixed in the openstack/neutron 13.0.0.0b3 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/queens)

Reviewed: https://review.openstack.org/584307
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=1511c5f9dd4de11f666ff151f5cdaacc580f4c08
Submitter: Zuul
Branch: stable/queens

commit 1511c5f9dd4de11f666ff151f5cdaacc580f4c08
Author: LIU Yulong <email address hidden>
Date: Sat Jun 23 04:52:15 2018 +0800

    Reduce IP address collision during port creating

    Try to give it a large random chance during generate IP
    address.

    The DB retry mechanism change moved to neutron-lib:
    I5ad139bdfb3ae125658b36d05f85f139a1b47bee

    Closes-Bug: #1777968
    Change-Id: I2acc9c720b39271bde2a89da4a66958f7aba5b7d
    (cherry picked from commit ec5cd0d1b85d56adef836bf1167287993ec35ca6)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/ocata)

Reviewed: https://review.openstack.org/584309
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=3df69ec9f608e87a551f94e4ba6cc35224fbe5fd
Submitter: Zuul
Branch: stable/ocata

commit 3df69ec9f608e87a551f94e4ba6cc35224fbe5fd
Author: LIU Yulong <email address hidden>
Date: Sat Jun 23 04:52:15 2018 +0800

    Reduce IP address collision during port creating

    Try to give it a large random chance during generate IP
    address.

    The DB retry mechanism change moved to neutron-lib:
    I5ad139bdfb3ae125658b36d05f85f139a1b47bee

    Closes-Bug: #1777968
    Change-Id: I2acc9c720b39271bde2a89da4a66958f7aba5b7d
    (cherry picked from commit ec5cd0d1b85d56adef836bf1167287993ec35ca6)

tags: added: in-stable-ocata
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/pike)

Reviewed: https://review.openstack.org/584308
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=00998f680cd758b3260c873765f335a6379479ca
Submitter: Zuul
Branch: stable/pike

commit 00998f680cd758b3260c873765f335a6379479ca
Author: LIU Yulong <email address hidden>
Date: Sat Jun 23 04:52:15 2018 +0800

    Reduce IP address collision during port creating

    Try to give it a large random chance during generate IP
    address.

    The DB retry mechanism change moved to neutron-lib:
    I5ad139bdfb3ae125658b36d05f85f139a1b47bee

    Closes-Bug: #1777968
    Change-Id: I2acc9c720b39271bde2a89da4a66958f7aba5b7d
    (cherry picked from commit ec5cd0d1b85d56adef836bf1167287993ec35ca6)

tags: added: in-stable-pike
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron-lib (master)

Reviewed: https://review.openstack.org/583527
Committed: https://git.openstack.org/cgit/openstack/neutron-lib/commit/?id=ed62a06a536a39b1236bd4dcadb2dd13cc2c4b4b
Submitter: Zuul
Branch: master

commit ed62a06a536a39b1236bd4dcadb2dd13cc2c4b4b
Author: LIU Yulong <email address hidden>
Date: Mon Jul 16 09:56:47 2018 +0800

    Make the DB retry mechanism more robust

    Currently in the worst case retry interval is:
    [0.1, 0.2, 0.4, 0.8, 1.6, 3.2, 6.4, 10, 10, 10]
    This may cause some neutron DB action fail, and result an
    API 500/40X error, for instance, the IP allocation get a
    large failure chance with such config.

    After this change, the worst case retry interval will look
    something like this:
    [0.5, 1, 2, 4, 8, 10, 10, ... 10]

    So, for the new value 20 and 0.5, in the worst case the total
    time cost will be 165.5s. For the old config, it's 42.7s. It's
    not too radical for the new retry input params. And we get a
    higher rate of API success.

    Change-Id: I5ad139bdfb3ae125658b36d05f85f139a1b47bee
    Closes-Bug: #1777968

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 12.0.4

This issue was fixed in the openstack/neutron 12.0.4 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron-lib 1.19.0

This issue was fixed in the openstack/neutron-lib 1.19.0 release.

Revision history for this message
Maciej Jozefczyk (maciejjozefczyk) wrote :

This bug is still happening and it hurts big Openstack Deployments.
Its easy to reproduce with Parallel on latest devstack with the fixes that were introduced.

for i in {1..300}; do echo port${i};done | parallel -vv -j 50 neutron port-create public --name {}

According to what LIU Yulong say the main issue is the generator of IP addresses:
https://github.com/openstack/neutron/blob/master/neutron/ipam/drivers/neutrondb_ipam/driver.py#L153-L178

We can confirm that also it is visible when it generates IP from large IPv6 class, even with latest patch.

Introduced patch makes the whole ip-generation process very slow due a lot of retries and waits.
We consider to rewrite the gen function to make it Deadlock and Duplicate Error resistant.

Revision history for this message
Maciej Jozefczyk (maciejjozefczyk) wrote :

Some statistics taken from my devstack test for 533 created ports.
Allocation is really slow.

>>> numpy.std(data)
38.85653252187421
>>> numpy.percentile(data, 90)
92.16371776
>>> numpy.percentile(data, 99)
174.667329048
>>> numpy.percentile(data, 50)
29.586086
>>> numpy.std(data)
38.85653252187421
>>> numpy.min(data)
1.3005738
>>> numpy.max(data)
275.8769221
>>> len(data)
533
>>>

Revision history for this message
Maciej Jozefczyk (maciejjozefczyk) wrote :

After applying my patch requests times are better now. I haven't found any 5xx or 4xx errors during my test also. Every port was created.

https://review.openstack.org/#/c/611341

After applying this patch times in seconds for 300 ports:
>>> numpy.std(data)
6.643350909534087
>>> numpy.percentile(data, 90)
31.71428038
>>> numpy.percentile(data, 99)
37.62866121
>>> numpy.percentile(data, 50)
23.123787049999997
>>> numpy.min(data)
4.2004371
>>> numpy.max(data)
40.4567599
>>>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 11.0.6

This issue was fixed in the openstack/neutron 11.0.6 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Maciej Jozefczyk (<email address hidden>) on branch: master
Review: https://review.openstack.org/611341

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron ocata-eol

This issue was fixed in the openstack/neutron ocata-eol release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.