neutron

[Pluggable IPAM] DB exceeded retry limit (RetryRequest) on create_router call

Bug #1543094 reported by Pavel Bondar on 2016-02-08

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	neutron	Fix Released	High	Ryan Tidwell

Bug Description

Observed errors "DB exceeded retry limit" [1] in cases where pluggable ipam is enabled, observed on master branch.
Each time retest is done different tests are failed, so looks like concurency issue.
4 errors 'DB exceeded retry limit' are observed in [1].

2016-02-04 11:55:59.944 15476 ERROR oslo_db.api [req-7ad8b69e-a851-4b6c-8c2c-33258c53bb54 admin -] DB exceeded retry limit.
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api Traceback (most recent call last):
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api File "/usr/local/lib/python2.7/dist-packages/oslo_db/api.py", line 137, in wrapper
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api return f(*args, **kwargs)
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api File "/opt/stack/new/neutron/neutron/api/v2/base.py", line 519, in _create
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api obj = do_create(body)
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api File "/opt/stack/new/neutron/neutron/api/v2/base.py", line 501, in do_create
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api request.context, reservation.reservation_id)
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 204, in __exit__
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api six.reraise(self.type_, self.value, self.tb)
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api File "/opt/stack/new/neutron/neutron/api/v2/base.py", line 494, in do_create
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api return obj_creator(request.context, **kwargs)
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api File "/opt/stack/new/neutron/neutron/db/l3_hamode_db.py", line 411, in create_router
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api self).create_router(context, router)
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api File "/opt/stack/new/neutron/neutron/db/l3_db.py", line 200, in create_router
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api self.delete_router(context, router_db.id)
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 204, in __exit__
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api six.reraise(self.type_, self.value, self.tb)
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api File "/opt/stack/new/neutron/neutron/db/l3_db.py", line 196, in create_router
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api gw_info, router=router_db)
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api File "/opt/stack/new/neutron/neutron/db/l3_gwmode_db.py", line 69, in _update_router_gw_info
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api context, router_id, info, router=router)
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api File "/opt/stack/new/neutron/neutron/db/l3_db.py", line 429, in _update_router_gw_info
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api ext_ips)
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api File "/opt/stack/new/neutron/neutron/db/l3_dvr_db.py", line 185, in _create_gw_port
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api ext_ips)
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api File "/opt/stack/new/neutron/neutron/db/l3_db.py", line 399, in _create_gw_port
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api new_network_id, ext_ips)
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api File "/opt/stack/new/neutron/neutron/db/l3_db.py", line 310, in _create_router_gw_port
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api context.elevated(), {'port': port_data})
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api File "/opt/stack/new/neutron/neutron/plugins/common/utils.py", line 149, in create_port
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api return core_plugin.create_port(context, {'port': port_data})
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api File "/opt/stack/new/neutron/neutron/plugins/ml2/plugin.py", line 1069, in create_port
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api result, mech_context = self._create_port_db(context, port)
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api File "/opt/stack/new/neutron/neutron/plugins/ml2/plugin.py", line 1045, in _create_port_db
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api result = super(Ml2Plugin, self).create_port(context, port)
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api File "/opt/stack/new/neutron/neutron/db/db_base_plugin_v2.py", line 1193, in create_port
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api port_id)
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api File "/opt/stack/new/neutron/neutron/db/ipam_pluggable_backend.py", line 172, in allocate_ips_for_port_and_store
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api revert_on_fail=False)
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 204, in __exit__
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api six.reraise(self.type_, self.value, self.tb)
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api File "/opt/stack/new/neutron/neutron/db/ipam_pluggable_backend.py", line 156, in allocate_ips_for_port_and_store
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api ips = self._allocate_ips_for_port(context, port)
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api File "/opt/stack/new/neutron/neutron/db/ipam_pluggable_backend.py", line 228, in _allocate_ips_for_port
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api return self._ipam_allocate_ips(context, ipam_driver, p, ips)
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api File "/opt/stack/new/neutron/neutron/db/ipam_pluggable_backend.py", line 135, in _ipam_allocate_ips
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api "external system for %s"), addresses)
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 204, in __exit__
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api six.reraise(self.type_, self.value, self.tb)
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api File "/opt/stack/new/neutron/neutron/db/ipam_pluggable_backend.py", line 120, in _ipam_allocate_ips
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api context, ipam_driver, port, ip_list)
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api File "/opt/stack/new/neutron/neutron/db/ipam_pluggable_backend.py", line 91, in _ipam_allocate_single_ip
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api port, subnet),
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api File "/opt/stack/new/neutron/neutron/db/ipam_pluggable_backend.py", line 80, in _ipam_try_allocate_ip
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api return ipam_subnet.allocate(ip_request)
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api File "/opt/stack/new/neutron/neutron/ipam/drivers/neutrondb_ipam/driver.py", line 350, in allocate
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api auto_generated)
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api File "/opt/stack/new/neutron/neutron/ipam/drivers/neutrondb_ipam/driver.py", line 219, in _allocate_specific_ip
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api session, db_range, first_ip=first_ip, last_ip=last_ip)
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api File "/opt/stack/new/neutron/neutron/ipam/drivers/neutrondb_ipam/db_api.py", line 167, in update_range
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api raise db_exc.RetryRequest(ipam_exc.IPAllocationFailed)
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api RetryRequest
2016-02-04 11:55:59.944 15476 ERROR oslo_db.api
2016-02-04 11:55:59.946 15476 ERROR neutron.api.v2.resource [req-7ad8b69e-a851-4b6c-8c2c-33258c53bb54 admin -] create failed
2016-02-04 11:55:59.946 15476 ERROR neutron.api.v2.resource Traceback (most recent call last):
2016-02-04 11:55:59.946 15476 ERROR neutron.api.v2.resource File "/opt/stack/new/neutron/neutron/api/v2/resource.py", line 83, in resource
2016-02-04 11:55:59.946 15476 ERROR neutron.api.v2.resource result = method(request=request, **args)
2016-02-04 11:55:59.946 15476 ERROR neutron.api.v2.resource File "/opt/stack/new/neutron/neutron/api/v2/base.py", line 408, in create
2016-02-04 11:55:59.946 15476 ERROR neutron.api.v2.resource return self._create(request, body, **kwargs)
2016-02-04 11:55:59.946 15476 ERROR neutron.api.v2.resource File "/usr/local/lib/python2.7/dist-packages/oslo_db/api.py", line 147, in wrapper
2016-02-04 11:55:59.946 15476 ERROR neutron.api.v2.resource ectxt.value = e.inner_exc
2016-02-04 11:55:59.946 15476 ERROR neutron.api.v2.resource File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 204, in __exit__
2016-02-04 11:55:59.946 15476 ERROR neutron.api.v2.resource six.reraise(self.type_, self.value, self.tb)
2016-02-04 11:55:59.946 15476 ERROR neutron.api.v2.resource File "<string>", line 2, in reraise
2016-02-04 11:55:59.946 15476 ERROR neutron.api.v2.resource TypeError: exceptions must be old-style classes or derived from BaseException, not type
2016-02-04 11:55:59.946 15476 ERROR neutron.api.v2.resource

[1] http://logs.openstack.org/23/181023/54/check/gate-neutron-dsvm-api/054071e/logs/screen-q-svc.txt.gz#_2016-02-04_11_55_59_944

Tags:

Salvatore Orlando (salvatore-orlando) on 2016-02-08

Changed in neutron:
assignee:	nobody → Salvatore Orlando (salvatore-orlando)

Martin Hickey (martin-hickey) on 2016-02-08

tags:

added: l3-ipam-dhcp

Armando Migliaccio (armando-migliaccio) on 2016-02-08

Changed in neutron:
status:	New → Confirmed
importance:	Undecided → High
milestone:	none → mitaka-3

Revision history for this message

Carl Baldwin (carl-baldwin) wrote on 2016-02-11:

I'll be happy to review the implementation for this fix but I'll probably need a direct ping from Salvatore when it is ready.

Revision history for this message

John Belamaric (jbelamaric) wrote on 2016-02-26:

Carl - Pavel can work on this next week, please assign it to him (I do not have permissions to assign it to anyone but myself).

Pavel Bondar (pasha117) on 2016-02-29

Changed in neutron:
assignee:	Salvatore Orlando (salvatore-orlando) → Pavel Bondar (pasha117)

Revision history for this message

Pavel Bondar (pasha117) wrote on 2016-02-29:

RetryRequest was raised in 'rollback' block of allocate_ips_for_port_and_store and since rollback failed orignal error got missed.
Created patchest with expanded debug [2] to uncover original failure.

[1] https://github.com/openstack/neutron/blob/master/neutron/db/ipam_pluggable_backend.py#L177
[2] https://review.openstack.org/#/c/285936/

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-03-03: Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/287811

Changed in neutron:
status:	Confirmed → In Progress

Armando Migliaccio (armando-migliaccio) on 2016-03-03

Changed in neutron:
milestone:	mitaka-3 → mitaka-rc1

Revision history for this message

Pavel Bondar (pasha117) wrote on 2016-03-04:

My recent findins about compate-and-swap used in internal ipam driver indicates,
that it scales worse than locking approach.
Compare-and-swap workflow in ipam driver:
A. Start transaction.
B. Do various db opeations(read/write), can be tens of db queries.
C. Read availability range.
D. Calculate updated availability range and try to update it
E. If count of updated rows is less than expected raise RetryRequest
F. Commit transaction

RetryRequest is handled outside db transaction, so retry has to start from step A.
It means cost of one retry is cost of each operation from A to E, and that is tens of db
queries done on step B.

Why do we need to restart entire transaction?
Because of transaction isolation level, which is often default to REPEATED_READ.
REPEATED_READ transaction isolation level means reading the same row returns
the same result.
So if we try to handle retry request without restarting transaction, then on C step
the same availability range would be read, even if another transaction has changed this range.

Locking workflow:
A. Start transaction.
B. Do various db opeations(read/write), can be tens of db queries.
C. Lock for update availability range row.
D. Do actual update.
F. Commit transaction.
In this case if two processes concurenly update row, one of them hangs up on step C.
When lock is released second process continues from step C, without having to restart
entire transaction.

Locking approach has issues in work with Galera multi master. In this case each master node
can ascquire lock on step C, but only the first master will be able to commit transaction.
For the second master Deadlock will be raised.
Currently Deadlock is handled in the same wrapper which handles RetryRequest and entire
transaction is restarted in this case.
So in rare case where the same row is updated on different masters entire transaction get restarted,
but for compare-and-swap each concurent update result in entire transaction restart.

To make compate-and-swap to work effectively it should repeat only steps C and D when RetryRequest is raised.
But in our case, where ipam driver is called inside big transaction it looks hardly achievable.
But still how it can be achieved?
- Use READ_COMMITTED isolation level for all backends.
Not sure if it is something that is safe to do.
- Create new connection/transaction inside of ipam driver that is out of scope of big transaction.
All my experiments in trying to get new connection inside of running big transaction finished with no success.
But doesn't look impossible yet.
- Any other ideas?

RetryRequest is handled outside db transaction, so retry has to start from step A.
It means cost of one retry is cost of each operation from A to E, and that is tens of db
queries done on step B.

To make compate-and-swap to work effectively it should repeat only steps C and D when RetryRequest is raised.
But in our case, where ipam driver is called inside big transaction it looks hardly achievable.
But still how it can be achieved?
- Use READ_COMMITTED isolation level for all backends.
 Not sure if it is something that is safe to do.
- Create new connection/transaction inside of ipam driver that is out of scope of big transaction. 
 All my experiments in trying to get new connection inside of running big transaction finished with no success.
 But doesn't look impossible yet.
- Any other ideas?

Revision history for this message

Pavel Bondar (pasha117) wrote on 2016-03-04:

I don't see good way to fix compare-and-swap in ipam driver to be at least as good as locking approach, so I propose to revert commit d755f7248d324bb4c44b3efc9d200f8eb075066d.

And fix original issue with missed lock by:
https://review.openstack.org/#/c/223123/1/neutron/ipam/drivers/neutrondb_ipam/db_api.py

Revision history for this message

Carl Baldwin (carl-baldwin) wrote on 2016-03-10:

I looked in to this a bit today. This availability range table has been a point of lock contention for a long time. I'm being serious when I suggest that we get rid of it. Let me explain.

The contents of the availability range table are completely derived from the "allocation pools" and the "allocations". It is essentially a set difference between the two. On this basis, some would argue that it never should've been stored as a table in the first place. But, I understand that someone thought that the queries required to compute availability on the fly were expensive and that caching availability in a table was an attractive alternative.

One problem is that availability had to be designed to be compact because it had to work for ipv6 and even some large ipv4 subnets where recording individual IP availability is impractical. So, availability was compressed in to ranges. The problem with this is that each allocation needs to adjust the ranges. So, all allocations (and previously all deallocations) on a subnet need to serialize around the ranges recorded for the subnet. This severely limits the number of allocations that can be performed at any given time. It is a bottleneck.

Hence, I suggest that we remove the table and compute availability on the fly. We read the subnet pools in to an ipset, read all of the allocations for the subnet, and do the set difference. Then, pick an IP and record it in the IP allocations table.

You might be thinking that we're merely moving the contention to the allocations table. But, this is easily mitigated. Given the algorithm that I described, we'll normally have a whole set of IPs from which to choose. So, instead of everyone choosing the first IP available, we select one randomly from a window of the next available IP addresses. For example, my algorithm might look at the next 16 IP addresses and pick one randomly to try to avoid colliding with someone else looking at the same (or almost the same) window. That should cut down on the number of collision that need to be retried.

The more contention I see around this table, the more I'm convinced that computing availability on the fly really won't be that bad, even in large subnets. What do you think?

I looked in to this a bit today.  This availability range table has been a point of lock contention for a long time.  I'm being serious when I suggest that we get rid of it.  Let me explain.

The contents of the availability range table are completely derived from the "allocation pools" and the "allocations".  It is essentially a set difference between the two.  On this basis, some would argue that it never should've been stored as a table in the first place.  But, I understand that someone thought that the queries required to compute availability on the fly were expensive and that caching availability in a table was an attractive alternative.

One problem is that availability had to be designed to be compact because it had to work for ipv6 and even some large ipv4 subnets where recording individual IP availability is impractical.  So, availability was compressed in to ranges.  The problem with this is that each allocation needs to adjust the ranges.  So, all allocations (and previously all deallocations) on a subnet need to serialize around the ranges recorded for the subnet.  This severely limits the number of allocations that can be performed at any given time.  It is a bottleneck.

Hence, I suggest that we remove the table and compute availability on the fly.  We read the subnet pools in to an ipset, read all of the allocations for the subnet, and do the set difference.  Then, pick an IP and record it in the IP allocations table.

You might be thinking that we're merely moving the contention to the allocations table.  But, this is easily mitigated.  Given the algorithm that I described, we'll normally have a whole set of IPs from which to choose.  So, instead of everyone choosing the first IP available, we select one randomly from a window of the next available IP addresses.  For example, my algorithm might look at the next 16 IP addresses and pick one randomly to try to avoid colliding with someone else looking at the same (or almost the same) window.  That should cut down on the number of collision that need to be retried.

The more contention I see around this table, the more I'm convinced that computing availability on the fly really won't be that bad, even in large subnets.  What do you think?

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-03-14:

Fix proposed to branch: master
Review: https://review.openstack.org/292207

Changed in neutron:
assignee:	Pavel Bondar (pasha117) → Ryan Tidwell (ryan-tidwell)

Ryan Tidwell (ryan-tidwell) on 2016-03-14

Changed in neutron:
assignee:	Ryan Tidwell (ryan-tidwell) → nobody

Revision history for this message

Ryan Tidwell (ryan-tidwell) wrote on 2016-03-14:

I like the idea of computing availability on the fly and not storing the availability. After taking a stab at a solution, the drawback I'm finding is that there are a large number of unit tests (100+) that assume a very deterministic way of allocating IP addresses, and those are just the tests I've found. What Carl is proposing would introduce an algorithm for selecting an IP address to allocate that isn't deterministic at all. I'm not suggesting that tests be the gating factor as to whether we pursue this approach, I'm simply pointing out a substantial amount of churn. As I work through this, I'm also finding it hard to use test failures as a guide for which tests to re-work as this non-deterministic behavior manifests itself in non-deterministic pass/fail of certain tests from run-to-run. We should proceed with an abundance of caution if we take Carl's approach, even if it does yield positive results in the end.

Armando Migliaccio (armando-migliaccio) on 2016-03-15

Changed in neutron:
milestone:	mitaka-rc1 → newton-1

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-03-23: Change abandoned on neutron (master)

#10

Change abandoned by Pavel Bondar (<email address hidden>) on branch: master
Review: https://review.openstack.org/287811
Reason: Abandoning since proposed patch still has issues with concurency due to reasons described in previous comments. Need to come up with another solution for #1543094

Revision history for this message

Pavel Bondar (pasha117) wrote on 2016-03-24:

#11

Idea of having algorithm for non-deterministic allocations makes sense for me. It looks like it will be not easy to rework all the failing UTs and we will pay cost of availability ranges computation on the fly for each allocation. But from my point of view removing bottleneck might worth it.

Revision history for this message

John Belamaric (jbelamaric) wrote on 2016-03-24:

#12

Well, isn't this one of the points of pluggability? We can leave the existing implementation alone, and add a new driver that uses a different strategy. Let the operators decide which they prefer.

Revision history for this message

Miguel Lavalle (minsel) wrote on 2016-03-24:

#13

Proposed fix by Ryan Tidwell: https://review.openstack.org/#/c/292207/

Changed in neutron:
assignee:	nobody → Ryan Tidwell (ryan-tidwell)

Revision history for this message

Carl Baldwin (carl-baldwin) wrote on 2016-03-24:

#14

@John - You make an interesting point. I think maybe we should consider starting this as an alternate experimental driver. But, we should weigh this against the difficulties of supporting multiple drivers.

If it is just to avoid having to fix unit tests, then I'd say that it isn't worth it to support two different drivers. Ignoring unit tests and other internal development reasons, is there really a compelling reason to continue to support the current behavior? Why would someone care if they get the next available IP address sequentially or if they get one of the N lowest available IP addresses? If it is just OCD, then I'd say there is no compelling reason. But, maybe you know of a reason.

@Pavel. I agree with you. But, the more I think through it, the more I think that the cost really isn't a big deal. We've found that by simplifying the DB queries we actually reduce the number of queries when there are multiple subnets thus reducing the round trips to the DB.

The worst case is a network with thousands of allocations. If there are this many allocations, they are likely from multiple subnets. So, compare the time it takes to query that many allocations and add them to an IPSet in memory in one shot with the time it takes to make multiple queries to the DB. I really don't think it will be bad. And, if we remove the contention, we won't have to pay for retries nearly as much. With that said, maybe we should develop this as a separate driver and find a deployment with thousands of allocations in a single network to measure how this behaves. Then, when it is ready, we can replace the current driver.

Revision history for this message

John Belamaric (jbelamaric) wrote on 2016-03-24:

#15

No, I don't have a reason around the IP allocations being perfectly ordered. I just suggested this for the same reason we kept the non-pluggable interface when we built out the non-pluggable interface - to mitigate risk.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-04-07: Related fix proposed to neutron (master)

#16

Related fix proposed to branch: master
Review: https://review.openstack.org/303085

OpenStack Infra (hudson-openstack) on 2016-04-07

Changed in neutron:
assignee:	Ryan Tidwell (ryan-tidwell) → Kevin Benton (kevinbenton)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-04-08: Fix proposed to neutron (master)

#17

Fix proposed to branch: master
Review: https://review.openstack.org/303603

Changed in neutron:
assignee:	Kevin Benton (kevinbenton) → Ryan Tidwell (ryan-tidwell)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-04-08:

#18

Fix proposed to branch: master
Review: https://review.openstack.org/303638

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-04-18: Fix merged to neutron (master)

#19

Reviewed: https://review.openstack.org/303085
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=db817a9e39dbed10383cb2c70c0f95d4b1795aec
Submitter: Jenkins
Branch: master

commit db817a9e39dbed10383cb2c70c0f95d4b1795aec
Author: Kevin Benton <email address hidden>
Date: Tue Apr 5 23:06:00 2016 -0700

Add semaphore to ML2 create_port db operation

    This adds a semaphore scoped to the network ID of a port
    creation in ML2 to ensure that all workers on a single server
    only try to allocate an IP for that network one at a time.

    This will alleviate the deadlock error retry mechanism being
    exceeded due to the related bug. It reduces the number of contenders
    for a single IP allocation from number of workers to number of servers.

It will unblock the switch to pluggable ipam while the IP allocation
strategy is being revamped to be less racey.

Partial-Bug: #1543094
Change-Id: Ieafdd640777d4654fcd0ebb65ace25c30151c412

Revision history for this message

Miguel Lavalle (minsel) wrote on 2016-04-20:

#20

As of today, these are the fixes in review in Gerrit for this bug:

https://review.openstack.org/#/c/292207/
https://review.openstack.org/#/c/303603/

Revision history for this message

Miguel Lavalle (minsel) wrote on 2016-05-05:

#21

Three related fixes:

https://review.openstack.org/#/c/312771/
https://review.openstack.org/#/c/292207/
https://review.openstack.org/#/c/303603/

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-05-11: Fix proposed to neutron (stable/mitaka)

#22

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/314810

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-05-18: Fix merged to neutron (master)

#23

Reviewed: https://review.openstack.org/303603
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=cbc15d2e1db22fbefaac0a97363589aebf834b24
Submitter: Jenkins
Branch: master

commit cbc15d2e1db22fbefaac0a97363589aebf834b24
Author: Ryan Tidwell <email address hidden>
Date: Fri Apr 8 14:12:01 2016 -0700

Ensure unit tests don't assume an IP address allocation strategy

    These unit tests initially asserted sequential allocation of IP
    addresses, even though they have no need to specifically assert
    that a specific IP was allocated. This made it difficult to
    change out the IP allocation algorithm in the future and made
    these tests fragile and poorly isolated.

    This change breaks the dependency these unit tests have on a
    specific IP allocation strategy and isolates them from any
    changes that may be made to the order in which IP addresses
    are allocated on a subnet.

Change-Id: Idc879b7f1e6496aa96b4f7ae6c3eaca6079bdcac
Partial-Bug: #1543094

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-05-23: Fix merged to neutron (stable/mitaka)

#24

Reviewed: https://review.openstack.org/314810
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=c871a43383e6bd40a358087663267da41b7bf1ec
Submitter: Jenkins
Branch: stable/mitaka

commit c871a43383e6bd40a358087663267da41b7bf1ec
Author: Kevin Benton <email address hidden>
Date: Tue Apr 5 23:06:00 2016 -0700

Add semaphore to ML2 create_port db operation

    This adds a semaphore scoped to the network ID of a port
    creation in ML2 to ensure that all workers on a single server
    only try to allocate an IP for that network one at a time.

It will unblock the switch to pluggable ipam while the IP allocation
strategy is being revamped to be less racey.

    Partial-Bug: #1543094
    Change-Id: Ieafdd640777d4654fcd0ebb65ace25c30151c412
    (cherry picked from commit db817a9e39dbed10383cb2c70c0f95d4b1795aec)

tags:

added: in-stable-mitaka

OpenStack Infra (hudson-openstack) on 2016-05-26

Changed in neutron:
assignee:	Ryan Tidwell (ryan-tidwell) → Carl Baldwin (carl-baldwin)

Armando Migliaccio (armando-migliaccio) on 2016-06-03

Changed in neutron:
milestone:	newton-1 → newton-2

OpenStack Infra (hudson-openstack) on 2016-06-09

Changed in neutron:
assignee:	Carl Baldwin (carl-baldwin) → Ryan Tidwell (ryan-tidwell)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-06-09: Fix merged to neutron (master)

#25

Reviewed: https://review.openstack.org/292207
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=dcb2a931b5b84fb7aa41f08b37a5148bf6e987bc
Submitter: Jenkins
Branch: master

commit dcb2a931b5b84fb7aa41f08b37a5148bf6e987bc
Author: Ryan Tidwell <email address hidden>
Date: Fri Apr 8 14:21:03 2016 -0700

Compute IPAvailabilityRanges in memory during IP allocation

    This patch computes IP availability in memory without locking on
    IPAvailabilityRanges. IP availability is generated in memory, and
    to avoid contention an IP address is selected by randomly
    selecting from within the first 10 available IP addresses on a
    subnet. Raises IPAddressGenerationFailure if unable to allocate an
    IP address from within the window.

    Change-Id: I52e4485e832cbe6798de6b4afb6a7cfd88db11e2
    Depends-On: I84195b0eb63b7ca6a4e00becbe09e579ff8b718e
    Closes-Bug: #1543094

Changed in neutron:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-06-17: Related fix proposed to neutron (master)

#26

Related fix proposed to branch: master
Review: https://review.openstack.org/331238

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-06-22: Fix merged to neutron (master)

#27

Reviewed: https://review.openstack.org/303638
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=38a080aca37197ff9c5464a3bc9981e684decee6
Submitter: Jenkins
Branch: master

commit 38a080aca37197ff9c5464a3bc9981e684decee6
Author: Ryan Tidwell <email address hidden>
Date: Fri Apr 8 16:16:25 2016 -0700

Remove IP availability range recalculation logic

    This patch removes unreachable code that rebuilds IP address
    availability data and persists in the database. This is
    ultimately derived data that is now computed in memory and never
    persisted. The code being removed is dead code and does not
    include the contract migration for removal of the
    IPAvailabilityRange models.

Change-Id: I96cb67396b8e0ebbe7f98353fad1607405944e44
Partial-Bug: #1543094

Ihar Hrachyshka (ihar-hrachyshka) on 2016-06-30

tags:

added: neutron-proactive-backport-potential

Revision history for this message

Doug Hellmann (doug-hellmann) wrote on 2016-07-13: Fix included in openstack/neutron 9.0.0.0b2

#28

This issue was fixed in the openstack/neutron 9.0.0.0b2 development milestone.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-08-01: Related fix proposed to neutron (master)

#29

Related fix proposed to branch: master
Review: https://review.openstack.org/349709

Armando Migliaccio (armando-migliaccio) on 2016-09-02

Changed in neutron:
milestone:	newton-2 → newton-rc1
milestone:	newton-rc1 → none

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-02: Related fix merged to neutron (master)

#30

Reviewed: https://review.openstack.org/331238
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=44de48a442ab21a84af37ad727c34ec88570ad3d
Submitter: Jenkins
Branch: master

commit 44de48a442ab21a84af37ad727c34ec88570ad3d
Author: Dariusz Smigiel <email address hidden>
Date: Fri Jun 17 15:49:51 2016 +0000

Remove workaround for bug/1543094

Bug which caused DB exceeded retry limit is solved.

Change-Id: Ie7b7d994d7443a9575fad0b322683c4f3009d139
Related-Bug: #1543094

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-14:

#31

Reviewed: https://review.openstack.org/349709
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=774792681de922c4f1c4788894da8c10da47f67c
Submitter: Jenkins
Branch: master

commit 774792681de922c4f1c4788894da8c10da47f67c
Author: Carl Baldwin <email address hidden>
Date: Mon Aug 1 15:04:23 2016 -0600

Remove availability range code and model

These models are effectively obsolete [1] and should've been removed
in a previous patch [2] but some of it was left behind.

[1] https://review.openstack.org/#/c/292207
[2] https://review.openstack.org/#/c/303638

    Change-Id: Ib381c24f37e787b4912e28d98ec77473c0448c2b
    Related-Bug: #1543094
    Closes-Bug: #1620746

Ihar Hrachyshka (ihar-hrachyshka) on 2016-10-07

tags:

removed: neutron-proactive-backport-potential

sahil rastogi (itmashu) on 2021-05-25

Changed in neutron:
status:	Fix Released → Confirmed
status:	Confirmed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.