Partial HA network causing HA router creation failed (race conditon)
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
neutron |
Invalid
|
Undecided
|
Unassigned |
Bug Description
ENV: stable/mitaka,VXLAN
Neutron API: two neutron-servers behind a HA proxy VIP.
Log:
[1] http://
Log [1] shows that the subnet of HA network is concurrently deleted while a new HA router create API comes. Seems the race conditon described in this bug is till exists : https:/
"""
Some known exceptions:
...
2. IpAddressGenera
concurrently HA subnet deletion)
...
"""
It has a very strange behavior that those 3 APIs have a same request-id [req-780b1f6e-
Test scenario:
Just create one HA router for a tenant, and then quickly delete it.
For now, our mitaka ENV use VxLAN as tenant network type. So there is a very large range of VNI.
So don't save that, and locally a temporary solution, we add a new config to decide whether delete the HA network every time.
description: | updated |
tags: | added: l3-ha |
description: | updated |
description: | updated |
Changed in neutron: | |
importance: | Undecided → High |
status: | New → Confirmed |
milestone: | none → ocata-1 |
Changed in neutron: | |
status: | Invalid → New |
Changed in neutron: | |
status: | New → Incomplete |
Adding a new configuration option is almost never temporary as deleting config options is rarely backward- compatible.
The race condition, as I understand it, is as following:
1. Create HA router, have worker1 send 'router_updated' to agent1. routers. create_ ha_port_ and_bind will try to create the HA port but there are no more IP addresses available, causing add_ha_port to fail as specified in the first paste.
2. Delete HA router (done by worker2). worker2 will now detect that there are no more HA routers and will delete the HA network for the tenant.
3. agent1 issues a 'sync_router', which triggers auto_schedule_
Point #3 is a bit weird to me, as it looks like IPAM is detecting a "network deleted during function run" as "no more IP addresses". In addition, this should be caught by [2], forcing a silent retrigger of this issue.
Aside from the issue that isn't clear to me, I'd like to point out that the latest stable/mitaka [1] doesn't even trigger auto_schedule_ routers on sync_router (not since [3] - perhaps you're missing this backport?) - hence the trace received in the first paste can't be reproduced. For this reason, I'm closing this as Invalid. Liu, feel free to reopen if you disagree with my assessment :)
[1]: https:/ /github. com/openstack/ neutron/ blob/5860fb21e9 66ab8f1e011654d d477d7af35f7a27 /neutron/ api/rpc/ handlers/ l3_rpc. py#L79 /github. com/openstack/ neutron/ blob/5860fb21e9 66ab8f1e011654d d477d7af35f7a27 /neutron/ common/ utils.py# L726 /github. com/openstack/ neutron/ commit/ 33650bf1d1994a9 6eff993af0bfdaa 62588f08a4
[2]: https:/
[3]: https:/
(5860fb21e966ab 8f1e011654dd477 d7af35f7a27 is the latest stable/mitaka hash that github.com provided.)