Partial HA network causing HA router creation failed (race conditon)

Bug #1633306 reported by LIU Yulong
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Invalid
Undecided
Unassigned

Bug Description

ENV: stable/mitaka,VXLAN
Neutron API: two neutron-servers behind a HA proxy VIP.

Log:
[1] http://paste.openstack.org/show/585670/

Log [1] shows that the subnet of HA network is concurrently deleted while a new HA router create API comes. Seems the race conditon described in this bug is till exists : https://bugs.launchpad.net/neutron/+bug/1533440, where has description said:

"""
Some known exceptions:
...
2. IpAddressGenerationFailure: (HA port created failed due to the
   concurrently HA subnet deletion)
...
"""

It has a very strange behavior that those 3 APIs have a same request-id [req-780b1f6e-2b3c-4303-a1de-a5fb4c7ea31e].

Test scenario:
Just create one HA router for a tenant, and then quickly delete it.

For now, our mitaka ENV use VxLAN as tenant network type. So there is a very large range of VNI.
So don't save that, and locally a temporary solution, we add a new config to decide whether delete the HA network every time.

Tags: l3-ha
LIU Yulong (dragon889)
description: updated
tags: added: l3-ha
description: updated
description: updated
John Schwarz (jschwarz)
Changed in neutron:
importance: Undecided → High
status: New → Confirmed
milestone: none → ocata-1
Revision history for this message
John Schwarz (jschwarz) wrote :

Adding a new configuration option is almost never temporary as deleting config options is rarely backward-compatible.

The race condition, as I understand it, is as following:

1. Create HA router, have worker1 send 'router_updated' to agent1.
2. Delete HA router (done by worker2). worker2 will now detect that there are no more HA routers and will delete the HA network for the tenant.
3. agent1 issues a 'sync_router', which triggers auto_schedule_routers. create_ha_port_and_bind will try to create the HA port but there are no more IP addresses available, causing add_ha_port to fail as specified in the first paste.

Point #3 is a bit weird to me, as it looks like IPAM is detecting a "network deleted during function run" as "no more IP addresses". In addition, this should be caught by [2], forcing a silent retrigger of this issue.

Aside from the issue that isn't clear to me, I'd like to point out that the latest stable/mitaka [1] doesn't even trigger auto_schedule_routers on sync_router (not since [3] - perhaps you're missing this backport?) - hence the trace received in the first paste can't be reproduced. For this reason, I'm closing this as Invalid. Liu, feel free to reopen if you disagree with my assessment :)

[1]: https://github.com/openstack/neutron/blob/5860fb21e966ab8f1e011654dd477d7af35f7a27/neutron/api/rpc/handlers/l3_rpc.py#L79
[2]: https://github.com/openstack/neutron/blob/5860fb21e966ab8f1e011654dd477d7af35f7a27/neutron/common/utils.py#L726
[3]: https://github.com/openstack/neutron/commit/33650bf1d1994a96eff993af0bfdaa62588f08a4

(5860fb21e966ab8f1e011654dd477d7af35f7a27 is the latest stable/mitaka hash that github.com provided.)

Changed in neutron:
importance: High → Undecided
status: Confirmed → Invalid
milestone: ocata-1 → none
LIU Yulong (dragon889)
Changed in neutron:
status: Invalid → New
Revision history for this message
LIU Yulong (dragon889) wrote :

Hi, John:
Thank you for reply this.
In your point 3, the 'sync_router' [1] , [2], [3] patches are all backported to our local mitaka branch. And yes, IpAddressGenerationFailure is now the problem. Maybe we need try to verify the master first.

And I think the exception may be raised by the following procedure, not involved the L3 agent:
1. create one HA router
2. create HA network and the HA port
3. delete HA router
4. delete HA network and it's subnet, transaction not committed
6. new HA router API comes, as HA network has no subnets

Or.
1. create one HA router1
2. create HA network, Partially without subnet
3. new HA router2 API comes

Sorry for the inaccurate description, the temporarily solution is only locally in our cloud deployment.

[1] https://review.openstack.org/#/c/349238/
[2] https://review.openstack.org/#/c/332102/
[3] https://review.openstack.org/#/c/365653/

description: updated
Revision history for this message
LIU Yulong (dragon889) wrote :

Sorry, supplement, in comment #2, exception procedure I said is only for log [2].

Log [1] was indeed involved the L3 agent.

[1] http://paste.openstack.org/show/585669/
[2] http://paste.openstack.org/show/585670/

Revision history for this message
John Schwarz (jschwarz) wrote :

Looking at the log involving the server ([1] - the same one you provided in the first comment and in comment #3), and specifically lines 19 and 21, it's clear that sync_routers() is triggering auto_schedule_routers(). Before [2] removed in, the call from sync_routers() to auto_schedule_routers() was done in line 96 of neutron/api/rpc/handlers/l3_rpc.py, as can be observed from the log:

2016-10-09 17:03:52.366 144166 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/neutron/api/rpc/handlers/l3_rpc.py", line 96, in sync_routers
2016-10-09 17:03:52.366 144166 ERROR oslo_messaging.rpc.dispatcher self.l3plugin.auto_schedule_routers(context, host, router_ids)

In [2], it's evident that the line 96 itself is removed. Thus, this can't be reproduced in master or in stable/mitaka and there is no (upstream) bug to fix.

[1]: http://paste.openstack.org/show/585669/
[2]: https://github.com/openstack/neutron/commit/33650bf1d1994a96eff993af0bfdaa62588f08a4

Changed in neutron:
status: New → Invalid
Revision history for this message
LIU Yulong (dragon889) wrote :

Thanks John, now this bug only for one creation failed log, something I've described in comment #2.

description: updated
Changed in neutron:
status: Invalid → New
Revision history for this message
John Schwarz (jschwarz) wrote :

I'm still not convinced there is a bug; [1] should have the relevant part of the code just retry the request when it fails.

Liu, please provide exact version (github? package version?) of the code you're using.

[1]: https://github.com/openstack/neutron/blob/5860fb21e966ab8f1e011654dd477d7af35f7a27/neutron/common/utils.py#L726

Changed in neutron:
status: New → Incomplete
Revision history for this message
LIU Yulong (dragon889) wrote :

John, we use stable/mitaka (8.1.2), but all L3 related patches are backported. Some patches I've said in comment #2, but we backported more than that.

Revision history for this message
LIU Yulong (dragon889) wrote :

Temporarily close this, I will test this in master, and reopen if it encountered.

Changed in neutron:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.