neutron

Partial HA network causing HA router creation failed (race conditon)

Bug #1633306 reported by LIU Yulong on 2016-10-14

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	neutron	Invalid	Undecided	Unassigned

Bug Description

ENV: stable/mitaka，VXLAN
Neutron API: two neutron-servers behind a HA proxy VIP.

Log:
[1] http://paste.openstack.org/show/585670/

Log [1] shows that the subnet of HA network is concurrently deleted while a new HA router create API comes. Seems the race conditon described in this bug is till exists : https://bugs.launchpad.net/neutron/+bug/1533440, where has description said:

"""
Some known exceptions:
...
2. IpAddressGenerationFailure: (HA port created failed due to the
concurrently HA subnet deletion)
...
"""

It has a very strange behavior that those 3 APIs have a same request-id [req-780b1f6e-2b3c-4303-a1de-a5fb4c7ea31e].

Test scenario:
Just create one HA router for a tenant, and then quickly delete it.

For now, our mitaka ENV use VxLAN as tenant network type. So there is a very large range of VNI.
So don't save that, and locally a temporary solution, we add a new config to decide whether delete the HA network every time.

See original description

Tags:

LIU Yulong (dragon889) on 2016-10-14

description:	updated
tags:	added: l3-ha
description:	updated
description:	updated

John Schwarz (jschwarz) on 2016-10-14

Changed in neutron:
importance:	Undecided → High
status:	New → Confirmed
milestone:	none → ocata-1

Revision history for this message

John Schwarz (jschwarz) wrote on 2016-10-14:

Adding a new configuration option is almost never temporary as deleting config options is rarely backward-compatible.

The race condition, as I understand it, is as following:

1. Create HA router, have worker1 send 'router_updated' to agent1.
2. Delete HA router (done by worker2). worker2 will now detect that there are no more HA routers and will delete the HA network for the tenant.
3. agent1 issues a 'sync_router', which triggers auto_schedule_routers. create_ha_port_and_bind will try to create the HA port but there are no more IP addresses available, causing add_ha_port to fail as specified in the first paste.

Point #3 is a bit weird to me, as it looks like IPAM is detecting a "network deleted during function run" as "no more IP addresses". In addition, this should be caught by [2], forcing a silent retrigger of this issue.

Aside from the issue that isn't clear to me, I'd like to point out that the latest stable/mitaka [1] doesn't even trigger auto_schedule_routers on sync_router (not since [3] - perhaps you're missing this backport?) - hence the trace received in the first paste can't be reproduced. For this reason, I'm closing this as Invalid. Liu, feel free to reopen if you disagree with my assessment :)

[1]: https://github.com/openstack/neutron/blob/5860fb21e966ab8f1e011654dd477d7af35f7a27/neutron/api/rpc/handlers/l3_rpc.py#L79
[2]: https://github.com/openstack/neutron/blob/5860fb21e966ab8f1e011654dd477d7af35f7a27/neutron/common/utils.py#L726
[3]: https://github.com/openstack/neutron/commit/33650bf1d1994a96eff993af0bfdaa62588f08a4

(5860fb21e966ab8f1e011654dd477d7af35f7a27 is the latest stable/mitaka hash that github.com provided.)

Changed in neutron:
importance:	High → Undecided
status:	Confirmed → Invalid
milestone:	ocata-1 → none

LIU Yulong (dragon889) on 2016-10-15

Changed in neutron:
status:	Invalid → New

Revision history for this message

LIU Yulong (dragon889) wrote on 2016-10-15:

Hi, John:
Thank you for reply this.
In your point 3, the 'sync_router' [1] , [2], [3] patches are all backported to our local mitaka branch. And yes, IpAddressGenerationFailure is now the problem. Maybe we need try to verify the master first.

And I think the exception may be raised by the following procedure, not involved the L3 agent:
1. create one HA router
2. create HA network and the HA port
3. delete HA router
4. delete HA network and it's subnet, transaction not committed
6. new HA router API comes, as HA network has no subnets

Or.
1. create one HA router1
2. create HA network, Partially without subnet
3. new HA router2 API comes

Sorry for the inaccurate description, the temporarily solution is only locally in our cloud deployment.

[1] https://review.openstack.org/#/c/349238/
[2] https://review.openstack.org/#/c/332102/
[3] https://review.openstack.org/#/c/365653/

description:

updated

Revision history for this message

LIU Yulong (dragon889) wrote on 2016-10-15:

Sorry, supplement, in comment #2, exception procedure I said is only for log [2].

Log [1] was indeed involved the L3 agent.

[1] http://paste.openstack.org/show/585669/
[2] http://paste.openstack.org/show/585670/

Revision history for this message

John Schwarz (jschwarz) wrote on 2016-10-15:

Looking at the log involving the server ([1] - the same one you provided in the first comment and in comment #3), and specifically lines 19 and 21, it's clear that sync_routers() is triggering auto_schedule_routers(). Before [2] removed in, the call from sync_routers() to auto_schedule_routers() was done in line 96 of neutron/api/rpc/handlers/l3_rpc.py, as can be observed from the log:

2016-10-09 17:03:52.366 144166 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/neutron/api/rpc/handlers/l3_rpc.py", line 96, in sync_routers
2016-10-09 17:03:52.366 144166 ERROR oslo_messaging.rpc.dispatcher self.l3plugin.auto_schedule_routers(context, host, router_ids)

In [2], it's evident that the line 96 itself is removed. Thus, this can't be reproduced in master or in stable/mitaka and there is no (upstream) bug to fix.

[1]: http://paste.openstack.org/show/585669/
[2]: https://github.com/openstack/neutron/commit/33650bf1d1994a96eff993af0bfdaa62588f08a4

Changed in neutron:
status:	New → Invalid

Revision history for this message

LIU Yulong (dragon889) wrote on 2016-10-17:

Thanks John, now this bug only for one creation failed log, something I've described in comment #2.

description:	updated
Changed in neutron:
status:	Invalid → New

Revision history for this message

John Schwarz (jschwarz) wrote on 2016-10-17:

I'm still not convinced there is a bug; [1] should have the relevant part of the code just retry the request when it fails.

Liu, please provide exact version (github? package version?) of the code you're using.

[1]: https://github.com/openstack/neutron/blob/5860fb21e966ab8f1e011654dd477d7af35f7a27/neutron/common/utils.py#L726

Anindita Das (anindita-das) on 2016-10-18

Changed in neutron:
status:	New → Incomplete

Revision history for this message

LIU Yulong (dragon889) wrote on 2016-10-19:

John, we use stable/mitaka (8.1.2), but all L3 related patches are backported. Some patches I've said in comment #2, but we backported more than that.

Revision history for this message

LIU Yulong (dragon889) wrote on 2016-10-19:

Temporarily close this, I will test this in master, and reopen if it encountered.

Changed in neutron:
status:	Incomplete → Invalid

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.