[SRU] Agent is failing to process HA router if initialize() fails

Bug #1662804 reported by venkata anil on 2017-02-08
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
Undecided
Unassigned
Mitaka
Undecided
Edward Hope-Morley
Newton
Undecided
Edward Hope-Morley
neutron
High
venkata anil
neutron (Ubuntu)
Undecided
Unassigned
Xenial
Undecided
Edward Hope-Morley
Yakkety
Undecided
Edward Hope-Morley

Bug Description

[Impact]

This patch resolves, amongst other things, issues with a create and delete router request race condition when using l3 HA. At the time of backport this patch is already available from Ocata onwards and has been verified as sufficiently minimal and safe for backport to Newton and Mitaka. Essentially the error case is a result of an incorrectly intialised router update action being executed without proper checks and this patch fixes this.

[Test Case]

 * Deploy Openstack Mitaka - http://pastebin.ubuntu.com/24637244/ - with neutron-l3-agent configured to provide HA (vrrp) routers.

 * Repeatedly create and delete routers in rapid succession and check that the l3 agent does not go into an infinite error loop i.e. run http://pastebin.ubuntu.com/24634950/ and run do tail -F /var/log/neutron/neutron-l3-agent.log on all units of l3 agent. Also check that qrouter- namepspaces are not stacking up. For Mitaka I typically hit the error after ~20 create/deletes.

[Regression Potential]

 * I do not envisage any regression potential from this patch.

====

When HA router initialize() function fails for some reason(rabbitmq restart or no ha_port), keepalived_manager or KeepalivedInstance won't be configured. In this case, _process_router_if_compatible fails with exception, then _resync_router(update) will again try to process this router in loop. As we try initialize() only once(which was failed), retry of _process_router_if_compatible will always fail(no keepalived manager or instance) and router is never configured(see below trace).

2017-02-06 18:34:18.539 26120 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qrouter-114a72fe-02ae-4b87-a2e7-70f962df0951', 'ip', '-o', 'link', 'show', 'qr-e6
3406e1-e7'] execute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:101
2017-02-06 18:34:18.544 26120 DEBUG neutron.agent.linux.utils [-]
Command: ['ip', 'netns', 'exec', u'qrouter-114a72fe-02ae-4b87-a2e7-70f962df0951', 'ip', '-o', 'link', 'show', u'qr-e63406e1-e7']
Exit code: 0
 execute /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:156
2017-02-06 18:34:18.544 26120 ERROR neutron.agent.l3.router_info [-] 'NoneType' object has no attribute 'get_process'
2017-02-06 18:34:18.544 26120 ERROR neutron.agent.l3.router_info Traceback (most recent call last):
2017-02-06 18:34:18.544 26120 ERROR neutron.agent.l3.router_info File "/usr/lib/python2.7/site-packages/neutron/common/utils.py", line 359, in call
2017-02-06 18:34:18.544 26120 ERROR neutron.agent.l3.router_info return func(*args, **kwargs)
2017-02-06 18:34:18.544 26120 ERROR neutron.agent.l3.router_info File "/usr/lib/python2.7/site-packages/neutron/agent/l3/router_info.py", line 744, in process
2017-02-06 18:34:18.544 26120 ERROR neutron.agent.l3.router_info self._process_internal_ports(agent.pd)
2017-02-06 18:34:18.544 26120 ERROR neutron.agent.l3.router_info File "/usr/lib/python2.7/site-packages/neutron/agent/l3/router_info.py", line 394, in _process_internal_ports
2017-02-06 18:34:18.544 26120 ERROR neutron.agent.l3.router_info self.internal_network_added(p)
2017-02-06 18:34:18.544 26120 ERROR neutron.agent.l3.router_info File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 275, in internal_network_added
2017-02-06 18:34:18.544 26120 ERROR neutron.agent.l3.router_info self._disable_ipv6_addressing_on_interface(interface_name)
2017-02-06 18:34:18.544 26120 ERROR neutron.agent.l3.router_info File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 235, in _disable_ipv6_addressing_on_interface
2017-02-06 18:34:18.544 26120 ERROR neutron.agent.l3.router_info if self._should_delete_ipv6_lladdr(ipv6_lladdr):
2017-02-06 18:34:18.544 26120 ERROR neutron.agent.l3.router_info File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 217, in _should_delete_ipv6_lladdr
2017-02-06 18:34:18.544 26120 ERROR neutron.agent.l3.router_info if manager.get_process().active:
2017-02-06 18:34:18.544 26120 ERROR neutron.agent.l3.router_info AttributeError: 'NoneType' object has no attribute 'get_process'
2017-02-06 18:34:18.544 26120 ERROR neutron.agent.l3.router_info
2017-02-06 18:34:18.549 26120 ERROR neutron.agent.l3.agent [-] Failed to process compatible router '114a72fe-02ae-4b87-a2e7-70f962df0951'
2017-02-06 18:34:18.549 26120 ERROR neutron.agent.l3.agent Traceback (most recent call last):
2017-02-06 18:34:18.549 26120 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 506, in _process_router_update
2017-02-06 18:34:18.549 26120 ERROR neutron.agent.l3.agent self._process_router_if_compatible(router)
2017-02-06 18:34:18.549 26120 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 445, in _process_router_if_compatible
2017-02-06 18:34:18.549 26120 ERROR neutron.agent.l3.agent self._process_updated_router(router)
2017-02-06 18:34:18.549 26120 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 459, in _process_updated_router
2017-02-06 18:34:18.549 26120 ERROR neutron.agent.l3.agent ri.process(self)
2017-02-06 18:34:18.549 26120 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 377, in process
2017-02-06 18:34:18.549 26120 ERROR neutron.agent.l3.agent super(HaRouter, self).process(agent)
2017-02-06 18:34:18.549 26120 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/neutron/common/utils.py", line 362, in call
2017-02-06 18:34:18.549 26120 ERROR neutron.agent.l3.agent self.logger(e)
2017-02-06 18:34:18.549 26120 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 204, in __exit__
2017-02-06 18:34:18.549 26120 ERROR neutron.agent.l3.agent six.reraise(self.type_, self.value, self.tb)
2017-02-06 18:34:18.549 26120 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/neutron/common/utils.py", line 359, in call
2017-02-06 18:34:18.549 26120 ERROR neutron.agent.l3.agent return func(*args, **kwargs)
2017-02-06 18:34:18.549 26120 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/neutron/agent/l3/router_info.py", line 744, in process
2017-02-06 18:34:18.549 26120 ERROR neutron.agent.l3.agent self._process_internal_ports(agent.pd)
2017-02-06 18:34:18.549 26120 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/neutron/agent/l3/router_info.py", line 394, in _process_internal_ports
2017-02-06 18:34:18.549 26120 ERROR neutron.agent.l3.agent self.internal_network_added(p)
2017-02-06 18:34:18.549 26120 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 275, in internal_network_added
2017-02-06 18:34:18.549 26120 ERROR neutron.agent.l3.agent self._disable_ipv6_addressing_on_interface(interface_name)
2017-02-06 18:34:18.549 26120 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 235, in _disable_ipv6_addressing_on_interface
2017-02-06 18:34:18.549 26120 ERROR neutron.agent.l3.agent if self._should_delete_ipv6_lladdr(ipv6_lladdr):
2017-02-06 18:34:18.549 26120 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 217, in _should_delete_ipv6_lladdr
2017-02-06 18:34:18.549 26120 ERROR neutron.agent.l3.agent if manager.get_process().active:
2017-02-06 18:34:18.549 26120 ERROR neutron.agent.l3.agent AttributeError: 'NoneType' object has no attribute 'get_process'

Changed in neutron:
assignee: nobody → venkata anil (anil-venkata)
tags: added: l3-ha

Fix proposed to branch: master
Review: https://review.openstack.org/431026

Changed in neutron:
status: New → In Progress
Changed in neutron:
importance: Undecided → Medium
Changed in neutron:
assignee: venkata anil (anil-venkata) → Brian Haley (brian-haley)
Changed in neutron:
assignee: Brian Haley (brian-haley) → venkata anil (anil-venkata)

Reviewed: https://review.openstack.org/431026
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=3e1ed94e389c427f1da56cde43a458832078f073
Submitter: Jenkins
Branch: master

commit 3e1ed94e389c427f1da56cde43a458832078f073
Author: venkata anil <email address hidden>
Date: Wed Feb 8 15:49:47 2017 +0000

    Avoid router ri.process if initialize() fails

    When router_info initialize() fails(with trace) some resources(
    like keepalived process) may not be created. While handling this
    exception, l3 agent calls _process_updated_router instead of
    again calling _process_added_router, which also fails trying to
    access resources which are not created.

    In this change, agent will have new router_info(i.e
    self.router_info[router_id] = ri) only when initialize() succeeds.
    When initialize() fails, as router_info is not part of agent,
    "_process_router_if_compatible" will again call initialize().
    We also cleanup router_info when initialize() fails.

    Closes-bug: #1662804
    Change-Id: I278ac83de57713c93d6e50846d79034d774c5d47

Changed in neutron:
status: In Progress → Fix Released

Reviewed: https://review.openstack.org/452099
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=71c0e8940661fefbe2830258509e6c4afb887783
Submitter: Jenkins
Branch: stable/ocata

commit 71c0e8940661fefbe2830258509e6c4afb887783
Author: venkata anil <email address hidden>
Date: Wed Feb 8 15:49:47 2017 +0000

    Avoid router ri.process if initialize() fails

    When router_info initialize() fails(with trace) some resources(
    like keepalived process) may not be created. While handling this
    exception, l3 agent calls _process_updated_router instead of
    again calling _process_added_router, which also fails trying to
    access resources which are not created.

    In this change, agent will have new router_info(i.e
    self.router_info[router_id] = ri) only when initialize() succeeds.
    When initialize() fails, as router_info is not part of agent,
    "_process_router_if_compatible" will again call initialize().
    We also cleanup router_info when initialize() fails.

    Closes-bug: #1662804
    Change-Id: I278ac83de57713c93d6e50846d79034d774c5d47
    (cherry picked from commit 3e1ed94e389c427f1da56cde43a458832078f073)

tags: added: in-stable-ocata

This issue was fixed in the openstack/neutron 10.0.1 release.

This issue was fixed in the openstack/neutron 11.0.0.0b1 development milestone.

Queueing this for SRU since it resolves issues with create/delete ha router race conditions.

Changed in neutron (Ubuntu):
status: New → Fix Released
Edward Hope-Morley (hopem) wrote :
Changed in cloud-archive:
status: New → Fix Released
summary: - Agent is failing to process HA router if initialize() fails
+ [SRU] Agent is failing to process HA router if initialize() fails
description: updated
tags: added: sts sts-sru-needed
Edward Hope-Morley (hopem) wrote :

Broken router, and l3 agent spinning in the loop, fetching router state over and over from neutron-server. I consider it a High impact bug, setting High.

Changed in neutron:
importance: Medium → High

Reviewed: https://review.openstack.org/452100
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=b98267f73af5a6c6388a76a73d88e46c90f8a71e
Submitter: Jenkins
Branch: stable/newton

commit b98267f73af5a6c6388a76a73d88e46c90f8a71e
Author: venkata anil <email address hidden>
Date: Wed Feb 8 15:49:47 2017 +0000

    Avoid router ri.process if initialize() fails

    When router_info initialize() fails(with trace) some resources(
    like keepalived process) may not be created. While handling this
    exception, l3 agent calls _process_updated_router instead of
    again calling _process_added_router, which also fails trying to
    access resources which are not created.

    In this change, agent will have new router_info(i.e
    self.router_info[router_id] = ri) only when initialize() succeeds.
    When initialize() fails, as router_info is not part of agent,
    "_process_router_if_compatible" will again call initialize().
    We also cleanup router_info when initialize() fails.

    Closes-bug: #1662804
    Change-Id: I278ac83de57713c93d6e50846d79034d774c5d47
    (cherry picked from commit 3e1ed94e389c427f1da56cde43a458832078f073)
    (cherry picked from commit 71c0e8940661fefbe2830258509e6c4afb887783)

tags: added: in-stable-newton
Edward Hope-Morley (hopem) wrote :

This issue was fixed in the openstack/neutron 9.4.0 release.

Changed in neutron (Ubuntu Xenial):
assignee: nobody → Edward Hope-Morley (hopem)
Changed in neutron (Ubuntu Yakkety):
assignee: nobody → Edward Hope-Morley (hopem)
James Page (james-page) on 2017-06-07
Changed in neutron (Ubuntu Xenial):
status: New → In Progress
James Page (james-page) on 2017-06-07
Changed in neutron (Ubuntu Yakkety):
status: New → In Progress
Edward Hope-Morley (hopem) wrote :

This fix will be released as part of the upcoming Newton PR which incl. neutron 9.4.0 and is tracked in https://bugs.launchpad.net/ubuntu/+source/neutron/+bug/1696133. I'll leave this bug open until that PR is released.

Hello venkata, or anyone else affected,

Accepted neutron into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/neutron/2:8.4.0-0ubuntu3 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in neutron (Ubuntu Xenial):
status: In Progress → Fix Committed
tags: added: verification-needed
Łukasz Zemczak (sil2100) wrote :

Hello venkata, or anyone else affected,

Accepted neutron into yakkety-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/neutron/2:9.4.0-0ubuntu1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in neutron (Ubuntu Yakkety):
status: In Progress → Fix Committed
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers