DVR external port setup fails with KeyError: 'host'

Bug #1712412 reported by Ihar Hrachyshka
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Swaminathan Vasudevan

Bug Description

http://logs.openstack.org/28/481928/9/check/gate-tempest-dsvm-neutron-dvr-multinode-scenario-ubuntu-xenial-nv/8fb7c92/logs/subnode-2/screen-q-l3.txt.gz?level=TRACE

Aug 22 07:55:55.697082 ubuntu-xenial-2-node-rax-ord-10554784-832133 neutron-l3-agent[2539]: ERROR neutron.agent.l3.agent [-] Failed to process compatible router: 02fc503f-72c9-49bd-a4f8-30305a7efb90: KeyError: 'host'
Aug 22 07:55:55.697264 ubuntu-xenial-2-node-rax-ord-10554784-832133 neutron-l3-agent[2539]: ERROR neutron.agent.l3.agent Traceback (most recent call last):
Aug 22 07:55:55.697446 ubuntu-xenial-2-node-rax-ord-10554784-832133 neutron-l3-agent[2539]: ERROR neutron.agent.l3.agent File "/opt/stack/new/neutron/neutron/agent/l3/agent.py", line 538, in _process_router_update
Aug 22 07:55:55.697665 ubuntu-xenial-2-node-rax-ord-10554784-832133 neutron-l3-agent[2539]: ERROR neutron.agent.l3.agent self._process_router_if_compatible(router)
Aug 22 07:55:55.697850 ubuntu-xenial-2-node-rax-ord-10554784-832133 neutron-l3-agent[2539]: ERROR neutron.agent.l3.agent File "/opt/stack/new/neutron/neutron/agent/l3/agent.py", line 475, in _process_router_if_compatible
Aug 22 07:55:55.698048 ubuntu-xenial-2-node-rax-ord-10554784-832133 neutron-l3-agent[2539]: ERROR neutron.agent.l3.agent self._process_updated_router(router)
Aug 22 07:55:55.698229 ubuntu-xenial-2-node-rax-ord-10554784-832133 neutron-l3-agent[2539]: ERROR neutron.agent.l3.agent File "/opt/stack/new/neutron/neutron/agent/l3/agent.py", line 490, in _process_updated_router
Aug 22 07:55:55.698405 ubuntu-xenial-2-node-rax-ord-10554784-832133 neutron-l3-agent[2539]: ERROR neutron.agent.l3.agent ri.process()
Aug 22 07:55:55.698581 ubuntu-xenial-2-node-rax-ord-10554784-832133 neutron-l3-agent[2539]: ERROR neutron.agent.l3.agent File "/opt/stack/new/neutron/neutron/agent/l3/dvr_local_router.py", line 737, in process
Aug 22 07:55:55.698756 ubuntu-xenial-2-node-rax-ord-10554784-832133 neutron-l3-agent[2539]: ERROR neutron.agent.l3.agent super(DvrLocalRouter, self).process()
Aug 22 07:55:55.698936 ubuntu-xenial-2-node-rax-ord-10554784-832133 neutron-l3-agent[2539]: ERROR neutron.agent.l3.agent File "/opt/stack/new/neutron/neutron/agent/l3/dvr_router_base.py", line 29, in process
Aug 22 07:55:55.699114 ubuntu-xenial-2-node-rax-ord-10554784-832133 neutron-l3-agent[2539]: ERROR neutron.agent.l3.agent super(DvrRouterBase, self).process()
Aug 22 07:55:55.699289 ubuntu-xenial-2-node-rax-ord-10554784-832133 neutron-l3-agent[2539]: ERROR neutron.agent.l3.agent File "/opt/stack/new/neutron/neutron/agent/l3/ha_router.py", line 436, in process
Aug 22 07:55:55.699469 ubuntu-xenial-2-node-rax-ord-10554784-832133 neutron-l3-agent[2539]: ERROR neutron.agent.l3.agent super(HaRouter, self).process()
Aug 22 07:55:55.699654 ubuntu-xenial-2-node-rax-ord-10554784-832133 neutron-l3-agent[2539]: ERROR neutron.agent.l3.agent File "/opt/stack/new/neutron/neutron/common/utils.py", line 189, in call
Aug 22 07:55:55.699903 ubuntu-xenial-2-node-rax-ord-10554784-832133 neutron-l3-agent[2539]: ERROR neutron.agent.l3.agent self.logger(e)
Aug 22 07:55:55.700084 ubuntu-xenial-2-node-rax-ord-10554784-832133 neutron-l3-agent[2539]: ERROR neutron.agent.l3.agent File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
Aug 22 07:55:55.700260 ubuntu-xenial-2-node-rax-ord-10554784-832133 neutron-l3-agent[2539]: ERROR neutron.agent.l3.agent self.force_reraise()
Aug 22 07:55:55.700457 ubuntu-xenial-2-node-rax-ord-10554784-832133 neutron-l3-agent[2539]: ERROR neutron.agent.l3.agent File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
Aug 22 07:55:55.700635 ubuntu-xenial-2-node-rax-ord-10554784-832133 neutron-l3-agent[2539]: ERROR neutron.agent.l3.agent six.reraise(self.type_, self.value, self.tb)
Aug 22 07:55:55.700816 ubuntu-xenial-2-node-rax-ord-10554784-832133 neutron-l3-agent[2539]: ERROR neutron.agent.l3.agent File "/opt/stack/new/neutron/neutron/common/utils.py", line 186, in call
Aug 22 07:55:55.700995 ubuntu-xenial-2-node-rax-ord-10554784-832133 neutron-l3-agent[2539]: ERROR neutron.agent.l3.agent return func(*args, **kwargs)
Aug 22 07:55:55.701180 ubuntu-xenial-2-node-rax-ord-10554784-832133 neutron-l3-agent[2539]: ERROR neutron.agent.l3.agent File "/opt/stack/new/neutron/neutron/agent/l3/router_info.py", line 1120, in process
Aug 22 07:55:55.701357 ubuntu-xenial-2-node-rax-ord-10554784-832133 neutron-l3-agent[2539]: ERROR neutron.agent.l3.agent self.process_external()
Aug 22 07:55:55.701537 ubuntu-xenial-2-node-rax-ord-10554784-832133 neutron-l3-agent[2539]: ERROR neutron.agent.l3.agent File "/opt/stack/new/neutron/neutron/agent/l3/dvr_local_router.py", line 560, in process_external
Aug 22 07:55:55.701713 ubuntu-xenial-2-node-rax-ord-10554784-832133 neutron-l3-agent[2539]: ERROR neutron.agent.l3.agent super(DvrLocalRouter, self).process_external()
Aug 22 07:55:55.701886 ubuntu-xenial-2-node-rax-ord-10554784-832133 neutron-l3-agent[2539]: ERROR neutron.agent.l3.agent File "/opt/stack/new/neutron/neutron/agent/l3/router_info.py", line 895, in process_external
Aug 22 07:55:55.702070 ubuntu-xenial-2-node-rax-ord-10554784-832133 neutron-l3-agent[2539]: ERROR neutron.agent.l3.agent self._process_external_gateway(ex_gw_port)
Aug 22 07:55:55.702248 ubuntu-xenial-2-node-rax-ord-10554784-832133 neutron-l3-agent[2539]: ERROR neutron.agent.l3.agent File "/opt/stack/new/neutron/neutron/agent/l3/router_info.py", line 795, in _process_external_gateway
Aug 22 07:55:55.702422 ubuntu-xenial-2-node-rax-ord-10554784-832133 neutron-l3-agent[2539]: ERROR neutron.agent.l3.agent self._handle_router_snat_rules(gw_port, interface_name)
Aug 22 07:55:55.702609 ubuntu-xenial-2-node-rax-ord-10554784-832133 neutron-l3-agent[2539]: ERROR neutron.agent.l3.agent File "/opt/stack/new/neutron/neutron/agent/l3/dvr_edge_router.py", line 184, in _handle_router_snat_rules
Aug 22 07:55:55.702788 ubuntu-xenial-2-node-rax-ord-10554784-832133 neutron-l3-agent[2539]: ERROR neutron.agent.l3.agent ex_gw_port, interface_name)
Aug 22 07:55:55.702980 ubuntu-xenial-2-node-rax-ord-10554784-832133 neutron-l3-agent[2539]: ERROR neutron.agent.l3.agent File "/opt/stack/new/neutron/neutron/agent/l3/dvr_local_router.py", line 514, in _handle_router_snat_rules
Aug 22 07:55:55.703155 ubuntu-xenial-2-node-rax-ord-10554784-832133 neutron-l3-agent[2539]: ERROR neutron.agent.l3.agent ext_device_name = self.get_external_device_interface_name(ex_gw_port)
Aug 22 07:55:55.703329 ubuntu-xenial-2-node-rax-ord-10554784-832133 neutron-l3-agent[2539]: ERROR neutron.agent.l3.agent File "/opt/stack/new/neutron/neutron/agent/l3/dvr_local_router.py", line 444, in get_external_device_interface_name
Aug 22 07:55:55.703503 ubuntu-xenial-2-node-rax-ord-10554784-832133 neutron-l3-agent[2539]: ERROR neutron.agent.l3.agent if not self._get_floatingips_bound_to_host(floating_ips):
Aug 22 07:55:55.703683 ubuntu-xenial-2-node-rax-ord-10554784-832133 neutron-l3-agent[2539]: ERROR neutron.agent.l3.agent File "/opt/stack/new/neutron/neutron/agent/l3/dvr_local_router.py", line 550, in _get_floatingips_bound_to_host
Aug 22 07:55:55.703856 ubuntu-xenial-2-node-rax-ord-10554784-832133 neutron-l3-agent[2539]: ERROR neutron.agent.l3.agent if (i['host'] == self.host or
Aug 22 07:55:55.704042 ubuntu-xenial-2-node-rax-ord-10554784-832133 neutron-l3-agent[2539]: ERROR neutron.agent.l3.agent KeyError: 'host'

This is probably result of https://review.openstack.org/#/c/437986/27/neutron/agent/l3/dvr_local_router.py that we sqeezed late in Pike.

This is probably the root cause of ssh connectivity timeouts we observe in dvr scenario job.

Changed in neutron:
importance: Undecided → High
status: New → Confirmed
tags: added: gate-failure l3-dvr-backlog
Changed in neutron:
milestone: none → pike-3
milestone: pike-3 → pike-rc2
Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

Is this only seen with multinode or with a combination of HA with multinode.

Revision history for this message
Brian Haley (brian-haley) wrote :

Swami, is it enough to change that code from i['host'] to i.get('host') to fix this quickly?

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

Brian, based on my tests, this call can be removed completely.
The get_external_interface_name in dvr_local_router can always fetch the 'fg' device name.
Since we have functions over-ridden in dvr_edge_router, the external device 'qg' will be automatically passed in.

I have posted a patch upstream to test the failures. Let us see how it goes.
If it passes the tests, then I will push in a real patch.

https://review.openstack.org/#/c/496438/

Apart from this, I found in the logs, that there were nearly 4 floatingips that was created without a host-binding, which is odd.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/497009

Changed in neutron:
assignee: nobody → Swaminathan Vasudevan (swaminathan-vasudevan)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/497985

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/497009
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=47fbc6157ac4125fa068e64b573433a02c0ce0fc
Submitter: Jenkins
Branch: master

commit 47fbc6157ac4125fa068e64b573433a02c0ce0fc
Author: Swaminathan Vasudevan <email address hidden>
Date: Wed Aug 23 21:11:58 2017 -0700

    DVR: _get_floatingips_bound_to_host throws KeyError

    _get_floatingips_bound_to_host function was introduced
    recently in dvr_local_router to retrieve the external
    interface name for centralizing the floatingip.

    This function was throwing a 'KeyError' on fip['host'] and
    not required for centralized floatingips anymore.

    The get_external_device_interface_name in dvr_local_router
    will try to get the 'fg' interface that is required for
    the bound floating-ips to clear up some of the rules.
    In the case of the centralized unbound floating-ips, the
    'qg' external interface is retreived from
    get_snat_external_device_interface_name that is defined
    in 'dvr_edge_router' and based on the namespace.

    So _get_floatingips_bound_to_host can be removed from
    get_external_device_inteface_name.

    Closes-Bug: 1712412

    Change-Id: I94c0a071df32f572745a2c29942956c3da9f309b

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/pike)

Reviewed: https://review.openstack.org/497985
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=5212c9c563e9470ce9e6abd76bdef22fa652a9b3
Submitter: Jenkins
Branch: stable/pike

commit 5212c9c563e9470ce9e6abd76bdef22fa652a9b3
Author: Swaminathan Vasudevan <email address hidden>
Date: Wed Aug 23 21:11:58 2017 -0700

    DVR: _get_floatingips_bound_to_host throws KeyError

    _get_floatingips_bound_to_host function was introduced
    recently in dvr_local_router to retrieve the external
    interface name for centralizing the floatingip.

    This function was throwing a 'KeyError' on fip['host'] and
    not required for centralized floatingips anymore.

    The get_external_device_interface_name in dvr_local_router
    will try to get the 'fg' interface that is required for
    the bound floating-ips to clear up some of the rules.
    In the case of the centralized unbound floating-ips, the
    'qg' external interface is retreived from
    get_snat_external_device_interface_name that is defined
    in 'dvr_edge_router' and based on the namespace.

    So _get_floatingips_bound_to_host can be removed from
    get_external_device_inteface_name.

    Closes-Bug: 1712412

    Change-Id: I94c0a071df32f572745a2c29942956c3da9f309b
    (cherry picked from commit 47fbc6157ac4125fa068e64b573433a02c0ce0fc)

tags: added: in-stable-pike
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 11.0.0.0rc3

This issue was fixed in the openstack/neutron 11.0.0.0rc3 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/499585

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/pike)

Related fix proposed to branch: stable/pike
Review: https://review.openstack.org/500077

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/pike)

Reviewed: https://review.openstack.org/500077
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=188f5c29a918e40379671d7d5dda729d3952ab64
Submitter: Jenkins
Branch: stable/pike

commit 188f5c29a918e40379671d7d5dda729d3952ab64
Author: Jakub Libosvar <email address hidden>
Date: Thu Aug 31 12:54:47 2017 +0200

    dvr: Don't raise KeyError in _get_floatingips_bound_to_host

    We thought _get_floatingips_bound_to_host is not needed but removing the
    method caused sending garps for fip that doesn't belong to node during
    the full-sync.

    This patch just replaces dict lookup with get() method, so fips are
    filtered based on presence on the host and if host is not set on fip, it
    won't raise a KeyError.

    Note: This patch hasn't been merged in master yet because of KeyError happening in grenade job now.

    Co-Authored-By: Swaminathan Vasudevan <email address hidden>

    Related-bug: #1712412
    Related-bug: #1713927

    Change-Id: I0fbc772d757fb13b788f9df8d6d7d28d288d054a

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/499585
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=32700292615b528a9095b99f795a290033d982d5
Submitter: Jenkins
Branch: master

commit 32700292615b528a9095b99f795a290033d982d5
Author: Jakub Libosvar <email address hidden>
Date: Thu Aug 31 12:54:47 2017 +0200

    dvr: Don't raise KeyError in _get_floatingips_bound_to_host

    We thought _get_floatingips_bound_to_host is not needed but removing the
    method caused sending garps for fip that doesn't belong to node during
    the full-sync.

    This patch just replaces dict lookup with get() method, so fips are
    filtered based on presence on the host and if host is not set on fip, it
    won't raise a KeyError.

    Co-Authored-By: Swaminathan Vasudevan <email address hidden>

    Related-bug: #1712412
    Related-bug: #1713927

    Change-Id: I0fbc772d757fb13b788f9df8d6d7d28d288d054a

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/499725
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=afd1995d91353874d04ac2d12be0162cea27c1d1
Submitter: Jenkins
Branch: master

commit afd1995d91353874d04ac2d12be0162cea27c1d1
Author: Swaminathan Vasudevan <email address hidden>
Date: Thu Aug 31 08:50:50 2017 -0700

    DVR: Fix agent to process only floatingips that have a host match

    The agent is not currently checking for the host bound
    before configuring the floatingip. That leads to
    floatingips being configured on multiple hosts.

    This is a partial fix on the agent side to prevent
    configuring a floatingip ip that is not bound to
    this host.

    Related-Bug: #1712412
    Related-Bug: #1713927

    Change-Id: I1bc8c42425f97234f56412a2f109a996d9f896de

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/pike)

Related fix proposed to branch: stable/pike
Review: https://review.openstack.org/502493

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/pike)

Reviewed: https://review.openstack.org/502493
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=2e6654067af4374443a502de8a42da361c76c6eb
Submitter: Jenkins
Branch: stable/pike

commit 2e6654067af4374443a502de8a42da361c76c6eb
Author: Swaminathan Vasudevan <email address hidden>
Date: Thu Aug 31 08:50:50 2017 -0700

    DVR: Fix agent to process only floatingips that have a host match

    The agent is not currently checking for the host bound
    before configuring the floatingip. That leads to
    floatingips being configured on multiple hosts.

    This is a partial fix on the agent side to prevent
    configuring a floatingip ip that is not bound to
    this host.

    Related-Bug: #1712412
    Related-Bug: #1713927

    Change-Id: I1bc8c42425f97234f56412a2f109a996d9f896de
    (cherry picked from commit afd1995d91353874d04ac2d12be0162cea27c1d1)

tags: added: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 12.0.0.0b1

This issue was fixed in the openstack/neutron 12.0.0.0b1 development milestone.

tags: removed: neutron-proactive-backport-potential
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.