AttributeError when updating DvrEdgeRouter objects running on network nodes

Bug #1755243 reported by Daniel Gonzalez Nothnagel
24
This bug affects 5 people
Affects Status Importance Assigned to Milestone
neutron
High
Daniel Gonzalez Nothnagel

Bug Description

In a configuration with L3 HA, DVR and neutron-lbaasv2, it can happen that the update of a router with a connected load balancer crashes with the following stack trace (line numbers may be a bit outdated):

Failed to process compatible router: 192c77b2-1487-4bc4-af40-26563e959989
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 543, in _process_router_update
    self._process_router_if_compatible(router)
  File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 464, in _process_router_if_compatible
    self._process_updated_router(router)
  File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 480, in _process_updated_router
    router['id'], router.get(l3_constants.HA_ROUTER_STATE_KEY))
  File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha.py", line 132, in check_ha_state_for_router
    if ri and current_state != TRANSLATION_MAP[ri.ha_state]:
AttributeError: 'DvrEdgeRouter' object has no attribute 'ha_state'

The issue is, that in a landscape with more network nodes than 'max_l3_agents_per_router', e.g. 6 network nodes and max_l3_agents_per_router = 3, it may happen that a load balancer is scheduled on a network node that does not have the correct router deployed on it. In such a case, neutron deploys a DvrEdgeRouter on the network node to serve the LB. Every time neutron updates that router, e.g. to assign a floating IP to the LB, it crashes with the above stack trace because it expected to find a DvrEdgeHaRouter on the network node on which it has to check the ha state.

To verify if it has to check the ha state of a router object, neutron runs the following check:

if router.get('ha') and not is_dvr_only_agent

In our case that check is true, because the agent runs in mode 'dvr_snat', and the router is HA. But the actual router object running on the network node is of type DvrEdgeRouter and therefore has no ha_state attribute, causing the update to fail.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/552097

Changed in neutron:
assignee: nobody → Daniel Gonzalez Nothnagel (dgonzalez)
status: New → In Progress
Revision history for this message
Brian Haley (brian-haley) wrote :

So in this case nova-compute is running on this dvr_snat node, right? Otherwise the lbaas instance would not have been scheduled there.

Changed in neutron:
importance: Undecided → High
Revision history for this message
Daniel Gonzalez Nothnagel (dgonzalez) wrote :

Hi Brian
No, nova-compute is not running on the dvr_snat node.
We are using the haproxy namespace_driver. Therefore the LBs are created as haproxy processes in their own network namespaces directly on the network node.

Revision history for this message
Daniel Marks (d3n14l) wrote :

Same here trying to user neutron-vpn-agent / vpnaas with DVR routers:

2018-03-12 07:24:36.538 30622 ERROR neutron.agent.l3.agent [-] Failed to process compatible router: c7c15968-1e35-407c-b455-c28087ed5fd4: AttributeError: 'DvrLocalRouter' object has no attribute 'ha_state'
2018-03-12 07:24:36.538 30622 ERROR neutron.agent.l3.agent Traceback (most recent call last):
2018-03-12 07:24:36.538 30622 ERROR neutron.agent.l3.agent File "/openstack/venvs/neutron-16.0.8/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 567, in _process_router_update
2018-03-12 07:24:36.538 30622 ERROR neutron.agent.l3.agent self._process_router_if_compatible(router)
2018-03-12 07:24:36.538 30622 ERROR neutron.agent.l3.agent File "/openstack/venvs/neutron-16.0.8/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 489, in _process_router_if_compatible
2018-03-12 07:24:36.538 30622 ERROR neutron.agent.l3.agent self._process_updated_router(router)
2018-03-12 07:24:36.538 30622 ERROR neutron.agent.l3.agent File "/openstack/venvs/neutron-16.0.8/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 506, in _process_updated_router
2018-03-12 07:24:36.538 30622 ERROR neutron.agent.l3.agent router['id'], router.get(l3_constants.HA_ROUTER_STATE_KEY))
2018-03-12 07:24:36.538 30622 ERROR neutron.agent.l3.agent File "/openstack/venvs/neutron-16.0.8/lib/python2.7/site-packages/osprofiler/profiler.py", line 153, in wrapper
2018-03-12 07:24:36.538 30622 ERROR neutron.agent.l3.agent return f(*args, **kwargs)
2018-03-12 07:24:36.538 30622 ERROR neutron.agent.l3.agent File "/openstack/venvs/neutron-16.0.8/lib/python2.7/site-packages/neutron/agent/l3/ha.py", line 95, in check_ha_state_for_router
2018-03-12 07:24:36.538 30622 ERROR neutron.agent.l3.agent if ri and current_state != TRANSLATION_MAP[ri.ha_state]:
2018-03-12 07:24:36.538 30622 ERROR neutron.agent.l3.agent AttributeError: 'DvrLocalRouter' object has no attribute 'ha_state'

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/552097
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=8c2dae659a806fdc20331de4b8a917ec3ae0e6f6
Submitter: Zuul
Branch: master

commit 8c2dae659a806fdc20331de4b8a917ec3ae0e6f6
Author: Daniel Gonzalez <email address hidden>
Date: Mon Mar 12 17:48:54 2018 +0100

    Fix l3-agent crash on routers without ha_state

    l3-agent checks the HA state of routers when a router is updated.
    To ensure that the HA state is only checked on HA routers the following
    check is performed: `if router.get('ha') and not is_dvr_only_agent`.
    This check should ensure that the check is only performed on
    DvrEdgeHaRouter and HaRouter objects.

    Unfortunately, there are cases where we have DvrEdgeRouter objects
    running on 'dvr_snat' agents. E.g. when deploying a loadbalancer with
    neutron-lbaas in a landscape with 6 network nodes and
    max_l3_agents_per_router set to 3, it may happen that the loadbalancer
    is placed on a network node that does not have a DvrEdgeHaRouter running
    on it. In such a case, neutron will deploy a DvrEdgeRouter object on the
    network node to serve the loadbalancer, just like it would deploy a
    DvrEdgeRouter on a compute node when deploying a VM.

    Under such circumstances each update to the router will lead to an
    AttributeError, because the DvrEdgeRouter object does not have the
    ha_state attribute.

    This patch circumvents the issue by doing an additional check on the
    router object to ensure that it actually has the ha_state attribute.

    Change-Id: I755990324db445efd0ee0b8a9db1f4d7bfb58e26
    Closes-Bug: #1755243

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/557454

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/557457

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/pike)

Reviewed: https://review.openstack.org/557457
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=0ecf8b6b4a0061f76ca2ca2d917c51a5e8dd3949
Submitter: Zuul
Branch: stable/pike

commit 0ecf8b6b4a0061f76ca2ca2d917c51a5e8dd3949
Author: Daniel Gonzalez <email address hidden>
Date: Mon Mar 12 17:48:54 2018 +0100

    Fix l3-agent crash on routers without ha_state

    l3-agent checks the HA state of routers when a router is updated.
    To ensure that the HA state is only checked on HA routers the following
    check is performed: `if router.get('ha') and not is_dvr_only_agent`.
    This check should ensure that the check is only performed on
    DvrEdgeHaRouter and HaRouter objects.

    Unfortunately, there are cases where we have DvrEdgeRouter objects
    running on 'dvr_snat' agents. E.g. when deploying a loadbalancer with
    neutron-lbaas in a landscape with 6 network nodes and
    max_l3_agents_per_router set to 3, it may happen that the loadbalancer
    is placed on a network node that does not have a DvrEdgeHaRouter running
    on it. In such a case, neutron will deploy a DvrEdgeRouter object on the
    network node to serve the loadbalancer, just like it would deploy a
    DvrEdgeRouter on a compute node when deploying a VM.

    Under such circumstances each update to the router will lead to an
    AttributeError, because the DvrEdgeRouter object does not have the
    ha_state attribute.

    This patch circumvents the issue by doing an additional check on the
    router object to ensure that it actually has the ha_state attribute.

    Closes-Bug: #1755243
    Change-Id: I755990324db445efd0ee0b8a9db1f4d7bfb58e26
    (cherry picked from commit 8c2dae659a806fdc20331de4b8a917ec3ae0e6f6)

tags: added: in-stable-pike
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/queens)

Reviewed: https://review.openstack.org/557454
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=da141f0859fe5d7e77cd8fbdd84b59e19ffc7d17
Submitter: Zuul
Branch: stable/queens

commit da141f0859fe5d7e77cd8fbdd84b59e19ffc7d17
Author: Daniel Gonzalez <email address hidden>
Date: Mon Mar 12 17:48:54 2018 +0100

    Fix l3-agent crash on routers without ha_state

    l3-agent checks the HA state of routers when a router is updated.
    To ensure that the HA state is only checked on HA routers the following
    check is performed: `if router.get('ha') and not is_dvr_only_agent`.
    This check should ensure that the check is only performed on
    DvrEdgeHaRouter and HaRouter objects.

    Unfortunately, there are cases where we have DvrEdgeRouter objects
    running on 'dvr_snat' agents. E.g. when deploying a loadbalancer with
    neutron-lbaas in a landscape with 6 network nodes and
    max_l3_agents_per_router set to 3, it may happen that the loadbalancer
    is placed on a network node that does not have a DvrEdgeHaRouter running
    on it. In such a case, neutron will deploy a DvrEdgeRouter object on the
    network node to serve the loadbalancer, just like it would deploy a
    DvrEdgeRouter on a compute node when deploying a VM.

    Under such circumstances each update to the router will lead to an
    AttributeError, because the DvrEdgeRouter object does not have the
    ha_state attribute.

    This patch circumvents the issue by doing an additional check on the
    router object to ensure that it actually has the ha_state attribute.

    Closes-Bug: #1755243
    Change-Id: I755990324db445efd0ee0b8a9db1f4d7bfb58e26
    (cherry picked from commit 8c2dae659a806fdc20331de4b8a917ec3ae0e6f6)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/558963
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=922cd0a938ac9dad462c09e3483a22e634d722e9
Submitter: Zuul
Branch: master

commit 922cd0a938ac9dad462c09e3483a22e634d722e9
Author: Brian Haley <email address hidden>
Date: Wed Apr 4 17:19:41 2018 -0400

    Change ha_state property to always return a value

    Right now, ha_state could return any value that is in
    the state file, or even '' if the file is empty. Instead,
    return 'unknown' if it's empty.

    We also need to update the translation map in the HA code
    to deal with this new value to avoid a KeyError.

    Related-bug: #1755243

    Change-Id: I94a39e574cf4ff5facb76df352c14cbaba793e98

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 13.0.0.0b1

This issue was fixed in the openstack/neutron 13.0.0.0b1 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 11.0.4

This issue was fixed in the openstack/neutron 11.0.4 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 12.0.2

This issue was fixed in the openstack/neutron 12.0.2 release.

tags: added: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/577380

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/ocata)

Reviewed: https://review.openstack.org/577380
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=7c874cd89ac2b697fce71307e05af2c390d07173
Submitter: Zuul
Branch: stable/ocata

commit 7c874cd89ac2b697fce71307e05af2c390d07173
Author: Daniel Gonzalez <email address hidden>
Date: Mon Mar 12 17:48:54 2018 +0100

    Fix l3-agent crash on routers without ha_state

    l3-agent checks the HA state of routers when a router is updated.
    To ensure that the HA state is only checked on HA routers the following
    check is performed: `if router.get('ha') and not is_dvr_only_agent`.
    This check should ensure that the check is only performed on
    DvrEdgeHaRouter and HaRouter objects.

    Unfortunately, there are cases where we have DvrEdgeRouter objects
    running on 'dvr_snat' agents. E.g. when deploying a loadbalancer with
    neutron-lbaas in a landscape with 6 network nodes and
    max_l3_agents_per_router set to 3, it may happen that the loadbalancer
    is placed on a network node that does not have a DvrEdgeHaRouter running
    on it. In such a case, neutron will deploy a DvrEdgeRouter object on the
    network node to serve the loadbalancer, just like it would deploy a
    DvrEdgeRouter on a compute node when deploying a VM.

    Under such circumstances each update to the router will lead to an
    AttributeError, because the DvrEdgeRouter object does not have the
    ha_state attribute.

    This patch circumvents the issue by doing an additional check on the
    router object to ensure that it actually has the ha_state attribute.

    Conflicts:
     neutron/agent/l3/agent.py
     neutron/tests/functional/agent/l3/test_dvr_router.py

    Change-Id: I755990324db445efd0ee0b8a9db1f4d7bfb58e26
    Closes-Bug: #1755243
    (cherry picked from commit 8c2dae659a806fdc20331de4b8a917ec3ae0e6f6)

tags: added: in-stable-ocata
tags: removed: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron ocata-eol

This issue was fixed in the openstack/neutron ocata-eol release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers