[dvr][fast-exit] incorrect policy rules get deleted when a distributed router has ports on multiple tenant networks

Bug #1759956 reported by Dmitrii Shcherbakov on 2018-03-29
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
Medium
Unassigned
Pike
Medium
Unassigned
Queens
Medium
Unassigned
neutron
Medium
Dmitrii Shcherbakov
neutron (Ubuntu)
Medium
Unassigned
Artful
Medium
Unassigned
Bionic
Medium
Unassigned

Bug Description

Ubuntu SRU details
------------------
[Impact]
See Original Description below.

[Test Case]
See Original Description below.

[Regression Potential]
Low. All patches have landed upstream in corresponding stable branches.

Original Description
--------------------
TL;DR: ip -4 rule del priority <priority> table <table-id> type unicast will delete the first matching rule it encounters: if there are two rules with the same priority it will just kill the first one it finds.

The original setup is described here:
https://bugs.launchpad.net/ubuntu/+source/neutron/+bug/1759918

OpenStack Queens from UCA (xenial, GA kernel, deployed via OpenStack charms), 2 external subnets (one routed provider network), 2 tenant subnets all in the same address scope to trigger "fast exit".

2 tenant networks attached (subnets 192.168.100.0/24 and 192.168.200.0/24) to a DVR:

# 2 rules as expected
ip netns exec qrouter-4f9ca9ef-303b-4082-abbc-e50782d9b800 ip rule
0: from all lookup local
32766: from all lookup main
32767: from all lookup default
80000: from 192.168.100.0/24 lookup 16
80000: from 192.168.200.0/24 lookup 16

# remove 192.168.200.0/24 sometimes deletes an incorrect policy rule
openstack router remove subnet pubrouter othertenantsubnet

# ip route del contains the cidr
2018-03-29 20:09:52.946 2083594 DEBUG neutron.agent.linux.utils [-] Running command: ['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'ne
tns', 'exec', 'fip-d0f008fc-dc45-4237-9ce0-a9e1977735eb', 'ip', '-4', 'route', 'del', '192.168.200.0/24', 'via', '169.254.93.94', 'dev', 'fpr-4f9ca9ef-3'
] create_process /usr/lib/python2.7/dist-packages/neutron/agent/linux/utils.py:92

# ip rule delete is not that specific
2018-03-29 20:09:53.195 2083594 DEBUG neutron.agent.linux.utils [-] Running command: ['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 'qrouter-4f9ca9ef-303b-4082-abbc-e50782d9b800', 'ip', '-4', 'rule', 'del', 'priority', '80000', 'table', '16', 'type', 'unicast'] create_pr
ocess /usr/lib/python2.7/dist-packages/neutron/agent/linux/utils.py:92

2018-03-29 20:15:59.210 2083594 DEBUG neutron.agent.linux.utils [-] Running command: ['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 'qrouter-4f9ca9ef-303b-4082-abbc-e50782d9b800', 'ip', '-4', 'rule', 'show'] create_process /usr/lib/python2.7/dist-packages/neutron/agent/linux/utils.py:92
2018-03-29 20:15:59.455 2083594 DEBUG neutron.agent.linux.utils [-] Running command: ['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 'qrouter-4f9ca9ef-303b-4082-abbc-e50782d9b800', 'ip', '-4', 'rule', 'add', 'from', '192.168.100.0/24', 'priority', '80000', 'table', '16', 'type', 'unicast'] create_process /usr/lib/python2.7/dist-packages/neutron/agent/linux/utils.py:92

~~~~

ip netns exec qrouter-4f9ca9ef-303b-4082-abbc-e50782d9b800 ip rule
0: from all lookup local
32766: from all lookup main
32767: from all lookup default
80000: from 192.168.100.0/24 lookup 16
80000: from 192.168.200.0/24 lookup 16

# try to delete a rule manually to see what is going on

ip netns exec qrouter-4f9ca9ef-303b-4082-abbc-e50782d9b800 ip rule ; ip netns exec qrouter-4f9ca9ef-303b-4082-abbc-e50782d9b800 ip -4 rule del priority 80000 table 16 type unicast ; ip netns exec qrouter-4f9ca9ef-303b-4082-abbc-e50782d9b800 ip rule
0: from all lookup local
32766: from all lookup main
32767: from all lookup default
80000: from 192.168.100.0/24 lookup 16
80000: from 192.168.200.0/24 lookup 16

0: from all lookup local
32766: from all lookup main
32767: from all lookup default
80000: from 192.168.200.0/24 lookup 16

# ^^ 192.168.100.0/24 rule got deleted instead of 192.168.200.0/24

# add the rule back manually
ip netns exec qrouter-4f9ca9ef-303b-4082-abbc-e50782d9b800 ip rule add from 192.168.100.0/24 priority 80000 table 16 type unicast

# different order now - 192.168.200.0/24 is first
ip netns exec qrouter-4f9ca9ef-303b-4082-abbc-e50782d9b800 ip rule
0: from all lookup local
32766: from all lookup main
32767: from all lookup default
80000: from 192.168.200.0/24 lookup 16
80000: from 192.168.100.0/24 lookup 16

# now 192.168.200.0/24 got deleted because it was first to match

ip netns exec qrouter-4f9ca9ef-303b-4082-abbc-e50782d9b800 ip rule ; ip netns exec qrouter-4f9ca9ef-303b-4082-abbc-e50782d9b800 ip -4 rule del priority 80000 table 16 type unicast ; ip netns exec qrouter-4f9ca9ef-303b-4082-abbc-e50782d9b800 ip rule
0: from all lookup local
32766: from all lookup main
32767: from all lookup default
80000: from 192.168.200.0/24 lookup 16
80000: from 192.168.100.0/24 lookup 16

0: from all lookup local
32766: from all lookup main
32767: from all lookup default
80000: from 192.168.100.0/24 lookup 16

Code:

_dvr_internal_network_removed
https://github.com/openstack/neutron/blob/stable/queens/neutron/agent/l3/dvr_local_router.py#L431-L443

_delete_interface_routing_rule_in_router_ns
https://github.com/openstack/neutron/blob/stable/queens/neutron/agent/l3/dvr_local_router.py#L642-L648
        ip_rule = ip_lib.IPRule(namespace=self.ns_name)
        for subnet in router_port['subnets']:
            rtr_port_cidr = subnet['cidr']
            ip_rule.rule.delete(ip=rtr_port_cidr,
                                table=dvr_fip_ns.FIP_RT_TBL,
                                priority=dvr_fip_ns.FAST_PATH_EXIT_PR)

IpRuleCommand
https://github.com/openstack/neutron/blob/master/neutron/agent/linux/ip_lib.py#L486-L494

        # TODO(Carl) ip ignored in delete, okay in general?

He-he, experience shows that definitely not.

We need to use the most specific rule description to avoid ordering issues.

ip -4 rule del from 192.168.200.0/24 priority 80000 table 16 type unicast

With a fix it looks like this:

2018-03-29 20:58:57.023 192084 DEBUG neutron.agent.linux.utils [-] Running command: ['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 'qrouter-4f9ca9ef-303b-4082-abbc-e50782d9b800', 'ip', '-4', 'rule', 'del', 'from', '192.168.200.0/24', 'priority', '80000', 'table', '16', 'type', 'unicast'] create_process /usr/lib/python2.7/dist-packages/neutron/agent/linux/utils.py:92

Fix proposed to branch: master
Review: https://review.openstack.org/557836

Changed in neutron:
assignee: nobody → Dmitrii Shcherbakov (dmitriis)
status: New → In Progress
tags: added: l3-dvr-backlog
Changed in neutron:
importance: Undecided → Critical
importance: Critical → Medium

Reviewed: https://review.openstack.org/557836
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=81db328b2df08f2b4adcc80104cf05ad8966c019
Submitter: Zuul
Branch: master

commit 81db328b2df08f2b4adcc80104cf05ad8966c019
Author: Dmitrii Shcherbakov <email address hidden>
Date: Thu Mar 29 17:32:01 2018 -0400

    Use cidr during tenant network rule deletion

    If a distributed router has interfaces on multiple tenant networks, with
    'fast exit' functionality policy based rules are created in qrouter
    namespace for every tenant network subnet and 'from <cidr>' is included
    into an 'ip rule' command invocation.

    When a port on a tenant network is deleted 'from <cidr>' part is not
    included and a first rule matching specified parameters gets deleted.

    For example with the following layout

    ip netns exec qrouter-4f9ca9ef-303b-4082-abbc-e50782d9b800 ip rule
    0: from all lookup local
    32766: from all lookup main
    32767: from all lookup default
    80000: from 192.168.100.0/24 lookup 16
    80000: from 192.168.200.0/24 lookup 16

    and neutron l3 agent will use this command

    ip netns exec qrouter-4f9ca9ef-303b-4082-abbc-e50782d9b800 ip -4 rule\
    del priority 80000 table 16 type unicast

    and 192.168.100.0/24 rule will get deleted even if you actually removed
    a port on 192.168.200.0.

    This results in an extra rule present and not cleaned up and the right
    rule removed. It is only recreated if a router is disabled and enabled
    again.

    additional changes:

    1) Floating IP rules are identified by priority only as implemented
    currently - for this reason this change adds fixed_ip to the rule
    removal code. Rule priorities are 32-bit values in iproute2 so,
    in theory, those should be not be used to cover IPv6.

    2) IP protocol information for 'from all' rules is currently
    derived from link-local address IP version. The same approach
    is preserved by using version-specific /0 addresses without
    changing the API provided by ip_lib.

    Change-Id: I0ea6dddd26e17771be223a1fbdf21792c90f3e9c
    Closes-Bug: #1759956

Changed in neutron:
status: In Progress → Fix Released
Dmitrii Shcherbakov (dmitriis) wrote :

Affects pike and queens UCA packages.

Changed in neutron (Ubuntu):
status: New → Confirmed

Reviewed: https://review.openstack.org/559256
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=fb9ec1afb6545def3130952008ee7f20dbaafd2c
Submitter: Zuul
Branch: stable/queens

commit fb9ec1afb6545def3130952008ee7f20dbaafd2c
Author: Dmitrii Shcherbakov <email address hidden>
Date: Thu Mar 29 17:32:01 2018 -0400

    Use cidr during tenant network rule deletion

    If a distributed router has interfaces on multiple tenant networks, with
    'fast exit' functionality policy based rules are created in qrouter
    namespace for every tenant network subnet and 'from <cidr>' is included
    into an 'ip rule' command invocation.

    When a port on a tenant network is deleted 'from <cidr>' part is not
    included and a first rule matching specified parameters gets deleted.

    For example with the following layout

    ip netns exec qrouter-4f9ca9ef-303b-4082-abbc-e50782d9b800 ip rule
    0: from all lookup local
    32766: from all lookup main
    32767: from all lookup default
    80000: from 192.168.100.0/24 lookup 16
    80000: from 192.168.200.0/24 lookup 16

    and neutron l3 agent will use this command

    ip netns exec qrouter-4f9ca9ef-303b-4082-abbc-e50782d9b800 ip -4 rule\
    del priority 80000 table 16 type unicast

    and 192.168.100.0/24 rule will get deleted even if you actually removed
    a port on 192.168.200.0.

    This results in an extra rule present and not cleaned up and the right
    rule removed. It is only recreated if a router is disabled and enabled
    again.

    additional changes:

    1) Floating IP rules are identified by priority only as implemented
    currently - for this reason this change adds fixed_ip to the rule
    removal code. Rule priorities are 32-bit values in iproute2 so,
    in theory, those should be not be used to cover IPv6.

    2) IP protocol information for 'from all' rules is currently
    derived from link-local address IP version. The same approach
    is preserved by using version-specific /0 addresses without
    changing the API provided by ip_lib.

    Change-Id: I0ea6dddd26e17771be223a1fbdf21792c90f3e9c
    Closes-Bug: #1759956
    (cherry picked from commit 81db328b2df08f2b4adcc80104cf05ad8966c019)

tags: added: in-stable-queens

Reviewed: https://review.openstack.org/559257
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=0224dcfea4ffeaac6cbcec7464887e7538e49091
Submitter: Zuul
Branch: stable/pike

commit 0224dcfea4ffeaac6cbcec7464887e7538e49091
Author: Dmitrii Shcherbakov <email address hidden>
Date: Thu Mar 29 17:32:01 2018 -0400

    Use cidr during tenant network rule deletion

    If a distributed router has interfaces on multiple tenant networks, with
    'fast exit' functionality policy based rules are created in qrouter
    namespace for every tenant network subnet and 'from <cidr>' is included
    into an 'ip rule' command invocation.

    When a port on a tenant network is deleted 'from <cidr>' part is not
    included and a first rule matching specified parameters gets deleted.

    For example with the following layout

    ip netns exec qrouter-4f9ca9ef-303b-4082-abbc-e50782d9b800 ip rule
    0: from all lookup local
    32766: from all lookup main
    32767: from all lookup default
    80000: from 192.168.100.0/24 lookup 16
    80000: from 192.168.200.0/24 lookup 16

    and neutron l3 agent will use this command

    ip netns exec qrouter-4f9ca9ef-303b-4082-abbc-e50782d9b800 ip -4 rule\
    del priority 80000 table 16 type unicast

    and 192.168.100.0/24 rule will get deleted even if you actually removed
    a port on 192.168.200.0.

    This results in an extra rule present and not cleaned up and the right
    rule removed. It is only recreated if a router is disabled and enabled
    again.

    additional changes:

    1) Floating IP rules are identified by priority only as implemented
    currently - for this reason this change adds fixed_ip to the rule
    removal code. Rule priorities are 32-bit values in iproute2 so,
    in theory, those should be not be used to cover IPv6.

    2) IP protocol information for 'from all' rules is currently
    derived from link-local address IP version. The same approach
    is preserved by using version-specific /0 addresses without
    changing the API provided by ip_lib.

    Change-Id: I0ea6dddd26e17771be223a1fbdf21792c90f3e9c
    Closes-Bug: #1759956
    (cherry picked from commit 81db328b2df08f2b4adcc80104cf05ad8966c019)

tags: added: in-stable-pike
Changed in neutron (Ubuntu Artful):
status: New → Triaged
importance: Undecided → Medium
Changed in neutron (Ubuntu Bionic):
status: Confirmed → Triaged
importance: Undecided → Medium
description: updated
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package neutron - 2:12.0.0-0ubuntu3

---------------
neutron (2:12.0.0-0ubuntu3) bionic; urgency=medium

  * d/p/refresh-router-objects-after-port-binding.patch: Cherry-picked
    from upstream stable/queens branch (LP: #1759971).
  * d/p/use-cidr-during-tenant-network-rule-deletion.patch: Cherry-picked
    from upstream stable/queens branch (LP: #1759956).

 -- Corey Bryant <email address hidden> Mon, 16 Apr 2018 16:06:25 -0400

Changed in neutron (Ubuntu Bionic):
status: Triaged → Fix Released

This issue was fixed in the openstack/neutron 13.0.0.0b1 development milestone.

This issue was fixed in the openstack/neutron 11.0.4 release.

This issue was fixed in the openstack/neutron 12.0.2 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers