Multiple router_centralized_snat interfaces created during Heat deployment

Bug #1709774 reported by Matthew Wynne
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Heat
Invalid
Undecided
Unassigned
neutron
Fix Released
High
Swaminathan Vasudevan

Bug Description

While attempting to deploy the attached hot template I ran into a few issues:

1. Multiple router_centralized_snat interfaces are being created.
2. One router_centralized_snat interface is created, but it's Down.

When Multiple interfaces are created the stack can't be deleted. I need to manually delete the additional ports that have been created before the stack can be deleted.

I'm using Newton with OVS+DVR.

I should state up front that the `depends_on` that are in the template are more of a last ditch effort than anything else, and are likely incorrect. However, without them the problem still exists.

Revision history for this message
Matthew Wynne (mwynne) wrote :
Revision history for this message
Matthew Wynne (mwynne) wrote :

Attaching an image of the router from Horizon.

Revision history for this message
Matthew Wynne (mwynne) wrote :

I'm not sure which logs are useful for you guys. Here's the heat-engine log. The Neutron logs didn't seem to have much value in them that I could see. Let me know if you need more logs. I will also upload the updated deploy.yml file I've used to repro this with less components to hopefully make things easier to triage.

I have also tried creating this template manually, and do not see this issue.

Revision history for this message
Matthew Wynne (mwynne) wrote :
tags: added: l3-dvr-backlog
Revision history for this message
Matthew Wynne (mwynne) wrote :

Additional Info: In one case the snat port became disassociated with the router in the neutron database, and I was required to insert the port into the routerports table in order to delete the port.

Without that step I received an error that the router did not have any ports connected to that subnet.

Revision history for this message
Matthew Wynne (mwynne) wrote :
Revision history for this message
Kevin Benton (kevinbenton) wrote :

From my initial findings this does look like some kind of race condition server side in Neutron and isn't an issue with Heat.

Changed in neutron:
status: New → Triaged
assignee: nobody → Swaminathan Vasudevan (swaminathan-vasudevan)
importance: Undecided → High
Changed in heat:
status: New → Invalid
Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

Based on the logs, it seems that there are three ports that have been created for the 2.x subnet for centralized snat. Only one port is ACTIVE and the remaining two ports are DOWN.
So not clear why two ports were DOWN.
In this same setup we are only seeing one csnat port for 3.X subnet.

I suspect a user error or when trying to remove the router-interface, it would have not cleared the port.
But let me check.

Revision history for this message
Matthew Wynne (mwynne) wrote :

Where are you seeing the 2.x and 3.x subnets in the log? I don't believe those subnets are being used by the template I used/posted.

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

Your snat.png file has the 2.x and 3.x addresses for the snat. That's what I was looking at and commented.

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

By looking at the neutron server logs that you have posted. I only see two 'router_centralized_snat' ports that are created and forwarded to the agent.

The first port is 172.16.6.6 and the next one is 172.16.5.12.

There are two router interfaces 172.16.6.1 and 172.16.5.1.

I don't see any additional router_centralized_snat port on the neturon.server log that you have posted.

Am I missing anything here.

Revision history for this message
Matthew Wynne (mwynne) wrote :

I left that stack running, and horizon shows that there's a router_centralized_snat port on 172.16.5.6 as well as 5.12, but it's "Down". Take a look at this screenshot. Hopefully that helps.

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

But there is nothing in the neutron log to show that this port was created. How did we (neutron) miss that port creation. (odd)

Revision history for this message
Kevin Benton (kevinbenton) wrote :

Hi Matthew,

I've looked at the server logs and they actually look like they are different subnets but the same network. The log snippet below shows two independent interfaces for two different subnets on the same network. This does match the heat template which is two subnets on the same 'mgmt_net' network.

2017-08-10 13:44:18.052 24945 DEBUG neutron.db.l3_dvr_db [req-30364b5a-5f17-4238-9422-4c83fec95878 - - - - -] Return the SNAT ports: defaultdict(<type 'list'>, {u'40846103-718c-443c-95c0-4a5f540844b7': [{'status': u'ACTIVE', 'binding:host_id': u'cloudcontrol', 'description': u'', 'allowed_address_pairs': [], 'extra_dhcp_opts': [], 'updated_at': '2017-08-10T20:44:15Z', 'device_owner': u'network:router_centralized_snat', 'revision_number': 7, 'port_security_enabled': False, 'binding:profile': {}, 'fixed_ips': [{'subnet_id': u'ebf5b419-9a50-4024-b613-4d310fc82015', 'ip_address': u'172.16.6.6'}], 'id': u'5543b3a8-a6a9-438f-965a-3ddfdea40412', 'security_groups': [], 'device_id': u'40846103-718c-443c-95c0-4a5f540844b7', 'name': u'', 'admin_state_up': True, 'network_id': u'b11c7c91-aea3-4200-a343-185d5220cc4d', 'tenant_id': u'', 'binding:vif_details': {u'port_filter': True, u'ovs_hybrid_plug': True}, 'binding:vnic_type': u'normal', 'binding:vif_type': u'ovs', 'mac_address': u'fa:16:3e:0d:d6:3f', 'project_id': u'', 'created_at': '2017-08-10T20:43:56Z'}, {'status': u'DOWN', 'binding:host_id': u'', 'description': u'', 'allowed_address_pairs': [], 'extra_dhcp_opts': [], 'updated_at': '2017-08-10T20:44:08Z', 'device_owner': u'network:router_centralized_snat', 'revision_number': 4, 'port_security_enabled': False, 'binding:profile': {}, 'fixed_ips': [{'subnet_id': u'83e49113-194a-4832-9075-0e14ff7aa0e0', 'ip_address': u'172.16.5.12'}], 'id': u'b0a8c228-79cb-40b3-84e9-52dfabab97ee', 'security_groups': [], 'device_id': u'40846103-718c-443c-95c0-4a5f540844b7', 'name': u'', 'admin_state_up': True, 'network_id': u'b11c7c91-aea3-4200-a343-185d5220cc4d', 'tenant_id': u'', 'binding:vif_details': {}, 'binding:vnic_type': u'normal', 'binding:vif_type': u'unbound', 'mac_address': u'fa:16:3e:4d:b1:63', 'project_id': u'', 'created_at': '2017-08-10T20:44:07Z'}]}) _get_snat_sync_interfaces /usr/lib/python2.7/site-packages/neutron/db/l3_dvr_db.py:549

So to clarify the bug report a bit, the problem arises with a DVR router attached to two subnets on the same neutron network.

When you perform the topology creation manually, does it instead result in one router interface with two IP addresses instead of two interfaces with one IP address each?

Revision history for this message
Matthew Wynne (mwynne) wrote :

Hey Kevin. Sorry about the confusion there. You are correct. This issue occurs with a DVR router which has interfaces for 2 subnets (1 interface for each subnet) belonging to the same neutron network.

When I manually created this topology it results in 2 interfaces with 1 IP address each.

neutron port-list:

Thse are the 2 interfaces from manual creation.

| 7d8cbf33-1b73-4887-a65f-b8331f384be4 | | fa:16:3e:26:7f:00 | {"subnet_id": "3dfa2b95-31cb-4a47-94c5-91225d3bdb4d", "ip_address": "172.16.1.12"}
| 33ba571c-97be-4786-86b0-5803c9b95f97 | | fa:16:3e:9b:4c:0f | {"subnet_id": "cd47d896-7cda-45e5-b262-e45b148b3f88", "ip_address": "172.16.2.9"}

And here are the 2 interfaces (each from their own namespace):

287: sg-33ba571c-97: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN qlen 1000
    link/ether fa:16:3e:9b:4c:0f brd ff:ff:ff:ff:ff:ff
    inet 172.16.2.9/24 brd 172.16.2.255 scope global sg-33ba571c-97
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fe9b:4c0f/64 scope link
       valid_lft forever preferred_lft forever

288: sg-7d8cbf33-1b: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN qlen 1000
    link/ether fa:16:3e:26:7f:00 brd ff:ff:ff:ff:ff:ff
    inet 172.16.1.12/24 brd 172.16.1.255 scope global sg-7d8cbf33-1b
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fe26:7f00/64 scope link
       valid_lft forever preferred_lft forever

^^ This looks correct to me.

If you look at the snat_2 image I uploaded you'll see that after the heat deployment there is a snat interface (172.16.5.6) which is Down. This port shows up in neutron port-list:

| 834bdfa5-ef13-481d-b493-6920a3c21289 | | fa:16:3e:ac:24:ec | {"subnet_id": "83e49113-194a-4832-9075-0e14ff7aa0e0", "ip_address": "172.16.5.6"}

However:

ip -a netns exec ip a | grep "fa:16:3e:ac:24:ec"

Returns nothing, so that interface does not appear to exist.

Hopefully this is helpful. Let me know if I can do more.

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

So Matthew are you saying that when you manually create the duplicate SNAT port is not seen. But when you create through heat templates you are seeing this additional port in 'DOWN' state.

Revision history for this message
Matthew Wynne (mwynne) wrote :

Yeah, Swami, that's correct.

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

DEBUG oslo_db.api [None req-4bde7e76-fc09-4ec3-8363-30394de45376 admin admin] Performing DB retry for function neutron.db.l3_dvr_db._add_csnat_on_interface_create {{(pid=28901) wrapper /usr/local/lib/python2.7/dist-packages/oslo_db/api.py:152}}

This retry action is causing the additional port getting created in '_add_csnat_on_interface_create'.

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

Also see this message in the log when the Retry happens.
StaleDataError: UPDATE statement on table 'standardattributes' expected to update 1 row(s); 0 were matched.

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

DEBUG oslo_concurrency.lockutils [-] Lock "event-dispatch" acquired by "neutron.plugins.ml2.ovo_rpc.dispatch_events" :: waited 0.000s {{(pid=28901) inner /usr/local/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py:270}}
DEBUG neutron.db.api [None req-4bde7e76-fc09-4ec3-8363-30394de45376 admin admin] Retry wrapper got retriable exception: UPDATE statement on table 'standardattributes' expected to update 1 row(s); 0 were matched. {{(pid=28901) wrapped /opt/stack/neutron/neutron/db/api.py:128}}
DEBUG oslo_db.api [None req-4bde7e76-fc09-4ec3-8363-30394de45376 admin admin] Performing DB retry for function neutron.db.l3_dvr_db._add_csnat_on_interface_create {{(pid=28901) wrapper /usr/local/lib/python2.7/dist-packages/oslo_db/api.py:152}}

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

stack@ubuntu:~/devstack$ neutron port-list -- --device_owner=network:router_centralized_snat
neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead.
+--------------------------------------+------+-----------+-------------------+-----------------------------------------------------------------------------------+
| id | name | tenant_id | mac_address | fixed_ips |
+--------------------------------------+------+-----------+-------------------+-----------------------------------------------------------------------------------+
| 6df031ec-abc1-45ee-bee9-70c5ea1f1947 | | | fa:16:3e:40:51:f9 | {"subnet_id": "2f8ecc6d-0656-474d-80dd-d5b819db944d", "ip_address": "10.16.6.12"} |
| 94258ba0-46d7-477a-9f32-437241436dc7 | | | fa:16:3e:99:c8:ed | {"subnet_id": "220742b6-4a34-44d5-97cb-654fead798ac", "ip_address": "10.16.5.7"} |
| ab4a9b89-967d-474c-a958-684144ed704e | | | fa:16:3e:cd:ea:63 | {"subnet_id": "2f8ecc6d-0656-474d-80dd-d5b819db944d", "ip_address": "10.16.6.11"} |

After creating the 10.16.6.12, it is supposed to update the RouterPort table successfully, but something is causing it to fail and then triggers this retry as shown below.

DEBUG neutron.scheduler.dhcp_agent_scheduler [None req-962cb19b-c845-48ae-8c0a-1b564d4d74a4 admin
admin] Network 0c1c7014-ef35-4d03-b5f0-fbff5f793a0f is already hosted by enough agents. {{(pid=30818) _get_dhcp_agents_hosting_network /opt/stack/neutron/neutron/scheduler/dhcp_agent_scheduler.py:243}}
WARNING neutron.db.l3_dvr_db [None req-962cb19b-c845-48ae-8c0a-1b564d4d74a4 admin admin] ### SNAT Port created: {'status': 'DOWN', 'binding:host_id': '', 'description': '', 'allowed_address_pairs': [], 'tags': [], 'extra_dhcp_opts': [], 'updated_at': '2017-08-16T22:05:57Z', 'device_owner': 'network:router_centralized_snat', 'revision_number': 3, 'binding:profile': {}, 'port_security_enabled': False, 'fixed_ips': [{'subnet_id': u'2f8ecc6d-0656-474d-80dd-d5b819db944d', 'ip_address': u'10.16.6.12'}], 'id': '6df031ec-abc1-45ee-bee9-70c5ea1f1947', 'security_groups': [], 'device_id': u'a53c639a-cb21-4143-835a-22813a89c8f6', 'name': '', 'admin_state_up': True, 'network_id': u'0c1c7014-ef35-4d03-b5f0-fbff5f793a0f', 'tenant_id': '', 'binding:vif_details': {}, 'binding:vnic_type': 'normal', 'binding:vif_type': 'unbound', 'mac_address': 'fa:16:3e:40:51:f9', 'project_id': '', 'created_at': '2017-08-16T22:05:57Z'}

After the port got created it is triggering a Retry.

DEBUG neutron.db.api [None req-962cb19b-c845-48ae-8c0a-1b564d4d74a4 admin admin] Retry wrapper got retriable exception: UPDATE statement on table 'standardattributes' expected to update 1 row(s); 0 were matched. {{(pid=30818) wrapped /opt/stack/neutron/neutron/db/api.py:128}}

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/494376

Changed in neutron:
status: Triaged → In Progress
Revision history for this message
Matthew Wynne (mwynne) wrote :

I'm not sure if it's related to this ticket, but I've just run into a case where a csnat port was not created (No network namespace or interface) when Heat created a single router with a single subnet connected to it. I had to delete the interface and re-connect it to my subnet manually to get Neutron to create the csnat port. Perhaps heat attempting to do many things in parallel is not playing nice with Neutron's network creation?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/494376
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=8c3cb2e15b48f5e2b0c3d22550f00f3c7adc7f33
Submitter: Jenkins
Branch: master

commit 8c3cb2e15b48f5e2b0c3d22550f00f3c7adc7f33
Author: Swaminathan Vasudevan <email address hidden>
Date: Wed Aug 16 21:48:08 2017 -0700

    DVR: Multiple csnat ports created when RouterPort table update fails

    When router interfaces are added to DVR router, if the router has
    gateway configured, then the internal csnat ports are created for
    the corresponding router interfaces.
    We have seen recently after the csnat port is created if the
    RouterPort table update fails, there is a DB retry that is happening
    and that retry operation is creating an additional csnat port.
    This additional port is not getting removed automatically when the
    router interfaces are deleted.
    This issue is seen when testing with a simple heat template as
    per the bug report.

    This patch fixes the issue by calling the RouterPort create with
    delete_port_on_error context.

    Change-Id: I916011f2200f02556ebb30bce30e349a8023602c
    Closes-Bug: #1709774

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/506273

tags: added: neutron-proactive-backport-potential
tags: removed: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/pike)

Reviewed: https://review.openstack.org/506273
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=33fb5917248255b759dfe7ff20e53273c9f6170c
Submitter: Zuul
Branch: stable/pike

commit 33fb5917248255b759dfe7ff20e53273c9f6170c
Author: Swaminathan Vasudevan <email address hidden>
Date: Wed Aug 16 21:48:08 2017 -0700

    DVR: Multiple csnat ports created when RouterPort table update fails

    When router interfaces are added to DVR router, if the router has
    gateway configured, then the internal csnat ports are created for
    the corresponding router interfaces.
    We have seen recently after the csnat port is created if the
    RouterPort table update fails, there is a DB retry that is happening
    and that retry operation is creating an additional csnat port.
    This additional port is not getting removed automatically when the
    router interfaces are deleted.
    This issue is seen when testing with a simple heat template as
    per the bug report.

    This patch fixes the issue by calling the RouterPort create with
    delete_port_on_error context.

    Change-Id: I916011f2200f02556ebb30bce30e349a8023602c
    Closes-Bug: #1709774
    (cherry picked from commit 8c3cb2e15b48f5e2b0c3d22550f00f3c7adc7f33)

tags: added: in-stable-pike
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 12.0.0.0b1

This issue was fixed in the openstack/neutron 12.0.0.0b1 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 11.0.2

This issue was fixed in the openstack/neutron 11.0.2 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/527952

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (stable/ocata)

Change abandoned by Bernard Cafarelli (<email address hidden>) on branch: stable/ocata
Review: https://review.openstack.org/527952
Reason: Ocata going EOL, no activity for months and failing tests as mentioned

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.opendev.org/686994

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/ocata)

Reviewed: https://review.opendev.org/527952
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=2028a66f21af596cd59e2a6f7d9bb05f6d5f7292
Submitter: Zuul
Branch: stable/ocata

commit 2028a66f21af596cd59e2a6f7d9bb05f6d5f7292
Author: Swaminathan Vasudevan <email address hidden>
Date: Wed Aug 16 21:48:08 2017 -0700

    DVR: Multiple csnat ports created when RouterPort table update fails

    When router interfaces are added to DVR router, if the router has
    gateway configured, then the internal csnat ports are created for
    the corresponding router interfaces.
    We have seen recently after the csnat port is created if the
    RouterPort table update fails, there is a DB retry that is happening
    and that retry operation is creating an additional csnat port.
    This additional port is not getting removed automatically when the
    router interfaces are deleted.
    This issue is seen when testing with a simple heat template as
    per the bug report.

    This patch fixes the issue by calling the RouterPort create with
    delete_port_on_error context.

    Conflicts:
            neutron/db/l3_dvr_db.py
            neutron/tests/unit/db/test_l3_dvr_db.py

    Change-Id: I916011f2200f02556ebb30bce30e349a8023602c
    Closes-Bug: #1709774
    (cherry picked from commit 8c3cb2e15b48f5e2b0c3d22550f00f3c7adc7f33)
    Signed-off-by: Mikhail Ushanov <email address hidden>

tags: added: in-stable-ocata
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (stable/ocata)

Change abandoned by Mikhail Ushanov (<email address hidden>) on branch: stable/ocata
Review: https://review.opendev.org/686994
Reason: The original patchset was merged, close this change.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron ocata-eol

This issue was fixed in the openstack/neutron ocata-eol release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.