Retired GRE and VXLAN tunnels persists in neutron db

Bug #1179223 reported by gregmark on 2013-05-12
52
This bug affects 9 people
Affects Status Importance Assigned to Milestone
neutron
High
Romil Gupta

Bug Description

Setup is multi-node, with per-tenant routers and gre or vxlan tunneling,
ovs or ML2, both affected.

SYMPTOM:

VM's are available on the external network for about 1-2 minutes, after which point the connection times out and cannot be re-established unless traffic is generated from the VM console. VMs with dhcp interface settings will periodically and temporarily come back on line after requesting new leases.

When I attempt to ping from the external network, I can trace the traffic all the way to the tap interface on the compute node, where the VM responds to the arp request sent by the tenant router (which is on the separate network node). However, this arp reply never makes it back to the tenant router. It seems to die at the GRE terminus at bridge br-tun.

CAUSE:

* I have a three nics on my network node. The VM traffic goes out the 1st nic on 192.168.239.99/24 to the other compute nodes, while management traffic goes out the 2nd nic on 192.168.241.99. The 3rd nic is external and has no IP.

* I have four GRE endpoints on the VM network, one at the network node (192.168.239.99) and three on compute nodes (192.168.239.{110,114,115}), all with IDs 2-5.

* I have a fifth GRE endpoint with id 1 to 192.168.241.99, the network node's management interface, on each of the compute nodes. This was the first tunnel created when I deployed the network node because that is how I set the remote_ip in the ovs plugin ini. I corrected the setting later, but the 192.168.241.99 endpoint persists:

mysql> select * from ovs_tunnel_endpoints;
+-----------------+----+
| ip_address | id |
+-----------------+----+
| 192.168.239.110 | 3 |
| 192.168.239.114 | 4 |
| 192.168.239.115 | 5 |
| 192.168.239.99 | 2 |
| 192.168.241.99 | 1 | <======== HERE
+-----------------+----+
5 rows in set (0.00 sec)

* Thus, after plugin restarts or reboots, this endpoint is re-created every time.

* The effect is that traffic from the VM has two possible flows from which to make a routing/switching decision. I was unable to determine how this decision is made, but obviously this is not a working configuration. Traffic the originates from the VM always seems to use the correct flow initially, but traffic which originates from the network node is never returned via the right flow unless the connection has been active within the previous 1-2 minutes. In both cases, successful connections timeout after 1-2 minutes of inactivity.

SOLUTION:

mysql> delete from ovs_tunnel_endpoints where id = 1;
Query OK, 1 row affected (0.00 sec)

mysql> select * from ovs_tunnel_endpoints;
+-----------------+----+
| ip_address | id |
+-----------------+----+
| 192.168.239.110 | 3 |
| 192.168.239.114 | 4 |
| 192.168.239.115 | 5 |
| 192.168.239.99 | 2 |
+-----------------+----+
4 rows in set (0.00 sec)

* After doing that, I simply restarted the quantum ovs agents on the network and compute nodes. The old GRE tunnel is not re-created. Thereafter, VM network traffic to and from the external network proceeds without incident.

* Should these tables be cleaned up as well, I wonder:

mysql> select * from ovs_network_bindings;
+--------------------------------------+--------------+------------------+-----------------+
| network_id | network_type | physical_network | segmentation_id |
+--------------------------------------+--------------+------------------+-----------------+
| 4e8aacca-8b38-40ac-a628-18cac3168fe6 | gre | NULL | 2 |
| af224f3f-8de6-4e0d-b043-6bcd5cb014c5 | gre | NULL | 1 |
+--------------------------------------+--------------+------------------+-----------------+
2 rows in set (0.00 sec)

mysql> select * from ovs_tunnel_allocations where allocated != 0;
+-----------+-----------+
| tunnel_id | allocated |
+-----------+-----------+
| 1 | 1 |
| 2 | 1 |
+-----------+-----------+
2 rows in set (0.00 sec)

Jiajun Liu (ljjjustin) on 2013-05-13
summary: - Retired GRE tunnels spersists in quantum db
+ Retired GRE tunnels persists in quantum db
Changed in quantum:
assignee: nobody → Jiajun Liu (ljjjustin)
tags: added: ovs
Jiajun Liu (ljjjustin) on 2013-07-19
Changed in neutron:
assignee: Jiajun Liu (ljjjustin) → nobody
Download full text (12.9 KiB)

in my case is like that :
at the moment I have

openvswitch-switch | 1.4.0-1ubuntu1.5 | http://gb.archive.ubuntu.com/ubuntu/ precise-updates/universe amd64 Packages
openvswitch-switch | 1.4.0-1ubuntu1 | http://gb.archive.ubuntu.com/ubuntu/ precise/universe amd64 Packages
openvswitch | 1.4.0-1ubuntu1 | http://gb.archive.ubuntu.com/ubuntu/ precise/universe Sources
openvswitch | 1.4.0-1ubuntu1.5 | http://gb.archive.ubuntu.com/ubuntu/ precise-updates/universe Sources

I have 2 compute / 1 net / 1 cont:

I can ping vm each other, I can ssh from outside , inside the vm I can ping google and bbc but I can't do apt-get update from example a vm with ubuntu cloud image or not able to surf from a vm instace with ubuntu desktop .
so I think is a problem with DNs .

grep dns /var/log/syslog
loki is my net node !
from net node :
infinity Sep 24 14:32:03 loki dnsmasq-dhcp[19298]: DHCPACK(tap689b75b7-f5) 50.50.1.3 fa:16:3e:32:fa:5a 50-50-1-3
Sep 24 14:32:43 loki dnsmasq-dhcp[19298]: DHCPREQUEST(tap689b75b7-f5) 50.50.1.4 fa:16:3e:cb:f7:cf
Sep 24 14:32:43 loki dnsmasq-dhcp[19298]: DHCPACK(tap689b75b7-f5) 50.50.1.4 fa:16:3e:cb:f7:cf 50-50-1-4
Sep 24 14:33:03 loki dnsmasq-dhcp[19298]: DHCPREQUEST(tap689b75b7-f5) 50.50.1.3 fa:16:3e:32:fa:5a
Sep 24 14:33:03 loki dnsmasq-dhcp[19298]: DHCPACK(tap689b75b7-f5) 50.50.1.3 fa:16:3e:32:fa:5a 50-50-1-3
Sep 24 14:33:37 loki dnsmasq-dhcp[19298]: DHCPREQUEST(tap689b75b7-f5) 50.50.1.4 fa:16:3e:cb:f7:cf
Sep 24 14:33:37 loki dnsmasq-dhcp[19298]: DHCPACK(tap689b75b7-f5) 50.50.1.4 fa:16:3e:cb:f7:cf 50-50-1-4
Sep 24 14:34:03 loki dnsmasq-dhcp[19298]: DHCPREQUEST(tap689b75b7-f5) 50.50.1.3 fa:16:3e:32:fa:5a
Sep 24 14:34:03 loki dnsmasq-dhcp[19298]: DHCPACK(tap689b75b7-f5) 50.50.1.3 fa:16:3e:32:fa:5a 50-50-1-3
Sep 24 14:34:21 loki dnsmasq-dhcp[19298]: DHCPREQUEST(tap689b75b7-f5) 50.50.1.4 fa:16:3e:cb:f7:cf
Sep 24 14:34:21 loki dnsmasq-dhcp[19298]: DHCPACK(tap689b75b7-f5) 50.50.1.4 fa:16:3e:cb:f7:cf 50-50-1-4
Sep 24 14:35:03 loki dnsmasq-dhcp[19298]: DHCPREQUEST(tap689b75b7-f5) 50.50.1.3 fa:16:3e:32:fa:5a
Sep 24 14:35:03 loki dnsmasq-dhcp[19298]: DHCPACK(tap689b75b7-f5) 50.50.1.3 fa:16:3e:32:fa:5a 50-50-1-3
Sep 24 14:35:05 loki dnsmasq-dhcp[19298]: DHCPREQUEST(tap689b75b7-f5) 50.50.1.4 fa:16:3e:cb:f7:cf
Sep 24 14:35:05 loki dnsmasq-dhcp[19298]: DHCPACK(tap689b75b7-f5) 50.50.1.4 fa:16:3e:cb:f7:cf 50-50-1-4
Sep 24 14:35:51 loki dnsmasq-dhcp[19298]: DHCPREQUEST(tap689b75b7-f5) 50.50.1.4 fa:16:3e:cb:f7:cf
Sep 24 14:35:51 loki dnsmasq-dhcp[19298]: DHCPACK(tap689b75b7-f5) 50.50.1.4 fa:16:3e:cb:f7:cf 50-50-1-4
Sep 24 14:36:03 loki dnsmasq-dhcp[19298]: DHCPREQUEST(tap689b75b7-f5) 50.50.1.3 fa:16:3e:32:fa:5a
Sep 24 14:36:03 loki dnsmasq-dhcp[19298]: DHCPACK(tap689b75b7-f5) 50.50.1.3 fa:16:3e:32:fa:5a 50-50-1-3
Sep 24 14:36:39 loki dnsmasq-dhcp[19298]: DHCPREQUEST(tap689b75b7-f5) 50.50.1.4 fa:16:3e:cb:f7:cf
Sep 24 14:36:39 loki dnsmasq-dhcp[19298]: DHCPACK(tap689b75b7-f5) 50.50.1.4 fa:16:3e:cb:f7:cf 50-50-1-4
Sep 24 14:37:03 loki dnsmasq-dhcp[19298]: DHCPREQUEST(tap689b75b7-f5) 50.50.1.3 fa:16:3e:32:fa:5a
Sep 24 14:37:03 loki dnsmasq-dhcp[19298]: DHCPACK(tap689b75b7-f5) 50.50.1.3 fa:16:3e:32:fa:5a 50-50-1-3
Sep 24 14:37:29 loki dnsmasq-dhcp[192...

Mark McClain (markmcclain) wrote :

Is this still a problem?

Changed in neutron:
assignee: nobody → Kyle Mestery (mestery)
status: New → Incomplete
Loic Dachary (dachary) wrote :

I'm experiencing the same problem ( leftover gre tunels ) on Havana.

Definitely still a problem in Grizzly.

On Wed, Dec 4, 2013 at 6:49 AM, Loic Dachary <email address hidden>wrote:

> I'm experiencing the same problem ( leftover gre tunels ) on Havana.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1179223
>
> Title:
> Retired GRE tunnels persists in quantum db
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/neutron/+bug/1179223/+subscriptions
>

--
\*..+.-
--Greg Chavez
+//..;};

Pengfei Zhang (eaglezpf) on 2014-03-12
Changed in neutron:
assignee: Kyle Mestery (mestery) → Pengfei Zhang (eaglezpf)

I'm using icehouse in trusty and it's still an issue.

Miguel Angel Ajo (mangelajo) wrote :

Still an issue in icehouse, confirmed:

1) Starting with a network node + compute node

MariaDB [neutron]> select * from ml2_vxlan_endpoints;
+----------------+----------+
| ip_address | udp_port |
+----------------+----------+
| 192.168.111.27 | 4789 |
| 192.168.111.8 | 4789 |
+----------------+----------+
2 rows in set (0.00 sec)

2) Adding a 2nd compute node at 192.168.111.29

MariaDB [neutron]> select * from ml2_vxlan_endpoints;
+----------------+----------+
| ip_address | udp_port |
+----------------+----------+
| 192.168.111.27 | 4789 |
| 192.168.111.29 | 4789 |
| 192.168.111.8 | 4789 |
+----------------+----------+
3 rows in set (0.00 sec)

3) Stopping the openvswitch-agent at compute "2"

[root@compute02]# service neutron-openvswitch-agent stop

4) Deleting the compute02 agent from neutron

[root@controller ~(openstack_admin)]# neutron agent-delete f6833f2e-6e98-45f8-ad38-2ab3ca7201b2

5) The endpoint is not gone

MariaDB [neutron]> select * from ml2_vxlan_endpoints;
+----------------+----------+
| ip_address | udp_port |
+----------------+----------+
| 192.168.111.27 | 4789 |
| 192.168.111.29 | 4789 |
| 192.168.111.8 | 4789 |
+----------------+----------+
3 rows in set (0.00 sec)

Changed in neutron:
status: Incomplete → Confirmed
summary: - Retired GRE tunnels persists in quantum db
+ Retired GRE and VXLAN tunnels persists in neutron db
description: updated
Changed in neutron:
importance: Undecided → Medium
Shiv Haris (shh) on 2014-06-25
Changed in neutron:
milestone: none → juno-2
Kyle Mestery (mestery) on 2014-07-22
Changed in neutron:
milestone: juno-2 → juno-3
Romil Gupta (romilg) wrote :

Hi Pengfei Zhang,

This defect seems to be interesting to me and I have also faced this issue.
Are you working on this defect ? Do you have any patch ?
I am interested in working on it. Shall I assign it to myself and work on it?
Please reply.....

You just sent this e-mail to me, not Pengfei Zhang. I reported the problem
a long time ago, but I assumed the issue was NOT interesting to anyone
because it was closed. Our cloud has performed well for over a year without
a fix, however.

--Greg

On Fri, Aug 22, 2014 at 9:21 AM, Romil Gupta <email address hidden> wrote:

> Hi Pengfei Zhang,
>
> This defect seems to be interesting to me and I have also faced this issue.
> Are you working on this defect ? Do you have any patch ?
> I am interested in working on it. Shall I assign it to myself and work on
> it?
> Please reply.....
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1179223
>
> Title:
> Retired GRE and VXLAN tunnels persists in neutron db
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/neutron/+bug/1179223/+subscriptions
>

--
\*..+.-
--Greg Chavez
+//..;};

Romil Gupta (romilg) on 2014-08-25
Changed in neutron:
assignee: Pengfei Zhang (eaglezpf) → Romil Gupta (romilg)
Thierry Carrez (ttx) on 2014-09-03
Changed in neutron:
milestone: juno-3 → juno-rc1
Romil Gupta (romilg) wrote :

Hi all,

I have completed the working POC code will includes agent_db.py , type_tunnel.py , and ovs_neutron_agnet.py changes( inshort I have introduce the tunnel_delete rpc which delete the stale tunnels). I will post for review tommorrow. And started looking for lb agent changes.

Fix proposed to branch: master
Review: https://review.openstack.org/121000

Changed in neutron:
status: Confirmed → In Progress
Kyle Mestery (mestery) on 2014-09-17
Changed in neutron:
milestone: juno-rc1 → kilo-1
importance: Medium → High
milestone: kilo-1 → juno-rc1
Romil Gupta (romilg) wrote :

Investigating on new design

Kyle Mestery (mestery) on 2014-09-27
Changed in neutron:
milestone: juno-rc1 → none
Romil Gupta (romilg) wrote :

Request core member to set the milestone.

Robert Kukura (rkukura) on 2014-10-08
Changed in neutron:
milestone: none → kilo-1
Romil Gupta (romilg) on 2014-10-25
tags: added: ml2
removed: ovs
Romil Gupta (romilg) on 2014-10-25
tags: added: ovs

Fix proposed to branch: master
Review: https://review.openstack.org/136106

Kyle Mestery (mestery) on 2014-12-16
Changed in neutron:
milestone: kilo-1 → kilo-2

Reviewed: https://review.openstack.org/121000
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=3db0e619c83892a7aab61807969205253833ff8d
Submitter: Jenkins
Branch: master

commit 3db0e619c83892a7aab61807969205253833ff8d
Author: Romil Gupta <email address hidden>
Date: Thu Sep 11 23:26:57 2014 -0700

    Stale VXLAN & GRE tunnel endpoint deletion from DB

    Description:
    Stale GRE and VXLAN tunnel endpoints persists in neutron db this should be
    deleted from the database. Also, if local_ip of L2 agent changes the
    stale tunnel ports and flows persists on br-tun on other Compute Nodes and
    Network Nodes for that remote ip this should also be removed.

    Implementation

    Plugin changes:
    Added host column in 'ml2_vxlan_endpoints' and 'ml2_gre_endpoints' table.
    Added delete_endpoint method for deleting the stale endpoints from db.
    Modified tunnel_sync() method to accommodate these changes.
    Modified testcases in test_type_vxlan.py
    Modified testcases in test_type_gre.py

    Agent changes:
    Added tunnel_delete rpc for removing stale ports and flows from br-tun.
    tunnel_sync rpc signature upgrade to obtain 'host'.
    Added testcases for TunnelRpcCallbackMixin().

    This patch-set only deals with plugin side changes.

    Partial-Bug: #1179223

    Change-Id: I75c6581fcc9f47a68bde29cbefcaa1a2a082344e

Shiv Haris (shh) wrote :

Romil, Can you please suggest the next steps to bring this to closure. Thanks.

Romil Gupta (romilg) wrote :

This https://review.openstack.org/136106 will fix this bug completely.

Kyle Mestery (mestery) on 2015-02-04
Changed in neutron:
milestone: kilo-2 → kilo-3

Reviewed: https://review.openstack.org/136106
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=67c4c6d809e4c9e112d9fb848b5bdce9d5cd04ac
Submitter: Jenkins
Branch: master

commit 67c4c6d809e4c9e112d9fb848b5bdce9d5cd04ac
Author: Romil Gupta <email address hidden>
Date: Thu Nov 20 11:32:07 2014 -0800

    Stale VXLAN and GRE tunnel port/flow deletion

    Description:
    Stale GRE and VXLAN tunnel endpoints persists in neutron db this should be
    deleted from the database. Also, if local_ip of L2 agent changes the
    stale tunnel ports and flows persists on br-tun on other Compute Nodes and
    Network Nodes for that remote ip this should also be removed.

    Implementation

    Plugin changes:
    The plugin side changes are covered in following patch-set
    https://review.openstack.org/#/c/121000/.

    Agent changes:
    Added tunnel_delete rpc for removing stale ports and flows from br-tun.
    tunnel_sync rpc signature upgrade to obtain 'host'.
    Added testcases for TunnelRpcCallbackMixin().

    This patch-set agent deals with agent side changes.

    Closes-Bug: #1179223
    Closes-Bug: #1381071
    Closes-Bug: #1276629

    Co-Authored-By: Aman Kumar <email address hidden>
    Co-Authored-By: phanipawan <email address hidden>

    Change-Id: I291992ffde5c3ab7152f0d7462deca2e4ac1ba3f

Changed in neutron:
status: In Progress → Fix Committed
Thierry Carrez (ttx) on 2015-03-19
Changed in neutron:
status: Fix Committed → Fix Released
Thierry Carrez (ttx) on 2015-04-30
Changed in neutron:
milestone: kilo-3 → 2015.1.0
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers