Comment 0 for bug 1179223

Revision history for this message
gregmark (greg-chavez) wrote : Retired GRE tunnels spersists in quantum db

This is Grizzly on Ubuntu 13.04 (1:2013.1-0ubuntu2). Setup is multi-node, with per-tenant routers and gre tunneling.

SYMPTOM:

VM's are available on the external network for about 1-2 minutes, after which point the connection times out and cannot be re-established unless traffic is generated from the VM console. VMs with dhcp interface settings will periodically and temporarily come back on line after requesting new leases.

When I attempt to ping from the external network, I can trace the traffic all the way to the tap interface on the compute node, where the VM responds to the arp request sent by the tenant router (which is on the separate network node). However, this arp reply never makes it back to the tenant router. It seems to die at the GRE terminus at bridge br-tun.

CAUSE:

* I have a three nics on my network node. The VM traffic goes out the 1st nic on 192.168.239.99/24 to the other compute nodes, while management traffic goes out the 2nd nic on 192.168.241.99. The 3rd nic is external and has no IP.

* I have four GRE endpoints on the VM network, one at the network node (192.168.239.99) and three on compute nodes (192.168.239.{110,114,115}), all with IDs 2-5.

* I have a fifth GRE endpoint with id 1 to 192.168.241.99, the network node's management interface, on each of the compute nodes. This was the first tunnel created when I deployed the network node because that is how I set the remote_ip in the ovs plugin ini. I corrected the setting later, but the 192.168.241.99 endpoint persists:

mysql> select * from ovs_tunnel_endpoints;
+-----------------+----+
| ip_address | id |
+-----------------+----+
| 192.168.239.110 | 3 |
| 192.168.239.114 | 4 |
| 192.168.239.115 | 5 |
| 192.168.239.99 | 2 |
| 192.168.241.99 | 1 | <======== HERE
+-----------------+----+
5 rows in set (0.00 sec)

* Thus, after plugin restarts or reboots, this endpoint is re-created every time.

* The effect is that traffic from the VM has two possible flows from which to make a routing/switching decision. I was unable to determine how this decision is made, but obviously this is not a working configuration. Traffic the originates from the VM always seems to use the correct flow initially, but traffic which originates from the network node is never returned via the right flow unless the connection has been active within the previous 1-2 minutes. In both cases, successful connections timeout after 1-2 minutes of inactivity.

SOLUTION:

mysql> delete from ovs_tunnel_endpoints where id = 1;
Query OK, 1 row affected (0.00 sec)

mysql> select * from ovs_tunnel_endpoints;
+-----------------+----+
| ip_address | id |
+-----------------+----+
| 192.168.239.110 | 3 |
| 192.168.239.114 | 4 |
| 192.168.239.115 | 5 |
| 192.168.239.99 | 2 |
+-----------------+----+
4 rows in set (0.00 sec)

* After doing that, I simply restarted the quantum ovs agents on the network and compute nodes. The old GRE tunnel is not re-created. Thereafter, VM network traffic to and from the external network proceeds without incident.

* Should these tables be cleaned up as well, I wonder:

mysql> select * from ovs_network_bindings;
+--------------------------------------+--------------+------------------+-----------------+
| network_id | network_type | physical_network | segmentation_id |
+--------------------------------------+--------------+------------------+-----------------+
| 4e8aacca-8b38-40ac-a628-18cac3168fe6 | gre | NULL | 2 |
| af224f3f-8de6-4e0d-b043-6bcd5cb014c5 | gre | NULL | 1 |
+--------------------------------------+--------------+------------------+-----------------+
2 rows in set (0.00 sec)

mysql> select * from ovs_tunnel_allocations where allocated != 0;
+-----------+-----------+
| tunnel_id | allocated |
+-----------+-----------+
| 1 | 1 |
| 2 | 1 |
+-----------+-----------+
2 rows in set (0.00 sec)