Bug #1794991 “Inconsistent flows with DVR l2pop VxLAN on br-tun” : Bugs : neutron

Gaëtan Trellu (goldyfruit) on 2018-09-28

description:	updated
description:	updated
description:	updated
description:	updated
summary:	- Inconsistent flows with DVR l2pop on br-tun + Inconsistent flows with DVR l2pop VxLAN on br-tun

Gaëtan Trellu (goldyfruit) on 2018-09-28

tags:

added: l3-dvr-backlog

Nate Johnston (nate-johnston) on 2018-10-01

tags:

added: l3-ipam-dhcp

Revision history for this message

Gaëtan Trellu (goldyfruit) wrote on 2018-10-01:

#1

Example of an inconsistent flow across compute nodes having an instance on the same network:

cookie=0xf305729cf4d16a1e, table=22, priority=1,dl_vlan=8 actions=strip_vlan,load:0x61->NXM_NX_TUN_ID[],output:"vxlan-0a83083d",output:"vxlan-0a830845",output:
"vxlan-0a830841",output:"vxlan-0a83083e"

A correct flow should be:

cookie=0xf305729cf4d16a1e, table=22, priority=1,dl_vlan=8 actions=strip_vlan,load:0x61->NXM_NX_TUN_ID[],output:"vxlan-0a83083d",output:"vxlan-0a830845",output:"vxlan-0a830841",output:"vxlan-0a83083e",output:"vxlan-0a830843",output:"vxlan-0a830844"

Revision history for this message

Gaëtan Trellu (goldyfruit) wrote on 2018-10-03:

#2

We tried with the native and ovs_ofctl interfaces, same behavior.
From where Neutron get the flows to add/update/remove, from the ports table ?

Revision history for this message

Manuel Rodriguez (manuel.rodriguez) wrote on 2018-10-03:

#3

Just adding more information to the issue number[1] Gaëtan described in the initial comment:

These are the flows listed from a vxlan network segmentation id 97, or 61 in hexadecimal from one of our compute nodes:

# dex -u0 openvswitch_vswitchd ovs-ofctl dump-flows br-tun | grep 0x61
cookie=0xde6f920d0d405dbc, duration=500977.143s, table=4, n_packets=427, n_bytes=41999, priority=1,tun_id=0x61 actions=mod_vlan_vid:8,resubmit(,9)

As you can see there is only one flow in the table 4, there was already two instances of that network running in the compute node, one created a few days ago, and another a few minutes ago, the later never got a DHCP IP because there were no flows to reach the DHCP namespaces running on three different controllers. To fix this we have found two workarounds:

1. Add the flows manually
2. Modify the network so it will trigger a flow addition across computes running instances from that network.

We tried to the second option by adding a second subnet to the network, and we could see log events about update_fdb_entries for the DHCP agents, now flows looks like this, even after deleting the second subnet,

# dex -u0 openvswitch_vswitchd ovs-ofctl dump-flows br-tun | grep 0x61
cookie=0xde6f920d0d405dbc, duration=501247.247s, table=4, n_packets=436, n_bytes=44143, priority=1,tun_id=0x61 actions=mod_vlan_vid:8,resubmit(,9)
cookie=0xde6f920d0d405dbc, duration=15.405s, table=20, n_packets=1, n_bytes=42, priority=2,dl_vlan=8,dl_dst=fa:16:3e:4c:67:28 actions=strip_vlan,load:0x61->NXM_NX_TUN_ID[],output:"vxlan-0a83083e"
cookie=0xde6f920d0d405dbc, duration=14.697s, table=20, n_packets=2, n_bytes=84, priority=2,dl_vlan=8,dl_dst=fa:16:3e:5b:63:e3 actions=strip_vlan,load:0x61->NXM_NX_TUN_ID[],output:"vxlan-0a83083d"
cookie=0xde6f920d0d405dbc, duration=14.711s, table=22, n_packets=9, n_bytes=1818, priority=1,dl_vlan=8 actions=strip_vlan,load:0x61->NXM_NX_TUN_ID[],output:"vxlan-0a83083d",output:"vxlan-0a83083e"

From the flows above we identify two things:
- They only take care of the communication to the DHCP agents, in other words none of the outputs of the vxlan tunnels from other computes running instances in that network are added in table 22.
- We are still missing a flow from one of the DHCP agents.

If we restart the instance that didn't get a DHCP lease, now it gets an IP. However once we restart the neutron openvswitch agent in the compute, we lose the flows from table 20 and table 22. Why we don't understand is why it always returns to the same unhealthy state with only one flow for table 4.

Just adding more information to the issue number[1] Gaëtan described in the initial comment:

These are the flows listed from a vxlan network segmentation id 97, or 61 in hexadecimal from one of our compute nodes:

# dex -u0 openvswitch_vswitchd ovs-ofctl dump-flows br-tun | grep 0x61
 cookie=0xde6f920d0d405dbc, duration=500977.143s, table=4, n_packets=427, n_bytes=41999, priority=1,tun_id=0x61 actions=mod_vlan_vid:8,resubmit(,9)

As you can see there is only one flow in the table 4, there was already two instances of that network running in the compute node, one created a few days ago, and another a few minutes ago, the later never got a DHCP IP because there were no flows to reach the DHCP namespaces running on three different controllers. To fix this we have found two workarounds:

1. Add the flows manually
2. Modify the network so it will trigger a flow addition across computes running instances from that network.

We tried to the second option by adding a second subnet to the network, and we could see log events about update_fdb_entries for the DHCP agents, now flows looks like this, even after deleting the second subnet,

# dex -u0 openvswitch_vswitchd ovs-ofctl dump-flows br-tun | grep 0x61
 cookie=0xde6f920d0d405dbc, duration=501247.247s, table=4, n_packets=436, n_bytes=44143, priority=1,tun_id=0x61 actions=mod_vlan_vid:8,resubmit(,9)
 cookie=0xde6f920d0d405dbc, duration=15.405s, table=20, n_packets=1, n_bytes=42, priority=2,dl_vlan=8,dl_dst=fa:16:3e:4c:67:28 actions=strip_vlan,load:0x61->NXM_NX_TUN_ID[],output:"vxlan-0a83083e"
 cookie=0xde6f920d0d405dbc, duration=14.697s, table=20, n_packets=2, n_bytes=84, priority=2,dl_vlan=8,dl_dst=fa:16:3e:5b:63:e3 actions=strip_vlan,load:0x61->NXM_NX_TUN_ID[],output:"vxlan-0a83083d"
 cookie=0xde6f920d0d405dbc, duration=14.711s, table=22, n_packets=9, n_bytes=1818, priority=1,dl_vlan=8 actions=strip_vlan,load:0x61->NXM_NX_TUN_ID[],output:"vxlan-0a83083d",output:"vxlan-0a83083e"

From the flows above we identify two things:
- They only take care of the communication to the DHCP agents, in other words none of the outputs of the vxlan tunnels from other computes running instances in that network are added in table 22. 
- We are still missing a flow from one of the DHCP agents.

If we restart the instance that didn't get a DHCP lease, now it gets an IP. However once we restart the neutron openvswitch agent in the compute, we lose the flows from table 20 and table 22. Why we don't understand is why it always returns to the same unhealthy state with only one flow for table 4.

Gaëtan Trellu (goldyfruit) on 2018-10-04

description:

updated

Revision history for this message

Swaminathan Vasudevan (swaminathan-vasudevan) wrote on 2018-10-10:

#4

Based upon your description you mentioned that when you remove all the instances belonging to a network and recreate the instance, then the l2pop creates all the required entries and updates the compute host.
This means it does it for the 1st active port that pops up on the compute host.

Did you see any errors or Traces in the openvswitch agent that is fishy and have not configured the flows?
Was there any RPC timeout>

How many instances where running on that particular compute or on that network.

Revision history for this message

Gaëtan Trellu (goldyfruit) wrote on 2018-10-10:

#5

Hey Swaminathan, thanks for answering !

We have 8 instances on this network spread like this:
  - 1 on compute002
  - 1 on compute003
  - 3 on compute005
  - 3 on compute007

The only RPC timeout we got are these one (they are not related to any port of our network/instances with issue):

2018-10-10 17:43:09.282 7 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-20947bce-dfd3-464c-a516-787caacb7405 - - - - -] Setting status for 90f27db6-9db2-4954-93de-07876d9e6edb to UP _bind_devices /var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:889
2018-10-10 17:43:09.548 7 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-20947bce-dfd3-464c-a516-787caacb7405 - - - - -] Configuration for devices [u'90f27db6-9db2-4954-93de-07876d9e6edb'] failed!: MessagingTimeout: Timed out waiting for a reply to message ID be464cf0bdee41f0a88577379d2ade86
2018-10-10 17:43:09.559 7 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-20947bce-dfd3-464c-a516-787caacb7405 - - - - -] Configuration for devices up [u'90f27db6-9db2-4954-93de-07876d9e6edb'] and devices down [] completed.
2018-10-10 17:43:09.559 7 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-20947bce-dfd3-464c-a516-787caacb7405 - - - - -] Agent rpc_loop - iteration:509186 - ports processed. Elapsed:7.430 rpc_loop /var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:2117

Revision history for this message

Gaëtan Trellu (goldyfruit) wrote on 2018-10-10:

#6

Download full text (4.3 KiB)

Output from Manuel script to check the flows per project:

│ Project: 24288331d192493eb08dbfeeea6eb864
│ │
├── Network: dcabfe60-576d-4820-9ad2-4ac0aeb39090 [default-network] - vxlan id dec: 97 [hex: 0x61]
│ │ │
│   │ │
│   ├── Flows for compute compute002
│   │   ├─ Table 20
│   │   ├── There is a flow for controller001 dhcp agent fa:16:3e:4f:3e:db in table 20
│   │   ├── There is a flow for controller002 dhcp agent fa:16:3e:5b:63:e3 in table 20
│   │   ├── There is a flow for controller003 dhcp agent fa:16:3e:4c:67:28 in table 20
│   │   ├─ Table 22
│   │   ├── Missing flow for controller001 dhcp agent output vxlan-0a83083c in table 22
│   │   ├── There is a flow for controller002 dhcp agent output vxlan-0a83083d in table 22
│   │   ├── There is a flow for controller003 dhcp agent output vxlan-0a83083e in table 22
│   │   ├── There is a flow for compute003 vxlan output vxlan-0a830841 in table 22
│   │   ├── Missing flow for compute005 vxlan output vxlan-0a830843 in table 22
│   │   ├── Missing flow for compute007 vxlan output vxlan-0a830845 in table 22
│   │ │
│   ├── Flows for compute compute003
│   │   ├─ Table 20
│   │   ├── There is a flow for controller001 dhcp agent fa:16:3e:4f:3e:db in table 20
│   │   ├── There is a flow for controller002 dhcp agent fa:16:3e:5b:63:e3 in table 20
│   │   ├── There is a flow for controller003 dhcp agent fa:16:3e:4c:67:28 in table 20
│   │   ├─ Table 22
│   │   ├── There is a flow for controller001 dhcp agent output vxlan-0a83083c in table 22
│   │   ├── There is a flow for controller002 dhcp agent output vxlan-0a83083d in table 22
│   │   ├── There is a flow for controller003 dhcp agent output vxlan-0a83083e in table 22
│   │   ├── There is a flow for compute002 vxlan output vxlan-0a830840 in table 22
│   │   ├── There is a flow for compute005 vxlan output vxlan-0a830843 in table 22
│   │   ├── There is a flow for compute007 vxlan output vxlan-0a830845 in table 22
│   │ │
│   ├── Flows for compute compute005
│   │   ├─ Table 20
│   │   ├── There is a flow for controller001 dhcp agent fa:16:3e:4f:3e:db in table 20
│   │   ├── There is a flow for controller002 dhcp agent fa:16:3e:5b:63:e3 in table 20
│   │   ├── There is a flow for controller003 dhcp agent fa:16:3e:4c:67:28 in table 20
│   │   ├─ Table 22
│   │   ├── Missing flow for controller001 dhcp agent output vxlan-0a83083c in table 22
│   │   ├── There is a flow for controller002 dhcp agent output vxlan-0a83083d in table 22
│   │   ├── There is a flow for controller003 dhcp agent output vxlan-0a83083e in table 22
│   │   ├── Missing flow for compute002 vxlan output vxlan-0a830840 in table 22
│   │   ├── There is a flow for compute003 vxlan output vxlan-0a830841 in table 22
│   │   ├── There is a flow for compute007 vxlan output vxlan-0a830845 in table 22
│   │ │
│   ├── Flows for compute compute007
│   │   ├─ Table 20
│   │   ├── There is a flow for controller001 dhcp agent fa:16:3e:4f:3e:db in table 20
│   │   ├── There is a flow for controller002 dhcp agent fa:16:3e:5b:63:e3 in table 20
│   │   ├── There is a flow for controller003 dhcp agent fa:16:3e:4c:67:28 in table 20
│   │   ├─ Table 22
│   │   ├── There is a flow for contro...

I saw an error in neutron-server.log:

2018-10-18 16:23:39.258 30 ERROR neutron.plugins.ml2.managers [req-61134bd9-1f45-4aad-9148-d64dba96b1d0 - - - - -] Mechanism driver 'l2population' failed in update_port_postcommit: AttributeError: 'Context' object has no attribute 'begin'
2018-10-18 16:23:39.258 30 ERROR neutron.plugins.ml2.managers Traceback (most recent call last):
2018-10-18 16:23:39.258 30 ERROR neutron.plugins.ml2.managers   File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/plugins/ml2/managers.py", line 426, in _call_on_drivers
2018-10-18 16:23:39.258 30 ERROR neutron.plugins.ml2.managers     getattr(driver.obj, method_name)(context)
2018-10-18 16:23:39.258 30 ERROR neutron.plugins.ml2.managers   File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/l2pop/mech_driver.py", line 183, in update_port_postcommit
2018-10-18 16:23:39.258 30 ERROR neutron.plugins.ml2.managers     self.update_port_up(context)
2018-10-18 16:23:39.258 30 ERROR neutron.plugins.ml2.managers   File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/l2pop/mech_driver.py", line 295, in update_port_up
2018-10-18 16:23:39.258 30 ERROR neutron.plugins.ml2.managers     if is_first_port or self.agent_restarted(context):
2018-10-18 16:23:39.258 30 ERROR neutron.plugins.ml2.managers   File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/l2pop/mech_driver.py", line 246, in agent_restarted
2018-10-18 16:23:39.258 30 ERROR neutron.plugins.ml2.managers     agent = l2pop_db.get_agent_by_host(port_context, agent_host)
2018-10-18 16:23:39.258 30 ERROR neutron.plugins.ml2.managers   File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/l2pop/db.py", line 60, in get_agent_by_host
2018-10-18 16:23:39.258 30 ERROR neutron.plugins.ml2.managers     with session.begin(subtransactions=True):
2018-10-18 16:23:39.258 30 ERROR neutron.plugins.ml2.managers AttributeError: 'Context' object has no attribute 'begin'
2018-10-18 16:23:39.258 30 ERROR neutron.plugins.ml2.managers 
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager [req-61134bd9-1f45-4aad-9148-d64dba96b1d0 - - - - -] Error during notification for neutron.plugins.ml2.plugin.Ml2Plugin._port_provisioned--9223372036854067064 port, provisioning_complete: MechanismDriverError
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager Traceback (most recent call last):
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager   File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron_lib/callbacks/manager.py", line 197, in _notify_loop
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager     callback(resource, event, trigger, **kwargs)
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager   File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/plugins/ml2/plugin.py", line 260, in _port_provisioned
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager     self.update_port_status(context, port_id, const.PORT_STATUS_ACTIVE)
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager   File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/plugins/ml2/plugin.py", line 1696, in update_port_status
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager     context, {port_id: status}, host)[port_id]
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager   File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/common/utils.py", line 674, in inner
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager     return f(self, context, *args, **kwargs)
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager   File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/db/api.py", line 163, in wrapped
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager     return method(*args, **kwargs)
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager   File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/db/api.py", line 93, in wrapped
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager     setattr(e, '_RETRY_EXCEEDED', True)
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager   File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager     self.force_reraise()
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager   File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager     six.reraise(self.type_, self.value, self.tb)
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager   File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/db/api.py", line 89, in wrapped
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager     return f(*args, **kwargs)
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager   File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_db/api.py", line 154, in wrapper
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager     ectxt.value = e.inner_exc
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager   File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager     self.force_reraise()
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager   File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager     six.reraise(self.type_, self.value, self.tb)
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager   File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_db/api.py", line 142, in wrapper
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager     return f(*args, **kwargs)
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager   File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/db/api.py", line 128, in wrapped
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager     LOG.debug("Retry wrapper got retriable exception: %s", e)
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager   File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager     self.force_reraise()
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager   File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager     six.reraise(self.type_, self.value, self.tb)
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager   File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/db/api.py", line 124, in wrapped
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager     return f(*dup_args, **dup_kwargs)
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager   File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/plugins/ml2/plugin.py", line 1711, in update_port_statuses
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager     context, port_dbs_by_id[port_id], status, host)
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager   File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/plugins/ml2/plugin.py", line 1723, in _safe_update_individual_port_db_status
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager     ectx.reraise = bool(db.get_port(context, port_id))
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager   File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager     self.force_reraise()
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager   File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager     six.reraise(self.type_, self.value, self.tb)
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager   File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/plugins/ml2/plugin.py", line 1719, in _safe_update_individual_port_db_status
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager     context, port, status, host)
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager   File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/plugins/ml2/plugin.py", line 1787, in _update_individual_port_db_status
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager     self.mechanism_manager.update_port_postcommit(mech_context)
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager   File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/plugins/ml2/managers.py", line 694, in update_port_postcommit
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager     continue_on_failure=True)
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager   File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/plugins/ml2/managers.py", line 444, in _call_on_drivers
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager     errors=errors
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager MechanismDriverError
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager

Revision history for this message

Swaminathan Vasudevan (swaminathan-vasudevan) wrote on 2018-10-25:

#20

One other suggestion would be to test this in your environment. (Since I am not able to reproduce locally may be it would make sense to test it in your environment)

https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/rpc.py#L327

Can you comment this check in the 'rpc.py'(#L327, #L328 and #L329) and restart the service.
You may see an ERROR log in the openvswith log that is being added for DVR routers, but it should not affect the functionality.

Revision history for this message

Brian Haley (brian-haley) wrote on 2018-11-08:

#21

Hi - just wondering if you've been able to test with Swami's suggestion of adding some logging to see which code path is taken so we can further triage? Thanks.

Revision history for this message

Gaëtan Trellu (goldyfruit) wrote on 2018-11-08:

#22

Hey Guys,

Sorry, it's hard for us to test this in our production environment.
Keep you posted about it !

Revision history for this message

Alexandre Perreault (alexperreault) wrote on 2018-11-23:

#23

Download full text (18.3 KiB)

Hi,

I have been trying to find the trigger for this problem for awhile and I finally was able to find something in one of our openstack environements.

I had setup a vxlan tenant network with DHCP enabled, a router and three instances. Two instances are on the same compute, the other is on a separate compute.
The OVS flows were good after the creation of these resources and it stayed that way for awhile but yesterday I finally saw some differences in the flows which are problematic.
On November 21st, early afternoon, everything was good. On November 22nd, the three controllers had missing flows or bad flows. Furthermore, the issues were not the same on all three controllers.

Below are the correct flows for each controller and then the problematic flows from the day after for comparison.

Controller001 Nov21 (good)
root@controller001[SRV][PRD001][NWI]:~# docker exec -ti openvswitch_vswitchd ovs-ofctl dump-flows br-tun | grep 0x7c
cookie=0x7449e7306e86ff8f, duration=423786.590s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=25,dl_dst=fa:16:3e:ab:ab:ae actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04922"
cookie=0x7449e7306e86ff8f, duration=423757.339s, table=20, n_packets=2, n_bytes=420, priority=2,dl_vlan=25,dl_dst=fa:16:3e:89:a2:e4 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa0491b"
cookie=0x7449e7306e86ff8f, duration=423754.230s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=25,dl_dst=fa:16:3e:d7:0f:a4 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa0491b"
cookie=0x7449e7306e86ff8f, duration=423107.131s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=25,dl_dst=fa:16:3e:bf:ae:93 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04902"
cookie=0x7449e7306e86ff8f, duration=423061.839s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=25,dl_dst=fa:16:3e:59:4e:da actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04903"
cookie=0x7449e7306e86ff8f, duration=438618.539s, table=22, n_packets=46, n_bytes=4140, priority=1,dl_vlan=25 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04903",output:"vxlan-0aa04902",output:"vxlan-0aa04922",output:"vxlan-0aa0491b"

Controller001 Nov22 (bad)
root@controller001[SRV][PRD001][NWI]:~# docker exec -ti openvswitch_vswitchd ovs-ofctl dump-flows br-tun | grep 0x7c
cookie=0x9d709590904de6cd, duration=77469.794s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=25,dl_dst=fa:16:3e:ab:ab:ae actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04922"
cookie=0x9d709590904de6cd, duration=77280.104s, table=20, n_packets=2, n_bytes=420, priority=2,dl_vlan=25,dl_dst=fa:16:3e:89:a2:e4 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa0491b"
cookie=0x9d709590904de6cd, duration=77279.096s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=25,dl_dst=fa:16:3e:d7:0f:a4 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa0491b"
cookie=0x9d709590904de6cd, duration=76996.452s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=25,dl_dst=fa:16:3e:59:4e:da actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04903"
cookie=0x9d709590904de6cd, duration=76846.245s, ...

Hi,

I have been trying to find the trigger for this problem for awhile and I finally was able to find something in one of our openstack environements.

I had setup a vxlan tenant network with DHCP enabled, a router and three instances. Two instances are on the same compute, the other is on a separate compute.
The OVS flows were good after the creation of these resources and it stayed that way for awhile but yesterday I finally saw some differences in the flows which are problematic.
On November 21st, early afternoon, everything was good. On November 22nd, the three controllers had missing flows or bad flows. Furthermore, the issues were not the same on all three controllers.

Below are the correct flows for each controller and then the problematic flows from the day after for comparison.

Controller001 Nov21 (good)
root@controller001[SRV][PRD001][NWI]:~# docker exec -ti openvswitch_vswitchd ovs-ofctl dump-flows br-tun | grep 0x7c
 cookie=0x7449e7306e86ff8f, duration=423786.590s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=25,dl_dst=fa:16:3e:ab:ab:ae actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04922"
 cookie=0x7449e7306e86ff8f, duration=423757.339s, table=20, n_packets=2, n_bytes=420, priority=2,dl_vlan=25,dl_dst=fa:16:3e:89:a2:e4 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa0491b"
 cookie=0x7449e7306e86ff8f, duration=423754.230s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=25,dl_dst=fa:16:3e:d7:0f:a4 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa0491b"
 cookie=0x7449e7306e86ff8f, duration=423107.131s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=25,dl_dst=fa:16:3e:bf:ae:93 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04902"
 cookie=0x7449e7306e86ff8f, duration=423061.839s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=25,dl_dst=fa:16:3e:59:4e:da actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04903"
 cookie=0x7449e7306e86ff8f, duration=438618.539s, table=22, n_packets=46, n_bytes=4140, priority=1,dl_vlan=25 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04903",output:"vxlan-0aa04902",output:"vxlan-0aa04922",output:"vxlan-0aa0491b"

Controller001 Nov22 (bad)
root@controller001[SRV][PRD001][NWI]:~# docker exec -ti openvswitch_vswitchd ovs-ofctl dump-flows br-tun | grep 0x7c 
 cookie=0x9d709590904de6cd, duration=77469.794s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=25,dl_dst=fa:16:3e:ab:ab:ae actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04922"
 cookie=0x9d709590904de6cd, duration=77280.104s, table=20, n_packets=2, n_bytes=420, priority=2,dl_vlan=25,dl_dst=fa:16:3e:89:a2:e4 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa0491b"
 cookie=0x9d709590904de6cd, duration=77279.096s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=25,dl_dst=fa:16:3e:d7:0f:a4 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa0491b"
 cookie=0x9d709590904de6cd, duration=76996.452s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=25,dl_dst=fa:16:3e:59:4e:da actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04903"
 cookie=0x9d709590904de6cd, duration=76846.245s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=25,dl_dst=fa:16:3e:bf:ae:93 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04902"
 cookie=0x9d709590904de6cd, duration=77390.425s, table=22, n_packets=52, n_bytes=4680, priority=1,dl_vlan=25 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04922",output:"vxlan-0aa0491b"

Table 20 flows are still good. Table 22 is missing tunnels to other controllers -> output:"vxlan-0aa04903",output:"vxlan-0aa04902" are no longer there.

Controller002 Nov21 (good)
root@controller002[SRV][PRD001][NWI]:~# docker exec -ti openvswitch_vswitchd ovs-ofctl dump-flows br-tun | grep 0x7c
 cookie=0x8a55ee0ee26404e, duration=423812.628s, table=20, n_packets=31, n_bytes=6308, priority=2,dl_vlan=156,dl_dst=fa:16:3e:ab:ab:ae actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04922"
 cookie=0x8a55ee0ee26404e, duration=423783.584s, table=20, n_packets=6, n_bytes=1240, priority=2,dl_vlan=156,dl_dst=fa:16:3e:89:a2:e4 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa0491b"
 cookie=0x8a55ee0ee26404e, duration=423780.439s, table=20, n_packets=133, n_bytes=17788, priority=2,dl_vlan=156,dl_dst=fa:16:3e:d7:0f:a4 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa0491b"
 cookie=0x8a55ee0ee26404e, duration=423132.939s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=156,dl_dst=fa:16:3e:bf:ae:93 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04902"
 cookie=0x8a55ee0ee26404e, duration=422890.009s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=156,dl_dst=fa:16:3e:78:a4:c7 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04904"
 cookie=0x8a55ee0ee26404e, duration=438644.536s, table=22, n_packets=178, n_bytes=15980, priority=1,dl_vlan=156 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04922",output:"vxlan-0aa04902",output:"vxlan-0aa04904",output:"vxlan-0aa0491b

Controller002 Nov22 (bad)
root@controller002[SRV][PRD001][NWI]:~# docker exec -ti openvswitch_vswitchd ovs-ofctl dump-flows br-tun | grep 0x7c
 cookie=0x6235e529b85f14b0, duration=77347.983s, table=20, n_packets=6, n_bytes=1240, priority=2,dl_vlan=156,dl_dst=fa:16:3e:89:a2:e4 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa0491b"
 cookie=0x6235e529b85f14b0, duration=77346.975s, table=20, n_packets=133, n_bytes=17788, priority=2,dl_vlan=156,dl_dst=fa:16:3e:d7:0f:a4 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa0491b"
 cookie=0x6235e529b85f14b0, duration=76914.122s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=156,dl_dst=fa:16:3e:bf:ae:93 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04902"
 cookie=0x6235e529b85f14b0, duration=11775.668s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=156,dl_dst=fa:16:3e:78:a4:c7 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04904"
 cookie=0x6235e529b85f14b0, duration=77457.952s, table=22, n_packets=182, n_bytes=16340, priority=1,dl_vlan=156 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa0491b"

Table 20 is missing flow to MAC fa:16:3e:ab:ab:ae tunnel vxlan-0aa04922 (instance1).
Tabke 22 is missing tunnels to other controllers and to a compute (instance1) -> output:"vxlan-0aa04922",output:"vxlan-0aa04902",output:"vxlan-0aa04904" are no longer there.

Controller003 Nov21 (good)
root@controller003[SRV][PRD001][NWI]:~# docker exec -ti openvswitch_vswitchd ovs-ofctl dump-flows br-tun | grep 0x7c
 cookie=0x381c0764f5962ef4, duration=423836.393s, table=20, n_packets=2, n_bytes=400, priority=2,dl_vlan=164,dl_dst=fa:16:3e:ab:ab:ae actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04922"
 cookie=0x381c0764f5962ef4, duration=423807.140s, table=20, n_packets=30, n_bytes=5944, priority=2,dl_vlan=164,dl_dst=fa:16:3e:89:a2:e4 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa0491b"
 cookie=0x381c0764f5962ef4, duration=423804.090s, table=20, n_packets=32, n_bytes=6364, priority=2,dl_vlan=164,dl_dst=fa:16:3e:d7:0f:a4 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa0491b"
 cookie=0x381c0764f5962ef4, duration=423111.556s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=164,dl_dst=fa:16:3e:59:4e:da actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04903"
 cookie=0x381c0764f5962ef4, duration=422913.727s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=164,dl_dst=fa:16:3e:78:a4:c7 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04904"
 cookie=0x381c0764f5962ef4, duration=438668.224s, table=22, n_packets=240, n_bytes=21600, priority=1,dl_vlan=164 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04903",output:"vxlan-0aa0491b",output:"vxlan-0aa04922",output:"vxlan-0aa04904"

Controller003 Nov22 (bad)
root@controller003[SRV][PRD001][NWI]:~# docker exec -ti openvswitch_vswitchd ovs-ofctl dump-flows br-tun | grep 0x7c
 cookie=0x7b570e6d184cd6b6, duration=77979.624s, table=20, n_packets=35, n_bytes=6826, priority=2,dl_vlan=164,dl_dst=fa:16:3e:89:a2:e4 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa0491b"
 cookie=0x7b570e6d184cd6b6, duration=77978.615s, table=20, n_packets=36, n_bytes=7204, priority=2,dl_vlan=164,dl_dst=fa:16:3e:d7:0f:a4 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa0491b"
 cookie=0x7b570e6d184cd6b6, duration=77696.228s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=164,dl_dst=fa:16:3e:59:4e:da actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04903"
 cookie=0x7b570e6d184cd6b6, duration=12407.309s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=164,dl_dst=fa:16:3e:78:a4:c7 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04904"
 cookie=0x7b570e6d184cd6b6, duration=78065.179s, table=22, n_packets=244, n_bytes=21960, priority=1,dl_vlan=164 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04922",output:"vxlan-0aa0491f",output:"vxlan-0aa0491e",output:"vxlan-0aa04921",output:"vxlan-0aa04920",output:"vxlan-0aa0491b",output:"vxlan-0aa04918",output:"vxlan-0aa0491c"

Table 20 is missing flow to MAC fa:16:3e:ab:ab:ae tunnel vxlan-0aa04922 (instance1).
Table 22 is really starnge!!! It has tunnel to instance1 (vxlan-0aa04922). It is missing the tunnels to the other controllers (vxlan-0aa04903 and vxlan-0aa04904). It has extra tunnels!output:"vxlan-0aa0491f",output:"vxlan-0aa0491e",output:"vxlan-0aa04921",output:"vxlan-0aa04920",output:"vxlan-0aa04918",output:"vxlan-0aa0491c". These tunnels should not be there!

What caused this? I looked into some of the logs. More precisely I wanted to compare why controller001 still had the flow to instance1 (fa:16:3e:ab:ab:ae) but not controller002 or controller003.
Here are the logs on controller003 specifically related to fa:16:3e:ab:ab:ae.

2018-11-21 22:24:18.265 8 DEBUG neutron.plugins.ml2.drivers.l2pop.rpc_manager.l2population_rpc [req-ea2a8701-df21-426a-bad3-25f78043e702 - - - - -] neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent.OVSNeutronAgent method add_fdb_entries called with arguments (<neutron_lib.context.Context object at 0x7fc7f5d744d0>,) {u'fdb_entries': {u'31f29743-e5f6-4178-a7b1-8927a0387eb8': {u'segment_id': 124, u'ports': {u'10.160.73.34': [[u'00:00:00:00:00:00', u'0.0.0.0'], [u'fa:16:3e:ab:ab:ae', u'10.150.150.5']]}, u'network_type': u'vxlan'}}} wrapper /var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_log/helpers.py:66

2018-11-21 22:25:08.194 8 DEBUG neutron.plugins.ml2.drivers.l2pop.rpc_manager.l2population_rpc [req-ea2a8701-df21-426a-bad3-25f78043e702 - - - - -] neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent.OVSNeutronAgent method add_fdb_entries called with arguments (<neutron_lib.context.Context object at 0x7fc7f597b310>,) {u'fdb_entries': {u'31f29743-e5f6-4178-a7b1-8927a0387eb8': {u'segment_id': 124, u'ports': {u'10.160.73.34': [[u'00:00:00:00:00:00', u'0.0.0.0'], [u'fa:16:3e:ab:ab:ae', u'10.150.150.5']]}, u'network_type': u'vxlan'}}} wrapper /var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_log/helpers.py:66

These are the only logs I could find related to the MAC. It is add_fdb_entries.

Now I checked the logs in controller001
2018-11-21 22:24:18.344 8 DEBUG neutron.plugins.ml2.drivers.l2pop.rpc_manager.l2population_rpc [req-ea2a8701-df21-426a-bad3-25f78043e702 - - - - -] neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent.OVSNeutronAgent method add_fdb_entries called with arguments (<neutron_lib.context.Context object at 0x7f64ee483910>,) {u'fdb_entries': {u'31f29743-e5f6-4178-a7b1-8927a0387eb8': {u'segment_id': 124, u'ports': {u'10.160.73.34': [[u'00:00:00:00:00:00', u'0.0.0.0'], [u'fa:16:3e:ab:ab:ae', u'10.150.150.5']]}, u'network_type': u'vxlan'}}} wrapper /var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_log/helpers.py:66

2018-11-21 22:25:08.604 8 DEBUG neutron.plugins.ml2.drivers.l2pop.rpc_manager.l2population_rpc [req-ea2a8701-df21-426a-bad3-25f78043e702 - - - - -] neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent.OVSNeutronAgent method add_fdb_entries called with arguments (<neutron_lib.context.Context object at 0x7f64ee420c50>,) {u'fdb_entries': {u'31f29743-e5f6-4178-a7b1-8927a0387eb8': {u'segment_id': 124, u'ports': {u'10.160.73.34': [[u'00:00:00:00:00:00', u'0.0.0.0'], [u'fa:16:3e:ab:ab:ae', u'10.150.150.5']]}, u'network_type': u'vxlan'}}} wrapper /var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_log/helpers.py:66

2018-11-21 22:25:08.605 8 DEBUG neutron.plugins.ml2.drivers.l2pop.rpc_manager.l2population_rpc [req-ea2a8701-df21-426a-bad3-25f78043e702 - - - - -] neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent.OVSNeutronAgent method fdb_add_tun called with arguments (<neutron_lib.context.Context object at 0x7f64ee420c50>, <neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_tun.OVSTunnelBridge object at 0x7f64eea271d0>, <neutron.plugins.ml2.drivers.openvswitch.agent.vlanmanager.LocalVLANMapping object at 0x7f64ee636f10>, {u'10.160.73.34': [PortInfo(mac_address=u'00:00:00:00:00:00', ip_address=u'0.0.0.0'), PortInfo(mac_address=u'fa:16:3e:ab:ab:ae', ip_address=u'10.150.150.5')]}, <bound method OVSNeutronAgent._tunnel_port_lookup of <neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent.OVSNeutronAgent object at 0x7f64eeb0cb50>>) {} wrapper /var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_log/helpers.py:66

2018-11-21 22:25:09.262 8 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ofswitch [req-ea2a8701-df21-426a-bad3-25f78043e702 - - - - -] ofctl request version=0x4,msg_type=0xe,msg_len=0xe8,xid=0x6681da75,OFPFlowMod(buffer_id=4294967295,command=0,cookie=11344731909475133133L,cookie_mask=0,flags=0,hard_timeout=0,idle_timeout=0,instructions=[OFPInstructionActions(actions=[OFPActionSetField(arp_op=2), NXActionRegMove(dst_field='arp_tha',dst_ofs=0,experimenter=8992,len=24,n_bits=48,src_field='arp_sha',src_ofs=0,subtype=6,type=65535), NXActionRegMove(dst_field='arp_tpa',dst_ofs=0,experimenter=8992,len=24,n_bits=32,src_field='arp_spa',src_ofs=0,subtype=6,type=65535), OFPActionSetField(arp_sha='fa:16:3e:ab:ab:ae'), OFPActionSetField(arp_spa='10.150.150.5'), NXActionRegMove(dst_field='eth_dst',dst_ofs=0,experimenter=8992,len=24,n_bits=48,src_field='eth_src',src_ofs=0,subtype=6,type=65535), OFPActionSetField(eth_src='fa:16:3e:ab:ab:ae'), OFPActionOutput(len=16,max_len=0,port=4294967288,type=0)],len=160,type=4)],match=OFPMatch(oxm_fields={'arp_tpa': '10.150.150.5', 'eth_type': 2054, 'vlan_vid': 4121}),out_group=0,out_port=0,priority=1,table_id=21) result None _send_msg /var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ofswitch.py:114

2018-11-21 22:25:09.288 8 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ofswitch [req-ea2a8701-df21-426a-bad3-25f78043e702 - - - - -] ofctl request version=0x4,msg_type=0xe,msg_len=0x78,xid=0x6681da8d,OFPFlowMod(buffer_id=4294967295,command=0,cookie=11344731909475133133L,cookie_mask=0,flags=0,hard_timeout=0,idle_timeout=0,instructions=[OFPInstructionActions(actions=[OFPActionPopVlan(len=8,type=18), OFPActionSetField(tunnel_id=124), OFPActionOutput(len=16,max_len=0,port=58,type=0)],len=48,type=4)],match=OFPMatch(oxm_fields={'eth_dst': 'fa:16:3e:ab:ab:ae', 'vlan_vid': 4121}),out_group=0,out_port=0,priority=2,table_id=20) result None _send_msg /var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ofswitch.py:114

As you can see, there are more logs related to the MAC. There are two add_fdb_entries which match with controller003. But then I have fdb_add_tun and two OFPFlowMod, one for table 21 (ARP responder) and the other is for table 20 (Unicast flow). This explains why controller001 still has the table 20 flow for fa:16:3e:ab:ab:ae. This also shows that the table 20 flow was never created on controller003 as there is no log for it.

The next question is what triggered the flows to be refreshed/recreated at 22:24?

After some investigation I found that neutron-server and neutron-openvswitch services were restarted around that time.

From what I have investigated, we now have one of the possible triggers to our issue and the end result. The next question is why were the flows created badly or not at all? Furthermore, why are the issues not identical on all three controllers? Why such differences? I dont have the answer and I'm hoping someone from the community can help.

Earlier in this ticket, it was suggested to comment lines 327, 328 and 329 from neutron/plugins/ml2/rpc.py.
I manually made this change and restarted neutron-server and neutron-openvswitch on all three controllers.

I got even worse results.
Controller001
 cookie=0x26b2b5c08704800d, duration=564.328s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=25,dl_dst=fa:16:3e:59:4e:da actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04903"
 cookie=0x26b2b5c08704800d, duration=437.006s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=25,dl_dst=fa:16:3e:bf:ae:93 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04902"

Controller002
 cookie=0x8d36e0eb1592f790, duration=472.327s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=156,dl_dst=fa:16:3e:78:a4:c7 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04904"
 cookie=0x8d36e0eb1592f790, duration=437.023s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=156,dl_dst=fa:16:3e:bf:ae:93 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04902"

Controller003
 cookie=0xd589182b73ccd982, duration=564.288s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=164,dl_dst=fa:16:3e:59:4e:da actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04903"
 cookie=0xd589182b73ccd982, duration=472.265s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=164,dl_dst=fa:16:3e:78:a4:c7 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04904"

As you can see table 20 is missing flows to the three instances. Only the flows to the controllers are there.
Even worse, table 22 is missing completely!

I then uncommented the three lines and restarted neutron-server and neutron-openvswitch on all three controllers (rollback).

Unfortunately nothing changed... It did not go back to its previous state. I have no table 22 and missing flows to all three instances.

Following what I have posted does anyone have any ideas? What else can I do to try to figure this out? Are there any logs I should post?

Revision history for this message

Brian Haley (brian-haley) wrote on 2018-11-29:

#24

Alexandre - thanks for all the information, we've just been swamped with other work, but we will take a look as soon as we have time.

Revision history for this message

Gaëtan Trellu (goldyfruit) wrote on 2018-11-29:

#25

Glad to here that @Brian !

Revision history for this message

Swaminathan Vasudevan (swaminathan-vasudevan) wrote on 2018-12-11:

#26

Hi Alexandre,
From your logs, it seems this 'unicat-to-tun' rule is missing between the good and the bad.

n_bytes=0, priority=2,dl_vlan=25,dl_dst=fa:16:3e:bf:ae:93 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04902"

Is this port with MAC fa:16:3e:bf:ae:93 still alive and is seen in other nodes.
Is this a VM port or a non VM port.

Was this rule only missing in the Controller Node and not on other Compute Nodes.

If it is only on the controller, was the controller configured for HA?

If it is a VM port did this VM ever migrate or at least this port went down and came up. ( I mean if the VM was shutdown and then restarted or so).
When this event happened, did the neutron server or the neutron-agent went down and came up.

Revision history for this message

Alexandre Perreault (alexperreault) wrote on 2018-12-12:

#27

Hi Swaminathan,

Did you download the full text? The website does not show the full comment. I am asking because I'm not sure in what part of my comment you are referring to when you are talking about the missing rule regarding MAC fa:16:3e:bf:ae:93. I dont see it missing.

Most of my text was referring to MAC fa:16:3e:ab:ab:ae which is missing a table 20 flow in controller003. I also discuss issues with table 22 on all controllers and that the trigger seemed to have been a restart of neutron services.

Revision history for this message

Swaminathan Vasudevan (swaminathan-vasudevan) wrote on 2018-12-12:

#28

Hi Alexandre,
Yes I might have missed the full text, I just noticed the one that shows up on the activity log.

I read the fulltext now.

So the problem that you mentioned in your log, is that when the Neutron-server and the openvswitch agent is restarted, that would have triggered this problem based on your statement.

"The next question is what triggered the flows to be refreshed/recreated at 22:24?

After some investigation I found that neutron-server and neutron-openvswitch services were restarted around that time."

The unicast-to-tun flow rules are created, when the Actual VM pops up on the compute, when there is a new MAC, that MAC entry will be populated on all the nodes other than the host node that is hosting the VM.
If the VM is already created and the rule is already in place, then by agent-restart we can bringback those roles. Also by enabling and disabling those ports, would trigger a 'Port-up'.

The table 22 rule is the default action or the normal action, where the traffic from the dVR routers should be routed. As and when the vxlan tunnels are added they should be populated in the list of output ports based on the Flood to tun.

This is first populated on the node, when the first port on a particular network shows up. It collects all the vxlan tunnel details for that particular network and adds it to the nodes.

So there is possibility of mismatch happening on a particular node, when the vxlan tunnels are deleted and agent restarted without the neutron-server.

I have always seen things in sync when the neutron-server restarted first and the agent restarted later.

Having understood your problem, the only way that the fdb_add_tun will not be called for add_fdb_flow is when the agent_ports does not exist.

https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L560

Revision history for this message

Swaminathan Vasudevan (swaminathan-vasudevan) wrote on 2018-12-12:

#29

In your case we should identify, why on agent-restart it did not pick up the 'unicast-to-tun' rules for the MAC that is missing. (dl_dst=fa:16:3e:ab:ab:ae)

On restart
https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/rpc.py#L335
L335 to L350 has the logic to re-write the flow rules, if the port is in ACTIVE state.
But if the port is not in ACTIVE state, it might not rewrite those rules, but will remove it.

Revision history for this message

Swaminathan Vasudevan (swaminathan-vasudevan) wrote on 2018-12-13:

#30

This is where the tunnel ports are getting added to the local agent list.

https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L1544

During the FLOOD_ENTRY, the number of tunnels supported is being retrieved from this constant.
So somehow either the local cached list is wipped out by some other function or else the agent misses it somehow.

Revision history for this message

Swaminathan Vasudevan (swaminathan-vasudevan) wrote on 2018-12-13:

#31

this is where the ports (vxlan ports) are being written to the flooding-table flow rule.
https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/br_tun.py#L173

Revision history for this message

Gaëtan Trellu (goldyfruit) wrote on 2018-12-13:

#32

Thanks @Swaminathan for your help on this !!!

Revision history for this message

Alexandre Perreault (alexperreault) wrote on 2018-12-13:

#33

Hi Swaminathan,

Thanks for the replies.

Ok, so it is always better to restart neutron-server before the openvswitch agent. We should never restart the agent alone?

If I understand correctly, the cause for the unicast to tun issue is different then the flood to tun issue?

I want to mention that nothing was erased or moved during the test case. I had created three instances on a new network and copied the flows. Then every day I connected to the environement and checked if my flows stayed the same. Then one day they had changed. That's when I posted my long comment. To my knowledge the ports was always active. It is strange that the flow to MAC fa:16:3e:ab:ab:ae was recreated in Controller001 and not controller002 and controller003. That's another reason why the port was active.

As for flood to tun, we see a lot of inconsitencies in our different environments. You mention that maybe the local cached list is wiped out. Is there a way to check?

I am open to any suggestions or any tests. It's been awhile since I wrote that long comment (november 23rd). When I left it, it was in a really bad state (end of my long comment). There was no table 22 flow and no table 20 flows to each instance... I verified again this week and the flows have slowly fixed themselves but still have some problems. Controller002 missing table 20 flow to controller003. Table 22 missing all tunnels except the one to controller001. Controller003 table 22 missing all tunnels except to controller001. Controller001 is good. The two computes which have instances are missing the table 20 flow to controller003. I couldn't tell you what actions were taken on the cloud since november 23rd though the actual tenant network and instances have not been modified in any way. I can keep testing on this network or create something new (network and instances) and run tests on them.

Revision history for this message

Alexandre Perreault (alexperreault) wrote on 2018-12-19:

#34

Hi Swaminathan,

Just another update. Pretty much all the flows on all servers are back. The only issue is on controller002 which is missing tunnels in table 22 to the two computes server. All other controllers and computes have correct flows for my test network.
It's unclear why it is happenning. Since this cloud infrastructure is used by many I cant follow all the activity. What I can say is no one is trying to fix my flows on purpose.

On a different production openstack infrastructure, we had a neutron maintenance yesterday. We had to change ml2_conf.ini to increase the vxlan id range. Neutron services were restarted after making the change. I also had a test network on this infrastructure with dhcp, a router and a few instances. I checked my flows for my network and saw abnormalities.
Table 22 disappeared from 3 of my physical hosts which either have instances or dhcp namespaces on them. On my other host, table 22 exists for my network but it has tunnels to physical hosts that do not have my network. Furthermore it is missing tunnels to the other hosts that do have my network. I still have not be able to find the cause for this issue.

Revision history for this message

Swaminathan Vasudevan (swaminathan-vasudevan) wrote on 2018-12-20:

#35

Thanks for the information. I will check again in the code path to see what might cause the table 22 to misbehave.

Revision history for this message

Gaëtan Trellu (goldyfruit) wrote on 2019-01-08:

#36

Guys do you have any update about this bug ?
We still facing the issue.

Revision history for this message

Alexandre Perreault (alexperreault) wrote on 2019-01-11:

#37

Download full text (3.3 KiB)

Here is an example of what I described in my previous comment. Our biggest problem is with table 22 as it causes outages for our clients.

Here is an example of flows for a vxlan network on a host. As you can see there is no table 22 at all.

docker exec -ti openvswitch_vswitchd ovs-ofctl dump-flows br-tun | grep 0x2b4
cookie=0xfdd60859f2658796, duration=2037831.278s, table=4, n_packets=245, n_bytes=52188, priority=1,tun_id=0x2b4 actions=mod_vlan_vid:181,resubmit(,9)
cookie=0xfdd60859f2658796, duration=304285.121s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=181,dl_dst=fa:16:3e:6d:55:e1 actions=strip_vlan,load:0x2b4->NXM_NX_TUN_ID[],output:"vxlan-0aa2490a"
cookie=0xfdd60859f2658796, duration=299805.493s, table=20, n_packets=34, n_bytes=6528, priority=2,dl_vlan=181,dl_dst=fa:16:3e:99:cc:4c actions=strip_vlan,load:0x2b4->NXM_NX_TUN_ID[],output:"vxlan-0aa2490c"
cookie=0xfdd60859f2658796, duration=293855.521s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=181,dl_dst=fa:16:3e:f2:2f:7d actions=strip_vlan,load:0x2b4->NXM_NX_TUN_ID[],output:"vxlan-0aa2490a"
cookie=0xfdd60859f2658796, duration=293234.777s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=181,dl_dst=fa:16:3e:38:08:be actions=strip_vlan,load:0x2b4->NXM_NX_TUN_ID[],output:"vxlan-0aa2490e"

Then on another host for the SAME vxlan network we have the opposite. There is a table 22 but all the tunnels are incorrect (They dont match the table 20 flows/tunnels).

docker exec -ti openvswitch_vswitchd ovs-ofctl dump-flows br-tun | grep 0x2b4
cookie=0xe4add62aa34905c9, duration=2040773.745s, table=4, n_packets=0, n_bytes=0, priority=1,tun_id=0x2b4 actions=mod_vlan_vid:669,resubmit(,9)
cookie=0xe4add62aa34905c9, duration=318238.768s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=669,dl_dst=fa:16:3e:af:9d:f4 actions=strip_vlan,load:0x2b4->NXM_NX_TUN_ID[],output:"vxlan-0aa2490d"
cookie=0xe4add62aa34905c9, duration=318227.433s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=669,dl_dst=fa:16:3e:4c:c0:93 actions=strip_vlan,load:0x2b4->NXM_NX_TUN_ID[],output:"vxlan-0aa2490d"
cookie=0xe4add62aa34905c9, duration=304524.808s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=669,dl_dst=fa:16:3e:6d:55:e1 actions=strip_vlan,load:0x2b4->NXM_NX_TUN_ID[],output:"vxlan-0aa2490a"
cookie=0xe4add62aa34905c9, duration=300045.175s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=669,dl_dst=fa:16:3e:99:cc:4c actions=strip_vlan,load:0x2b4->NXM_NX_TUN_ID[],output:"vxlan-0aa2490c"
cookie=0xe4add62aa34905c9, duration=294095.211s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=669,dl_dst=fa:16:3e:f2:2f:7d actions=strip_vlan,load:0x2b4->NXM_NX_TUN_ID[],output:"vxlan-0aa2490a"
cookie=0xe4add62aa34905c9, duration=2040773.792s, table=22, n_packets=8, n_bytes=720, priority=1,dl_vlan=669 actions=strip_vlan,load:0x2b4->NXM_NX_TUN_ID[],output:"vxlan-0aa2490f",output:"vxlan-0aa24911",output:"vxlan-0aa24910",output:"vxlan-0aa2490b",output:"vxlan-0aa2490d"

These are outputs from today from the same environment I talked about on december 19th...

I can also provide similar outputs for other networks on different environments. In another environment, on one host, table 22 only...

Here is an example of what I described in my previous comment. Our biggest problem is with table 22 as it causes outages for our clients.

Here is an example of flows for a vxlan network on a host. As you can see there is no table 22 at all.

docker exec -ti openvswitch_vswitchd ovs-ofctl dump-flows br-tun | grep 0x2b4
 cookie=0xfdd60859f2658796, duration=2037831.278s, table=4, n_packets=245, n_bytes=52188, priority=1,tun_id=0x2b4 actions=mod_vlan_vid:181,resubmit(,9)
 cookie=0xfdd60859f2658796, duration=304285.121s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=181,dl_dst=fa:16:3e:6d:55:e1 actions=strip_vlan,load:0x2b4->NXM_NX_TUN_ID[],output:"vxlan-0aa2490a"
 cookie=0xfdd60859f2658796, duration=299805.493s, table=20, n_packets=34, n_bytes=6528, priority=2,dl_vlan=181,dl_dst=fa:16:3e:99:cc:4c actions=strip_vlan,load:0x2b4->NXM_NX_TUN_ID[],output:"vxlan-0aa2490c"
 cookie=0xfdd60859f2658796, duration=293855.521s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=181,dl_dst=fa:16:3e:f2:2f:7d actions=strip_vlan,load:0x2b4->NXM_NX_TUN_ID[],output:"vxlan-0aa2490a"
 cookie=0xfdd60859f2658796, duration=293234.777s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=181,dl_dst=fa:16:3e:38:08:be actions=strip_vlan,load:0x2b4->NXM_NX_TUN_ID[],output:"vxlan-0aa2490e"

Then on another host for the SAME vxlan network we have the opposite. There is a table 22 but all the tunnels are incorrect (They dont match the table 20 flows/tunnels).

docker exec -ti openvswitch_vswitchd ovs-ofctl dump-flows br-tun | grep 0x2b4
 cookie=0xe4add62aa34905c9, duration=2040773.745s, table=4, n_packets=0, n_bytes=0, priority=1,tun_id=0x2b4 actions=mod_vlan_vid:669,resubmit(,9)
 cookie=0xe4add62aa34905c9, duration=318238.768s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=669,dl_dst=fa:16:3e:af:9d:f4 actions=strip_vlan,load:0x2b4->NXM_NX_TUN_ID[],output:"vxlan-0aa2490d"
 cookie=0xe4add62aa34905c9, duration=318227.433s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=669,dl_dst=fa:16:3e:4c:c0:93 actions=strip_vlan,load:0x2b4->NXM_NX_TUN_ID[],output:"vxlan-0aa2490d"
 cookie=0xe4add62aa34905c9, duration=304524.808s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=669,dl_dst=fa:16:3e:6d:55:e1 actions=strip_vlan,load:0x2b4->NXM_NX_TUN_ID[],output:"vxlan-0aa2490a"
 cookie=0xe4add62aa34905c9, duration=300045.175s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=669,dl_dst=fa:16:3e:99:cc:4c actions=strip_vlan,load:0x2b4->NXM_NX_TUN_ID[],output:"vxlan-0aa2490c"
 cookie=0xe4add62aa34905c9, duration=294095.211s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=669,dl_dst=fa:16:3e:f2:2f:7d actions=strip_vlan,load:0x2b4->NXM_NX_TUN_ID[],output:"vxlan-0aa2490a"
 cookie=0xe4add62aa34905c9, duration=2040773.792s, table=22, n_packets=8, n_bytes=720, priority=1,dl_vlan=669 actions=strip_vlan,load:0x2b4->NXM_NX_TUN_ID[],output:"vxlan-0aa2490f",output:"vxlan-0aa24911",output:"vxlan-0aa24910",output:"vxlan-0aa2490b",output:"vxlan-0aa2490d"

These are outputs from today from the same environment I talked about on december 19th...

I can also provide similar outputs for other networks on different environments. In another environment, on one host, table 22 only has one tunnel when it should have four, and on another host but for the same network, table 22 contains 12 tunnels! Again there should only be four...

I dont understand why these inconsistencies happen.

Revision history for this message

Bruca Lock (lockshi) wrote on 2019-01-22:

#38

Hi Swaminathan,

I did more analysis at the code level and here are my observations:

1 - The cleanup_stale_flows will wipe out all flows and somehow the table 22 and table 20 for the compute doesn't get recreated:
https://github.com/openstack/neutron/blob/886782c1777d8f104b5c57ffd4eff7680056602c/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L2165

My question is should it systematically call cleanup_stale_flows on restart of nova_ovs? Note that there are no failed devices detected during process_network_ports.
I tried forcing the provisioning_needed on restart to ensure it does a local vlan provisioning. The side effect of this is table 22 is maintained with some extra flows,
but I still lose table 20 of the compute instance.

2 - I added logs to the flooding-table flow rule https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/br_tun.py#L173
as per your suggestion and noticed that table 22 for my specific tunnel didn't get populated during process_network_ports.

3 - Looking at _setup_tunnel_port https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L1544
I didn't notice any errors related to this function in the code.

4 - For process_network_ports most of the processing goes into treat_devices_added_or_updated and could take quite a while in some cases we had to wait around 15 minutes.

Question about the cleanup_stale_flows, just noticed it just loops over the bridges and cleanup the flows. Didn't see any logic there where it checks for stale flow unless I'm wrong?
The problem probably lies in the fact that the flooding-table flow rule didn't get called for our specific vlan_id. Any insights of what could cause that?

Revision history for this message

Gaëtan Trellu (goldyfruit) wrote on 2019-02-01:

#39

Could this be related to Docker (SIGTERM, SIGKILL) ?

Bruca Lock (lockshi) on 2019-02-13

Changed in neutron:
assignee:	nobody → Bruca Lock (lockshi)

Revision history for this message

Bruca Lock (lockshi) wrote on 2019-02-18:

#40

An interesting observation is when use neutron-ovs-cleanup tool on the controller, flows get repopulated quickly after reboot of neutron-ovs-agent. Somehow another problem occurs where I'm not able to do icmp on remote host. This could be due to the fact that I may reboot the compute.

So from a quick extrapolation this could either due to the amount of resources that need to be repopulated during reboot and somehow either timeout (which I didn't see in the logs) or dangling resources that corrupt the repopulation.

Anyone aware of this situation or knows whether neutron-ovs-cleanup is still required.

Revision history for this message

Bruca Lock (lockshi) wrote on 2019-02-26:

#41

Finally found the reason why it's not populating the fdb entries. In case where there are lots of ports, the gateway port binding gets skipped during reboot from the neutron server as per following code which relies on a timer to check which is not optimal:
https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/rpc.py#L332

I propose we come up with a more deterministic way of knowing whether the reboot completed either through RPC confirmation or the iteration counter.

For now a fix that works would be to increate the l2pop agent_boot_time from the default value of 180 to 1800 in my case in the neutron server ml2 config.

Revision history for this message

Gaëtan Trellu (goldyfruit) wrote on 2019-02-26:

#42

Swaminathan, what is the downside of having a too high value for agent_boot_time ?

Gaëtan Trellu (goldyfruit) on 2019-02-26

description:

updated

Revision history for this message

Gaëtan Trellu (goldyfruit) wrote on 2019-03-05:

#43

We tried in different environments and so far we are not facing the issue anymore.
Still the pending questions from Bruca (#41) and I (#42). :)

Revision history for this message

Brian Haley (brian-haley) wrote on 2019-03-05:

#44

I wonder how much of this will be covered in the fix(es) for https://bugs.launchpad.net/neutron/+bug/1813703 ? There is a lot of good info there as well and a patch, https://review.openstack.org/#/c/640797/

Revision history for this message

Gaëtan Trellu (goldyfruit) wrote on 2019-03-05:

#45

@Brian, this is a great news !!!

OpenStack Infra (hudson-openstack) on 2019-03-06

Changed in neutron:
assignee:	Bruca Lock (lockshi) → LIU Yulong (dragon889)
status:	New → In Progress

OpenStack Infra (hudson-openstack) on 2019-03-07

Changed in neutron:
assignee:	LIU Yulong (dragon889) → Brian Haley (brian-haley)

OpenStack Infra (hudson-openstack) on 2019-03-07

Changed in neutron:
assignee:	Brian Haley (brian-haley) → LIU Yulong (dragon889)

Revision history for this message

Bruca Lock (lockshi) wrote on 2019-03-11:

#46

Tested with a very large agent_boot_time and it has no side effects beside the flow being repopulated many times.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-03-22: Fix proposed to neutron (stable/rocky)

#47

Fix proposed to branch: stable/rocky
Review: https://review.openstack.org/645405

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-03-22: Fix proposed to neutron (stable/queens)

#48

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/645406

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-03-22: Fix proposed to neutron (stable/pike)

#49

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/645408

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-03-23: Fix merged to neutron (master)

#50

Reviewed: https://review.openstack.org/640797
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=a5244d6d44d2b66de27dc77efa7830fa657260be
Submitter: Zuul
Branch: master

commit a5244d6d44d2b66de27dc77efa7830fa657260be
Author: LIU Yulong <email address hidden>
Date: Mon Mar 4 21:17:20 2019 +0800

More accurate agent restart state transfer

    Ovs-agent can be very time-consuming in handling a large number
    of ports. At this point, the ovs-agent status report may have
    exceeded the set timeout value. Some flows updating operations
    will not be triggerred. This results in flows loss during agent
    restart, especially for hosts to hosts of vxlan tunnel flow.

    This fix will let the ovs-agent explicitly, in the first rpc loop,
    indicate that the status is restarted. Then l2pop will be required
    to update fdb entries.

    Closes-Bug: #1813703
    Closes-Bug: #1813714
    Closes-Bug: #1813715
    Closes-Bug: #1794991
    Closes-Bug: #1799178

Change-Id: I8edc2deb509216add1fb21e1893f1c17dda80961

Changed in neutron:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-03-24: Fix included in openstack/neutron 14.0.0.0rc1

#51

This issue was fixed in the openstack/neutron 14.0.0.0rc1 release candidate.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-03: Fix proposed to neutron (stable/ocata)

#52

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/649729

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-06: Fix merged to neutron (stable/pike)

#53

Reviewed: https://review.openstack.org/645408
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=62fe7852bbd70a24174853997096c52ee015e269
Submitter: Zuul
Branch: stable/pike

commit 62fe7852bbd70a24174853997096c52ee015e269
Author: LIU Yulong <email address hidden>
Date: Mon Mar 4 21:17:20 2019 +0800

More accurate agent restart state transfer

    Ovs-agent can be very time-consuming in handling a large number
    of ports. At this point, the ovs-agent status report may have
    exceeded the set timeout value. Some flows updating operations
    will not be triggerred. This results in flows loss during agent
    restart, especially for hosts to hosts of vxlan tunnel flow.

    This fix will let the ovs-agent explicitly, in the first rpc loop,
    indicate that the status is restarted. Then l2pop will be required
    to update fdb entries.

Conflicts:
neutron/plugins/ml2/rpc.py

Conflicts:
neutron/plugins/ml2/drivers/l2pop/mech_driver.py

    Closes-Bug: #1813703
    Closes-Bug: #1813714
    Closes-Bug: #1813715
    Closes-Bug: #1794991
    Closes-Bug: #1799178

    Change-Id: I8edc2deb509216add1fb21e1893f1c17dda80961
    (cherry picked from commit a5244d6d44d2b66de27dc77efa7830fa657260be)
    (cherry picked from commit cc49ab550179bdc76d79f48be67886681cb43d4e)
    (cherry picked from commit 5ffca4966877454c605442e9e429aa83ea7d7348)

tags:

added: in-stable-pike

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-07: Fix merged to neutron (stable/rocky)

#54

Reviewed: https://review.openstack.org/645405
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=cc49ab550179bdc76d79f48be67886681cb43d4e
Submitter: Zuul
Branch: stable/rocky

commit cc49ab550179bdc76d79f48be67886681cb43d4e
Author: LIU Yulong <email address hidden>
Date: Mon Mar 4 21:17:20 2019 +0800

More accurate agent restart state transfer

    Ovs-agent can be very time-consuming in handling a large number
    of ports. At this point, the ovs-agent status report may have
    exceeded the set timeout value. Some flows updating operations
    will not be triggerred. This results in flows loss during agent
    restart, especially for hosts to hosts of vxlan tunnel flow.

    This fix will let the ovs-agent explicitly, in the first rpc loop,
    indicate that the status is restarted. Then l2pop will be required
    to update fdb entries.

Conflicts:
neutron/plugins/ml2/rpc.py

    Closes-Bug: #1813703
    Closes-Bug: #1813714
    Closes-Bug: #1813715
    Closes-Bug: #1794991
    Closes-Bug: #1799178

Change-Id: I8edc2deb509216add1fb21e1893f1c17dda80961
(cherry picked from commit a5244d6d44d2b66de27dc77efa7830fa657260be)

tags:

added: in-stable-rocky

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-12: Fix included in openstack/neutron 11.0.7

#55

This issue was fixed in the openstack/neutron 11.0.7 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-12: Fix included in openstack/neutron 13.0.3

#56

This issue was fixed in the openstack/neutron 13.0.3 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-12: Fix included in openstack/neutron 12.0.6

#57

This issue was fixed in the openstack/neutron 12.0.6 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-06-15: Fix merged to neutron (stable/ocata)

#58

Reviewed: https://review.opendev.org/649729
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=9583dc0549da2b4529a59b5862ba42aebc5ae15f
Submitter: Zuul
Branch: stable/ocata

commit 9583dc0549da2b4529a59b5862ba42aebc5ae15f
Author: LIU Yulong <email address hidden>
Date: Mon Mar 4 21:17:20 2019 +0800

More accurate agent restart state transfer

    Ovs-agent can be very time-consuming in handling a large number
    of ports. At this point, the ovs-agent status report may have
    exceeded the set timeout value. Some flows updating operations
    will not be triggerred. This results in flows loss during agent
    restart, especially for hosts to hosts of vxlan tunnel flow.

    This fix will let the ovs-agent explicitly, in the first rpc loop,
    indicate that the status is restarted. Then l2pop will be required
    to update fdb entries.

Conflicts:
neutron/plugins/ml2/rpc.py

Conflicts:
neutron/plugins/ml2/drivers/l2pop/mech_driver.py

    Closes-Bug: #1813703
    Closes-Bug: #1813714
    Closes-Bug: #1813715
    Closes-Bug: #1794991
    Closes-Bug: #1799178

    Change-Id: I8edc2deb509216add1fb21e1893f1c17dda80961
    (cherry picked from commit a5244d6d44d2b66de27dc77efa7830fa657260be)
    (cherry picked from commit cc49ab550179bdc76d79f48be67886681cb43d4e)
    (cherry picked from commit 5ffca4966877454c605442e9e429aa83ea7d7348)

tags:

added: in-stable-ocata

Bernard Cafarelli (bcafarel) on 2019-06-19

tags:

added: neutron-proactive-backport-potential

Revision history for this message

Hua Zhang (zhhuabj) wrote on 2019-09-09:

#59

I still can see the following problem reported in the bug description with neutron-server=2:14.0.2-0ubuntu1~cloud0 which actually has included all fixed patches in this bug.

About behavior [2] it is related to missing flows too, some compute nodes have missing output to other compute nodes (vxlan-xxxxxx) which prevent an instance on compute 1 to communicate to an instance on compute 2.

On one of our production system, we saw the following flow rule (sorry, we don't know how to reproduce this problem yet).

cookie=0x7e47f520bdf5215b, duration=55484.529s, table=22, n_packets=3441, n_bytes=218308, idle_age=794, priority=1,dl_vlan=10 actions=strip_vlan,load:0x40a->NXM_NX_TUN_ID[],output:47

port 47 is port vxlan-0af64525

# readlink -f /tmp/tmp.Z4pKS6ZnFC/vcd41026/ovs/bridges/br-tun/ports/47
/tmp/tmp.Z4pKS6ZnFC/vcd41026/ovs/ports/vxlan-0af64525

vxlan-0af64525 in a tunnel endpoint with remote_ip=10.246.69.37

root@vcd41026:~# ovs-vsctl show| grep 'Interface "vxlan-0af64525"' -A 2
Interface "vxlan-0af64525"
type: vxlan
options: {df_default="true", in_key=flow, local_ip="10.246.69.32", out_key=flow, remote_ip="10.246.69.37"}

where 10.246.69.37 is on vcd41025 but our test vm is running on vcd41021.

So I believe it should be:

cookie=0x7e47f520bdf5215b, duration=55484.529s, table=22, n_packets=3441, n_bytes=218308, idle_age=794, priority=1,dl_vlan=10 actions=strip_vlan,load:0x40a->NXM_NX_TUN_ID[],output:45,output:46,output:47,output:48,output:49,output:50,output:51,output:52

Alvaro Uria (aluria) on 2019-09-23

tags:

added: canonical-bootstack

Revision history for this message

Jörn Volkhausen (jvolkhausen) wrote on 2019-10-31:

#60

We see the Problem in our infrastructure like zhhuabi after updating to following versions:
Compute Nodes:
  neutron-common/bionic-updates,now 2:12.1.0-0ubuntu1 all [installed]
  neutron-l3-agent/bionic-updates,now 2:12.1.0-0ubuntu1 all [installed]
  neutron-metadata-agent/bionic-updates,now 2:12.1.0-0ubuntu1 all [installed,automatic]
  neutron-openvswitch-agent/bionic-updates,now 2:12.1.0-0ubuntu1 all [installed]
Network Nodes:
  neutron-common/bionic-updates,now 2:12.1.0-0ubuntu1 all [installed,automatic]
  neutron-dhcp-agent/bionic-updates,now 2:12.1.0-0ubuntu1 all [installed]
  neutron-l3-agent/bionic-updates,now 2:12.1.0-0ubuntu1 all [installed]
  neutron-lbaas-common/bionic,now 2:12.0.0-0ubuntu1 all [installed,automatic]
  neutron-lbaasv2-agent/bionic,now 2:12.0.0-0ubuntu1 all [installed]
  neutron-metadata-agent/bionic-updates,now 2:12.1.0-0ubuntu1 all [installed,automatic]
  neutron-metering-agent/bionic-updates,now 2:12.1.0-0ubuntu1 all [installed]
  neutron-openvswitch-agent/bionic-updates,now 2:12.1.0-0ubuntu1 all [installed]

Here we have a infrastructure with three network and six compute Nodes and gre tunnels between them.

At the moment we are manipulating the flow entrys in br-tun table=22 with a script to output the packets to every compute/network nodes.

I think, if the patch is included in this packet versions, it would not completely solve the issues described above and seen here.

We will now check the agent_boot_time quick fix mentioned from Bruca Lock in post #41

Bernard Cafarelli (bcafarel) on 2020-01-31

tags:

removed: neutron-proactive-backport-potential

Revision history for this message

Florian Engelmann (engelmann) wrote on 2020-02-03:

#61

looks like "agent_boot_time" was removed in Train. Why?

Revision history for this message

Slawek Kaplonski (slaweq) wrote on 2020-02-03:

#62

Hi,

It was deprecated in Stein and we removed it in Train: https://review.opendev.org/#/c/641866/

And more info why it was deprecated is in https://review.opendev.org/#/c/640797/

Revision history for this message

Hua Zhang (zhhuabj) wrote on 2020-06-10:

#63

See this problem again in another customer environment, one host lacks flow to 44, 46 and 48 as the comment #59 said

cookie=0x885c49e981aec5b2, duration=32295.028s, table=22, n_packets=225, n_bytes=14724, idle_age=1318, priority=1,dl_vlan=29 actions=strip_vlan,load:0x3f7->NXM_NX_TUN_ID[],output:36,output:37,output:39,output:43,output:45,output:47

$ ls /var/snap/ovs-stat/common/tmp/tmp.U6AafAs8Ip/lcpip-xxx002/ovs/bridges/br-tun/ports/
1 36 37 39 43 44 45 46 47 48
$ readlink -f /var/snap/ovs-stat/common/tmp/tmp.U6AafAs8Ip/lcpip-xxx002/ovs/bridges/br-tun/ports/1
/var/snap/ovs-stat/common/tmp/tmp.U6AafAs8Ip/lcpip-xxx002/ovs/ports/patch-int
$ readlink -f /var/snap/ovs-stat/common/tmp/tmp.U6AafAs8Ip/lcpip-xxx002/ovs/bridges/br-tun/ports/44
/var/snap/ovs-stat/common/tmp/tmp.U6AafAs8Ip/lcpip-xxx002/ovs/ports/vxlan-0abf0615
$ readlink -f /var/snap/ovs-stat/common/tmp/tmp.U6AafAs8Ip/lcpip-xxx002/ovs/bridges/br-tun/ports/46
/var/snap/ovs-stat/common/tmp/tmp.U6AafAs8Ip/lcpip-xxx002/ovs/ports/vxlan-0abf060d
$ readlink -f /var/snap/ovs-stat/common/tmp/tmp.U6AafAs8Ip/lcpip-xxx002/ovs/bridges/br-tun/ports/48
/var/snap/ovs-stat/common/tmp/tmp.U6AafAs8Ip/lcpip-xxx002/ovs/ports/vxlan-0abf0613

After restarting ovs, the flow to 44, 46 and 48 came back

cookie=0x8e8b842be4291fe7, duration=5223.823s, table=22, n_packets=250, n_bytes=16496, priority=1,dl_vlan=29 actions=strip_vlan,load:0x3f7->NXM_NX_TUN_ID[],output:"vxlan-0abf0616",output:"vxlan-0abf0612",output:"vxlan-0abf060f",output:"vxlan-0abf0611",output:"vxlan-0abf0615",output:"vxlan-0abf0614",output:"vxlan-0abf060d",output:"vxlan-0abf060e",output:"vxlan-0abf0613",output:"vxlan-0abf0617",output:"vxlan-0abf0618",output:"vxlan-0abf0619"

Revision history for this message

Paul Goins (vultaire) wrote on 2020-06-16:

#64

Could the occurrences in stein and newer be linked to https://bugs.launchpad.net/neutron/+bug/1853613?

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-05-14: Fix included in openstack/neutron ocata-eol

#65

This issue was fixed in the openstack/neutron ocata-eol release.

Revision history for this message

Hemanth Nakkina (hemanth-n) wrote on 2022-04-18:

#66

Tested in OpenStack Train and Ussuri, I see some inconsistency in l2pop flows whenever openvswitch is restarted (with out restart of neutron-*-agents).

The workaround is to restart neutron-openvswitch-agent to repopulate the correct l2pop flows.

neutron

Inconsistent flows with DVR l2pop VxLAN on br-tun

Bug Description

Other bug subscribers

Remote bug watches