Comment 1 for bug 1432873

Revision history for this message
James Denton (james-denton) wrote :

Some additional info...

The Neutron DB and the forwarding DB somehow get out of sync so that the FDB has one entry and Neutron has another. For example:

On a compute node:

compute003# bridge fdb | grep fa:16:3e:5d:05:4f
fa:16:3e:5d:05:4f dev vxlan-8 vlan 0
fa:16:3e:5d:05:4f dev vxlan-8 dst 172.29.243.252 self permanent

fa:16:3e:5d:05:4f is the MAC address of the qr interface of the router. 172.29.243.252 is the vtep of infra01. Neutron, however, thinks the router is scheduled to infra04:

root@compute003:~# neutron l3-agent-list-hosting-router e29e967c-4db1-4283-b9cf-bb2625198c9f
+--------------------------------------+--------------------------------------------------+----------------+-------+
| id | host | admin_state_up | alive |
+--------------------------------------+--------------------------------------------------+----------------+-------+
| 18e9dbb6-2bab-4a8b-bc89-7da3dcd224a2 | infra04_neutron_agent | True | :-) |
+--------------------------------------+--------------------------------------------------+----------------+-------+

When you attempt to unschedule the router from infra04, you'll see the following fdb delete failure in the linuxbridge agent log:

2015-03-17 13:48:05.853 30207 ERROR neutron.agent.linux.utils [req-5d5b8a90-cb10-4acf-9971-a3fa6b996c74 None]
Command: ['sudo', '/usr/local/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'bridge', 'fdb', 'del', 'fa:16:3e:5d:05:4f', 'dev', 'vxlan-8', 'dst', '172.29.242.66']
Exit code: 2
Stdout: ''
Stderr: 'RTNETLINK answers: No such file or directory\n'

172.29.242.66 is the vtep on infra04. It is expected that it would fail, considering the entry doesn't exist. As a result, this is still left:

compute003# bridge fdb | grep fa:16:3e:5d:05:4f
fa:16:3e:5d:05:4f dev vxlan-8 vlan 0
fa:16:3e:5d:05:4f dev vxlan-8 dst 172.29.243.252 self permanent

To work around it, you can reschedule the router to infra01. That results in the following error:

2015-03-17 13:50:33.006 30207 ERROR neutron.agent.linux.utils [req-3a4ae444-40f8-4d3b-ad37-8813b963a5ec None]
Command: ['sudo', '/usr/local/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'bridge', 'fdb', 'add', 'fa:16:3e:5d:05:4f', 'dev', 'vxlan-8', 'dst', '172.29.243.252']
Exit code: 2
Stdout: ''
Stderr: 'RTNETLINK answers: File exists\n'

That is to be expected, as the entry already exists. Then, you can unschedule the router from infra01 and see the FDB entry get properly removed:

compute003# bridge fdb | grep fa:16:3e:5d:05:4f
fa:16:3e:5d:05:4f dev vxlan-8 vlan 0

Rescheduling to another agent results in the correct entry being added:

compute003# bridge fdb | grep fa:16:3e:5d:05:4f
fa:16:3e:5d:05:4f dev vxlan-8 vlan 0
fa:16:3e:5d:05:4f dev vxlan-8 dst 172.29.242.66 self permanent

We don't exactly know what causes the FDB entry to not get removed properly to begin with. The result, though, is an inconsistent Neutron DB/FDB state and eventual traffic loss.