The Neutron DB and the forwarding DB somehow get out of sync so that the FDB has one entry and Neutron has another. For example:
On a compute node:
compute003# bridge fdb | grep fa:16:3e:5d:05:4f
fa:16:3e:5d:05:4f dev vxlan-8 vlan 0
fa:16:3e:5d:05:4f dev vxlan-8 dst 172.29.243.252 self permanent
fa:16:3e:5d:05:4f is the MAC address of the qr interface of the router. 172.29.243.252 is the vtep of infra01. Neutron, however, thinks the router is scheduled to infra04:
Rescheduling to another agent results in the correct entry being added:
compute003# bridge fdb | grep fa:16:3e:5d:05:4f
fa:16:3e:5d:05:4f dev vxlan-8 vlan 0
fa:16:3e:5d:05:4f dev vxlan-8 dst 172.29.242.66 self permanent
We don't exactly know what causes the FDB entry to not get removed properly to begin with. The result, though, is an inconsistent Neutron DB/FDB state and eventual traffic loss.
Some additional info...
The Neutron DB and the forwarding DB somehow get out of sync so that the FDB has one entry and Neutron has another. For example:
On a compute node:
compute003# bridge fdb | grep fa:16:3e:5d:05:4f
fa:16:3e:5d:05:4f dev vxlan-8 vlan 0
fa:16:3e:5d:05:4f dev vxlan-8 dst 172.29.243.252 self permanent
fa:16:3e:5d:05:4f is the MAC address of the qr interface of the router. 172.29.243.252 is the vtep of infra01. Neutron, however, thinks the router is scheduled to infra04:
root@compute003:~# neutron l3-agent- list-hosting- router e29e967c- 4db1-4283- b9cf-bb2625198c 9f ------- ------- ------- ------- ----+-- ------- ------- ------- ------- ------- ------- ------+ ------- ------- --+---- ---+ ------- ------- ------- ------- ----+-- ------- ------- ------- ------- ------- ------- ------+ ------- ------- --+---- ---+ 2bab-4a8b- bc89-7da3dcd224 a2 | infra04_ neutron_ agent | True | :-) | ------- ------- ------- ------- ----+-- ------- ------- ------- ------- ------- ------- ------+ ------- ------- --+---- ---+
+------
| id | host | admin_state_up | alive |
+------
| 18e9dbb6-
+------
When you attempt to unschedule the router from infra04, you'll see the following fdb delete failure in the linuxbridge agent log:
2015-03-17 13:48:05.853 30207 ERROR neutron. agent.linux. utils [req-5d5b8a90- cb10-4acf- 9971-a3fa6b996c 74 None] bin/neutron- rootwrap' , '/etc/neutron/ rootwrap. conf', 'bridge', 'fdb', 'del', 'fa:16: 3e:5d:05: 4f', 'dev', 'vxlan-8', 'dst', '172.29.242.66']
Command: ['sudo', '/usr/local/
Exit code: 2
Stdout: ''
Stderr: 'RTNETLINK answers: No such file or directory\n'
172.29.242.66 is the vtep on infra04. It is expected that it would fail, considering the entry doesn't exist. As a result, this is still left:
compute003# bridge fdb | grep fa:16:3e:5d:05:4f
fa:16:3e:5d:05:4f dev vxlan-8 vlan 0
fa:16:3e:5d:05:4f dev vxlan-8 dst 172.29.243.252 self permanent
To work around it, you can reschedule the router to infra01. That results in the following error:
2015-03-17 13:50:33.006 30207 ERROR neutron. agent.linux. utils [req-3a4ae444- 40f8-4d3b- ad37-8813b963a5 ec None] bin/neutron- rootwrap' , '/etc/neutron/ rootwrap. conf', 'bridge', 'fdb', 'add', 'fa:16: 3e:5d:05: 4f', 'dev', 'vxlan-8', 'dst', '172.29.243.252']
Command: ['sudo', '/usr/local/
Exit code: 2
Stdout: ''
Stderr: 'RTNETLINK answers: File exists\n'
That is to be expected, as the entry already exists. Then, you can unschedule the router from infra01 and see the FDB entry get properly removed:
compute003# bridge fdb | grep fa:16:3e:5d:05:4f
fa:16:3e:5d:05:4f dev vxlan-8 vlan 0
Rescheduling to another agent results in the correct entry being added:
compute003# bridge fdb | grep fa:16:3e:5d:05:4f
fa:16:3e:5d:05:4f dev vxlan-8 vlan 0
fa:16:3e:5d:05:4f dev vxlan-8 dst 172.29.242.66 self permanent
We don't exactly know what causes the FDB entry to not get removed properly to begin with. The result, though, is an inconsistent Neutron DB/FDB state and eventual traffic loss.