Add FDB bridge entry fails if old entry not removed

Bug #1432873 reported by Kevin Stevens
28
This bug affects 5 people
Affects Status Importance Assigned to Milestone
OpenStack-Ansible
Invalid
Undecided
Kevin Carter
Juno
Fix Released
Undecided
Kevin Carter
neutron
Fix Released
Undecided
Li Ma

Bug Description

Running on Ubuntu 14.04 with Linuxbridge agent and L2pop with vxlan networks.

In situations where "remove_fdb_entries" messages are lost/never consumed, future "add_fdb_bridge_entry" attempts will fail with the following example error message:
2015-03-16 21:10:08.520 30207 ERROR neutron.agent.linux.utils [req-390ab63a-9d3c-4d0e-b75b-200e9f5b97c6 None]
Command: ['sudo', '/usr/local/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'bridge', 'fdb', 'add', 'fa:16:3e:a5:15:35', 'dev', 'vxlan-15', 'dst', '172.30.100.60']
Exit code: 2
Stdout: ''
Stderr: 'RTNETLINK answers: File exists\n'

In our case, instances were unable to communicate with their Neutron router because vxlan traffic was being forwarded to the wrong vxlan endpoint. This was corrected by either migrating the router to a new agent or by executing a "bridge fdb del" for the fdb entry corresponding with the Neutron router mac address. Once deleted, the LB agent added the appropriate fdb entry at the next polling event.

If anything is unclear, please let me know.

Revision history for this message
James Denton (james-denton) wrote :

Some additional info...

The Neutron DB and the forwarding DB somehow get out of sync so that the FDB has one entry and Neutron has another. For example:

On a compute node:

compute003# bridge fdb | grep fa:16:3e:5d:05:4f
fa:16:3e:5d:05:4f dev vxlan-8 vlan 0
fa:16:3e:5d:05:4f dev vxlan-8 dst 172.29.243.252 self permanent

fa:16:3e:5d:05:4f is the MAC address of the qr interface of the router. 172.29.243.252 is the vtep of infra01. Neutron, however, thinks the router is scheduled to infra04:

root@compute003:~# neutron l3-agent-list-hosting-router e29e967c-4db1-4283-b9cf-bb2625198c9f
+--------------------------------------+--------------------------------------------------+----------------+-------+
| id | host | admin_state_up | alive |
+--------------------------------------+--------------------------------------------------+----------------+-------+
| 18e9dbb6-2bab-4a8b-bc89-7da3dcd224a2 | infra04_neutron_agent | True | :-) |
+--------------------------------------+--------------------------------------------------+----------------+-------+

When you attempt to unschedule the router from infra04, you'll see the following fdb delete failure in the linuxbridge agent log:

2015-03-17 13:48:05.853 30207 ERROR neutron.agent.linux.utils [req-5d5b8a90-cb10-4acf-9971-a3fa6b996c74 None]
Command: ['sudo', '/usr/local/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'bridge', 'fdb', 'del', 'fa:16:3e:5d:05:4f', 'dev', 'vxlan-8', 'dst', '172.29.242.66']
Exit code: 2
Stdout: ''
Stderr: 'RTNETLINK answers: No such file or directory\n'

172.29.242.66 is the vtep on infra04. It is expected that it would fail, considering the entry doesn't exist. As a result, this is still left:

compute003# bridge fdb | grep fa:16:3e:5d:05:4f
fa:16:3e:5d:05:4f dev vxlan-8 vlan 0
fa:16:3e:5d:05:4f dev vxlan-8 dst 172.29.243.252 self permanent

To work around it, you can reschedule the router to infra01. That results in the following error:

2015-03-17 13:50:33.006 30207 ERROR neutron.agent.linux.utils [req-3a4ae444-40f8-4d3b-ad37-8813b963a5ec None]
Command: ['sudo', '/usr/local/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'bridge', 'fdb', 'add', 'fa:16:3e:5d:05:4f', 'dev', 'vxlan-8', 'dst', '172.29.243.252']
Exit code: 2
Stdout: ''
Stderr: 'RTNETLINK answers: File exists\n'

That is to be expected, as the entry already exists. Then, you can unschedule the router from infra01 and see the FDB entry get properly removed:

compute003# bridge fdb | grep fa:16:3e:5d:05:4f
fa:16:3e:5d:05:4f dev vxlan-8 vlan 0

Rescheduling to another agent results in the correct entry being added:

compute003# bridge fdb | grep fa:16:3e:5d:05:4f
fa:16:3e:5d:05:4f dev vxlan-8 vlan 0
fa:16:3e:5d:05:4f dev vxlan-8 dst 172.29.242.66 self permanent

We don't exactly know what causes the FDB entry to not get removed properly to begin with. The result, though, is an inconsistent Neutron DB/FDB state and eventual traffic loss.

Revision history for this message
Kevin Carter (kevin-carter) wrote :

We're suggesting that to potentially improve functionality we should upgrade to the code base for Juno to 2014.2.2 however in doing so we'll need to do a major release which will update all projects to the new tag. This is not something that we'll be able to immediately address but we will in the "next" juno release.

Changed in openstack-ansible:
status: New → Incomplete
status: Incomplete → Triaged
milestone: none → next
Revision history for this message
Darragh O'Reilly (darragh-oreilly) wrote :
Revision history for this message
Darragh O'Reilly (darragh-oreilly) wrote :

oh maybe not - not sure if the patch for 1367999 will fix this problem.

Revision history for this message
James Denton (james-denton) wrote :

Since it appears 'bridge fdb replace' will add an entry in the event one doesn't exist, and replaces one if it does, it seems that may be the way to go in resolving this bug. That said, is this something that can/will be backported to Juno?

Revision history for this message
Darragh O'Reilly (darragh-oreilly) wrote :

yes, it is a small simple fix, so it should be easy to backport. Can you test it and confirm that it resolves this bug.

Revision history for this message
James Denton (james-denton) wrote :

We will be testing it either today or tomorrow and will report back.

Revision history for this message
James Denton (james-denton) wrote :

So this patch appeared to work at first glance, but it has the unfortunate side effect of breaking the ability to add broadcast entries to the forwarding table:

Command: ['sudo', '/usr/local/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'bridge', 'fdb', 'replace', '00:00:00:00:00:00', 'dev', 'vxlan-19', 'dst', '172.29.243.252']
Exit code: 2
Stdout: ''
Stderr: 'RTNETLINK answers: Operation not supported\n'

This is repeated for each VTEP.

Revision history for this message
James Denton (james-denton) wrote :

Actually - hold off on that. It may have been patched incorrectly :O

Revision history for this message
Li Ma (nick-ma-z) wrote :

Yes, broadcast entries are not able to applied. I'll fix it soon.

Changed in neutron:
assignee: nobody → Li Ma (nick-ma-z)
Revision history for this message
James Denton (james-denton) wrote :

The patch did work as expected when applied correctly.

Revision history for this message
Li Ma (nick-ma-z) wrote :

Got it. Thanks.

Revision history for this message
Li Ma (nick-ma-z) wrote :

The bug has been fixed via https://review.openstack.org/165137

Changed in neutron:
status: New → Fix Released
Changed in openstack-ansible:
milestone: next → 10.1.4
Changed in openstack-ansible:
milestone: 10.1.4 → 10.1.5
Revision history for this message
Darren Birkett (darren-birkett) wrote :

For openstack-ansible, the bump to the latest juno release (which includes the fix referenced in this bug) is here:

https://review.openstack.org/#/c/177388/

Once that merges, the work for openstack-ansible in this bug is complete.

Changed in openstack-ansible:
milestone: 10.1.5 → none
Changed in openstack-ansible:
milestone: none → 10.1.5
Changed in openstack-ansible:
status: Triaged → Fix Committed
assignee: nobody → Kevin Carter (kevin-carter)
status: Fix Committed → Invalid
milestone: 10.1.5 → 9.0.9
milestone: 9.0.9 → none
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.