L2 pop notifications are not reliable

Bug #1860521 reported by Oleg Bondarev
This bug affects 3 people
Affects Status Importance Assigned to Milestone

Bug Description

Problem: lack of connectivity (e.g. vxlan tunnels, OVS flows) between nodes/VMs in L2 segment due to partial RabbitMQ unavailability, RPC message loss or agent failure on applying fdb entry updates.

Why: currently FDB entries are sent by neutron server to L2 agents one-way (no feedback), thus agent has no way to detect if all required tunnels/flows are built. On the other hand server has no way to detect if all sent FDB entries were delivered and required flows were applied. In case some messages are lost - only agent restart fixes possible issues.

Way to address: new synchronization mechanism on L2 agent side, which will periodically request net topology from server and match it to actual config applied on the node, with applying missing parts.

Option 2: move from RPC fanouts and casts to RPC calls which guarantee message delivery. Concerns: scalability, increased load on neutron server.

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

Hi Oleg,

I will add this RFE to the agenda of our next drivers meeting: http://eavesdrop.openstack.org/#Neutron_drivers_Meeting - so it would be great if You could join there if there would be any additional questions. But RFE should be discussed even if You will not be able to attend this meeting.

tags: added: rfe-triaged
Revision history for this message
Slawek Kaplonski (slaweq) wrote :

We were discussing this RFE on our last drivers team meeting: http://eavesdrop.openstack.org/meetings/neutron_drivers/2020/neutron_drivers.2020-02-14-14.00.log.html#l-57

Summary of that discussion is:
* we don't want to go with option 2 from Your proposal - switchning to call() methods wouldn't scale as You also mentioned and that's not good solution in our opinion,

* we lean towards some sort of synch problem and that should be explored more in the spec.

So, can You propose spec with detailed description how such sync mechanism should work to solve this problem? We can than review it and continue discussion about it in the spec.

tags: removed: rfe
Revision history for this message
Slawek Kaplonski (slaweq) wrote :

Hi Oleg,

Are You still going to work on this RFE? Will You propose spec for that?

Revision history for this message
Oleg Bondarev (obondarev) wrote :

Hi Slawek,

not in the near future I'm afraid..

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

Thx Oleg for info. I will mark this RFE as postponed for now. Feel free to reopen it if You want to work on this. We can always discuss that again during drivers meeting.

tags: added: rfe-postponed
removed: rfe-triaged
Changed in neutron:
status: New → Opinion
importance: Undecided → Wishlist
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.