Implement consistency check and self-healing for SDN-managed fabrics

Bug #1829449 reported by Jacob Anders on 2019-05-17
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Wishlist
Unassigned

Bug Description

When SDN mechanism driver is used in Neutron (on our site we use mlnx_sdn_assist but this issue isn’t limited just to this driver, we hear about similar issues with at least three other SDN solutions) there is no consistency checking applied to the fabric past the initial port configuration. If there is an issue with the SDN layer after Neutron issues the request to the SDN controller and the requested configuration is not implemented appropriately, there is no way for Neutron to know about this. Ideally such scenarios should not happen but the feedback from operators indicates that these issues occasionally do happen for a variety of reasons and when they happen the user impact is significant as the state of neutron and SDN needs to be merged manually which is generally non-trivial.

If SDN mechanism drivers are not used and the standard openvswitch based networking is configured, neutron-openvswitch-agent periodically checks the port configuration and enforces the desired state if needed. Investigating if/how this could be applied to SDN in a general case would probably be a logical first step.

It would be very valuable for the SDN-based cloud operators to be able to:

Have neutron poll SDN to check the state of each of the ports and
Have neutron “push” the state of each port to make sure that the SDN state is consistent with neutron state
Ensure that each SDN solution supported with OpenStack provides support for those actions

Initially these actions could be triggered manually (or from a monitoring system) and later on it would likely become a periodic task adding self-healing capabilities to SDN-based OpenStack installations.

Slawek Kaplonski (slaweq) wrote :

Hi,

Isn't https://review.opendev.org/#/c/565463/ something what tries to address this problem already? Can You check this proposed spec? Thx.

tags: added: rfe
Ryan Tidwell (ryan-tidwell) wrote :

Is part of the concern here that the ML2 mechanism driver interface doesn't allow for any feedback to be consumed from the mechanism it is controlling? Without special vendor extensions it does seem like third-party drivers may be lacking some of the functionality ML2+OVS (as an example) has in its RPC API. Are you suggesting something akin to an RPC API for third-party mech drivers and controllers to leverage? Just thinking out loud, I also wonder where we draw the line of demarcation with neutron and third-party drivers. This sounds like a potential RFE, I think drilling into the details of what this would look like would help me.

Jacob Anders (jacobanders) wrote :

Hi Slawek,

The spec you referenced does make a lot of sense. I'm still going through it, I hope to provide more feedback by the end of the week.

Ryan - all valid points / concerns.

I shared this bug with fellow SDN users ( for an IRC discussion on Wednesday ) as well as my colleagues at Mellanox and I hope to get some feedback shortly. I will keep you posted.

Thank you,
Jacob

Miguel Lavalle (minsel) on 2019-05-23
Changed in neutron:
importance: Undecided → Wishlist
Miguel Lavalle (minsel) wrote :

Hi Jacob,

Thanks for following up on our conversation with the filing of this RFE. Any feedback on https://review.opendev.org/#/c/565463/? As far as I can tell, it is very similar to what you propose.

tags: added: rfe-confirmed
removed: rfe
Jacob Anders (jacobanders) wrote :

Hi Miguel and All,

I've reviewed the proposal as well as the related spec. I have also had initial discussions with the Scientific SIG which is where this idea originated from. To my best knowledge, I think we're on the right direction - we're trying to solve a similar if not the same problem and we can definitely join efforts. Consistency check and sync functions look like they should be able to do what we're after.

I hope to run these ideas by my colleagues at Mellanox in the coming week (they were away attending a conference this week) - they have more depth in this topic. I will also have further discussions with the Scientific SIG members. I will provide further updates as the information becomes available - please expect to hear from me in about week's time.

Best Regards,
Jacob

Stig Telfer (stigtelfer) wrote :

Some mechanism for coherency between Neutron and SDN controllers seems well overdue. The proposed spec doesn't ensure ordering but does improve the rate of detection of out-of-order operations, and provides for a recovery method for the times when it fails. Alternatively a journal-based approach may improve serialised event ordering, but doesn't automatically solve the scenario where the SDN controller fails to apply the requested configuration.

Jacob Anders (jacobanders) wrote :

Apologies for the delay.

I've spoken to my colleagues at Mellanox and they highlighted one failure mode that doesn't seem to be addressed in the spec: that is leftover/garbage entries in the SDN DB. This can be caused by port deletion requests that failed silently or re-initialising the neutron DB without clearning SDN DB first. Do you think this is something that could be added to this spec, or is it out of scope? From my experience this failure mode (silent deletion failure) isn't very uncommon.

Other than that I think this spec is a step in the right direction that can help bring Neutron/SDN closer to parity with Neutron/OVS.

Would it make sense to discuss this further in the neutron-drivers meeting on Friday?

Miguel Lavalle (minsel) wrote :

Hi Jacob,

Thanks for your follow up. My recommendation is to propose in https://review.opendev.org/#/c/565463/ handling the scenario that you point out in note #7 above. If for some reason that scenario doesn't fit in that spec, we can discuss it further in the drivers meeting. But let's give a try first with the existing spec

Miguel Lavalle (minsel) wrote :

Since we have agreed to pursue this RFE as part of https://review.opendev.org/#/c/565463/, I am going to mark it as approved

tags: added: rfe-approved
removed: rfe-confirmed
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers