L2 extensions flows lost on openvswitch restart

Bug #1646526 reported by Thomas Morin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Thomas Morin

Bug Description

(taking networking-bgpvpn as an example)

On an openvswitch restart (desired restart or after a crash), neutron openvswitch agent detects the restart, and sets up br-int and br-tun again. However, the base setup for br-tun/br-mpls/br-int integration is done only at startup, via the initialize callback of the agent extension API. As a result on openvswitch restart the br-tun flows to forward traffic to and from br-mpls are lost, and the BGPVPN connectivity is interrupted on the compute node.

Changed in bgpvpn:
status: New → Confirmed
importance: Medium → High
Changed in bgpvpn:
assignee: nobody → Thomas Morin (tmmorin-orange)
milestone: none → 6.0.0
Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

Indeed we don't call ext_manager.initialize() on OVS restart detection, hence flows not set. I believe it's to be fixed on neutron side, so changing the component accordingly.

A logical fix could be moving .initialize() call under rpc_loop, but since .initialize() usually registers consumers, and this should happen before we call to consume_in_threads() [and this happens on agent __init__], it probably won't work.

An alternative to that could be making note of all flow operations made by each extension on OVSCookieBridge and replay those on OVS restart. There is a problem with it though: it may be very wasteful (most entries in the journal could be irrelevant), and even unsafe (we may expose the node to a state from the journal past).

To avoid that problem, we could try to keep the latest flow state in-memory and dumb just that. I am not sure if mere interception of [add|mod|delete]_flow gives enough information to maintain that state though. (That's another case where we suffer from lack of proper flow manager in neutron.)

If nothing works, we can add an 'OVS restarted' event that extensions could intercept (or entry point to be called by agent in such scenario). Of course, the extension would need to implement it, which will require from it to maintain the in-memory map of its flows.

affects: bgpvpn → neutron
Changed in neutron:
milestone: 6.0.0 → none
Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

Huh, it's ofc dumb -> dump in the last comment. ;)

Revision history for this message
Thomas Morin (tmmorin-orange) wrote :

The two ("replay" and "dump" approaches) are interesting.

On the "replay" approach: the journal would grow a long run time, and replaying it would take an increasing amount of time, since the replay would have to block other flow operations (I think), then this would become not viable.

On the "dump" approach:
- if understood as "interception of [add|mod|delete]_flow" then I guess you mean "build an internal representation of all flows based on intercepted [add|mod|delete]_flow" (or it would be just like the replay approach): the trouble is that to build a faithful representation¸ it would require perfectly reproducing the behavior of OVS, which may not be easy (e.g. determining which existing flows are impacted by a mod flows)
- if understood as an "ovs-ofctl dump-flows snapshot", then it seems that it may work; note however that the snapshot would have to be done after each flow operation (because doing it after we noticed OVS restarted would be too late), which may end up having a significant performance impact.

Without having thought about these approaches, my naive suggestion was very much according to your last proposal. Indeed it requires support by extensions, but well the extensions can keep track of flow state, and can even keep track of higher level information that they can translate as flow information whenever needed.

With the "restart event" approach, couldn't we have the following (as an alternative to requiring extensions to having specific code to be able to reinstall flows):
- rely on an "OVS restarted" event to have the extension setup again the flows that are independent of any specific port -- e.g. in BGPVPN we have such flows to map traffic between bridges (this step may go away when [1] is achieved, by the way)
- have the agent call handle_port for all ports (or handle_port with an ovs_restarted context flag ?), and let the extension setup everything for this port like it would to for a newly appearing port -- the assumption would be that handle_port would have to be idempotent and not "lazy" (not skip setting up flows because it would already have some internal higher-level state telling him things are already ready)

[1] https://review.openstack.org/#/c/320439/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/421771

Changed in neutron:
status: Confirmed → In Progress
Revision history for this message
Thomas Morin (tmmorin-orange) wrote :

networking-bgpvpn corresponding bug: https://bugs.launchpad.net/bgpvpn/+bug/1657689

summary: - bgpvpn functionality lost on openvswitch restart
+ L2 extensions flows lost on openvswitch restart
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/421771
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=ea2cab0e15079490fb772ffa077f5aad236e8239
Submitter: Jenkins
Branch: master

commit ea2cab0e15079490fb772ffa077f5aad236e8239
Author: Thomas Morin <email address hidden>
Date: Wed Jan 18 10:50:44 2017 +0100

    openvswitch agent: add OVS_RESTARTED event

    This new event is aimed at informing that OVS has restarted, and in particular
    to let L2 extensions know that they may need to setup their flows again.

    Change-Id: I9aebe7ccc3e2f565b4339d42842d89b911131b1f
    Closes-Bug: 1646526
    Partial-Bug: 1657689

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 10.0.0.0b3

This issue was fixed in the openstack/neutron 10.0.0.0b3 development milestone.

tags: added: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/444333

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/newton)

Reviewed: https://review.openstack.org/444333
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=dbf657a7998687e90fc79fa8fcf182a4bc5c3f28
Submitter: Jenkins
Branch: stable/newton

commit dbf657a7998687e90fc79fa8fcf182a4bc5c3f28
Author: Thomas Morin <email address hidden>
Date: Wed Jan 18 10:50:44 2017 +0100

    openvswitch agent: add OVS_RESTARTED event

    This new event is aimed at informing that OVS has restarted, and in particular
    to let L2 extensions know that they may need to setup their flows again.

    Conflicts:
     neutron/tests/functional/agent/test_l2_ovs_agent.py

    Change-Id: I9aebe7ccc3e2f565b4339d42842d89b911131b1f
    Closes-Bug: 1646526
    Partial-Bug: 1657689
    (cherry picked from commit ea2cab0e15079490fb772ffa077f5aad236e8239)

tags: added: in-stable-newton
tags: removed: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 9.3.0

This issue was fixed in the openstack/neutron 9.3.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.