Restarting neutron openvswitch agent causes network hiccup by throwing away all flows

Bug #1383674 reported by Robert van Leeuwen on 2014-10-21
150
This bug affects 27 people
Affects Status Importance Assigned to Milestone
neutron
High
Ann Taraday

Bug Description

If you reload/restart the neutron-openvswitch daemon it will first throw away all existing flows.
This will break all networking until the flows are re-created.

It can take some time before everything is back in the flow table on machines with quite a bit of instances or a busy neutron server.

Please make the clearing of the flows optional during a restart.

tags: added: neutron-openvswitch-agent
summary: - Restarting openvswitch daemon causes network hiccup by throwing away all
- flows
+ Restarting neutron openvswitch agent causes network hiccup by throwing
+ away all flows
Eugene Nikanorov (enikanorov) wrote :

Which version are you using? There was such issue once, but it was fixed in the beginning of the Icehouse cycle

tags: added: ovs
removed: neutron-openvswitch-agent
Changed in neutron:
status: New → Incomplete

I'm using the latest Icehouse from the RDO repo:
openstack-neutron-openvswitch-2014.1.3

The flows are still being thrown away and re-created with every restart of the agent.

Kyle Mestery (mestery) wrote :

Please read the description of the bug here:

https://bugs.launchpad.net/tripleo/+bug/1290486

Without reprogramming flows, the agent used to clear out existing flows and leave the NORMAL flow in place, which is clearly undesirable for many reasons. I think what you're proposing is that we add an option to not only disable reprogramming flows on agent restart, but also disable clearing all flows. To me, this is doable, but perhaps is a security issue as well. But if we make it optional (and not enabled by default), it may satisfy your use case.

Now I think longer about this:

Making an option to not clear all flows is not a good enough long term solution (for now: yes please!)

My reasoning that it requires something better:
To be honest I think "my use-case" is pretty much the standard use-case.
This behaviour makes it impossible to do any config changes (e.g. logging) or version updates without creating a network outage.
Because of this I think disabling the clearing of flows will become the default for most users.
This makes a hole is the security part of bug 1290486

So I think the agent should do something smarter, e.g.:
Instead of blindly breaking all network traffic flows on startup first create all flows and then throw away all flows except the new ones we just created.

Disabling clearing of flows/disable reprogramming flows upon restart is definetly do-able, probably by skipping the bridge setups based on a new config or someother means, but how do we ensure the agent cache is populated accurately to be still in sync with the earlier flows?? For example local vlan map, iptables driver's firewall object etc?

Tom Fifield (fifieldt) on 2014-11-03
tags: added: ops
Tom Fifield (fifieldt) on 2014-11-03
Changed in neutron:
status: Incomplete → New
tags: removed: ops
Changed in neutron:
importance: Undecided → Medium
status: New → Confirmed
Changed in neutron:
assignee: nobody → maxiao@unitedstack.com (maxiao)
Miguel Angel Ajo (mangelajo) wrote :

In my opinion, agent should not do this kind of clear/reset on restart by any mean. It totally breaks the no-outage-upgradability of the agents.

At agent boot we should always inspect the system and retrieve old vlan mappings and flows to it's own cache, iptable rule zones (when we use such), etc...

maxiao@unitedstack.com (maxiao) wrote :

Hi, mangelajo

Thanks for your reply.
I will solve the problem by rebuilding the vlan-mapping in the first time handing of loop function.
But some limitation exists such as if you change the mode from normal to dvr,you must make sure the flows be cleared. You can do this by set the reset flag.

I will git commit for reviewing today or tomorrow.

maxiao@unitedstack.com (maxiao) wrote :

Hi,all

I think the iptables zone does not matter because in the loop, the setup_port_filters will make the iptables rules not be changed.
So only the flows be the problem.

Fix proposed to branch: master
Review: https://review.openstack.org/142611

Changed in neutron:
status: Confirmed → In Progress
Kyle Mestery (mestery) on 2015-01-08
Changed in neutron:
milestone: none → kilo-2
Kyle Mestery (mestery) on 2015-02-03
Changed in neutron:
milestone: kilo-2 → none

Is this still being worked on by someone? Seems the merge request is stalled...
If not worked on should the bug be set back to new/confirmed again?

As mentioned this bug is crippling for doing updates on the agent due to the creation of network outages.
Would be very sad to see not have it fixed in this release cycle.

Hi Robert,

I will update the patch this week.
And indeed the patch which fixed this problem
has been used in our public cloud service:

https://console.ustack.com/login

Thank you!

------------------ Original ------------------
From: "Robert van Leeuwen"<email address hidden>;
Date: Mon, Feb 9, 2015 06:05 PM
To: "Zebra"<email address hidden>;

Subject: [Bug 1383674] Re: Restarting neutron openvswitch agent causes networkhiccup by throwing away all flows

Is this still being worked on by someone? Seems the merge request is stalled...
If not worked on should the bug be set back to new/confirmed again?

As mentioned this bug is crippling for doing updates on the agent due to the creation of network outages.
Would be very sad to see not have it fixed in this release cycle.

--
You received this bug notification because you are a bug assignee.
https://bugs.launchpad.net/bugs/1383674

Title:
  Restarting neutron openvswitch agent causes network hiccup by throwing
  away all flows

Status in OpenStack Neutron (virtual network service):
  In Progress

Bug description:
  If you reload/restart the neutron-openvswitch daemon it will first throw away all existing flows.
  This will break all networking until the flows are re-created.

  It can take some time before everything is back in the flow table on
  machines with quite a bit of instances or a busy neutron server.

  Please make the clearing of the flows optional during a restart.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1383674/+subscriptions

Haifeng.Yan (yanheven) wrote :

this patch is really important since it can derease the down time of vm when restart ovs-agent

Miguel Angel Ajo (mangelajo) wrote :

If you restart 3 agents at the same time, and in the meanwhile a multicast/broadcast
packet enters br-tun, we get a network storm (Loop between the 3 openvswitch br-tuns)

NodeA -> NodeB -> NodeC -> NodeA

This brings down the network.

This would be avoided if we re-created br-tun in fail-mode "secure" at least,
because that doesn't introduce the "NORMAL" default switching rule on the switch
at creation (origin of this problem.)

We should rise the priority of this bug to the maximum available.

Change abandoned by Kyle Mestery (<email address hidden>) on branch: master
Review: https://review.openstack.org/142611
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Changed in neutron:
status: In Progress → New
importance: Medium → High
Assaf Muller (amuller) on 2015-05-04
Changed in neutron:
assignee: maxiao@unitedstack.com (maxiao) → nobody
status: New → Confirmed
Changed in neutron:
assignee: nobody → Eugene Nikanorov (enikanorov)
Miguel Angel Ajo (mangelajo) wrote :

There is another strategy which could work, using OpenFlow cookies:

1) At agent boot, mark current flows with a cookie to make sure we can delete them later by reference.
2) Insert the new flows on the bridge,
3) Drop the old flows by cookie reference.

Ideally, if we recover the VLAN mapping from the openvswitch info at reboot, should not be
any conflicts between new and old rules.

Ann Taraday (akamyshnikova) wrote :

@Miguel Angel Ajo

Thanks for your comment! In fact we are trying to adopt the idea similar that you described in https://review.openstack.org/182920.

Changed in neutron:
status: Confirmed → In Progress
Changed in neutron:
assignee: Eugene Nikanorov (enikanorov) → Ann Kamyshnikova (akamyshnikova)
Changed in neutron:
assignee: Ann Kamyshnikova (akamyshnikova) → Eugene Nikanorov (enikanorov)
Changed in neutron:
assignee: Eugene Nikanorov (enikanorov) → Ann Kamyshnikova (akamyshnikova)
Changed in neutron:
assignee: Ann Kamyshnikova (akamyshnikova) → Eugene Nikanorov (enikanorov)
Changed in neutron:
assignee: Eugene Nikanorov (enikanorov) → Ann Kamyshnikova (akamyshnikova)
Changed in neutron:
assignee: Ann Kamyshnikova (akamyshnikova) → Eugene Nikanorov (enikanorov)
Changed in neutron:
assignee: Eugene Nikanorov (enikanorov) → Ann Kamyshnikova (akamyshnikova)
Changed in neutron:
assignee: Ann Kamyshnikova (akamyshnikova) → Eugene Nikanorov (enikanorov)
Changed in neutron:
assignee: Eugene Nikanorov (enikanorov) → Ann Kamyshnikova (akamyshnikova)
Changed in neutron:
assignee: Ann Kamyshnikova (akamyshnikova) → Eugene Nikanorov (enikanorov)
Changed in neutron:
assignee: Eugene Nikanorov (enikanorov) → Ann Kamyshnikova (akamyshnikova)

In context of Neutron upgrades and no VM downtime, this bug should be merged in Liberty, to offer better user experience when performing Openstack upgrade from Kilo to Liberty.
Ca we mark this bug as L3 milestone?

Reviewed: https://review.openstack.org/182920
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=73673beacd75a2d9f51f15b284f1b458d32e992e
Submitter: Jenkins
Branch: master

commit 73673beacd75a2d9f51f15b284f1b458d32e992e
Author: Eugene Nikanorov <email address hidden>
Date: Mon May 11 03:10:29 2015 +0400

    Graceful ovs-agent restart

    When agent is restarted it drops all existing flows. This
    breaks all networking until the flows are re-created.

    This change adds an ability to drop only old flows.
    Agent_uuid_stamp is added for agents. This agent_uuid_stamp is set as
    cookie for flows and then flows with stale cookies are deleted during
    cleanup.

    Co-Authored-By: Ann Kamyshnikova<email address hidden>

    Closes-bug: #1383674

    DocImpact

    Change-Id: I95070d8218859d4fff1d572c1792cdf6019dd7ea

Changed in neutron:
status: In Progress → Fix Committed

Reviewed: https://review.openstack.org/215596
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=053bfc5a4f03c620edf4d17a9bd7cebe001c142f
Submitter: Jenkins
Branch: master

commit 053bfc5a4f03c620edf4d17a9bd7cebe001c142f
Author: Ann Kamyshnikova <email address hidden>
Date: Fri Aug 21 15:13:25 2015 +0300

    Graceful OVS restart for DVR

    Graceful OVS restart that was intoduced in I95070d8218859d4fff1d572c1792cdf6019dd7ea
    missed that flows are also dropped in setup_dvr_flows_on_integ_br.

    Related-bug: #1383674

    Change-Id: I7b24a159962af7b58c096a1b2766e2169e9f8aed

Download full text (155.6 KiB)

Reviewed: https://review.openstack.org/218710
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=2c5f44e1b3bd4ed8a0b7232fd293b576cc8c1c87
Submitter: Jenkins
Branch: feature/pecan

commit f35d1c5c50dccbef1a2e079f967b82f0df0e22e9
Author: Adelina Tuvenie <email address hidden>
Date: Thu Aug 27 02:27:28 2015 -0700

    Fixes wrong neutron Hyper-V Agent name in constants

    Change Id03fb147e11541be309c1cd22ce27e70fadc28b5 moved the
    AGENT_TYPE_HYPERV constant from common.constants to
    plugins.ml2.drivers.hyperv.constants but change the value of the
    constant from 'HyperV agent' to 'hyperv'. This patch changes
    the name back to 'HyperV agent'

    Change-Id: If74b4b2a84811e266c8b12e70bf6bfe74ed4ea21
    Partial-Bug: #1487598

commit de604de334854e2eb6b4312ff57920564cbd4459
Author: OpenStack Proposal Bot <email address hidden>
Date: Sun Aug 30 01:39:06 2015 +0000

    Updated from global requirements

    Change-Id: Ie52aa3b59784722806726e4046bd07f4a4d97328

commit f0415ac20eaf5ab4abb9bd4839bf6d04ceee85d0
Author: armando-migliaccio <email address hidden>
Date: Fri Aug 28 13:53:04 2015 -0700

    Revert "Add support for unaddressed port"

    This implementation may expose a vulnerability where a malicious
    user can sieze the opportunity of a time window where a port
    may land unaddressed on a shared network, thus allowing him/her
    to suck up all the tenant traffic he/she wants....oh the shivers.

    This reverts commit d4c52b7f5a36a103a92bf9dcda7f371959112292.

    Change-Id: I7ebdaa8d3defa80eab90e460fde541a5bdd8864c

commit 013fdcd2a6d45dbe4de5d6e7077e5e9b60985ef9
Author: Assaf Muller <email address hidden>
Date: Fri Aug 28 16:41:07 2015 -0400

    Improve logging upon failure in iptables functional tests

    This will help us nail down a more accurate and efficient logstash
    query.

    Change-Id: Iee4238e358f7b056e373c7be8d6aa3202117a680
    Related-Bug: #1478847

commit 622dea818d851224a43d5276a81d5ce8a6eebb76
Author: Ivar Lazzaro <email address hidden>
Date: Mon Aug 17 17:17:42 2015 -0700

    handle gw_info outside of the db transaction on router creation

    Move the gateway interface creation outside the DB transaction
    to avoid lock timeout.

    Change-Id: I5a78d7f32e8ca912016978105221d5f34618af19
    Closes-bug: 1485809

commit 5b27d290a0a95f6247fc5a0fe6da1e7d905e6b2d
Author: Assaf Muller <email address hidden>
Date: Wed Aug 26 10:07:03 2015 -0400

    Remove ml2 resource extension success logging

    This is the cause of a tremendous amount of logs, for no
    perceivable gain. A normal dvr run in the gate shows this debug
    message around 120K times, which is way too much.

    Closes-Bug: #1489952

    Change-Id: I26fca8515d866a7cc1638d07fa33bc04479ae221

commit 8d3faf549cba2f58c872ef4121b2481e73464010
Author: huangpengtao <email address hidden>
Date: Fri Aug 28 23:20:46 2015 +0800

    Replace "prt" variable by "port"

    the local variable prt is meaningless,
    and port is used popular.

    Change-Id: I20849102cf5b4d84433c46791b4b1e2a22dc4739

commit ee374e7a5f4dea538fcd942f5...

tags: added: in-feature-pecan
Thierry Carrez (ttx) on 2015-09-03
Changed in neutron:
milestone: none → liberty-3
status: Fix Committed → Fix Released
Thierry Carrez (ttx) on 2015-10-15
Changed in neutron:
milestone: liberty-3 → 7.0.0
tags: added: kilo-backport-potential

As an operator I'd really like to see this back ported to Kilo. We plan to upgrade to the latest stable/kilo before upgrading to Liberty and this being back ported would save us a network outage.

Kevin Benton (kevinbenton) wrote :

Unfortunately it's quite an invasive change[1] so it probably wouldn't survive a stable review.

1. https://review.openstack.org/#/c/182920/

Kyle Mestery (mestery) wrote :

In fact, a simple cherry-pick reveals it will require some work, so I tend to agree with Kevin's comments here.

Xav Paice (xavpaice) wrote :

That's a real shame given that the vast majority of ops are still running Kilo (or earlier) and only planning the liberty upgrade now. Upgrading is only harder when we have to take data plane downtime to do so.

Is it that the cherry-pick doesn't cleanly apply so needs a bit of re-factor, or is it because of the difference in behaviour? If we need some more effort, I'm happy to pitch in, but if it's not a desired change for that version then I'll consider carrying a local patch (ew).

Ann Taraday (akamyshnikova) wrote :

During Liberty there was done significant refactor of code, so applying won't be easy. Also in kilo local_vlan_mapping restoration is absent, so applying it to kilo will need some extra work and this is unacceptable for stable branches.

Joseph bajin (josephbajin) wrote :

This is definitely a big issue for many of us operators when we have to do any maintenance.
I have to agree with Xav and Clayton that while it may be a lot of work to get it back-ported, not a lot of the large deployments are going to be able to quickly get to Liberty or the latest and greatest code set.

I would ask that the team reconsider and re-prioritize some work to allow this patch to be back-ported to at least Kilo.

Kyle Mestery (mestery) wrote :

The broader issue is that Kilo is now in security only mode [1], meaning even if cherry-picking the patches was something easy, we couldn't apply them.

[1] https://wiki.openstack.org/wiki/Releases

David Medberry (med) wrote :

I guess the operators will just have to come up with their own backport.

Kevin Benton (kevinbenton) wrote :

Just to be clear, the upgrade impact only affects those upgrading from Juno to Kilo. The liberty agent supports the hitless startup so changing to the liberty agent from Kilo should not cause an outage.

So if you are already on Kilo, the only thing a backport would help with is maintenance restarts.

Something that may be worth testing is running the liberty agent with a Kilo server. In fact, making the liberty agent backwards-compatible with a Kilo server is something sellable as a backport since liberty is fresh and it shouldn't be too invasive of a change.

Kevin, are you sure that's the case? My understanding is that the restart feature depended on there being a cookie attached to each flow and that flows missing the cookie (all of them from Kilo) would be deleted.

Even if that's true (and it'd be great if it is), many operators are going to have to upgrade to Kilo.1 before going to Liberty because of the Nova DB issue. Since we're still using packages for nova and neutron, there isn't an easy way to upgrade just Nova without upgrading Neutron. Because of that we'll have to at least take one data plane outage when upgrading to Kilo.1 before going to Liberty.

Update: Reading over the patch again, it looks like maybe only flows with cookies are considered eligible to be pruned, so maybe this does help with Liberty upgrades?

Kevin Benton (kevinbenton) wrote :

Yes, the cookie is only used to determine which flows are stale. When the liberty agent starts it will overwrite all required flows with its own cookie value and then drop all flows that do not match the cookie. So flows from a Kilo agent would be handled the same as flows from a previous run of the Liberty agent.

Would running the liberty agent be feasible for you? It would be less work to make sure the Liberty agent is backwards compatible with a Kilo neutron server than to back-port all of the work that hitless restart depended on...

Kyle Mestery (mestery) wrote :

My comment in #29 was wrong, actually. Per the email thread here [1], the Neutron team is actively interested in continuing to do backports to Kilo. This particular change may be challenging, as Kevin says in #33.

[1] http://lists.openstack.org/pipermail/openstack-dev/2015-November/078557.html

Kyle: Thanks, it's good to hear that it's possible from a policy standpoint. I know nearly all operators are still on Kilo (at best) so it's good to hear it's still possible to get bug fixes also.

Kevin: It's probably not practical for us to run the Liberty agent at this point. We've been slowly working towards being able to do that, but we're not there yet for Neutron and probably won't be until after we have Liberty deployed. I suspect we're not unusual in this regard, I believe most operators are using packages which makes that sort of thing fairly difficult due to upstream support. Even if you are doing your own packages, it becomes difficult due to the python dependencies.

Erik Colnick (erikcolnick) wrote :

Kyle: Is a backport for this to stable/kilo currently being worked on? If not, I am willing to take this on.

Jian Wen (wenjianhn) wrote :

Erik: Don't forget to backport the patch that fixes bug 1515075.
Thanks.

Change abandoned by Doug Wiegley (<email address hidden>) on branch: stable/kilo
Review: https://review.openstack.org/249935
Reason: This review is > 4 weeks without comment and currently blocked by a core reviewer with a -2. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and contacting the reviewer with the -2 on this review to ensure you address their concerns.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers