[SRU] iptables_manager can run very slowly when a large number of security group rules are present

Bug #1453264 reported by Brian Haley
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
Fix Released
Undecided
Unassigned
Icehouse
Fix Committed
Undecided
Unassigned
neutron
Fix Released
Undecided
Kevin Benton
Kilo
Fix Released
Undecided
Unassigned
neutron (Ubuntu)
Fix Released
Medium
Unassigned
Trusty
Fix Released
Medium
Unassigned

Bug Description

[Impact]

We have customers that typically add a few hundred security group rules or more. We also typically run 30+ VMs per compute node. When about 10+ VMs with a large SG set all get scheduled to the same node, the L2 agent (OVS) can spend many minutes in the iptables_manager.apply() code, so much so that by the time all the rules are updated, the VM has already tried DHCP and failed, leaving it in an unusable state.

While there have been some patches that tried to address this in Juno and Kilo, they've either not helped as much as necessary, or broken SGs completely due to re-ordering the of the iptables rules.

I've been able to show some pretty bad scaling with just a handful of VMs running in devstack based on today's code (May 8th, 2015) from upstream Openstack.

[Test Case]

Here's what I tested:

1. I created a security group with 1000 TCP port rules (you could alternately have a smaller number of rules and more VMs, but it's quicker this way)

2. I booted VMs, specifying both the default and "large" SGs, and timed from the second it took Neutron to "learn" about the port until it completed it's work

3. I got a :( pretty quickly

And here's some data:

1-3 VM - didn't time, less than 20 seconds
4th VM - 0:36
5th VM - 0:53
6th VM - 1:11
7th VM - 1:25
8th VM - 1:48
9th VM - 2:14

While it's busy adding the rules, the OVS agent is consuming pretty close to 100% of a CPU for most of this time (from top):

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
25767 stack 20 0 157936 76572 4416 R 89.2 0.5 50:14.28 python

And this is with only ~10K rules at this point! When we start crossing the 20K point VM boot failures start to happen.

I'm filing this bug since we need to take a closer look at this in Liberty and fix it, it's been this way since Havana and needs some TLC.

I've attached a simple script I've used to recreate this, and will start taking a look at options here.

[Regression Potential]

Minimal since this has been running in upstream stable for several releases now (Kilo, Liberty, Mitaka).

Revision history for this message
Brian Haley (brian-haley) wrote :
Changed in neutron:
assignee: nobody → Brian Haley (brian-haley)
Changed in neutron:
assignee: Brian Haley (brian-haley) → Kevin Benton (kevinbenton)
status: New → In Progress
Revision history for this message
Brian Haley (brian-haley) wrote :

Kevin - I also think the change you mentioned at the bar might help too - make a chain for a particular SG containing it's rules, then have jump rules from the instance-specific chain to each SG chain it belongs too. That's deeper surgery in the firewall code, but should be doable. I can work on that after the summit (call me a slacker for not doing it during the summit :)

Changed in neutron:
assignee: Kevin Benton (kevinbenton) → Brian Haley (brian-haley)
Changed in neutron:
assignee: Brian Haley (brian-haley) → Kevin Benton (kevinbenton)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/184318
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=7a3934d982ef29d8851450b5586319201baa0122
Submitter: Jenkins
Branch: master

commit 7a3934d982ef29d8851450b5586319201baa0122
Author: Kevin Benton <email address hidden>
Date: Fri May 15 17:10:15 2015 -0700

    Switch to dictionary for iptables find

    The code to find the matching entry was scanning through a
    list of all rules for every rule. This became extremely slow
    as the number of rules became large, leading to long delays
    waiting for firewall rules to be applied.

    This patch switches to the use of a dictionary so the cost
    becomes a hash lookup instead of a list scan.

    Closes-Bug: #1453264
    Closes-Bug: #1455675
    Change-Id: I1e6fe5e50b9c13066c966c252cadc8ed1d08f686

Changed in neutron:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in neutron:
milestone: none → liberty-1
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (feature/pecan)

Fix proposed to branch: feature/pecan
Review: https://review.openstack.org/196701

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (feature/pecan)

Change abandoned by Kyle Mestery (<email address hidden>) on branch: feature/pecan
Review: https://review.openstack.org/196701
Reason: This is lacking the functional fix [1], so I'll propose a new merge commit which includes that one.

[1] https://review.openstack.org/#/c/196711/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (feature/pecan)

Fix proposed to branch: feature/pecan
Review: https://review.openstack.org/196920

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (feature/pecan)
Download full text (171.5 KiB)

Reviewed: https://review.openstack.org/196920
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=7f759c077f8f860c13db92d2ea6b353ef6b70900
Submitter: Jenkins
Branch: feature/pecan

commit 8123144fadd7c5d5e6e56a76ea860512619a2cf6
Author: Moshe Levi <email address hidden>
Date: Sun Jun 28 14:37:14 2015 +0300

    Fix Consolidate sriov agent and driver code

    This patch add mising __init to mech_sriov/mech_driver/
    and update the setup.cfg to the new agent entrypoint

    Trivial Fix

    Change-Id: I53a527081feb78472f496675bbb3c5121d38a14a

commit 8942fccf02e6e179d47582fdb2792a1ca972da21
Author: Assaf Muller <email address hidden>
Date: Mon Jun 29 11:38:51 2015 -0400

    Remove failing SafeFixture tests

    The fixtures 1.3 release attempted to fix the fixtures resource
    leak issue, but failed to do so completely. Our own SafeFixture
    is still needed: The 1.3 release broke our SafeFixture tests,
    but not the usage of SafeFixture itself. This patch removes
    those failing tests for now to unbreak the gate. Jakub reported
    a bug on fixtures 1.3:
    https://bugs.launchpad.net/python-fixtures/+bug/1469759

    We will continue to use SafeFixture until that bug is fixed
    in fixtures, at which point we will be able to require
    fixtures > 1.3.

    Change-Id: I59457c3bb198ff86d5ad55a1e623d008f0034b8f
    Closes-Bug: #1469734

commit 71dffb0a2c1720cd8233a329d32958a0160dd6f5
Author: Kevin Benton <email address hidden>
Date: Mon Jun 29 08:27:41 2015 +0000

    Revert "Removed test_lib module"

    This reverts commit 9a6536de6e1a7fe9b2552adc142e254426b82b6f.

    We pulled all of the plugins out of the tree, many of which still inherit
    from neutron test classes. This change then stated that we no longer
    support testing other plugins. I think this is a bit premature and should
    have been discussed under the subject
    "Neutron plugins can't use neutron plugin unit tests" or something
    similar.

    Change-Id: I68318589f010b731574ea3bfa8df98492bab31fc

commit b20fd81dbd497e058384a0af065dd0f1fdc4c728
Author: Jakub Libosvar <email address hidden>
Date: Fri Jun 5 14:32:51 2015 +0000

    Refactor NetcatTester class

    Following capabilities were added:
       - used transport protocol is passed as a constant instead of bool
       - src port for testing was added
       - connection can be established explicitly
       - change constructor parameters of NetcatTester

    As a part of removing bool for protocol definition
    get_free_namespace_port() was also modified to match the behavior.

    Change-Id: Id2ec322e7f731c05a3754a65411c9a5d8b258126

commit 83e37980dcd0b2bad6d64dd2cb23bcd2891cafca
Author: jingliuqing <email address hidden>
Date: Sat Jun 27 13:41:54 2015 +0800

    Use REST rather than ReST

    Change-Id: I06c9deaab58c5ec13bfeec39fb8fd4b1fe21f42d

commit 1b60df85ba3ad442c2e4e7e52538e1b9a1bf9378
Author: Kevin Benton <email address hidden>
Date: Thu Jun 25 18:34:38 2015 -0700

    Add a double-mock guard to the base test case

    Use mock to patch mock with a check to prevent multiple active
    patches to the...

tags: added: in-feature-pecan
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/kilo)

Fix proposed to branch: stable/kilo
Review: https://review.openstack.org/224900

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/kilo)

Reviewed: https://review.openstack.org/224900
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=e6a0e7d27054d5d179c513b3c826a5eaef2077dd
Submitter: Jenkins
Branch: stable/kilo

commit e6a0e7d27054d5d179c513b3c826a5eaef2077dd
Author: Kevin Benton <email address hidden>
Date: Fri May 15 17:10:15 2015 -0700

    Switch to dictionary for iptables find

    The code to find the matching entry was scanning through a
    list of all rules for every rule. This became extremely slow
    as the number of rules became large, leading to long delays
    waiting for firewall rules to be applied.

    This patch switches to the use of a dictionary so the cost
    becomes a hash lookup instead of a list scan.

    Closes-Bug: #1453264
    Closes-Bug: #1455675
    (cherry picked from commit 7a3934d982ef29d8851450b5586319201baa0122)

    Conflicts:

     neutron/agent/linux/iptables_manager.py

    Change-Id: I1e6fe5e50b9c13066c966c252cadc8ed1d08f686

tags: added: in-stable-kilo
Thierry Carrez (ttx)
Changed in neutron:
milestone: liberty-1 → 7.0.0
Revision history for this message
Billy Olsen (billy-olsen) wrote : Re: iptables_manager can run very slowly when a large number of security group rules are present

Uploading debdiff based on what is currently available in trusty-proposed since that has been verified and pending release.

description: updated
Mathew Hodson (mhodson)
tags: added: trusty
Changed in neutron (Ubuntu):
importance: Undecided → Medium
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Billy, Thanks for the patch! I've pushed it to lp:~ubuntu-server-dev/neutron/icehouse. Once 1:2014.1.5-0ubuntu5 gets promoted from trusty-proposed we can upload 1:2014.1.5-0ubuntu6.

Changed in neutron (Ubuntu):
status: New → Invalid
Changed in neutron (Ubuntu Trusty):
status: New → Fix Committed
Changed in cloud-archive:
status: New → Invalid
summary: - iptables_manager can run very slowly when a large number of security
- group rules are present
+ [SRU] iptables_manager can run very slowly when a large number of
+ security group rules are present
Mathew Hodson (mhodson)
Changed in neutron (Ubuntu Trusty):
importance: Undecided → Medium
Changed in neutron (Ubuntu):
status: Invalid → Fix Released
Changed in cloud-archive:
status: Invalid → Fix Released
Mathew Hodson (mhodson)
Changed in neutron (Ubuntu Trusty):
status: Fix Committed → In Progress
Revision history for this message
Billy Olsen (billy-olsen) wrote :

Great thanks Matt & Corey

Revision history for this message
Martin Pitt (pitti) wrote : Please test proposed package

Hello Brian, or anyone else affected,

Accepted neutron into trusty-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/neutron/1:2014.1.5-0ubuntu6 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in neutron (Ubuntu Trusty):
status: In Progress → Fix Committed
tags: added: verification-needed
Revision history for this message
Billy Olsen (billy-olsen) wrote :

I have received direct feedback from a user in production that the proposed package (1:2014.1.5-0ubuntu6) improves the performance and resolves this bug. Marking as verified.

tags: added: verification-done
removed: verification-needed
tags: removed: trusty
Revision history for this message
Martin Pitt (pitti) wrote : Update Released

The verification of the Stable Release Update for neutron has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package neutron - 1:2014.1.5-0ubuntu6

---------------
neutron (1:2014.1.5-0ubuntu6) trusty; urgency=medium

  * iptables_manager can run very slowly when a large number of security group
    rules are present (LP: #1453264)
    - d/p/use-dictionary-for-iptables-find.patch: Use a dictionary for looking
      up iptables rules rather than an iterator.

 -- Billy Olsen <email address hidden> Mon, 29 Aug 2016 15:06:06 -0700

Changed in neutron (Ubuntu Trusty):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.