Conntrack entry removal can take a long time on large deployments

Bug #1745468 reported by Brian Haley on 2018-01-25
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
neutron
High
Brian Haley

Bug Description

On a large deployment of about 1000 instances, instance deletion (neutron port deletion) or security group rule changes can take a really long time. We've actually seen it take hours in some instances.

While changing to netlink-lib for the IP Conntrack manager will help, https://review.openstack.org/#/c/470912/ it could still lead to long delays at higher instance counts. Also, that change might not be easily back-portable to older releases. Doing the conntrack entry deletion in a thread, which has been proposed before, could help alleviate this a bit by letting the caller (OVS agent) get back to other work quicker.

Also, while the netlink-lib change above is better at only issuing calls for entries it finds, the current code doesn't do that, it could call 'conntrack -D' with arguments for nothing. If we first checked the table for given IPs it might reduce the time it takes for cleanup.

Changed in neutron:
status: Confirmed → In Progress
Miguel Lavalle (minsel) on 2018-02-06
Changed in neutron:
milestone: none → queens-rc1
Miguel Lavalle (minsel) on 2018-02-06
Changed in neutron:
milestone: queens-rc1 → none

Change abandoned by Brian Haley (<email address hidden>) on branch: master
Review: https://review.openstack.org/538042
Reason: I don't think this is worth the trouble given the conntrack-lib change.

Reviewed: https://review.openstack.org/537654
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=65a81623fc0377b26d2d5800607f7c3acc08c45a
Submitter: Zuul
Branch: master

commit 65a81623fc0377b26d2d5800607f7c3acc08c45a
Author: Brian Haley <email address hidden>
Date: Wed Jan 24 15:55:56 2018 -0500

    Process conntrack updates in worker threads

    With a large number of instances and/or security group rules,
    conntrack updates when ports are removed or rules are changed
    can take a long time to process. By enqueuing these to a set
    or worker threads, the agent can continue with other work while
    they are processed in the background.

    This is a change in behavior in the agent since it could
    program a new set of security group rules before all existing
    conntrack entries are deleted, but since the iptables or OVSfw
    NAT rules will have been removed, it should not pose a
    security issue.

    Change-Id: Ibf858c7fdf7a822a30e4a0c4722d70fd272741b6
    Closes-bug: #1745468

Changed in neutron:
status: In Progress → Fix Released

Reviewed: https://review.openstack.org/545612
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=0dbd35df1bdaea7dec97fd976b6990f4b79a6b77
Submitter: Zuul
Branch: stable/queens

commit 0dbd35df1bdaea7dec97fd976b6990f4b79a6b77
Author: Brian Haley <email address hidden>
Date: Wed Jan 24 15:55:56 2018 -0500

    Process conntrack updates in worker threads

    With a large number of instances and/or security group rules,
    conntrack updates when ports are removed or rules are changed
    can take a long time to process. By enqueuing these to a set
    or worker threads, the agent can continue with other work while
    they are processed in the background.

    This is a change in behavior in the agent since it could
    program a new set of security group rules before all existing
    conntrack entries are deleted, but since the iptables or OVSfw
    NAT rules will have been removed, it should not pose a
    security issue.

    Change-Id: Ibf858c7fdf7a822a30e4a0c4722d70fd272741b6
    Closes-bug: #1745468
    (cherry picked from commit 65a81623fc0377b26d2d5800607f7c3acc08c45a)

tags: added: in-stable-queens

This issue was fixed in the openstack/neutron 12.0.1 release.

Does this patch have a backport potential to Pike.

Piotr Misiak (piotr-misiak) wrote :

Looks like the patch should be directly applicable to Pike.

Please keep in mind that this patch introduces a new bug: https://bugs.launchpad.net/neutron/+bug/1750777 so it should be backported to Pike together with a fix to 1750777 bug.

This issue was fixed in the openstack/neutron 13.0.0.0b1 development milestone.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers