Packets getting lost during SNAT with too many connections using the same source and destination on Network Node

Bug #1814002 reported by Swaminathan Vasudevan
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Undecided
Brian Haley

Bug Description

Probably we have a problem with SNAT, with too many connections using the same source / destination, on the network nodes.

We have reproduced the bug with DNS requests, but we assume that it affects other packages as well.

When we send a lot of DNS requests, we see that sometimes a packet does not pass through the NAT and simply "gets lost".

In addition, we can see in the conntrack table that the who "insert_failed" increases.

ip netns exec snat-848819dc-efa2-45d9-9bc3-d96f093fa87a conntrack -S | grep insert_failed | grep -v insert_failed=0
cpu=0 searched=1166140 found=5587918 new=6659 invalid=5 ignore=0 delete=27726 delete_list=27712 insert=6645 insert_failed=14 drop=0 early_drop=0 error=0 search_restart=0
cpu=2 searched=12015 found=64626 new=2467 invalid=0 ignore=0 delete=15205 delete_list=15204 insert=2466 insert_failed=1 drop=0 early_drop=0 error=0 search_restart=0
cpu=3 searched=1348502 found=6097345 new=4093 invalid=0 ignore=0 delete=23200 delete_list=23173 insert=4066 insert_failed=27 drop=0 early_drop=0 error=0 search_restart=0
cpu=4 searched=1068516 found=5398514 new=3299 invalid=0 ignore=0 delete=14144 delete_list=14126 insert=3281 insert_failed=18 drop=0 early_drop=0 error=0 search_restart=0
cpu=5 searched=2280948 found=9908854 new=6770 invalid=0 ignore=0 delete=17224 delete_list=17185 insert=6731 insert_failed=39 drop=0 early_drop=0 error=0 search_restart=0
cpu=6 searched=1123341 found=5264368 new=9749 invalid=0 ignore=0 delete=17272 delete_list=17247 insert=9724 insert_failed=25 drop=0 early_drop=0 error=0 search_restart=0
cpu=7 searched=1553934 found=7234262 new=8734 invalid=0 ignore=0 delete=15658 delete_list=15634 insert=8710 insert_failed=24 drop=0 early_drop=0 error=0 search_restart=0

This might be a generic problem with conntrack and linux.
We suspect that we encounter the following "limitation / bug" in the kernel:
https://github.com/torvalds/linux/blob/24de3d377539e384621c5b8f8f8d8d01852dddc8/net/netfilter/nf_nat_core.c#L290-L291

There seems to be a workaround to alleviate this behavior by setting the -random-fully flag in iptables. Unfortunately, this is only available since iptables 1.6.2.

Also this is not currently supported in neutron for the SNAT rules, it just uses the --to-source.

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

This iptables patch might be required to fix this issue if you don't have the right iptables version that supports it.

https://git.netfilter.org/iptables/commit/?id=8b0da2130b8af3890ef20afb2305f11224bb39ec

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/636473

Changed in neutron:
assignee: nobody → Swaminathan Vasudevan (swaminathan-vasudevan)
status: New → In Progress
Changed in neutron:
assignee: Swaminathan Vasudevan (swaminathan-vasudevan) → Slawek Kaplonski (slaweq)
Changed in neutron:
assignee: Slawek Kaplonski (slaweq) → Swaminathan Vasudevan (swaminathan-vasudevan)
Changed in neutron:
assignee: Swaminathan Vasudevan (swaminathan-vasudevan) → Brian Haley (brian-haley)
Changed in neutron:
assignee: Brian Haley (brian-haley) → Swaminathan Vasudevan (swaminathan-vasudevan)
Changed in neutron:
assignee: Swaminathan Vasudevan (swaminathan-vasudevan) → Brian Haley (brian-haley)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/636473
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=30f35e08f92e5262e7a9108684da048d11402b07
Submitter: Zuul
Branch: master

commit 30f35e08f92e5262e7a9108684da048d11402b07
Author: Swaminathan Vasudevan <email address hidden>
Date: Tue Feb 12 11:27:51 2019 -0800

    Packets getting lost during SNAT with too many connections

    We have a problem with SNAT with too many connections using the
    same source and destination on the network nodes.

    In addition we can see in the conntrack table that the who
    "instert_failed" increases.

    This might be a generic problem with conntrack and linux.
    We suspect that we encounter the following "limitation / bug"
    in the kernel.

    There seems to be a workaround to alleviate this behavior by
    setting the -random-fully flag in iptables for port consumption.

    This patch fixes the problem by adding the --random-fully to
    the SNAT rules.

    Change-Id: I246c1f56df889bad9c7e140b56c3614124d80a19
    Closes-Bug: #1814002

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/655790

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/655791

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/655792

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.opendev.org/655794

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.opendev.org/655801

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/stein)

Reviewed: https://review.opendev.org/655790
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=eded5d2d6ae7c1281ef868a193de67ac52d6daac
Submitter: Zuul
Branch: stable/stein

commit eded5d2d6ae7c1281ef868a193de67ac52d6daac
Author: Swaminathan Vasudevan <email address hidden>
Date: Tue Feb 12 11:27:51 2019 -0800

    Packets getting lost during SNAT with too many connections

    We have a problem with SNAT with too many connections using the
    same source and destination on the network nodes.

    In addition we can see in the conntrack table that the who
    "instert_failed" increases.

    This might be a generic problem with conntrack and linux.
    We suspect that we encounter the following "limitation / bug"
    in the kernel.

    There seems to be a workaround to alleviate this behavior by
    setting the -random-fully flag in iptables for port consumption.

    This patch fixes the problem by adding the --random-fully to
    the SNAT rules.

    Conflicts:
        neutron/agent/linux/iptables_manager.py
        neutron/common/constants.py
        neutron/tests/unit/agent/l3/test_agent.py

    Change-Id: I246c1f56df889bad9c7e140b56c3614124d80a19
    Closes-Bug: #1814002
    (cherry picked from commit 30f35e08f92e5262e7a9108684da048d11402b07)

tags: added: in-stable-stein
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/queens)

Reviewed: https://review.opendev.org/655792
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=836f79e7b73f3241bbd209289683ed7dac0fc735
Submitter: Zuul
Branch: stable/queens

commit 836f79e7b73f3241bbd209289683ed7dac0fc735
Author: Swaminathan Vasudevan <email address hidden>
Date: Tue Feb 12 11:27:51 2019 -0800

    Packets getting lost during SNAT with too many connections

    We have a problem with SNAT with too many connections using the
    same source and destination on the network nodes.

    In addition we can see in the conntrack table that the who
    "instert_failed" increases.

    This might be a generic problem with conntrack and linux.
    We suspect that we encounter the following "limitation / bug"
    in the kernel.

    There seems to be a workaround to alleviate this behavior by
    setting the -random-fully flag in iptables for port consumption.

    This patch fixes the problem by adding the --random-fully to
    the SNAT rules.
    Conflicts:
        neutron/agent/linux/iptables_manager.py
        neutron/common/constants.py
        neutron/tests/unit/agent/l3/test_agent.py

    Change-Id: I246c1f56df889bad9c7e140b56c3614124d80a19
    Closes-Bug: #1814002
    (cherry picked from commit 30f35e08f92e5262e7a9108684da048d11402b07)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/rocky)

Reviewed: https://review.opendev.org/655791
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=1dd35515d4ea7f31878cfb60612bf68e523b9e69
Submitter: Zuul
Branch: stable/rocky

commit 1dd35515d4ea7f31878cfb60612bf68e523b9e69
Author: Swaminathan Vasudevan <email address hidden>
Date: Tue Feb 12 11:27:51 2019 -0800

    Packets getting lost during SNAT with too many connections

    We have a problem with SNAT with too many connections using the
    same source and destination on the network nodes.

    In addition we can see in the conntrack table that the who
    "instert_failed" increases.

    This might be a generic problem with conntrack and linux.
    We suspect that we encounter the following "limitation / bug"
    in the kernel.

    There seems to be a workaround to alleviate this behavior by
    setting the -random-fully flag in iptables for port consumption.

    This patch fixes the problem by adding the --random-fully to
    the SNAT rules.

    Conflicts:
        neutron/agent/linux/iptables_manager.py
        neutron/common/constants.py
        neutron/tests/unit/agent/l3/test_agent.py

    Change-Id: I246c1f56df889bad9c7e140b56c3614124d80a19
    Closes-Bug: #1814002
    (cherry picked from commit 30f35e08f92e5262e7a9108684da048d11402b07)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/pike)

Reviewed: https://review.opendev.org/655794
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=d772b735fbc8441f876d6de1c31252db201acd73
Submitter: Zuul
Branch: stable/pike

commit d772b735fbc8441f876d6de1c31252db201acd73
Author: Swaminathan Vasudevan <email address hidden>
Date: Tue Feb 12 11:27:51 2019 -0800

    Packets getting lost during SNAT with too many connections

    We have a problem with SNAT with too many connections using the
    same source and destination on the network nodes.

    In addition we can see in the conntrack table that the who
    "instert_failed" increases.

    This might be a generic problem with conntrack and linux.
    We suspect that we encounter the following "limitation / bug"
    in the kernel.

    There seems to be a workaround to alleviate this behavior by
    setting the -random-fully flag in iptables for port consumption.

    This patch fixes the problem by adding the --random-fully to
    the SNAT rules.

    Conflicts:
        neutron/agent/linux/iptables_manager.py
        neutron/common/constants.py
        neutron/tests/unit/agent/l3/test_agent.py

    Change-Id: I246c1f56df889bad9c7e140b56c3614124d80a19
    Closes-Bug: #1814002
    (cherry picked from commit 30f35e08f92e5262e7a9108684da048d11402b07)

tags: added: in-stable-pike
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/ocata)

Reviewed: https://review.opendev.org/655801
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=ce628a123769f93fc0c1b2edbe20ec5325aab0f6
Submitter: Zuul
Branch: stable/ocata

commit ce628a123769f93fc0c1b2edbe20ec5325aab0f6
Author: Swaminathan Vasudevan <email address hidden>
Date: Tue Feb 12 11:27:51 2019 -0800

    Packets getting lost during SNAT with too many connections

    We have a problem with SNAT with too many connections using the
    same source and destination on the network nodes.

    In addition we can see in the conntrack table that the who
    "instert_failed" increases.

    This might be a generic problem with conntrack and linux.
    We suspect that we encounter the following "limitation / bug"
    in the kernel.

    There seems to be a workaround to alleviate this behavior by
    setting the -random-fully flag in iptables for port consumption.

    This patch fixes the problem by adding the --random-fully to
    the SNAT rules.

    Conflicts:
        neutron/agent/l3/dvr_edge_router.py
        neutron/agent/linux/iptables_manager.py
        neutron/common/constants.py
        neutron/tests/unit/agent/l3/test_agent.py

    Change-Id: I246c1f56df889bad9c7e140b56c3614124d80a19
    Closes-Bug: #1814002
    (cherry picked from commit 30f35e08f92e5262e7a9108684da048d11402b07)

tags: added: in-stable-ocata
tags: added: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 13.0.4

This issue was fixed in the openstack/neutron 13.0.4 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 14.0.2

This issue was fixed in the openstack/neutron 14.0.2 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 12.1.0

This issue was fixed in the openstack/neutron 12.1.0 release.

tags: removed: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 15.0.0.0b1

This issue was fixed in the openstack/neutron 15.0.0.0b1 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron ocata-eol

This issue was fixed in the openstack/neutron ocata-eol release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron pike-eol

This issue was fixed in the openstack/neutron pike-eol release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.