neutron

Packets getting lost during SNAT with too many connections using the same source and destination on Network Node

Bug #1814002 reported by Swaminathan Vasudevan on 2019-01-30

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	neutron	Fix Released	Undecided	Brian Haley

Bug Description

Probably we have a problem with SNAT, with too many connections using the same source / destination, on the network nodes.

We have reproduced the bug with DNS requests, but we assume that it affects other packages as well.

When we send a lot of DNS requests, we see that sometimes a packet does not pass through the NAT and simply "gets lost".

In addition, we can see in the conntrack table that the who "insert_failed" increases.

ip netns exec snat-848819dc-efa2-45d9-9bc3-d96f093fa87a conntrack -S | grep insert_failed | grep -v insert_failed=0
cpu=0 searched=1166140 found=5587918 new=6659 invalid=5 ignore=0 delete=27726 delete_list=27712 insert=6645 insert_failed=14 drop=0 early_drop=0 error=0 search_restart=0
cpu=2 searched=12015 found=64626 new=2467 invalid=0 ignore=0 delete=15205 delete_list=15204 insert=2466 insert_failed=1 drop=0 early_drop=0 error=0 search_restart=0
cpu=3 searched=1348502 found=6097345 new=4093 invalid=0 ignore=0 delete=23200 delete_list=23173 insert=4066 insert_failed=27 drop=0 early_drop=0 error=0 search_restart=0
cpu=4 searched=1068516 found=5398514 new=3299 invalid=0 ignore=0 delete=14144 delete_list=14126 insert=3281 insert_failed=18 drop=0 early_drop=0 error=0 search_restart=0
cpu=5 searched=2280948 found=9908854 new=6770 invalid=0 ignore=0 delete=17224 delete_list=17185 insert=6731 insert_failed=39 drop=0 early_drop=0 error=0 search_restart=0
cpu=6 searched=1123341 found=5264368 new=9749 invalid=0 ignore=0 delete=17272 delete_list=17247 insert=9724 insert_failed=25 drop=0 early_drop=0 error=0 search_restart=0
cpu=7 searched=1553934 found=7234262 new=8734 invalid=0 ignore=0 delete=15658 delete_list=15634 insert=8710 insert_failed=24 drop=0 early_drop=0 error=0 search_restart=0

This might be a generic problem with conntrack and linux.
We suspect that we encounter the following "limitation / bug" in the kernel:
https://github.com/torvalds/linux/blob/24de3d377539e384621c5b8f8f8d8d01852dddc8/net/netfilter/nf_nat_core.c#L290-L291

There seems to be a workaround to alleviate this behavior by setting the -random-fully flag in iptables. Unfortunately, this is only available since iptables 1.6.2.

Also this is not currently supported in neutron for the SNAT rules, it just uses the --to-source.

Tags:

Revision history for this message

Swaminathan Vasudevan (swaminathan-vasudevan) wrote on 2019-01-30:

This iptables patch might be required to fix this issue if you don't have the right iptables version that supports it.

https://git.netfilter.org/iptables/commit/?id=8b0da2130b8af3890ef20afb2305f11224bb39ec

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-02-12: Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/636473

Changed in neutron:
assignee:	nobody → Swaminathan Vasudevan (swaminathan-vasudevan)
status:	New → In Progress

OpenStack Infra (hudson-openstack) on 2019-04-03

Changed in neutron:
assignee:	Swaminathan Vasudevan (swaminathan-vasudevan) → Slawek Kaplonski (slaweq)

OpenStack Infra (hudson-openstack) on 2019-04-05

Changed in neutron:
assignee:	Slawek Kaplonski (slaweq) → Swaminathan Vasudevan (swaminathan-vasudevan)

OpenStack Infra (hudson-openstack) on 2019-04-10

Changed in neutron:
assignee:	Swaminathan Vasudevan (swaminathan-vasudevan) → Brian Haley (brian-haley)

OpenStack Infra (hudson-openstack) on 2019-04-11

Changed in neutron:
assignee:	Brian Haley (brian-haley) → Swaminathan Vasudevan (swaminathan-vasudevan)

OpenStack Infra (hudson-openstack) on 2019-04-12

Changed in neutron:
assignee:	Swaminathan Vasudevan (swaminathan-vasudevan) → Brian Haley (brian-haley)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-24: Fix merged to neutron (master)

Reviewed: https://review.opendev.org/636473
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=30f35e08f92e5262e7a9108684da048d11402b07
Submitter: Zuul
Branch: master

commit 30f35e08f92e5262e7a9108684da048d11402b07
Author: Swaminathan Vasudevan <email address hidden>
Date: Tue Feb 12 11:27:51 2019 -0800

Packets getting lost during SNAT with too many connections

We have a problem with SNAT with too many connections using the
same source and destination on the network nodes.

In addition we can see in the conntrack table that the who
"instert_failed" increases.

    This might be a generic problem with conntrack and linux.
    We suspect that we encounter the following "limitation / bug"
    in the kernel.

There seems to be a workaround to alleviate this behavior by
setting the -random-fully flag in iptables for port consumption.

This patch fixes the problem by adding the --random-fully to
the SNAT rules.

Change-Id: I246c1f56df889bad9c7e140b56c3614124d80a19
Closes-Bug: #1814002

Changed in neutron:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-25: Fix proposed to neutron (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/655790

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-25: Fix proposed to neutron (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/655791

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-25: Fix proposed to neutron (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/655792

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-25: Fix proposed to neutron (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.opendev.org/655794

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-25: Fix proposed to neutron (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.opendev.org/655801

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-05-22: Fix merged to neutron (stable/stein)

Reviewed: https://review.opendev.org/655790
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=eded5d2d6ae7c1281ef868a193de67ac52d6daac
Submitter: Zuul
Branch: stable/stein

commit eded5d2d6ae7c1281ef868a193de67ac52d6daac
Author: Swaminathan Vasudevan <email address hidden>
Date: Tue Feb 12 11:27:51 2019 -0800

Packets getting lost during SNAT with too many connections

We have a problem with SNAT with too many connections using the
same source and destination on the network nodes.

In addition we can see in the conntrack table that the who
"instert_failed" increases.

    This might be a generic problem with conntrack and linux.
    We suspect that we encounter the following "limitation / bug"
    in the kernel.

There seems to be a workaround to alleviate this behavior by
setting the -random-fully flag in iptables for port consumption.

This patch fixes the problem by adding the --random-fully to
the SNAT rules.

    Conflicts:
        neutron/agent/linux/iptables_manager.py
        neutron/common/constants.py
        neutron/tests/unit/agent/l3/test_agent.py

    Change-Id: I246c1f56df889bad9c7e140b56c3614124d80a19
    Closes-Bug: #1814002
    (cherry picked from commit 30f35e08f92e5262e7a9108684da048d11402b07)

tags:

added: in-stable-stein

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-06-05: Fix merged to neutron (stable/queens)

#10

Reviewed: https://review.opendev.org/655792
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=836f79e7b73f3241bbd209289683ed7dac0fc735
Submitter: Zuul
Branch: stable/queens

commit 836f79e7b73f3241bbd209289683ed7dac0fc735
Author: Swaminathan Vasudevan <email address hidden>
Date: Tue Feb 12 11:27:51 2019 -0800

Packets getting lost during SNAT with too many connections

We have a problem with SNAT with too many connections using the
same source and destination on the network nodes.

In addition we can see in the conntrack table that the who
"instert_failed" increases.

    This might be a generic problem with conntrack and linux.
    We suspect that we encounter the following "limitation / bug"
    in the kernel.

There seems to be a workaround to alleviate this behavior by
setting the -random-fully flag in iptables for port consumption.

    This patch fixes the problem by adding the --random-fully to
    the SNAT rules.
    Conflicts:
        neutron/agent/linux/iptables_manager.py
        neutron/common/constants.py
        neutron/tests/unit/agent/l3/test_agent.py

    Change-Id: I246c1f56df889bad9c7e140b56c3614124d80a19
    Closes-Bug: #1814002
    (cherry picked from commit 30f35e08f92e5262e7a9108684da048d11402b07)

tags:

added: in-stable-queens

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-06-06: Fix merged to neutron (stable/rocky)

#11

Reviewed: https://review.opendev.org/655791
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=1dd35515d4ea7f31878cfb60612bf68e523b9e69
Submitter: Zuul
Branch: stable/rocky

commit 1dd35515d4ea7f31878cfb60612bf68e523b9e69
Author: Swaminathan Vasudevan <email address hidden>
Date: Tue Feb 12 11:27:51 2019 -0800

Packets getting lost during SNAT with too many connections

We have a problem with SNAT with too many connections using the
same source and destination on the network nodes.

In addition we can see in the conntrack table that the who
"instert_failed" increases.

    This might be a generic problem with conntrack and linux.
    We suspect that we encounter the following "limitation / bug"
    in the kernel.

There seems to be a workaround to alleviate this behavior by
setting the -random-fully flag in iptables for port consumption.

This patch fixes the problem by adding the --random-fully to
the SNAT rules.

    Conflicts:
        neutron/agent/linux/iptables_manager.py
        neutron/common/constants.py
        neutron/tests/unit/agent/l3/test_agent.py

    Change-Id: I246c1f56df889bad9c7e140b56c3614124d80a19
    Closes-Bug: #1814002
    (cherry picked from commit 30f35e08f92e5262e7a9108684da048d11402b07)

tags:

added: in-stable-rocky

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-06-07: Fix merged to neutron (stable/pike)

#12

Reviewed: https://review.opendev.org/655794
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=d772b735fbc8441f876d6de1c31252db201acd73
Submitter: Zuul
Branch: stable/pike

commit d772b735fbc8441f876d6de1c31252db201acd73
Author: Swaminathan Vasudevan <email address hidden>
Date: Tue Feb 12 11:27:51 2019 -0800

Packets getting lost during SNAT with too many connections

We have a problem with SNAT with too many connections using the
same source and destination on the network nodes.

In addition we can see in the conntrack table that the who
"instert_failed" increases.

    This might be a generic problem with conntrack and linux.
    We suspect that we encounter the following "limitation / bug"
    in the kernel.

There seems to be a workaround to alleviate this behavior by
setting the -random-fully flag in iptables for port consumption.

This patch fixes the problem by adding the --random-fully to
the SNAT rules.

    Conflicts:
        neutron/agent/linux/iptables_manager.py
        neutron/common/constants.py
        neutron/tests/unit/agent/l3/test_agent.py

    Change-Id: I246c1f56df889bad9c7e140b56c3614124d80a19
    Closes-Bug: #1814002
    (cherry picked from commit 30f35e08f92e5262e7a9108684da048d11402b07)

tags:

added: in-stable-pike

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-06-07: Fix merged to neutron (stable/ocata)

#13

Reviewed: https://review.opendev.org/655801
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=ce628a123769f93fc0c1b2edbe20ec5325aab0f6
Submitter: Zuul
Branch: stable/ocata

commit ce628a123769f93fc0c1b2edbe20ec5325aab0f6
Author: Swaminathan Vasudevan <email address hidden>
Date: Tue Feb 12 11:27:51 2019 -0800

Packets getting lost during SNAT with too many connections

We have a problem with SNAT with too many connections using the
same source and destination on the network nodes.

In addition we can see in the conntrack table that the who
"instert_failed" increases.

    This might be a generic problem with conntrack and linux.
    We suspect that we encounter the following "limitation / bug"
    in the kernel.

There seems to be a workaround to alleviate this behavior by
setting the -random-fully flag in iptables for port consumption.

This patch fixes the problem by adding the --random-fully to
the SNAT rules.

    Conflicts:
        neutron/agent/l3/dvr_edge_router.py
        neutron/agent/linux/iptables_manager.py
        neutron/common/constants.py
        neutron/tests/unit/agent/l3/test_agent.py

    Change-Id: I246c1f56df889bad9c7e140b56c3614124d80a19
    Closes-Bug: #1814002
    (cherry picked from commit 30f35e08f92e5262e7a9108684da048d11402b07)