OpenStack-Ansible

nf_conntrack table fills on swift nodes

Bug #1479127 reported by James Dewey on 2015-07-28

This bug affects 3 people

	Status	Importance	Assigned to	Milestone
OpenStack-Ansible	Invalid	Undecided	Unassigned
Juno	Fix Released	Medium	Christopher H. Laco	OpenStack-Ansible 10.1.15
Trunk	Invalid	Undecided	Unassigned

Bug Description

Due to the use of natting in iptables required by LXC, it appears that the conntrack table fills up and then randomly drops packets. This might manifest itself if you are using swift-recon in order to monitor your cluster, you will intermittently see timeouts, like so:

[2015-07-28 21:33:45] Checking swift.conf md5sum
-> http://172.29.244.72:6000/recon/swiftconfmd5: <urlopen error timed out>
4/5 hosts matched, 1 error[s] while checking hosts.

Upon investigating, I would see dmesg filled with the following:

[20765760.747582] nf_conntrack: table full, dropping packet
[20765761.251622] nf_conntrack: table full, dropping packet
[20765762.067443] nf_conntrack: table full, dropping packet
[20765762.067595] nf_conntrack: table full, dropping packet
[20765762.068828] nf_conntrack: table full, dropping packet
[20765762.070060] nf_conntrack: table full, dropping packet
[20765762.070393] nf_conntrack: table full, dropping packet
[20765762.070632] nf_conntrack: table full, dropping packet
[20765762.070847] nf_conntrack: table full, dropping packet

I have seen this issue in a couple different environments, in all cases I have raised the nf_conntrack_max value to a sufficiently large value (Around 300,000 in the case of these relatively small environments), and then committed it to /etc/sysctl.conf to prevent it from reverting on a server restart.

Maybe we should raise this value, or paramatize it so that it is easier to manipulate across larger environments?

Revision history for this message

Jordan Callicoat (jcallicoat) wrote on 2015-07-28:

See also https://bugs.launchpad.net/openstack-ansible/+bug/1441363

Revision history for this message

Jesse Pretorius (jesse-pretorius) wrote on 2015-08-04:

IT'd be useful to know the version of os-ansible-deployment that was in use here and whether the patch from https://bugs.launchpad.net/openstack-ansible/+bug/1441363 was applied.

no longer affects:	openstack-ansible/juno
no longer affects:	openstack-ansible/kilo

Revision history for this message

Evan Callicoat (diopter) wrote on 2015-08-04:

I would like to have more information on the networking state involved when this issue manifests itself.

Specifically, whether the tw_reuse patch mentioned in the linked bug is in effect or not, the netfilter TCP tunings/timeouts, and the summary of connections and their states.

Please provide the output from the following:
sysctl -n net.ipv4.tcp_tw_reuse
sysctl -a | grep net.netfilter.nf_conntrack_tcp
ss -s

Revision history for this message

James Dewey (james-dewey) wrote on 2015-08-19:

# sysctl -n net.ipv4.tcp_tw_reuse
1

# sysctl -a | grep net.netfilter.nf_conntrack_tcp
net.netfilter.nf_conntrack_tcp_be_liberal = 0
net.netfilter.nf_conntrack_tcp_loose = 1
net.netfilter.nf_conntrack_tcp_max_retrans = 3
net.netfilter.nf_conntrack_tcp_timeout_close = 10
net.netfilter.nf_conntrack_tcp_timeout_close_wait = 60
net.netfilter.nf_conntrack_tcp_timeout_established = 432000
net.netfilter.nf_conntrack_tcp_timeout_fin_wait = 120
net.netfilter.nf_conntrack_tcp_timeout_last_ack = 30
net.netfilter.nf_conntrack_tcp_timeout_max_retrans = 300
net.netfilter.nf_conntrack_tcp_timeout_syn_recv = 60
net.netfilter.nf_conntrack_tcp_timeout_syn_sent = 120
net.netfilter.nf_conntrack_tcp_timeout_time_wait = 120
net.netfilter.nf_conntrack_tcp_timeout_unacknowledged = 300

# ss -s
Total: 612 (kernel 2652)
TCP: 96222 (estab 99, closed 96109, orphaned 3, synrecv 0, timewait 96109/0), ports 0

Transport Total IP IPv6
* 2652 - -
RAW 0 0 0
UDP 22 9 13
TCP 113 110 3
INET 135 119 16
FRAG 0 0 0

This environment is 10.1.8

Let me know if you need any further information about this environment.

Revision history for this message

Jordan Callicoat (jcallicoat) wrote on 2015-08-25:

See also LP:1451217

Revision history for this message

Jesse Pretorius (jesse-pretorius) wrote on 2015-09-23:

Switching status to new to re-discuss this in the bug triage meeting.

Revision history for this message

Jesse Pretorius (jesse-pretorius) wrote on 2015-09-25:

It appears that this is a juno only issue, so setting other series to invalid.

no longer affects:

openstack-ansible/kilo

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-09-25: Fix merged to openstack-ansible (juno)

Reviewed: https://review.openstack.org/226880
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible/commit/?id=73da84c4e676d60bdb5e8f470afb2b3f2fcb9843
Submitter: Jenkins
Branch: juno

commit 73da84c4e676d60bdb5e8f470afb2b3f2fcb9843
Author: Christopher H. Laco <email address hidden>
Date: Wed Sep 23 12:26:06 2015 -0400

Add net.netfilter.nf_conntrack_max to Swift Storage

With the default sysctl value, the nf_contract table fills and starts
dropping packets on Swift storage nodes after a certain period of time.

    This is not a problem in Kilo as the value is set to 256k in all hosts
    by default. Adding this specifically to the storage setup to avoid
    adding the more complex solution used for nova/neutron that uses var
    files in playbooks.

    Change-Id: Ic9162eeb50523b32f477075b565f55bbf868d1d6
    Closes-Bug: #1451217
    Closes-Bug: #1479127

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.