nf_conntrack table fills on swift nodes

Bug #1479127 reported by James Dewey
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack-Ansible
Invalid
Undecided
Unassigned
Juno
Fix Released
Medium
Christopher H. Laco
Trunk
Invalid
Undecided
Unassigned

Bug Description

Due to the use of natting in iptables required by LXC, it appears that the conntrack table fills up and then randomly drops packets. This might manifest itself if you are using swift-recon in order to monitor your cluster, you will intermittently see timeouts, like so:

[2015-07-28 21:33:45] Checking swift.conf md5sum
-> http://172.29.244.72:6000/recon/swiftconfmd5: <urlopen error timed out>
4/5 hosts matched, 1 error[s] while checking hosts.

Upon investigating, I would see dmesg filled with the following:

[20765760.747582] nf_conntrack: table full, dropping packet
[20765761.251622] nf_conntrack: table full, dropping packet
[20765762.067443] nf_conntrack: table full, dropping packet
[20765762.067595] nf_conntrack: table full, dropping packet
[20765762.068828] nf_conntrack: table full, dropping packet
[20765762.070060] nf_conntrack: table full, dropping packet
[20765762.070393] nf_conntrack: table full, dropping packet
[20765762.070632] nf_conntrack: table full, dropping packet
[20765762.070847] nf_conntrack: table full, dropping packet

I have seen this issue in a couple different environments, in all cases I have raised the nf_conntrack_max value to a sufficiently large value (Around 300,000 in the case of these relatively small environments), and then committed it to /etc/sysctl.conf to prevent it from reverting on a server restart.

Maybe we should raise this value, or paramatize it so that it is easier to manipulate across larger environments?

Revision history for this message
Jordan Callicoat (jcallicoat) wrote :
Revision history for this message
Jesse Pretorius (jesse-pretorius) wrote :

IT'd be useful to know the version of os-ansible-deployment that was in use here and whether the patch from https://bugs.launchpad.net/openstack-ansible/+bug/1441363 was applied.

no longer affects: openstack-ansible/juno
no longer affects: openstack-ansible/kilo
Revision history for this message
Evan Callicoat (diopter) wrote :

I would like to have more information on the networking state involved when this issue manifests itself.

Specifically, whether the tw_reuse patch mentioned in the linked bug is in effect or not, the netfilter TCP tunings/timeouts, and the summary of connections and their states.

Please provide the output from the following:
sysctl -n net.ipv4.tcp_tw_reuse
sysctl -a | grep net.netfilter.nf_conntrack_tcp
ss -s

Revision history for this message
James Dewey (james-dewey) wrote :

# sysctl -n net.ipv4.tcp_tw_reuse
1

# sysctl -a | grep net.netfilter.nf_conntrack_tcp
net.netfilter.nf_conntrack_tcp_be_liberal = 0
net.netfilter.nf_conntrack_tcp_loose = 1
net.netfilter.nf_conntrack_tcp_max_retrans = 3
net.netfilter.nf_conntrack_tcp_timeout_close = 10
net.netfilter.nf_conntrack_tcp_timeout_close_wait = 60
net.netfilter.nf_conntrack_tcp_timeout_established = 432000
net.netfilter.nf_conntrack_tcp_timeout_fin_wait = 120
net.netfilter.nf_conntrack_tcp_timeout_last_ack = 30
net.netfilter.nf_conntrack_tcp_timeout_max_retrans = 300
net.netfilter.nf_conntrack_tcp_timeout_syn_recv = 60
net.netfilter.nf_conntrack_tcp_timeout_syn_sent = 120
net.netfilter.nf_conntrack_tcp_timeout_time_wait = 120
net.netfilter.nf_conntrack_tcp_timeout_unacknowledged = 300

# ss -s
Total: 612 (kernel 2652)
TCP: 96222 (estab 99, closed 96109, orphaned 3, synrecv 0, timewait 96109/0), ports 0

Transport Total IP IPv6
* 2652 - -
RAW 0 0 0
UDP 22 9 13
TCP 113 110 3
INET 135 119 16
FRAG 0 0 0

This environment is 10.1.8

Let me know if you need any further information about this environment.

Revision history for this message
Jordan Callicoat (jcallicoat) wrote :

See also LP:1451217

Revision history for this message
Jesse Pretorius (jesse-pretorius) wrote :

Switching status to new to re-discuss this in the bug triage meeting.

Revision history for this message
Jesse Pretorius (jesse-pretorius) wrote :

It appears that this is a juno only issue, so setting other series to invalid.

no longer affects: openstack-ansible/kilo
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-ansible (juno)

Reviewed: https://review.openstack.org/226880
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible/commit/?id=73da84c4e676d60bdb5e8f470afb2b3f2fcb9843
Submitter: Jenkins
Branch: juno

commit 73da84c4e676d60bdb5e8f470afb2b3f2fcb9843
Author: Christopher H. Laco <email address hidden>
Date: Wed Sep 23 12:26:06 2015 -0400

    Add net.netfilter.nf_conntrack_max to Swift Storage

    With the default sysctl value, the nf_contract table fills and starts
    dropping packets on Swift storage nodes after a certain period of time.

    This is not a problem in Kilo as the value is set to 256k in all hosts
    by default. Adding this specifically to the storage setup to avoid
    adding the more complex solution used for nova/neutron that uses var
    files in playbooks.

    Change-Id: Ic9162eeb50523b32f477075b565f55bbf868d1d6
    Closes-Bug: #1451217
    Closes-Bug: #1479127

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.