juju should tune kernel socket options on apiserver

Bug #1656430 reported by Shane Peters
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Horacio Durán
2.1
Fix Released
High
Ian Booth

Bug Description

In large deployments (400+ servers, 1700+ juju agents), default tcp sysctls are quickly overwhelmed preventing 'dblogpruner' connecting to mongod and trimming the logs collections as scheduled.

When this happens, the logs database collection can grow very rapidly resulting in high cpu, ram, and I/O usage by mongod. As a result, 'juju status' takes upwards of 5 minutes to return and we receive repeated 'i/o timeout' in the juju controller logs.

After experiencing this in our environment, we manually removed documents from the logs collection using mongoshell and tuned some tcp sysctls (net.core.somaxconn, net.ipv4.tcp_max_syn_backlog). Performance was significantly increased and mongod's ram and cpu usage leveled out and we were able to see 'dblogpruner' fully complete on schedule. Also, 'juju status' returned in under 2 seconds.

machine-0.log
--------------
ERROR exited "dblogpruner": worker "dblogpruner" exited: failed to prune logs by time: read tcp 10.1.100.114:48986->10.1.100.114:37017: i/o timeout
ERROR exited "dblogpruner": worker "dblogpruner" exited: failed to prune logs by time: read tcp 10.1.100.114:49914->10.1.100.114:37017: i/o timeout
ERROR exited "dblogpruner": worker "dblogpruner" exited: failed to prune logs by time: read tcp 10.1.100.114:49916->10.1.100.114:37017: i/o timeout
ERROR failed to write status history: read tcp 127.0.0.1:34328->127.0.0.1:37017: i/o timeout

kern.log
---------
TCP: request_sock_TCP: Possible SYN flooding on port 37017. Sending cookies. Check SNMP counters.
TCP: request_sock_TCP: Possible SYN flooding on port 17070. Sending cookies. Check SNMP counters.

mongodb logs collection (33GB):
-------------------------
33500684288 Jan 11 22:43 collection-66--9039449553220388758.wt

Tags: sts
Revision history for this message
Shane Peters (shaner) wrote :

juju version:
2.0.0-xenial-amd64

Revision history for this message
Anastasia (anastasia-macmood) wrote :

According to mongo documentation, https://docs.mongodb.com/manual/core/capped-collections/, you cannot delete from capped collection (unless you are deleting all documents from it).

Changed in juju:
status: New → Invalid
Revision history for this message
Tim Penhey (thumper) wrote :

Out of curiosity, what values did you update net.core.somaxconn and net.ipv4.tcp_max_syn_backlog to?

summary: - juju logs should be a capped collection in mongodb
+ juju should tune kernel socket options on apiserver
Changed in juju:
status: Invalid → Triaged
importance: Undecided → High
milestone: none → 2.1-rc1
Revision history for this message
Shane Peters (shaner) wrote :

Tim,

We ended up setting the following socket opts. Whether they're correct or not are open to discussion.

net.ipv4.tcp_max_syn_backlog = 4096
net.core.somaxconn = 16384
net.core.netdev_max_backlog = 1000
net.ipv4.tcp_fin_timeout = 30

Thanks,
Shane

Ian Booth (wallyworld)
Changed in juju:
milestone: 2.1-rc1 → 2.2.0-alpha1
Revision history for this message
Anastasia (anastasia-macmood) wrote :
Ian Booth (wallyworld)
Changed in juju:
assignee: nobody → Horacio Durán (hduran-8)
status: Triaged → Fix Committed
Curtis Hovey (sinzui)
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.