TCP keepalive timeouts too high in pods

Bug #1836232 reported by Bart Wensley
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Bart Wensley

Bug Description

Brief Description
-----------------
The TCP keepalive timeouts in pods are currently set to the following:
net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 7200

This means that a dropped TCP connection can take more than 2 hours to be removed. That can cause large delays in reacting to unexpected events like the uncontrolled reboot of a host.

Severity
--------
Major: the reaction time to dropped TCP connections can impact recovery from process restarts, host reboots, etc...

Steps to Reproduce
------------------
N/A

Expected Behavior
------------------
When a TCP connection from inside a pod is dropped, it should be cleaned up in a reasonable amount of time. The current settings for the host OS should be used for pods:
net.ipv4.tcp_keepalive_intvl = 1
net.ipv4.tcp_keepalive_probes = 5
net.ipv4.tcp_keepalive_time = 5

Actual Behavior
----------------
See above

Reproducibility
---------------
Reproducible

System Configuration
--------------------
All

Branch/Pull Time/Commit
-----------------------
All

Last Pass
---------
Never

Timestamp/Logs
--------------
N/A

Test Activity
-------------
Developer Testing

Changed in starlingx:
assignee: nobody → Bart Wensley (bartwensley)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)

Fix proposed to branch: master
Review: https://review.opendev.org/670822

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/670822
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=9a4b6b6a5d903482624f2f4b86041511d3dfa7e4
Submitter: Zuul
Branch: master

commit 9a4b6b6a5d903482624f2f4b86041511d3dfa7e4
Author: Bart Wensley <email address hidden>
Date: Mon Jul 15 07:03:46 2019 -0500

    Set TCP keepalive timeouts for cluster network

    The TCP keepalive timeouts in pods running on the cluster
    network are currently set to the following:
    net.ipv4.tcp_keepalive_intvl = 75
    net.ipv4.tcp_keepalive_probes = 9
    net.ipv4.tcp_keepalive_time = 7200

    This means that a dropped TCP connection can take more than
    two hours to be removed. That can cause large delays in reacting
    to unexpected events like the uncontrolled reboot of a host.

    This commit changes the TCP keepalive timeouts for the cluster
    network to match the timeouts for the host OS:
    net.ipv4.tcp_keepalive_intvl = 1
    net.ipv4.tcp_keepalive_probes = 5
    net.ipv4.tcp_keepalive_time = 5

    Change-Id: I23e2c9a733727e4059ac272e052dca0e6ec4f2e1
    Closes-bug: 1836232
    Signed-off-by: Bart Wensley <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as stx.2.0 gating; issue results in delayed system recovery

Changed in starlingx:
importance: Undecided → High
tags: added: stx.2.0 stx.containers
tags: added: stx.config
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.