StarlingX

TCP keepalive timeouts too high in pods

Bug #1836232 reported by Bart Wensley on 2019-07-11

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	High	Bart Wensley

Bug Description

Brief Description
-----------------
The TCP keepalive timeouts in pods are currently set to the following:
net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 7200

This means that a dropped TCP connection can take more than 2 hours to be removed. That can cause large delays in reacting to unexpected events like the uncontrolled reboot of a host.

Severity
--------
Major: the reaction time to dropped TCP connections can impact recovery from process restarts, host reboots, etc...

Steps to Reproduce
------------------
N/A

Expected Behavior
------------------
When a TCP connection from inside a pod is dropped, it should be cleaned up in a reasonable amount of time. The current settings for the host OS should be used for pods:
net.ipv4.tcp_keepalive_intvl = 1
net.ipv4.tcp_keepalive_probes = 5
net.ipv4.tcp_keepalive_time = 5

Actual Behavior
----------------
See above

Reproducibility
---------------
Reproducible

System Configuration
--------------------
All

Branch/Pull Time/Commit
-----------------------
All

Last Pass
---------
Never

Timestamp/Logs
--------------
N/A

Test Activity
-------------
Developer Testing

Tags:

Bart Wensley (bartwensley) on 2019-07-11

Changed in starlingx:
assignee:	nobody → Bart Wensley (bartwensley)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-07-15: Fix proposed to ansible-playbooks (master)

Fix proposed to branch: master
Review: https://review.opendev.org/670822

Changed in starlingx:
status:	New → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-07-15: Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/670822
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=9a4b6b6a5d903482624f2f4b86041511d3dfa7e4
Submitter: Zuul
Branch: master

commit 9a4b6b6a5d903482624f2f4b86041511d3dfa7e4
Author: Bart Wensley <email address hidden>
Date: Mon Jul 15 07:03:46 2019 -0500

Set TCP keepalive timeouts for cluster network

    The TCP keepalive timeouts in pods running on the cluster
    network are currently set to the following:
    net.ipv4.tcp_keepalive_intvl = 75
    net.ipv4.tcp_keepalive_probes = 9
    net.ipv4.tcp_keepalive_time = 7200

    This means that a dropped TCP connection can take more than
    two hours to be removed. That can cause large delays in reacting
    to unexpected events like the uncontrolled reboot of a host.

    This commit changes the TCP keepalive timeouts for the cluster
    network to match the timeouts for the host OS:
    net.ipv4.tcp_keepalive_intvl = 1
    net.ipv4.tcp_keepalive_probes = 5
    net.ipv4.tcp_keepalive_time = 5

    Change-Id: I23e2c9a733727e4059ac272e052dca0e6ec4f2e1
    Closes-bug: 1836232
    Signed-off-by: Bart Wensley <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-07-16:

Marking as stx.2.0 gating; issue results in delayed system recovery

Changed in starlingx:
importance:	Undecided → High
tags:	added: stx.2.0 stx.containers
tags:	added: stx.config

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.