curl glance-api.openstack.svc.cluster.local:9292 timeout

Bug #1882172 reported by chendongqi
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Won't Fix
Low
zhao.shuai

Bug Description

Brief Description
-----------------
Curl --connect-timeout 10 glance-api.openstack.svc.cluster.local:9292 timeout.

Test script:
See attachment

Cause:
changes made by kernel 4.18
This changes the /proc/sys/net/ipv4/tcp_tw_reuse from a boolean to an integer.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=79e9fed460385a3d8ba0b5782e9e74405cb199b1

Solution,set net.ipv4.tcp_tw_reuse = 0 in initscripts-config/files/sysctl.conf

Severity
--------
Major

Steps to Reproduce
------------------
After deploying StarlingX, test script

Expected Behavior
------------------
Execute the script without timeout

Actual Behavior
----------------
Timeout

Reproducibility
---------------
100%

System Configuration
--------------------
All-in-One Simplex

Branch/Pull Time/Commit
-----------------------
With 4.18 kernel
BUILD_ID="20200529T033909Z"

Timestamp/Logs
--------------
N/A

Test Activity
-------------
Developer Testing

Workaround
----------
4.18 kernel,set net.ipv4.tcp_tw_reuse = 0 in initscripts-config/files/sysctl.conf

Revision history for this message
chendongqi (chen-dq) wrote :
Changed in starlingx:
assignee: nobody → zhao.shuai (zhao.shuai.neusoft)
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as stx.4.0 / high priority given this issue was recently introduced by the kernel upversion.

tags: added: stx.4.0 stx.distro.openstack stx.distro.other
Changed in starlingx:
importance: Undecided → High
status: New → Triaged
Revision history for this message
chendongqi (chen-dq) wrote :

After redeploying the same ISO, regardless of whether the value of net.ipv4.tcp_tw_reuse is 2 or 0, no timeout has occurred, which is inconsistent with the previous test phenomenon. The randomness of the test results requires further investigation.

Revision history for this message
chendongqi (chen-dq) wrote :

1. AIO Simplex deploys the same ISO, sometimes timeout (most) occurs, and sometimes timeout does not occur.

2. The occurrence of curl timeout is not related to the value of net.ipv4.tcp_tw_reuse (only the first test is valid at the beginning, and it is invalid afterwards).

3. Execute the script. After timeout, wait 1-2 minutes (if you don't wait, execute the script, the timeout will occur in about 100 times). If you execute the script again, timeout will occur, and the curl count will always stop at a regular value, redeploying this regular value will change. Sometimes it is 1500 times, sometimes it is 2100 times.

4. If curl glance-api.openstack.svc.cluster.local:9292 is changed to curl glance.openstack.svc.cluster.local:80, the timeout problem will not occur.

Revision history for this message
yong hu (yhu6) wrote :

This issue was also seen with the old kernel (3.10.xx). It looks the issue was not really triggered by the new kernel.

Revision history for this message
Austin Sun (sunausti) wrote :

 based on "20200614T080013Z" , can not reproduce in 2*100000 times.the script is attached.

Revision history for this message
Austin Sun (sunausti) wrote :

 based on "20200614T080013Z" , can not reproduce in 2*100000 times.the script is attached.

Revision history for this message
chendongqi (chen-dq) wrote :

Setting the local IP port numbers range from 1024 to 60999 is helpful to optimize this problem, and timeout rarely occurs (100000 times may occur once or not)
E.g
sudo sysctl net.ipv4.ip_local_port_range=1024

Revision history for this message
chendongqi (chen-dq) wrote :

Bare metal deployment 1/10 probability does not occur timeout

Revision history for this message
yong hu (yhu6) wrote :

Austin will help watching out the progress on this issue.

Revision history for this message
chendongqi (chen-dq) wrote :

Refer to the official documentation https://docs.starlingx.io/deploy_install_guides/r3_release/bare_metal/aio_simplex_install_kubernetes.html#configure-controller-0
When configuring controller-0, you need to configure the mgmt network, assign mgmt to an actual physical network card, and connect the network port to a switch.

Test result: timeout did not occur
Number of deployments: 5
Test times: (2*100000)*5

Changed in starlingx:
status: Triaged → Confirmed
Revision history for this message
yong hu (yhu6) wrote :

This issue only happens in Simplex with virtual environment (with "lo" as management interface), so we like to downgrade its severity.

Changed in starlingx:
importance: High → Medium
Revision history for this message
yong hu (yhu6) wrote :

As analyzed above, we don't think this issue (only with virtual "lo") will block 4.0, so defer it.

tags: removed: stx.4.0
Ghada Khalil (gkhalil)
tags: added: stx.5.0
Revision history for this message
Austin Sun (sunausti) wrote :

Hi Ghada:
   as this issue only happened in AIO-Simplex, and have some 'WA', I suggest to put the priority to low.

Thanks.
BR
Austin Sun

Revision history for this message
Austin Sun (sunausti) wrote :

as https://bugs.launchpad.net/starlingx/+bug/1880777 fixed , this issue impact will be low.

Changed in starlingx:
importance: Medium → Low
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Given this is marked as a low priority, I am removing the stx.5.0 release tag.

tags: removed: stx.5.0
Revision history for this message
Thales Elero Cervi (tcervi) wrote :

Closing it now due to inactivity.

Changed in starlingx:
status: Confirmed → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.