Periodic message loss seen between VIM and OpenStack REST APIs
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
High
|
Austin Sun |
Bug Description
Title
-----
Periodic message loss seen between VIM and OpenStack REST APIs
Brief Description
-----------------
I am seeing periodic message loss when the VIM (on bare metal) is sending messages to the REST APIs for the OpenStack services (e.g. nova, neutron, glance). The VIM sends a message and never gets a reply. I see this happening every 10 or 20 minutes - the logs look something like this (nfv_vim.log):
Severity
--------
Minor
Steps to Reproduce
------------------
This seems to happen in some labs during steady state operation.
Expected Behavior
------------------
There should not be any message loss between the VIM and the OpenStack REST APIs.
Actual Behavior
----------------
See above
Reproducibility
---------------
Intermittent - only see this occasionally.
System Configuration
-------
2 + 2 system (kubernetes)
Branch/Pull Time/Commit
-------
OS="centos"
SW_VERSION="19.01"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="f/stein"
JOB="STX_
<email address hidden>"
BUILD_NUMBER="54"
BUILD_HOST=
BUILD_DATE=
Timestamp/Logs
--------------
See above.
Changed in starlingx: | |
assignee: | nobody → Bart Wensley (bartwensley) |
Ghada Khalil (gkhalil) wrote : | #1 |
Changed in starlingx: | |
importance: | Undecided → Medium |
status: | New → Triaged |
tags: | added: stx.2019.05 stx.containers |
Bart Wensley (bartwensley) wrote : | #2 |
I have made some progress on the periodic message loss issue.
First, I believe the messages are being sent by the VIM, but are not being received by the nova-api-osapi pod. As an example, in one failure case, the VIM would have sent two GET requests to the nova-api here:
2019-02-
2019-02-
2019-02-
I can see that the second message is processed by the nova-api-osapi pod on controller-1 (co-located with the VIM):
2019-02-27 16:54:05,634.634 1 INFO nova.osapi_
The VIM handles the reply:
2019-02-
2019-02-
But the first message doesn’t appear in the nova-api-osapi logs and the VIM eventually times out waiting for the reply:
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
I next determined that this problem is not specific to the VIM. I wrote simple script to send GET requests directly to the nova API using curl (run this from a shell with the OpenStack credentials set up):
#!/bin/bash
TENANT_
TOKEN_ID=`openstack token issue | grep "| id |" | cut -f3 -d'|' | tr -d '[[:space:]]'`
counter=1
while [ $counter -le 10000 ]
do
echo
echo "Request: $counter"
date +"%D %T %N"
curl -g -i -X GET http://
#curl -g -i -X GET http://
Changed in starlingx: | |
assignee: | Bart Wensley (bartwensley) → Matt Peters (mpeters-wrs) |
Matt Peters (mpeters-wrs) wrote : | #3 |
The connection fails only under specific conditions and is directly related to a conflict when setting up the NAT connection tracking information. This issue occurs in about 1 in 500 connections and is only impacts that one connection, all other and subsequent connections are fine.
There seems to be a race condition or a source port conflict when performing the NAT operation when connecting to the local pod. When there is a failed connection, the conntrack entry shows the use of a generated port number in the reverse path translation. This seems to imply it could not use the original source port due to the conflict, but then subsequent packets fail to match the flow entry, causing the connection to fail.
tcp 6 86399 ESTABLISHED src=10.10.10.13 dst=10.103.101.91 sport=60171 dport=8774 src=172.16.0.164 dst=10.10.10.13 sport=8774 dport=60171 [ASSURED] mark=0 use=1
tcp 6 114 TIME_WAIT src=10.10.10.13 dst=10.103.101.91 sport=59721 dport=8774 src=172.16.0.164 dst=10.10.10.13 sport=8774 dport=59721 [ASSURED] mark=0 use=1
tcp 6 115 TIME_WAIT src=10.10.10.13 dst=10.103.101.91 sport=59753 dport=8774 src=172.16.0.164 dst=10.10.10.13 sport=8774 dport=59753 [ASSURED] mark=0 use=1
tcp 6 118 TIME_WAIT src=10.10.10.13 dst=10.103.101.91 sport=60053 dport=8774 src=172.16.0.164 dst=10.10.10.13 sport=8774 dport=60053 [ASSURED] mark=0 use=1
tcp 6 117 TIME_WAIT src=10.10.10.13 dst=10.103.101.91 sport=59961 dport=8774 src=172.16.0.164 dst=10.10.10.13 sport=8774 dport=59961 [ASSURED] mark=0 use=2
tcp 6 114 TIME_WAIT src=10.10.10.13 dst=10.103.101.91 sport=59717 dport=8774 src=172.16.0.164 dst=10.10.10.13 sport=8774 dport=59717 [ASSURED] mark=0 use=1
tcp 6 115 TIME_WAIT src=10.10.10.13 dst=10.103.101.91 sport=59739 dport=8774 src=172.16.0.164 dst=10.10.10.13 sport=8774 dport=59739 [ASSURED] mark=0 use=1
tcp 6 117 TIME_WAIT src=10.10.10.13 dst=10.103.101.91 sport=59985 dport=8774 src=172.16.0.164 dst=10.10.10.13 sport=8774 dport=59985 [ASSURED] mark=0 use=2
tcp 6 119 TIME_WAIT src=10.10.10.13 dst=10.103.101.91 sport=60147 dport=8774 src=172.16.0.164 dst=10.10.10.13 sport=8774 dport=60147 [ASSURED] mark=0 use=2
tcp 6 118 TIME_WAIT src=10.10.10.13 dst=10.103.101.91 sport=60047 dport=8774 src=172.16.0.164 dst=10.10.10.13 sport=8774 dport=60047 [ASSURED] mark=0 use=1
tcp 6 117 TIME_WAIT src=10.10.10.13 dst=10.103.101.91 sport=60009 dport=8774 src=172.16.0.164 dst=10.10.10.13 sport=8774 dport=60009 [ASSURED] mark=0 use=1
tcp 6 119 TIME_WAIT src=10.10.10.13 dst=10.103.101.91 sport=60161 dport=8774 src=172.16.0.164 dst=10.10.10.13 sport=8774 dport=60161 [ASSURED] mark=0 use=1
tcp 6 118 TIME_WAIT src=10.10.10.13 dst=10.103.101.91 sport=60057 dport=8774 src=172.16.0.164 dst=10.10.10.13 sport=8774 dport=60057 [ASSURED] mark=0 use=1
tcp 6 116 TIME_WAIT src=10.10.10.13 dst=10.103.101.91 sport=59833 dport=8774 src=172.16.0.164 dst=10.10.10.13 sport=8774 dport=59833 [ASSURED] mark=0 use=1
tcp 6 115 TIME_WAIT src=10.10.10.13 dst=10.103.101.91 sport=59791 dport=8774 src=172.16.0.164 dst=10.10.10.13 sport=8774 dport=59791 [ASSURED] mark=0 use=1
tcp 6 118 TIME_WAIT src=10.10.10.13 dst=10.103.101.91 sp...
Matt Peters (mpeters-wrs) wrote : | #4 |
The statistics also show the "invalid" counts being begged every time the issue occurs.
controller-0:~$ sudo conntrack -S
cpu=0 searched=2278720 found=26257506 new=925893 invalid=1205 ignore=4451231 delete=1094225 delete_list=1094226 insert=925895 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=0
cpu=1 searched=2192697 found=26227245 new=929531 invalid=1392 ignore=4293678 delete=1069473 delete_list=1069464 insert=929527 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=0
cpu=2 searched=3038601 found=35074053 new=932092 invalid=1197 ignore=4435206 delete=1042689 delete_list=1042690 insert=932090 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=0
cpu=3 searched=12913387 found=154903356 new=2980111 invalid=5441 ignore=3586824 delete=2559266 delete_list=2428654 insert=2849496 insert_failed=0 drop=0 early_drop=0 error=5 search_restart=0
tags: |
added: stx.2.0 removed: stx.2019.05 |
Austin Sun (sunausti) wrote : | #5 |
Bart, and Matt:
I have try reproduce this issue in virtual environment in latest version.
I run more than 10000 times the script Bart mentioned, but can not reproduce.
could you try latest version if the issue is there or not ?
Matt Peters (mpeters-wrs) wrote : | #6 |
This issue only occurs when the services (or test client) are co-located on the same host. Can you confirm that your reproduction attempts were all localized to a single host?
Austin Sun (sunausti) wrote : | #7 |
Hi Matt:
you mean run this in a AIO(All-in-One) Simplex setup ?
Austin Sun (sunausti) wrote : | #8 |
I think you are right , in AIO Simplex setup ,the script stopped at
Request: 2083
06/11/19 03:03:48 523488226
curl: (7) Failed connect to nova-api.
7
Austin Sun (sunausti) wrote : | #9 |
- mytestnossh.cap15 Edit (9.5 MiB, application/octet-stream)
From, mytestnossh.cap15 tcpdump. a very weird No 13523, it got a response from nova-api, but can not find any corresponding request send from host in previous time.
Ghada Khalil (gkhalil) wrote : | #10 |
Assigning to Austin since he is actively working on this bug
Changed in starlingx: | |
assignee: | Matt Peters (mpeters-wrs) → Austin Sun (sunausti) |
Austin Sun (sunausti) wrote : | #11 |
- 1817936_iptrace.zip Edit (286.7 KiB, application/zip)
in, kern_2019-
Line 24000, mangle:
Line 24001, mangle:
The destination port was suddenly changed from 33727 to 53902.
after that:
Ln 24014, it goes to filter:
tags: |
added: stx.networking removed: stx.containers |
Austin Sun (sunausti) wrote : | #12 |
Here is some finding , readiness or liveness deployment of nova-api-osapi or xxxx-api is using the tcp and port to check pod readiness and liveness
For example barbican-api readiness
readinessProbe:
port: 9311
‘curl --connect-timeout 10 http://
Host Name readinessProbe. periodSeconds Error times in 50000 times Comments
R22 10 29 only applied barbican , no cinder, glance , compute-toolkit were applied.
R22 10 28
R22 300 0
R22 300 0
R22 300 0
Edgetest901 10 8 Apply default helm chart
Edgetest901 300 2
Edgetest901 300 1
it seems change period of readiness will improve the result.
And I changed nova-api-osapi deploy readiness and liveness from tcp to exec command.
Changed to
exec:
- cat
- /tmp/nova-api.sh
The result is very positive , there is 0 error on 2*50000 times.
will check how to fix this issue officially
Austin Sun (sunausti) wrote : | #13 |
in 1817936_
Ln 23963, SPT=53902 , SYN Package from kernel, but it was not found in tcpdump package. tcpdump package only find source port =33727 sync package.
Austin Sun (sunausti) wrote : | #14 |
- tcp_test.zip Edit (101.3 KiB, application/zip)
Some more tests are performed based on attached tcp_test.zip
Set-Up K8s with calico CNI On Ubuntu 18.04.1 LTS and CentOS Linux release 7.6.1810
'kubectl create ns policy-demo'
'kubectl apply -f tcp_echo.yaml'
and run test.sh in attached zip.
Ubuntu 18.04.1 result is Good. connected 10000 times.
on CentOS 7.6.1810, connected will failed in 1000 times.
But if readinessProbe was removed from pod in CentOS 7.6.1810. connected 10000 times without any failure.
General info
CentOS Linux release 7.6.1810
docker version:19.03.0
k8s:"v1.15.1"
calico v3.6.4(https:/
Ubuntu 18.04.1 LTS
Docker version 18.09.8, build 0dd43dd87f
k8s:"v1.15.1"
calico v3.6.4(https:/
centos_trace.cap No 211, 10.10.10.
kern.log Line 7189 10.10.10.
In Line 7241 SYN Package Source Port is 42934, But in Line 7242 The SYN ACK Destination Port changed to 1237
Line 7241 Jul 22 23:44:17 localhost kernel: [ 8013.941489] TRACE: nat:KUBE-
Line 7242 Jul 22 23:44:17 localhost kernel: [ 8013.941551] TRACE: raw:PREROUTING:
Austin Sun (sunausti) wrote : | #15 |
another interesting test was performed. if using command
"curl --local-port 47250-67500 --connect-timeout 10 http://
Frank Miller (sensfan22) wrote : | #16 |
Changed the priority to High based on the outcome of discussions with PLs and TLs (Ghada, Brent, Frank). Numerous LPs have been reported over the past few months related to openstack commands intermittently failing. Many of these look like message loss between openstack pods and could be due to this LP.
tags: | added: stx.containers |
Changed in starlingx: | |
importance: | Medium → High |
OpenStack Infra (hudson-openstack) wrote : Fix proposed to integ (master) | #17 |
Fix proposed to branch: master
Review: https:/
Changed in starlingx: | |
status: | Triaged → In Progress |
Austin Sun (sunausti) wrote : | #18 |
it seems curl command in centos is not controlled by net.ipv4.
When I running curl command , the source port range is from 11728 to 65508 with tcpdump log.
So the valid test approach should be 'curl --local-port 49216-61000 xxxx' 49216-61000 is stx ip local port range.
Austin Sun (sunausti) wrote : | #19 |
although curl command is invalid test case, but for python test code.
import urllib
import datetime
for i in range(1,20000):
now = datetime.
print("start %s at: %s" % (i,now))
try:
response = urllib.urlopen("http://
except:
now = datetime.
exit()
now = datetime.
print("end %s at: %s" % (i,now))
This issue is still reproducible even local port range 49216-61000 or 45000-65500 or 32768-60999.
need more investigation.
Austin Sun (sunausti) wrote : | #20 |
if remove glance-api readiness probe and livenessprobe , then test mentioned above run successfully w/o failure in 50000 times. it should be some conflict interfere between application with tcp probe.
Austin Sun (sunausti) wrote : | #21 |
- glance-api.tar.gz Edit (578.1 KiB, application/x-tar)
from the attached tcpdump log.
@26151, there is a package 10.10.10.
Austin Sun (sunausti) wrote : | #22 |
- k8s_probe.tar.gz Edit (23.7 MiB, application/x-tar)
attachment is long trace for monitoring 9292 port , there are a lot of ports out of ip_local_
for example: No 233525, the port is 10325, 10.10.10.
The ephemeral ports out of ip_local_
10187: 10325: 11283: 11702: 11891: 12797: 13324: 13336: 13341: 13447: 13470: 13553: 13664: 13685: 13864: 13934: 14393
: 15440: 15496: 15520: 16558: 16695: 16794: 16909: 16932: 17017: 17110: 17117: 17269: 17343: 17589: 18024: 18166: 182
13: 18274: 18342: 18959: 19275: 19845: 20230: 20499: 21086: 21121: 21141: 21302: 21477: 21816: 21987: 22117: 22296: 2
3081: 23233: 23621: 24344: 24877: 25243: 25297: 25634: 25637: 25909: 26443: 26454: 27055: 27428: 27597: 27703: 27786:
27860: 27935: 28010: 28105: 28224: 28403: 28847: 28922: 29385: 29714: 29912: 30493: 30547: 30641: 30736: 31264: 3147
9: 31734: 32003: 32070: 32234: 32531: 32670: 32987: 33032: 33486: 33507: 33509: 33611: 33806: 33926: 34073: 34505: 34
522: 34574: 34589: 34819: 35228: 35442: 35779: 36030: 36036: 37493: 37538: 37782: 38567: 38756: 38872: 39561: 39642:
39700: 39903: 39989: 3exmp: 40624: 40858: 40874: 41165: 41231: 41343: 41511: 41687: 41701: 41709: 41773: 41799: 41864
: 42103: 42525: 42698: 42837: 4334: 43705: 4388: 44435: 44726: 45212: 45313: 45613: 45752: 45789: 45925: 45927: 46201
: 46225: 47263: 47679: 47830: 47945: 47985: 48007: 4826: 48431: 48473: 48646: 48729: 4873: 48852: 4921: 49216: 49217:
49218:
Austin Sun (sunausti) wrote : | #23 |
- echo-glance-api.tar.gz Edit (16.7 KiB, application/x-tar)
when issue occured.
in echo-glance-api.cap (tcpdump -i any -vv port 9292 -w echo-glance-
No468: 10.10.10.3:13550 ---> 172.16.192.99:9292
in echo-glance-api.con (conntrack -o timestamp -E -p tcp --dport 9292)
[1564902343.325301] [NEW] tcp 6 120 SYN_SENT src=10.10.10.3 dst=10.98.65.211 sport=53720 dport=9292 [UNREPLIED] src=172.16.192.99 dst=10.10.10.3 sport=9292 dport=13550
[1564902343.325305] [UPDATE] tcp 6 60 SYN_RECV src=10.10.10.3 dst=10.98.65.211 sport=53720 dport=9292 src=172.16.192.99 dst=10.10.10.3 sport=9292 dport=13550
The sport was changed from 53720 to 13550. it seems NAT change the port. will check netfliter code.
OpenStack Infra (hudson-openstack) wrote : Change abandoned on integ (master) | #24 |
Change abandoned by Austin Sun (<email address hidden>) on branch: master
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Fix proposed to integ (master) | #25 |
Fix proposed to branch: master
Review: https:/
Austin Sun (sunausti) wrote : | #26 |
- conntrack_tcpdump_kern.zip Edit (3.5 MiB, application/zip)
from this log , The root cause is found.
probe was in TIME_WAIT connection(
client send a new connection 10.10.10.
10.10.10.
=======
describe controller service endpoint STATE
Probe *(50538)
Client New *(50538)----> *(9292)
After NAT
Client *(24479)
IN PREROUTING *(24479)
INPUT Chain *(50538)
=======
tcpdump_
No.24288 10.10.10.
No.24289 172.16.
conntrack_
[1565008856.898709] [UPDATE] tcp 6 120 TIME_WAIT src=10.10.10.3 dst=172.16.192.101 sport=50538 dport=9292 src=172.16.192.101 dst=10.10.10.3 sport=9292 dport=50538 [ASSURED]
[1565008911.774492] [NEW] tcp 6 120 SYN_SENT src=10.10.10.3 dst=10.109.43.235 sport=50538 dport=9292 [UNREPLIED] src=172.16.192.101 dst=10.10.10.3 sport=9292 dport=24479
[1565008911.774522] [UPDATE] tcp 6 60 SYN_RECV src=10.10.10.3 dst=10.109.43.235 sport=50538 dport=9292 src=172.16.192.101 dst=10.10.10.3 sport=9292 dport=24479
because port 50538 is still in TIME_WAIT to release.
New connection NAT can not use this CT. so NAT change to 10.10.10.
10.10.10.
kern_9292.log:
mangle change the CT 172.16.
because this 172.16.
so 172.16.
2019-08-
2019-08-
Austin Sun (sunausti) wrote : | #27 |
The probe connection action before going to time_wait state.
Probe connection
controller service endpoint Chain TCP FLAG SEQ ACK
10.10.10.3:50538 -------
10.10.10.3:50538 <------
10.10.10.3:50538 -------
10.10.10.3:50538 -------
10.10.10.3:50538 <------
10.10.10.3:50538 <------
10.10.10.3:50538 -------
And for the curl command connection with same port 50538: it will be like
controller service endpoint Chain TCP FLAG SEQ ACK
10.10.10.3:50538 --> 10.109.43.235:9292 raw:OUTPUT:policy:4 SYN 2917708674 0
10.10.10.3:50538 -------
10.10.10.3:24479 <------
10.10.10.3:50538 <------
10.10.10.3:50538 --> 10.109.43.235:9292 raw:OUTPUT:policy:4 ACK 2707980038 1599414187
10.10.10.3:50538 --> 10.109.43.235:9292 filter:
10.10.10.3:50538 --> 10.109.43.235:9292 filter:
The last ACK(10.
from https:/
Because this is not a correct SEQ/ACK , then it is set invalid and then dropped.
If enabling tcp_tw_recycle, the previous socket should be already closed , then the issu...
OpenStack Infra (hudson-openstack) wrote : Fix merged to integ (master) | #28 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: master
commit 171c43dca8b419c
Author: Sun Austin <email address hidden>
Date: Mon Aug 5 16:46:43 2019 +0800
Fix Periodic message loss between VIM and Openstack REST APIs
set net.ipv4.
The probe connection action before going to time_wait state.
Probe connection
controller pod TCP FLAG SEQ ACK
controller:
controller:
2707980037
controller:
1599414186
controller:
1599414186
controller:
2707980038
controller:
2707980038
controller:
1599414187
And for the curl command connection with same port 50538: it will be
like
controller pod TCP FLAG SEQ ACK
controller:
controller:
controller:
2917708675
controller:
2917708675
controller:
1599414187
controller:
1599414187
controller:
1599414187
The last ACK(controller:
Probe TIME_WAIT latest ACK’s.
from
https:/
it only check (des ip , des port, src ip, and src port).Because this is
not
a correct SEQ/ACK , then it is set invalid and then dropped.
If enabling tcp_tw_recycle, the previous socket should be already
closed, then the issue should be gone.
Closes-Bug: 1817936
Change-Id: If6e66d85f08fc9
Signed-off-by: Sun Austin <email address hidden>
Changed in starlingx: | |
status: | In Progress → Fix Released |
OpenStack Infra (hudson-openstack) wrote : Fix proposed to integ (r/stx.2.0) | #29 |
Fix proposed to branch: r/stx.2.0
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Change abandoned on integ (r/stx.2.0) | #30 |
Change abandoned by Austin Sun (<email address hidden>) on branch: r/stx.2.0
Review: https:/
Cindy Xie (xxie1) wrote : | #31 |
change needs to be reverted, due to https:/
Changed in starlingx: | |
status: | Fix Released → Confirmed |
Austin Sun (sunausti) wrote : | #32 |
once revert change merged, will submit a new change for another parameters net.ipv4_
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config-files (master) | #33 |
Fix proposed to branch: master
Review: https:/
Changed in starlingx: | |
status: | Confirmed → In Progress |
OpenStack Infra (hudson-openstack) wrote : Fix merged to config-files (master) | #34 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: master
commit fbc09b8db8a14fb
Author: Sun Austin <email address hidden>
Date: Sat Sep 7 08:31:13 2019 +0800
Fix Periodic message loss between VIM and Openstack REST APIs
set net.ipv4.
and remove customizing ephemeral port range
The probe connection action before going to time_wait state.
Probe connection
controller pod TCP FLAG SEQ ACK
controller:
controller:
2707980037
controller:
1599414186
controller:
1599414186
controller:
2707980038
controller:
2707980038
controller:
1599414187
And for the curl command connection with same port 50538: it will be
like
controller pod TCP FLAG SEQ ACK
controller:
controller:
controller:
2917708675
controller:
2917708675
controller:
1599414187
controller:
1599414187
controller:
1599414187
The last ACK(controller:
Probe TIME_WAIT latest ACK’s.
from
https:/
it only check (des ip , des port, src ip, and src port).Because this is
not
a correct SEQ/ACK , then it is set invalid and then dropped.
If disable tcp_tw_reuse, the port nova-api will be always not same as
pod probe using, then the issue should be gone.
set back default(centos) ephemeral port range to avoid ephemeral port
exhaustion .
Closes-Bug: 1817936
Change-Id: I0b37e9829ac5d3
Signed-off-by: Sun Austin <email address hidden>
Changed in starlingx: | |
status: | In Progress → Fix Released |
Ghada Khalil (gkhalil) wrote : | #35 |
@Austin, the change has sufficient soak in master and sanity is green, please cherrypick to r/stx.2.0
Ghada Khalil (gkhalil) wrote : | #36 |
@Austin, when the change has sufficient soak in master and sanity is green, please cherrypick to r/stx.2.0
Austin Sun (sunausti) wrote : | #37 |
Hi Ghada:
The previous change was merged but sanity is green, so i think sanity did not cover some cases. I will check with test team to check if they can perform some regression test after daily sanity is green
OpenStack Infra (hudson-openstack) wrote : Fix proposed to integ (r/stx.2.0) | #38 |
Fix proposed to branch: r/stx.2.0
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Fix merged to integ (r/stx.2.0) | #39 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: r/stx.2.0
commit 457b4747e92255f
Author: Sun Austin <email address hidden>
Date: Sun Sep 22 19:33:04 2019 +0800
Fix Periodic message loss between VIM and Openstack REST APIs
set net.ipv4.
and remove customizing ephemeral port range
The probe connection action before going to time_wait state.
Probe connection
controller pod TCP FLAG SEQ ACK
controller:
controller:
2707980037
controller:
1599414186
controller:
1599414186
controller:
2707980038
controller:
2707980038
controller:
1599414187
And for the curl command connection with same port 50538: it will be
like
controller pod TCP FLAG SEQ ACK
controller:
controller:
controller:
2917708675
controller:
2917708675
controller:
1599414187
controller:
1599414187
controller:
1599414187
The last ACK(controller:
Probe TIME_WAIT latest ACK’s.
from
https:/
it only check (des ip , des port, src ip, and src port).Because this is
not a correct SEQ/ACK , then it is set invalid and then dropped.
If disable tcp_tw_reuse, the port nova-api will be always not same as
pod probe using, then the issue should be gone.
set back default(centos) ephemeral port range to avoid ephemeral port
exhaustion .
Closes-Bug: 1817936
Change-Id: I9a39e3b439633f
Signed-off-by: Sun Austin <email address hidden>
tags: | added: in-r-stx20 |
Ghada Khalil (gkhalil) wrote : | #40 |
The following testing was done by Elio Martinez with an stx.2.0 designer build with this fix
Sanity Horizon VM operation Ping 10 Hrs Script from bug
AIOX GREEN PASS PASS PASS PASS
MULTI GREEN PASS PASS PASS PASS
Important highlights:
1. All scenarios were executed in AIOX and MULTI (2+2) systems.
2. Sanity execution was successfully without any kind of problema (GREEN)
3. Horizon tests were executed as well (pause, suspend, unpause, créate, delete instances) .
4. VM operation was using Horizon as well, no problem detected.
5. Ping between instances were executed about at least 10 hours each configuration.
6. I decided to execute the script mentioned in https:/
Conclusion:
No problem was found with requested testing on http://
Marking as release gating until further investigation. medium priority.