app port unreachable from external

Bug #1841802 reported by Peng Peng
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
zhipeng liu

Bug Description

Brief Description
-----------------
apply a new test app, like hello-kitty app. After apply successfully, try to
wget app via <oam_ip>:<targetPort>, but failed

Severity
--------
Major

Steps to Reproduce
------------------
Upload hello-kitty helm charts
Apply hello-kitty
wget app via <oam_ip>:<targetPort>

TC-name: z_containers/test_custom_containers.py::test_launch_app_via_sysinv

Expected Behavior
------------------
wget app success

Actual Behavior
----------------
wget stuck at connecting app port

Reproducibility
---------------
Reproducible

System Configuration
--------------------
One node system
Multi-node system

Lab-name: SM-3, WCP_63-66, WCP_113-121

Branch/Pull Time/Commit
-----------------------
stx master as of 20190828T013000Z

Last Pass
---------
Lab: WCP_63_66
Load: 2019-08-26_20-59-00
Job: StarlingX_2.0_build

Timestamp/Logs
--------------
[2019-08-28 08:25:35,582] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne application-upload -n hello-kitty -v 1.0 /home/sysadmin//custom_apps/hello-kitty.tgz'

[2019-08-28 08:25:58,875] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne application-apply hello-kitty'

[2019-08-28 08:26:40,599] 301 DEBUG MainThread ssh.send :: Send 'kubectl get service/hk-hello-kitty-hello-kit -o jsonpath="{.spec.ports[0].nodePort}";echo'
[2019-08-28 08:26:40,817] 423 DEBUG MainThread ssh.expect :: Output:
31124

[2019-08-28 08:26:43,890] 301 DEBUG MainThread ssh.send :: Send 'wget http://128.224.151.85:31124 -O /sandbox/AUTOMATION_LOGS/wcp_63_66/201908280358/tmp_files//hello-kitty.html'
[2019-08-28 08:27:44,022] 394 WARNING MainThread ssh.expect :: No match found for ['svc\\-cgcsauto\\@yow\\-cgcs\\-test\\$\\ '].
expect timeout.
[2019-08-28 08:27:44,022] 779 DEBUG MainThread ssh.send_control:: Sending ctrl+c
[2019-08-28 08:27:44,024] 423 DEBUG MainThread ssh.expect :: Output:
--2019-08-28 04:26:43-- http://128.224.151.85:31124/
Connecting to 128.224.151.85:31124... ^C
]0;svc-cgcsauto@yow-cgcs-test:/home/svc-cgcsauto

Test Activity
-------------
Sanity

Revision history for this message
Peng Peng (ppeng) wrote :
Numan Waheed (nwaheed)
tags: added: stx.retestneeded
Revision history for this message
Yang Liu (yliu12) wrote :

Ports are open for tcp6 only.

[sysadmin@controller-0 ~(keystone_admin)]$ kubectl get service
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
hk-hello-kitty-hello-kit NodePort 10.101.117.251 <none> 55555:31987/TCP 5h25m
kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 7h16m

[sysadmin@controller-0 ~(keystone_admin)]$ sudo netstat -tulpn | grep LISTEN | grep 31987
tcp6 0 0 :::31987 :::* LISTEN 121595/kube-proxy

Note that same issue is seen with stx-openstack horizon port 31000, causing openstack horizon unreachable.

[sysadmin@controller-0 ~(keystone_admin)]$ sudo netstat -tulpn | grep LISTEN | grep 31000
tcp6 0 0 :::31000 :::* LISTEN 121595/kube-proxy

Revision history for this message
Yang Liu (yliu12) wrote :

Note that this issue is only and always seen when stx-openstack is deployed.

The test app is reachable from external via <oam_ip>:<NodePort> when stx-openstack is not deployed.

Revision history for this message
Frank Miller (sensfan22) wrote :

As this issue only occurs when the stx-openstack application is applied, tagging this with the stx.distro.openstack label even though the issue is reported on a generic application. This may be a duplicate of https://bugs.launchpad.net/starlingx/+bug/1841833.

Marking stx.3.0 gating and high priority as applications cannot be accessed.

Assigning to the disto.openstack PL to determine who should prime this bug.

Changed in starlingx:
status: New → Triaged
importance: Undecided → High
assignee: nobody → yong hu (yhu6)
tags: added: stx.3.0 stx.distro.openstack
Revision history for this message
Yang Liu (yliu12) wrote :

I have to take my words back for comment #3.

With more stats in today's sanity, I think this issue is intermittent overall.
Frequency since first seen on Aug26th load: 10 out of 15 times.

yong hu (yhu6)
Changed in starlingx:
assignee: yong hu (yhu6) → zhipeng liu (zhipengs)
Revision history for this message
zhipeng liu (zhipengs) wrote :

Hi all,

According to change history after 8/24 and related scenario,
It may be caused by below patch
https://review.opendev.org/#/c/674719/
which enabled below flag
net.ipv4.tcp_tw_recycle = 1
It might cause a lot of tcp sync packages loss in NAT environment

Thanks!
Zhipeng

Revision history for this message
zhipeng liu (zhipengs) wrote :

Hi peng,

You can try below command.
controller-0:~$ sudo sysctl -w net.ipv4.tcp_tw_recycle=0
It will revert the change done by #674719
After that, if the issue could not be reproduced, then it can be duplicated to LP 1841833

Thanks!
Zhipeng

Changed in starlingx:
status: Triaged → Incomplete
Revision history for this message
zhipeng liu (zhipengs) wrote :

Any update pengpeng?

Thanks!

Zhipeng

Revision history for this message
Peng Peng (ppeng) wrote :

Executed the cmd
[sysadmin@controller-0 ~(keystone_admin)]$ sudo sysctl -w net.ipv4.tcp_tw_recycle=0
Password:
net.ipv4.tcp_tw_recycle = 0

And rerun the TC, it was failed again, but the output seems different.

rerun output:

[2019-09-09 14:06:19,972] 301 DEBUG MainThread ssh.send :: Send 'wget https://128.224.150.148:30433 -O /sandbox/AUTOMATION_LOGS/wcp_112/201909091002/tmp_files//hello-kitty.html'
[2019-09-09 14:06:20,096] 423 DEBUG MainThread ssh.expect :: Output:
--2019-09-09 10:06:20-- https://128.224.150.148:30433/
Connecting to 128.224.150.148:30433... connected.
OpenSSL: error:140770FC:SSL routines:SSL23_GET_SERVER_HELLO:unknown protocol
Unable to establish SSL connection.
]0;svc-cgcsauto@yow-cgcs-test:/home/svc-cgcsautosvc-cgcsauto@yow-cgcs-test$

Yang Liu (yliu12)
Changed in starlingx:
status: Incomplete → Confirmed
Revision history for this message
Ghada Khalil (gkhalil) wrote :

It appears that Peng opened a new bug for the failure signature above:
https://bugs.launchpad.net/starlingx/+bug/1843908

So, based on Zhipeng's original analysis, I'll mark this as a duplicate of https://bugs.launchpad.net/starlingx/+bug/1841833

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Bug# 1841833 was fixed by backing out the commit Zhipeng references above:
https://review.opendev.org/680907

Merged on 2019-09-10

Marking as Fix Released

Changed in starlingx:
status: Confirmed → Fix Released
Peng Peng (ppeng)
tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers