StarlingX

reapplying stx-openstack application failed on swacted host

Bug #1837055 reported by Peng Peng on 2019-07-18

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Low	Matt Peters

Bug Description

Brief Description
-----------------
After host swact success, tried to reapply stx-openstack, but the status stuck at "applying application manifest", and eventually apply-failed by operation aborted.

Severity
--------
Major

Steps to Reproduce
------------------
host-swact
application-apply
application-list

TC-name: z_containers/test_openstack_services.py::test_reapply_stx_openstack_no_change[controller-1]

Expected Behavior
------------------
reapply success

Actual Behavior
----------------
reapply failed

Reproducibility
---------------
Seen once

System Configuration
--------------------
Multi-node system

Lab-name: WCP_63-66

Branch/Pull Time/Commit
-----------------------
stx master as of 20190718T013000Z

Last Pass
---------
2019-05-29_17-05-57

Timestamp/Logs
--------------
[2019-07-18 11:30:42,593] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-swact controller-0'

[2019-07-18 11:41:05,280] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne application-list'
[2019-07-18 11:41:06,799] 423 DEBUG MainThread ssh.expect :: Output:
+---------------------+--------------------------------+-------------------------------+--------------------+---------------+------------------------------------------+
| application | version | manifest name | manifest file | status | progress |
+---------------------+--------------------------------+-------------------------------+--------------------+---------------+------------------------------------------+
| hello-kitty | 1.0 | hello-kitty | manifest.yaml | remove-failed | operation aborted, check logs for detail |
| platform-integ-apps | 1.0-7 | platform-integration-manifest | manifest.yaml | applied | completed |
| stx-openstack | 1.0-17-centos-stable-versioned | armada-manifest | stx-openstack.yaml | applying | applying application manifest |

Test Activity
-------------
Sanity

Tags:

Revision history for this message

Peng Peng (ppeng) wrote on 2019-07-18:

ALL_NODES_20190718.135724.tar Edit (22.9 MiB, application/x-tar)

Revision history for this message

Yang Liu (yliu12) wrote on 2019-07-18:

Just a note that 'helm list' cmd was hanging when this issue was seen.

Ghada Khalil (gkhalil) on 2019-07-19

tags:	added: stx.containers
tags:	added: stx.2.0
tags:	removed: stx.2.0

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-07-19:

On SWACT, the TCP connections will timeout after 30 seconds and then get re-established as per the following review which merged on 2019-07-15: https://review.opendev.org/#/c/670822/

How long did the TC wait after the swact before re-applying stx-openstack?

Changed in starlingx:
status:	New → Incomplete
assignee:	nobody → Peng Peng (ppeng)

Revision history for this message

Peng Peng (ppeng) wrote on 2019-07-19:

wait for 5-6 mins

11:30 host-swact controller-0
11:31 controller-1 Login successful
11:36 application-apply stx-openstack
11:41 apply-failed

Ghada Khalil (gkhalil) on 2019-07-19

Changed in starlingx:
assignee:	Peng Peng (ppeng) → Bart Wensley (bartwensley)

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-07-22:

Marking as stx.2.0 given that the TCP connections should have been re-established given the 5-6min wait.

tags:	added: stx.2.0
Changed in starlingx:
status:	Incomplete → Triaged
importance:	Undecided → High

Numan Waheed (nwaheed) on 2019-07-22

tags:

added: stx.retestneeded

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-07-25: Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/672741

Changed in starlingx:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-07-25: Fix proposed to ansible-playbooks (master)

Fix proposed to branch: master
Review: https://review.opendev.org/672742

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-07-25: Fix merged to config (master)

Reviewed: https://review.opendev.org/672741
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=ae145b78f8b8891ac718fa2a4ea4b5c5a510c306
Submitter: Zuul
Branch: master

commit ae145b78f8b8891ac718fa2a4ea4b5c5a510c306
Author: Bart Wensley <email address hidden>
Date: Wed Jul 24 14:47:56 2019 -0500

Revert "Revert "Changing tiller pod networking settings to improve swact time""

This reverts commit a5c236dc522c050b036e638955c03074a2963996.

    It was thought that setting the TCP timeouts for the cluster
    network was enough to address the issues with the helm commands
    hanging after a controller swact. This is not the case. In
    particular, swacting away from the controller with the
    tiller-deploy pod seems to cause tcp connection from that pod to
    the kube-apiserver to hang. Putting the tiller-deploy pod back on
    the host network "fixes" the issue.

    Change-Id: I8f37530e1f615afcffcf6cb1d629518436c99cb9
    Related-Bug: 1817941
    Partial-Bug: 1837055
    Signed-off-by: Bart Wensley <email address hidden>

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-07-25: Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/672742
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=5a1fe1616e1c541a4cdd076b672f217c22d5c843
Submitter: Zuul
Branch: master

commit 5a1fe1616e1c541a4cdd076b672f217c22d5c843
Author: Bart Wensley <email address hidden>
Date: Wed Jul 24 14:41:57 2019 -0500

Revert "Revert "Changing tiller pod networking settings to improve swact time""

This reverts commit fe10dcbfed9fd4a6b0e4494cd6d414bf78f03bab.

    Change-Id: I89c4db6dc063f238c70fad4e913577046e5452f5
    Related-Bug: 1817941
    Partial-Bug: 1837055
    Signed-off-by: Bart Wensley <email address hidden>

Revision history for this message

Bart Wensley (bartwensley) wrote on 2019-07-25:

#10

The issue should be fixed now, but I am assigning it to Matt in case he has time to look for the root cause of the issue.

Frank Miller (sensfan22) on 2019-07-25

Changed in starlingx:
assignee:	Bart Wensley (bartwensley) → Matt Peters (mpeters-wrs)

Revision history for this message

Frank Miller (sensfan22) wrote on 2019-08-06:

#11

Based on Bart's change and analysis that the issue is now fixed, lowering priority of this LP and removing the stx.2.0 tag. It would be good to identify root cause eventually so adding an stx.3.0 tag for now.

Changed in starlingx:
importance:	High → Low
tags:	added: stx.3.0 removed: stx.2.0

Revision history for this message

Matt Peters (mpeters-wrs) wrote on 2019-08-09:

#12

Further details of the investigation as to why changing the TCP keepalive parameters of the container was not sufficient.

When the cluster networking is used, and a swact occurs, the TCP connection to the kube-apiserver remains active because the NATed connection is not able to route the packets to the new kube-apiserver endpoint on the other controller. The socket is therefore open, but not able to communicate with the far-end, which means the socket must timeout before being cleaned up. With host networking, the packets are routed to the new destination which results in an immediate TCP reset being sent which closes the connection immediately.

The TCP keepalive timer for the tiller connection to the kube-apiserver socket is 30s, combined with the 5 probes for a total of 2.5mins. Therefore, the system settings are not being used, therefore the client side of the connection would not close the connection faster than the options set on the socket. However, it is still taking 15mins to close the connection which is based on the failed retransmit timeout that is configured by the system setting of net.ipv4.tcp_retries2=15. This timer will be activated if a request is made on the socket, overriding the keepalive timer.
Therefore, if no request is made for 2.5mins, the request will be complete successfully since the connection would have been cleaned up by the keepalive timeout. If a request is made within that 2.5mins window, the connection will remain active until the tcp_restries2 timeout.

The kube-apiserver TCP keepalive timer is set to 300s with 5 probes, so it will not be able to detect the failed connection for 15mins.

Based on the above information, it is recommended to keep the hostNetwork configuration option for the Tiller Pod since it provides the fastest socket cleanup time, re-enabling the connectivity.