Containers: Reapply from controller-1 after swact hung on generating overrides for 10+ minutes due to RPC reponse timeout

Bug #1817941 reported by Yang Liu
22
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Al Bailey

Bug Description

Brief Description
-----------------
After swact to controller-1, reapply stx-openstack without any charts change, the reapply will hang on "Generating application overrides" for 10+ minutes, with RPC reposnse timeout error in sysinv log.

- The application does eventually apply successfully.
- This issue does not happen on subsequent reapply from controller-1
- This issue does not happen when reapply after swacting to controller-0

Severity
--------
Minor

Steps to Reproduce
------------------
- install and config system, deploy stx-openstack application
- swact active controller to controller-1
- Run "system application-apply stx-openstack" from controller-1 after swact.

Expected Behavior
------------------
- Reapply succeeds very quickly (normally within 1 minute)

Actual Behavior
----------------
- Reapply gets stuck at generating application overrides for about 10 minutes, and eventually completes

Reproducibility
---------------
Reproducible

System Configuration
--------------------
Multi-node system

Branch/Pull Time/Commit
-----------------------
f/stein as of 2019-02-25

Timestamp/Logs
--------------
[2019-02-26 18:59:55,310] 262 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-swact controller-0'

[2019-02-26 19:02:14,727] 262 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne application-apply stx-openstack'

[2019-02-26 19:17:47,162] 262 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne application-list'
+---------------+-----------------+------------------------+----------+----------------------------------+
| application | manifest name | manifest file | status | progress |
+---------------+-----------------+------------------------+----------+----------------------------------+
| stx-openstack | armada-manifest | manifest-no-tests.yaml | applying | generating application overrides |
+---------------+-----------------+------------------------+----------+----------------------------------+
controller-1:~$

sysinv.log:
2019-02-26 19:02:16.477 178122 INFO sysinv.conductor.kube_app [-] Generating application overrides...
2019-02-26 19:02:17.007 178122 INFO sysinv.helm.neutron [req-6f73c478-0811-4605-8b93-f44da59f9c6a admin admin] _get_neutron_ml2_config={'ml2': {'physical_network_mtus': 'group0-data0:1500'}}
2019-02-26 19:04:00.408 3710 INFO sysinv.agent.manager [-] ilvg_get_nova_ilvg_by_ihost() Timeout.
2019-02-26 19:04:00.417 3710 INFO sysinv.openstack.common.rpc.common [-] Connected to AMQP server on 192.168.204.2:5672
2019-02-26 19:05:00.478 3710 ERROR sysinv.openstack.common.periodic_task [-] Error during AgentManager._agent_audit: Timeout while waiting on RPC response - topic: "sysinv.conductor_manager", RPC method: "get_host_ttys_dcd" info: "<unknown>"
2019-02-26 19:05:00.478 3710 TRACE sysinv.openstack.common.periodic_task Traceback (most recent call last):
2019-02-26 19:05:00.478 3710 TRACE sysinv.openstack.common.periodic_task File "/usr/lib64/python2.7/site-packages/sysinv/openstack/common/periodic_task.py", line 182, in run_periodic_tasks
2019-02-26 19:05:00.478 3710 TRACE sysinv.openstack.common.periodic_task task(self, context)
2019-02-26 19:05:00.478 3710 TRACE sysinv.openstack.common.periodic_task File "/usr/lib64/python2.7/site-packages/sysinv/agent/manager.py", line 1035, in _agent_audit
2019-02-26 19:05:00.478 3710 TRACE sysinv.openstack.common.periodic_task force_updates=None)
2019-02-26 19:05:00.478 3710 TRACE sysinv.openstack.common.periodic_task File "/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py", line 315, in inner
2019-02-26 19:05:00.478 3710 TRACE sysinv.openstack.common.periodic_task return f(*args, **kwargs)
2019-02-26 19:05:00.478 3710 TRACE sysinv.openstack.common.periodic_task File "/usr/lib64/python2.7/site-packages/sysinv/agent/manager.py", line 1134, in agent_audit
2019-02-26 19:05:00.478 3710 TRACE sysinv.openstack.common.periodic_task self._update_ttys_dcd_status(icontext, self._ihost_uuid)
2019-02-26 19:05:00.478 3710 TRACE sysinv.openstack.common.periodic_task File "/usr/lib64/python2.7/site-packages/sysinv/agent/manager.py", line 234, in _update_ttys_dcd_status
2019-02-26 19:05:00.478 3710 TRACE sysinv.openstack.common.periodic_task ttys_dcd = rpcapi.get_host_ttys_dcd(context, host_id)
2019-02-26 19:05:00.478 3710 TRACE sysinv.openstack.common.periodic_task File "/usr/lib64/python2.7/site-packages/sysinv/conductor/rpcapi.py", line 1174, in get_host_ttys_dcd
2019-02-26 19:05:00.478 3710 TRACE sysinv.openstack.common.periodic_task ihost_id=ihost_id))
2019-02-26 19:05:00.478 3710 TRACE sysinv.openstack.common.periodic_task File "/usr/lib64/python2.7/site-packages/sysinv/openstack/common/rpc/proxy.py", line 126, in call
2019-02-26 19:05:00.478 3710 TRACE sysinv.openstack.common.periodic_task exc.info, real_topic, msg.get('method'))
2019-02-26 19:05:00.478 3710 TRACE sysinv.openstack.common.periodic_task Timeout: Timeout while waiting on RPC response - topic: "sysinv.conductor_manager", RPC method: "get_host_ttys_dcd" info: "<unknown>"

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as release gating; low priority given the operation complete and is only slow right after a swact operation.

Changed in starlingx:
importance: Undecided → Low
status: New → Triaged
tags: added: stx.2019.05 stx.containers stx.helpwanted
Ken Young (kenyis)
tags: added: stx.2.0
removed: stx.2019.05
Revision history for this message
Ghada Khalil (gkhalil) wrote :

As per Brent, this maybe an issue with a stale TCP connection between tiller and the k8s api on swact. The connection is not cleaned up until a timeout.

tags: removed: stx.helpwanted
Changed in starlingx:
importance: Low → Medium
assignee: nobody → Chris Friesen (cbf123)
importance: Medium → High
Ghada Khalil (gkhalil)
tags: added: stx.retestneeded
Frank Miller (sensfan22)
Changed in starlingx:
assignee: Chris Friesen (cbf123) → Al Bailey (albailey1974)
Revision history for this message
Al Bailey (albailey1974) wrote :

The stake TCP connections can be observed through this netstat command:
netstat -ptown | grep kube-api

The floating IP is torn down, and setup on a swact.
The interaction between tiller and kube-api is unable to detect this. The tcp keepalive (5 probes 3 minutes each) takes 15 minutes to tear it down.

Revision history for this message
Al Bailey (albailey1974) wrote :

Matt looked into this. Here are his findings...

The tiller-deploy pod is running in the cluster network namespace. Therefore it is not inheriting the host TCP keepalive parameters, and is using the network namespace defaults (listed below). Because these values are large, it is the kube-apiserver that detects the timeout which has it adjusted to 15 minutes (3 minute interval * 5 probes).

namespace (cluster):
net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 7200

namespace (host):
net.ipv4.tcp_keepalive_intvl = 1
net.ipv4.tcp_keepalive_probes = 5
net.ipv4.tcp_keepalive_time = 5

Matt has provided a solution for this.
Currently Testing a fix to specify the host network when tiller is being setup.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/657087

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
Al Bailey (albailey1974) wrote :

Tested the swact to controller-1 and the (re)apply.
Not seeing the hangs reported in the original bug.

Application overrides occurred at: 2019-05-03 23:39:06.377

2019-05-03 23:41:12.247 474629 INFO sysinv.conductor.kube_app [-] All docker images for application stx-openstack were successfully downloaded in 114 seconds
2019-05-03 23:41:12.257 474629 INFO sysinv.conductor.kube_app [-] Starting Armada service...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/657087
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=4802f1d96a1217124e39a057fd7a05e22177b81c
Submitter: Zuul
Branch: master

commit 4802f1d96a1217124e39a057fd7a05e22177b81c
Author: Al Bailey <email address hidden>
Date: Fri May 3 15:11:57 2019 -0500

    Changing tiller pod networking settings to improve swact time

    Based on investigation by Matt, the tiller-deploy pod was running
    in the cluster network namespace and therefore not inheriting host
    TCP keepalive parameters.

    During a swact, when the floating IP is taken down, tiller keepalive
    is so large its the kube-apiserver detects the timeout after 15 minutes
    (5 probes * 180 seconds)

    The cluster namespace values are 9 probes at 75 second intervals.
    The host TCP values are 5 consecutive probes at 1 second intervals.

    By changing the tiller pod to be deployed using the host network,
    it will inherit the host sysctl values and detect much more quickly.
    (10 seconds)

    Adding additional override settings during helm init for tiller
    helm init <params> --override spec.template.spec.hostNetwork=true

    These changes were added to the ansible playbook.

    Change-Id: I218e4ef37100950c8ac5a0cb9759d9df50d9e368
    Closes-Bug: 1817941
    Partial-Bug: 1818123
    Co-Authored-By: Matt Peters <email address hidden>
    Signed-off-by: Al Bailey <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Yang Liu (yliu12) wrote :

Test passed on following load: 2019-06-03_18-34-53.
helm list worked and reapply completed shortly after swact.

tags: removed: stx.retestneeded
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to ansible-playbooks (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/671095

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to config (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/671096

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/671095
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=fe10dcbfed9fd4a6b0e4494cd6d414bf78f03bab
Submitter: Zuul
Branch: master

commit fe10dcbfed9fd4a6b0e4494cd6d414bf78f03bab
Author: Bart Wensley <email address hidden>
Date: Tue Jul 16 10:53:08 2019 -0500

    Revert "Changing tiller pod networking settings to improve swact time"

    This reverts commit 4802f1d96a1217124e39a057fd7a05e22177b81c
    from the config repo. The change made is no longer necessary due
    to commit 9a4b6b6a (also in the config repo). The playbookconfig
    code from that commit was moved to the ansible-playbooks repo
    and is being removed here.

    Change-Id: I3b198edc3c8dc39646f3296089c6d554da5416e4
    Related-Bug: 1817941
    Signed-off-by: Bart Wensley <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to config (master)

Reviewed: https://review.opendev.org/671096
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=a5c236dc522c050b036e638955c03074a2963996
Submitter: Zuul
Branch: master

commit a5c236dc522c050b036e638955c03074a2963996
Author: Bart Wensley <email address hidden>
Date: Mon Jul 15 13:05:35 2019 -0500

    Revert "Changing tiller pod networking settings to improve swact time"

    This reverts commit 4802f1d96a1217124e39a057fd7a05e22177b81c.
    The change made is no longer necessary due to commit 9a4b6b6a.
    The playbookconfig code was moved to the ansible-playbooks repo
    and will be removed there.

    Conflicts:
     playbookconfig/centos/build_srpm.data
     playbookconfig/playbookconfig/playbooks/bootstrap/roles/bringup-essential-services/tasks/bringup_helm.yml
     puppet-manifests/centos/build_srpm.data
     puppet-manifests/src/modules/platform/manifests/helm.pp

    Change-Id: I20a38c1ad882bebb6e1208f43d6582bc399e9e87
    Related-Bug: 1817941
    Signed-off-by: Bart Wensley <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to config (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/672741

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to ansible-playbooks (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/672742

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to config (master)

Reviewed: https://review.opendev.org/672741
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=ae145b78f8b8891ac718fa2a4ea4b5c5a510c306
Submitter: Zuul
Branch: master

commit ae145b78f8b8891ac718fa2a4ea4b5c5a510c306
Author: Bart Wensley <email address hidden>
Date: Wed Jul 24 14:47:56 2019 -0500

    Revert "Revert "Changing tiller pod networking settings to improve swact time""

    This reverts commit a5c236dc522c050b036e638955c03074a2963996.

    It was thought that setting the TCP timeouts for the cluster
    network was enough to address the issues with the helm commands
    hanging after a controller swact. This is not the case. In
    particular, swacting away from the controller with the
    tiller-deploy pod seems to cause tcp connection from that pod to
    the kube-apiserver to hang. Putting the tiller-deploy pod back on
    the host network "fixes" the issue.

    Change-Id: I8f37530e1f615afcffcf6cb1d629518436c99cb9
    Related-Bug: 1817941
    Partial-Bug: 1837055
    Signed-off-by: Bart Wensley <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/672742
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=5a1fe1616e1c541a4cdd076b672f217c22d5c843
Submitter: Zuul
Branch: master

commit 5a1fe1616e1c541a4cdd076b672f217c22d5c843
Author: Bart Wensley <email address hidden>
Date: Wed Jul 24 14:41:57 2019 -0500

    Revert "Revert "Changing tiller pod networking settings to improve swact time""

    This reverts commit fe10dcbfed9fd4a6b0e4494cd6d414bf78f03bab.

    It was thought that setting the TCP timeouts for the cluster
    network was enough to address the issues with the helm commands
    hanging after a controller swact. This is not the case. In
    particular, swacting away from the controller with the
    tiller-deploy pod seems to cause tcp connection from that pod to
    the kube-apiserver to hang. Putting the tiller-deploy pod back on
    the host network "fixes" the issue.

    Change-Id: I89c4db6dc063f238c70fad4e913577046e5452f5
    Related-Bug: 1817941
    Partial-Bug: 1837055
    Signed-off-by: Bart Wensley <email address hidden>

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.