Containers: Re-apply openstack application without modification gets stuck

Bug #1815465 reported by Yang Liu
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
John Kung

Bug Description

Brief Description
-----------------
Reapply openstack application without modification gets stuck at ‘generating application overrides’ for 15+ minutes due to timeout waiting for RPC response during get_host_ttys_dcd.

Severity
--------
Major

Steps to Reproduce
------------------
1. Install and configure the platform
2. Deploy stx-openstack application
3. Reapply stx-openstack via 'system application-apply stx-openstack' without modifying helm-charts

Expected Behavior
------------------
- Reapply should complete fairly quickly in less than a few minutes due to no changes were made to helm-charts.

Actual Behavior
----------------
- Reapply openstack application without modification gets stuck at ‘generating application overrides’ for 15+ minutes
- After that it then stuck at ‘retrieving docker images’ for 17+ hours

Reproducibility
---------------
Unknown.
This test cannot be re-executed when system is in the bad state. I will try this again after system is re-installed.

System Configuration
--------------------
Two node supermicro system

Branch/Pull Time/Commit
-----------------------
f/stein as of 2019-02-05

Timestamp/Logs
--------------
sysinv log:
[2019-02-07 22:45:54,788] 262 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne application-apply stx-openstack'

[2019-02-07 22:51:09,259] 262 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne application-list'
+---------------+-----------------+------------------------+----------+----------------------------------+
| application | manifest name | manifest file | status | progress |
+---------------+-----------------+------------------------+----------+----------------------------------+
| stx-openstack | armada-manifest | manifest-no-tests.yaml | applying | generating application overrides |
+---------------+-----------------+------------------------+----------+----------------------------------+
controller-1:~$

Traceback in sysinv log:
2019-02-07 22:48:43.697 9195 ERROR sysinv.openstack.common.periodic_task [-] Error during AgentManager._agent_audit: Timeout while waiting on RPC response - topic: "sysinv.conductor_manager", RPC method: "get_host_ttys_dcd" info: "<unknown>"
2019-02-07 22:48:43.697 9195 TRACE sysinv.openstack.common.periodic_task Traceback (most recent call last):
2019-02-07 22:48:43.697 9195 TRACE sysinv.openstack.common.periodic_task File "/usr/lib64/python2.7/site-packages/sysinv/openstack/common/periodic_task.py", line 182, in run_periodic_tasks
2019-02-07 22:48:43.697 9195 TRACE sysinv.openstack.common.periodic_task task(self, context)
2019-02-07 22:48:43.697 9195 TRACE sysinv.openstack.common.periodic_task File "/usr/lib64/python2.7/site-packages/sysinv/agent/manager.py", line 1035, in _agent_audit
2019-02-07 22:48:43.697 9195 TRACE sysinv.openstack.common.periodic_task force_updates=None)
2019-02-07 22:48:43.697 9195 TRACE sysinv.openstack.common.periodic_task File "/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py", line 315, in inner
2019-02-07 22:48:43.697 9195 TRACE sysinv.openstack.common.periodic_task return f(*args, **kwargs)
2019-02-07 22:48:43.697 9195 TRACE sysinv.openstack.common.periodic_task File "/usr/lib64/python2.7/site-packages/sysinv/agent/manager.py", line 1134, in agent_audit
2019-02-07 22:48:43.697 9195 TRACE sysinv.openstack.common.periodic_task self._update_ttys_dcd_status(icontext, self._ihost_uuid)
2019-02-07 22:48:43.697 9195 TRACE sysinv.openstack.common.periodic_task File "/usr/lib64/python2.7/site-packages/sysinv/agent/manager.py", line 234, in _update_ttys_dcd_status
2019-02-07 22:48:43.697 9195 TRACE sysinv.openstack.common.periodic_task ttys_dcd = rpcapi.get_host_ttys_dcd(context, host_id)
2019-02-07 22:48:43.697 9195 TRACE sysinv.openstack.common.periodic_task File "/usr/lib64/python2.7/site-packages/sysinv/conductor/rpcapi.py", line 1174, in get_host_ttys_dcd
2019-02-07 22:48:43.697 9195 TRACE sysinv.openstack.common.periodic_task ihost_id=ihost_id))
2019-02-07 22:48:43.697 9195 TRACE sysinv.openstack.common.periodic_task File "/usr/lib64/python2.7/site-packages/sysinv/openstack/common/rpc/proxy.py", line 126, in call
2019-02-07 22:48:43.697 9195 TRACE sysinv.openstack.common.periodic_task exc.info, real_topic, msg.get('method'))
2019-02-07 22:48:43.697 9195 TRACE sysinv.openstack.common.periodic_task Timeout: Timeout while waiting on RPC response - topic: "sysinv.conductor_manager", RPC method: "get_host_ttys_dcd" info: "<unknown>"
2019-02-07 22:48:43.697 9195 TRACE sysinv.openstack.common.periodic_task

# Application overrides finally generated at:
2019-02-07 23:01:38.773 16509 INFO sysinv.conductor.kube_app [-] Application overrides generated.

# However, it is then stuck in ‘retrieving docker images’ for 17+ hours:
2019-02-07 23:02:49.183 16509 ERROR sysinv.conductor.kube_app [-] Image 192.168.204.2:9001/quay.io/external_storage/rbd-provisioner:latest download failed from local registry: 500 Server Error: Internal Server Error ("Get http://192.168.204.2:9001/v2/: dial tcp 192.168.204.2:9001: getsockopt: connection refused")

Frank Miller (sensfan22)
tags: added: stx.containers
Revision history for this message
Yang Liu (yliu12) wrote :

Errors after 2019-02-07T23:02:23.000 can be ignored, since a swact was triggered after that.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as release gating; related to container env.
Need to determine how reproducible this issue is. Asked the originator to monitor for a re-occurrence.

Changed in starlingx:
importance: Undecided → Medium
assignee: nobody → Frank Miller (sensfan22)
status: New → Triaged
tags: added: stx.2019.05
Revision history for this message
Frank Miller (sensfan22) wrote :

Assigning to John to prime triage of all re-apply issues with goal being to get re-apply working reliably/repeatedly.

Changed in starlingx:
assignee: Frank Miller (sensfan22) → John Kung (john-kung)
Ken Young (kenyis)
tags: added: stx.2.0
removed: stx.2019.05
Ghada Khalil (gkhalil)
tags: added: stx.retestneeded
Revision history for this message
John Kung (john-kung) wrote :

This issue is a duplicate of https://bugs.launchpad.net/starlingx/+bug/1817941 .

The issue with the overrides generation and image operation on application apply is similar to the issue described in 1817941, which is triggered by Tiller not responding to image list shortly after the host-swact (host-swact at 2019-02-07T22:41:18.089).

| 2019-02-07T22:41:18.089 | 4528 | node-scn | controller-1 | | swact | issued against host controller-0

2019-02-07 23:02:49.183 16509 ERROR sysinv.conductor.kube_app [-] Image 192.168.204.2:9001/quay.io/external_storage/rbd-provisioner:latest download failed from local registry: 500 Server Error: Internal Server Error ("Get http://192.168.204.2:9001/v2/: dial tcp 192.168.204.2:9001: getsockopt: connection refused")

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Duplicate bug was fixed on 2019-05-07
https://review.opendev.org/657087

Marking as Fix Released

Changed in starlingx:
status: Triaged → Fix Released
Revision history for this message
Yang Liu (yliu12) wrote :

Test passed on following load: 2019-06-03_18-34-53.
helm list worked and reapply completed shortly after swact.

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.