helm chart upload fails with operation aborted

Bug #1830866 reported by Ralf Graefe
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Tee Ngo

Bug Description

Installation details:
------------------
All-In-One duplex installation of build 20190523T013000Z
The following patch already deployed because initial problem was that describe in below patch:
https://review.opendev.org/#/c/660893/2..2/sysinv/sysinv/sysinv/sysinv/helm/helm.py

Brief Description
------------------
installation of helm charts fails for both the version accompanying build 20190523T013000Z and the latest version.

After running 'system application-upload stx-openstack-1.0-13-centos-stable-versioned.tgz' subsequent 'system application-list' shows the following:

+---------------------+------------------------------+-------------------------------+---------------+---------------+------------------------------------------+
| application | version | manifest name | manifest file | status | progress |
+---------------------+------------------------------+-------------------------------+---------------+---------------+------------------------------------------+
| platform-integ-apps | 1.0-5 | platform-integration-manifest | manifest.yaml | upload-failed | operation aborted, check logs for detail |
| stx-openstack | 1.0-13-centos-stable- | armada-manifest | manifest.yaml | upload-failed | operation aborted, check logs for detail |
| | versioned | | | | |
| | | | | | |
+---------------------+------------------------------+-------------------------------+---------------+---------------+------------------------------------------+

Severity
--------
<Major: System/Feature is usable but degraded>

Steps to Reproduce
------------------
- follow installation guide for AIO system https://wiki.openstack.org/wiki/StarlingX/Containers/InstallationOnAIODX
up to section 1.8 Using sysinv to bring up/down the containerized services
- apply patch https://review.opendev.org/#/c/660893/2..2/sysinv/sysinv/sysinv/sysinv/helm/helm.py
- download file stx-openstack-1.0-13-centos-stable-versioned.tgz from http://mirror.starlingx.cengn.ca/mirror/starlingx/master/centos/latest_docker_image_build/outputs/helm-charts/
- run 'system application-upload stx-openstack-1.0-13-centos-stable-versioned.tgz'

Expected behavior
-----------------
- helm charts are uploaded an successfully deployed

Actual Behavior
---------------
- upload fails
- see error logs above

Reproducibility
--------------
100% reproducible

System configuration
--------------------
Two node AIO duplex

Branch/Pull Time/Commit
-----------------------
20190523T013000Z

Last Pass
---------
never

Timestamp/Logs
--------------
Logfile /var/log/sysinv.log shows:

2019-05-29 08:10:42.018 3145613 ERROR sysinv.helm.helm [req-b863816e-033f-4b4c-a102-e63812361408 admin admin] chart kube-system not supported for system overrides
2019-05-29 08:10:42.018 3145613 TRACE sysinv.helm.helm Traceback (most recent call last):
2019-05-29 08:10:42.018 3145613 TRACE sysinv.helm.helm File "/usr/lib64/python2.7/site-packages/sysinv/openstack/common/rpc/dispatcher.py", line 151, in dispatch
2019-05-29 08:10:42.018 3145613 TRACE sysinv.helm.helm cb_namespace = proxyobj.RPC_API_NAMESPACE
2019-05-29 08:10:42.018 3145613 TRACE sysinv.helm.helm AttributeError: 'ConductorManager' object has no attribute 'RPC_API_NAMESPACE'
2019-05-29 08:10:42.018 3145613 TRACE sysinv.helm.helm
2019-05-29 08:10:42.019 3145613 ERROR sysinv.helm.helm [req-b863816e-033f-4b4c-a102-e63812361408 admin admin] chart kube-system not supported for system overrides
2019-05-29 08:10:42.019 3145613 TRACE sysinv.helm.helm Traceback (most recent call last):
2019-05-29 08:10:42.019 3145613 TRACE sysinv.helm.helm File "/usr/lib64/python2.7/site-packages/sysinv/openstack/common/rpc/dispatcher.py", line 151, in dispatch
2019-05-29 08:10:42.019 3145613 TRACE sysinv.helm.helm cb_namespace = proxyobj.RPC_API_NAMESPACE
2019-05-29 08:10:42.019 3145613 TRACE sysinv.helm.helm AttributeError: 'ConductorManager' object has no attribute 'RPC_API_NAMESPACE'
2019-05-29 08:10:42.019 3145613 TRACE sysinv.helm.helm

...

2019-05-29 08:13:15.801 3145613 ERROR sysinv.conductor.kube_app [-] generate_helm_application_overrides() takes at least 3 arguments (7 given)
2019-05-29 08:13:15.801 3145613 TRACE sysinv.conductor.kube_app Traceback (most recent call last):
2019-05-29 08:13:15.801 3145613 TRACE sysinv.conductor.kube_app File "/usr/lib64/python2.7/site-packages/sysinv/conductor/kube_app.py", line 1131, in perform_app_upload
2019-05-29 08:13:15.801 3145613 TRACE sysinv.conductor.kube_app self._save_images_list(app)
2019-05-29 08:13:15.801 3145613 TRACE sysinv.conductor.kube_app File "/usr/lib64/python2.7/site-packages/sysinv/conductor/kube_app.py", line 447, in _save_images_list
2019-05-29 08:13:15.801 3145613 TRACE sysinv.conductor.kube_app armada_chart_info=app.charts, combined=True)
2019-05-29 08:13:15.801 3145613 TRACE sysinv.conductor.kube_app File "/usr/lib64/python2.7/site-packages/sysinv/helm/helm.py", line 41, in _wrapper
2019-05-29 08:13:15.801 3145613 TRACE sysinv.conductor.kube_app return func(self, *args, **kwargs)
2019-05-29 08:13:15.801 3145613 TRACE sysinv.conductor.kube_app TypeError: generate_helm_application_overrides() takes at least 3 arguments (7 given)
2019-05-29 08:13:15.801 3145613 TRACE sysinv.conductor.kube_app
2019-05-29 08:13:15.812 3145613 ERROR sysinv.conductor.kube_app [-] Application upload aborted!.

...

2019-05-29 08:15:36.138 3145613 ERROR sysinv.conductor.kube_app [-] generate_helm_application_overrides() takes at least 3 arguments (7 given)
2019-05-29 08:15:36.138 3145613 TRACE sysinv.conductor.kube_app Traceback (most recent call last):
2019-05-29 08:15:36.138 3145613 TRACE sysinv.conductor.kube_app File "/usr/lib64/python2.7/site-packages/sysinv/conductor/kube_app.py", line 1131, in perform_app_upload
2019-05-29 08:15:36.138 3145613 TRACE sysinv.conductor.kube_app self._save_images_list(app)
2019-05-29 08:15:36.138 3145613 TRACE sysinv.conductor.kube_app File "/usr/lib64/python2.7/site-packages/sysinv/conductor/kube_app.py", line 447, in _save_images_list
2019-05-29 08:15:36.138 3145613 TRACE sysinv.conductor.kube_app armada_chart_info=app.charts, combined=True)
2019-05-29 08:15:36.138 3145613 TRACE sysinv.conductor.kube_app File "/usr/lib64/python2.7/site-packages/sysinv/helm/helm.py", line 41, in _wrapper
2019-05-29 08:15:36.138 3145613 TRACE sysinv.conductor.kube_app return func(self, *args, **kwargs)
2019-05-29 08:15:36.138 3145613 TRACE sysinv.conductor.kube_app TypeError: generate_helm_application_overrides() takes at least 3 arguments (7 given)
2019-05-29 08:15:36.138 3145613 TRACE sysinv.conductor.kube_app
2019-05-29 08:15:36.144 3145613 ERROR sysinv.conductor.kube_app [-] Application upload aborted!.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Assigning to Bruce to have someone from Intel help with this issue which is reported from an Intel lab in Europe.

tags: added: stx.containers
Changed in starlingx:
importance: Undecided → High
status: New → Triaged
assignee: nobody → Bruce Jones (brucej)
tags: added: stx.2.0
Bruce Jones (brucej)
Changed in starlingx:
assignee: Bruce Jones (brucej) → yong hu (yhu6)
Revision history for this message
yong hu (yhu6) wrote :

@rgraefe, it seemed there were 2 problems:
1. "platform-integ-apps", what was supposedly uploaded and applied by system automatically, but it failed.

If you can upload the whole file of /var/log/sysinv.log, it will be more helpful.

2. for "stx-openstack", generate_helm_application_overrides() takes at least 3 arguments (7 given). It was caused by missing parameters when you executed: system application-upload stx-openstack-1.0-13-centos-stable-versioned.tgz
The correct cmd should be "# system application-upload -n "stx-openstack" -v "1.0-13-centos-stable-versioned" stx-openstack-1.0-13-centos-stable-versioned.tgz

-----------------------------------------------------------------
In addition, we happened to use 0524 build and the deployment worked well. (see the attachment).
So please have another try with 0524 build.

Changed in starlingx:
status: Triaged → Incomplete
assignee: yong hu (yhu6) → Ralf Graefe (rgraefe)
Revision history for this message
Ralf Graefe (rgraefe) wrote :

I updated to build 0524 but still have issues. I setup a duplex system and I am behind a proxy. I suspect some form of proxy issue. I attach the sysinv.log and also my default.yml for the original Ansible bootstrap showing my docker proxy settings. As you will see in the sysinv.log there is a TLS timeout issue from a local IP but running the same command from the shell works.

Revision history for this message
Ralf Graefe (rgraefe) wrote :

Adding Ansible config file as per last comment.

Revision history for this message
yong hu (yhu6) wrote :

offline debugging with Ralf and we enabled the duplex deployment with stx-openstack applied and VMs created on Openstack.

Certainly there were some analysis captured from the email we exchanged:

1. Controller-0 encountered some failures on NIC VF, and it triggered host swact. You can see the errors in /var/log/sm.log in controller-0. Why did the issue take place on controller-0, but did never happen on Controller-1? We need to figure out further. Supposedly the 2 servers are the same in terms of HW spec, they should behave similar. You might have an experiment that you go to deploy controller-0 onto the server which is currently running controller-1, and then check if we still see the stability issue.

2. When swact was happening, application-apply was in the middle of process. As a result, when active controller was switched to controller-1, application-apply wouldn’t go on. This was a known issue tracked by LP (https://bugs.launchpad.net/starlingx/+bug/1829936) which was fixed by a patch lately.

3. Following #2, if application-apply got stuck somewhere, it cannot be removed by user. This was tracked by another LP (https://bugs.launchpad.net/starlingx/+bug/1826047).

So in conclusion, we duplicated this issue to LP1826047, and going further Ralf to switch servers for controller-0 and controller-1.

Changed in starlingx:
status: Incomplete → In Progress
Revision history for this message
yong hu (yhu6) wrote :

change to "progress" and waiting for fix for https://bugs.launchpad.net/starlingx/+bug/1826047

Revision history for this message
Bill Zvonar (billzvonar) wrote :

Sorry, accidentally changed the priority - I changed it back.

Changed in starlingx:
importance: High → Medium
importance: Medium → High
Revision history for this message
chen haochuan (martin1982) wrote :

https://review.opendev.org/#/c/671552/

this command is ready, "system application-abort <app>"

Revision history for this message
Ghada Khalil (gkhalil) wrote :

As per Frank Miller:
Duplicate of https://bugs.launchpad.net/starlingx/+bug/1833323
Fixed by: https://review.opendev.org/#/c/671552/
Merged on 2019-07-25

Changed in starlingx:
assignee: Ralf Graefe (rgraefe) → Tee Ngo (teewrs)
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.