Stx-openstack application apply failed with Latest ISO(2019-Apr-28)

Bug #1826912 reported by Gopinath
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
yong hu

Bug Description

Brief Description
-----------------
When I try to apply Stx-openstack application using CMD
(system host-apply stx-openstack), With Latest ISO and helm charts.
The status hung at 70% and it fails.
Severity
--------
Provide the severity of the defect.
<Critical: System/Feature is not usable after the defect>

Steps to Reproduce
------------------
Reference link:
  https://wiki.openstack.org/wiki/StarlingX/Containers/InstallationOnStandard#Verify_the_cluster_endpoints

  1. Configured Setup all nodes
  2. Provisioning done for all nodes
  3. Add ceph OSDs to controllers
  4. Using sysinv to bring up/down the containerized services
     a. When doing Bring up services, used below command
     >> system application-apply stx-openstack
     >> system application-list
 Error: operation aborted, check logs for detail

Expected Behavior
------------------
The application have to reach 100% and its should be in applied state.
Actual Behavior
----------------
The application hung at 70% and it in failed state.
+---------------+-----------------+---------------+---------------+------------------------------------------+
| application | manifest name | manifest file | status | progress |
+---------------+-----------------+---------------+---------------+------------------------------------------+
| stx-openstack | armada-manifest | manifest.yaml | apply-failed | operation aborted, check logs for detail |
+---------------+-----------------+---------------+---------------+------------------------------------------+

Reproducibility
---------------
<Reproducible/Intermittent>
State if the issue is 100% reproducible.

System Configuration
--------------------
<Multi-node system, Dedicated storage>

Branch/Pull Time/Commit
-----------------------
bootimage.iso 2019-Apr-28 00:19:54 1.9G application/x-iso9660-image
ISO:
http://mirror.starlingx.cengn.ca/mirror/starlingx/master/centos/20190428T013000Z/outputs/iso/
HelmChart:
http://mirror.starlingx.cengn.ca/mirror/starlingx/master/centos/20190428T013000Z/outputs/helm-charts/
helm-charts-manifest-centos-stable-latest.tgz

Last Pass
---------
Yes, it passed with older ISO
http://mirror.starlingx.cengn.ca/mirror/starlingx/master/centos/20190407T233001Z/outputs/
And used the helm chart from same link above

Timestamp/Logs
--------------
Attached logs of pods_status, sysinv,syslog and dmesg.
Docker logs:
[wrsroot@controller-0 ~(keystone_admin)]$ sudo docker exec armada_service tail -f stx-openstack-apply.log
Password:
2019-04-28 21:48:36.600 14 ERROR armada.cli File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
2019-04-28 21:48:36.600 14 ERROR armada.cli raise self._exception
2019-04-28 21:48:36.600 14 ERROR armada.cli File "/usr/lib/python3.6/concurrent/futures/thread.py", line 56, in run
2019-04-28 21:48:36.600 14 ERROR armada.cli result = self.fn(*self.args, **self.kwargs)
2019-04-28 21:48:36.600 14 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/cli/apply.py", line 252, in handle
2019-04-28 21:48:36.600 14 ERROR armada.cli return armada.sync()
2019-04-28 21:48:36.600 14 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/handlers/armada.py", line 249, in sync
2019-04-28 21:48:36.600 14 ERROR armada.cli raise armada_exceptions.ChartDeployException(failures)
2019-04-28 21:48:36.600 14 ERROR armada.cli armada.exceptions.armada_exceptions.ChartDeployException: Exception deploying charts: ['openvswitch', 'libvirt', 'neutron', 'nova']
2019-04-28 21:48:36.600 14 ERROR armada.cli

Revision history for this message
Gopinath (gprabakx) wrote :
description: updated
Revision history for this message
Gopinath (gprabakx) wrote :
Revision history for this message
Elio Martinez (elio1979) wrote :

As a matter of fact, if we try to execute system-apply after failure, it is going to be stucked on 26% showing the following message from sysinv.log

 processing chart: osh-openstack-garbd, overall completion: 26.0%
2019-04-29 12:13:48.466 7345 INFO sysinv.api.controllers.v1.host [-] controller-0 ihost_patch_start_2019-04-29-12-13-48 patch
2019-04-29 12:13:48.467 7345 INFO sysinv.api.controllers.v1.host [-] controller-0 ihost_patch_end. No changes from mtce/1.0.
2019-04-29 12:13:49.099 7344 INFO sysinv.api.controllers.v1.host [-] compute-1 ihost_patch_start_2019-04-29-12-13-49 patch
2019-04-29 12:13:49.102 7344 INFO sysinv.api.controllers.v1.host [-] compute-1 ihost_patch_end. No changes from mtce/1.0.
2019-04-29 12:13:49.201 7344 INFO sysinv.api.controllers.v1.host [-] compute-0 ihost_patch_start_2019-04-29-12-13-49 patch
2019-04-29 12:13:49.202 7344 INFO sysinv.api.controllers.v1.host [-] compute-0 ihost_patch_end. No changes from mtce/1.0.
2019-04-29 12:13:58.117 7344 INFO sysinv.api.controllers.v1.host [-] controller-1 ihost_patch_start_2019-04-29-12-13-58 patch
2019-04-29 12:13:58.118 7344 INFO sysinv.api.controllers.v1.host [-] controller-1 ihost_patch_end. No changes from mtce/1.0.

Revision history for this message
Frank Miller (sensfan22) wrote :

Marking as stx.2.0 release gating as the standard configuration is one of the main configurations supported for StarlingX.

This issue appears to be the same issue as https://bugs.launchpad.net/starlingx/+bug/1826445 that Mingyuan is currently investigating. Assigning to Cindy to determine if this is indeed a duplicate and if so to update the LP to be a duplicate of 1826445.

Changed in starlingx:
status: New → Triaged
importance: Undecided → High
assignee: nobody → Cindy Xie (xxie1)
tags: added: stx.2.0 stx.containers
Revision history for this message
Cindy Xie (xxie1) wrote :

we also have a similar bug LP#1827521 that Yong is investigating. Assign to Yong.

Changed in starlingx:
assignee: Cindy Xie (xxie1) → yong hu (yhu6)
Revision history for this message
yong hu (yhu6) wrote :

this issue is different from "LP#1827521" which was stuck in the middle of applying mariadb helm chart. In this issue, the fatal error was:
2019-04-28 11:35:10.791 71 ERROR armada.cli [-] Caught internal exception: armada.exceptions.tiller_exceptions.TillerPodNotRunningException: No Tiller pods found in running state

Will find this 0428 Cengn image for a reproduce.

@gprabakx, a few more questions:
 "With Latest ISO and helm charts." I suppose you meant the helm-chart tgz from the same build, wasn't it?
"The status hung at 70% and it fails." I didn't see 70% progress from sysinv.log. what did you mean here?

Revision history for this message
yong hu (yhu6) wrote :

@gprabakx, before running "application-upload", please check the status of "tiller" pod and deployment and service, by:
$ kubectl get pods --all-namespaces | grep tiller
$ kubectl get services --all-namespaces | grep tiller
$ kubectl get deployments.apps --all-namespaces | grep tiller

The following info were outputs from my environment:

controller-0:~$ kubectl get pods --all-namespaces | grep tiller
kube-system tiller-deploy-5b859c7dd8-n5t5m 1/1 Running 0 78m

controller-0:~$ kubectl get services --all-namespaces | grep tiller
kube-system tiller-deploy ClusterIP 10.105.228.145 <none> 44134/TCP 84m

controller-0:~$ kubectl get deployments.apps --all-namespaces | grep tiller
kube-system tiller-deploy 1/1 1 1 83m

Revision history for this message
yong hu (yhu6) wrote :

The issue couldn't be reproduced with Cengn 0428 build on virtual machines (not bare metal hosts), please see the details in the attachment.
So we need more info for further debugging.

Changed in starlingx:
status: Triaged → Incomplete
Revision history for this message
Elio Martinez (elio1979) wrote :

Yong, could you provide your vm specs?

Revision history for this message
yong hu (yhu6) wrote :

@elio1979 please see VM specs for 2+2+2 nodes in the attachment.

Revision history for this message
yong hu (yhu6) wrote :

@elio1979 and @gprabakx, can you confirm that after having bigger RAM sizes with VMs, this application-apply can work well?

If Yes, we can close this issue.

Revision history for this message
Cindy Xie (xxie1) wrote :

please check if this bug is of duplicate of LP: https://bugs.launchpad.net/starlingx/+bug/1826445

Revision history for this message
yong hu (yhu6) wrote :

The symptom was not the same.
In this current failure, No Tiller pods were under running at all. As we know, Tiller is the "daemon" of Helm and actually does the work of installation of "application" helm-charts. So without Tiller pods running, none of helm-charts in packed in stx-openstack wouldn't be installed.

In 1826445, stx-openstack was in the middle of applying (up to 26%).

So, we don't have evidence they were duplicated.

Since both Martin and I couldn't see this issue with VM configs (controller 6 Cores + 30GB RAM).
I would suggest Eli or Gopinath test it once more with increased RAM size.

Revision history for this message
Cindy Xie (xxie1) wrote :

Please retest it under minimal config documented on wiki:

Controller:
4 CPUs
16GB Mem

Compute:
3 CPUs
10GB Mem

Revision history for this message
chen haochuan (martin1982) wrote :

application apply fail with this configuration
+---------------+-----------------+---------------+--------------+------------------------------------------+
| application | manifest name | manifest file | status | progress |
+---------------+-----------------+---------------+--------------+------------------------------------------+
| stx-openstack | armada-manifest | manifest.yaml | apply-failed | operation aborted, check logs for detail |
+---------------+-----------------+---------------+--------------+------------------------------------------+

with 2+2+2
controller: 4 CPU + 16G memory
storage and compute: 3 CPU + 10G memory

Revision history for this message
Cindy Xie (xxie1) wrote :

@Martin, are you reproduce the issue using 20190428T013000Z? Or are you using the latest Cengen ISO? Please specify the build#.

Revision history for this message
chen haochuan (martin1982) wrote :

I use 0428 daily build

Revision history for this message
Cindy Xie (xxie1) wrote :

@Martin, I think this LP may of duplicate of https://bugs.launchpad.net/starlingx/+bug/1827952, please discuss with Shuicheng.

Revision history for this message
chen haochuan (martin1982) wrote :

today I check again, apply succeed. with configuration on wiki

controller 4 cpu + 16G
compute 3 cpu + 10G
storage 3 cpu + 10G

 [wrsroot@controller-0 ~(keystone_admin)]$ system application-list

+---------------+-----------------+---------------+---------+-----------+
| application | manifest name | manifest file | status | progress |
+---------------+-----------------+---------------+---------+-----------+
| stx-openstack | armada-manifest | manifest.yaml | applied | completed |
+---------------+-----------------+---------------+---------+-----------+
[wrsroot@controller-0 ~(keystone_admin)]$

Revision history for this message
Lin Shuicheng (shuicheng) wrote :

For LP1827952, it is just about 2 benign error message from stevedore. These 2 error message will not cause any issue. So it is not the same issue as this one.

Revision history for this message
yong hu (yhu6) wrote :

@Elio, please update whether you can see this issue or not.

On our side, with these minimal VM configs, the system worked as expected:
controller 4 cpu + 16G
compute 3 cpu + 10G
storage 3 cpu + 10G

Changed in starlingx:
assignee: yong hu (yhu6) → nobody
assignee: nobody → Elio Martinez (elio1979)
Revision history for this message
yong hu (yhu6) wrote :

It was tried with several minimal config in virtual environment and the issue was not reproduced for weeks, so close this.

Changed in starlingx:
assignee: Elio Martinez (elio1979) → yong hu (yhu6)
status: Incomplete → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.