StarlingX

Stx-openstack application apply failed with Latest ISO(2019-Apr-28)

Bug #1826912 reported by Gopinath on 2019-04-29

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	High	yong hu

Bug Description

Brief Description
-----------------
When I try to apply Stx-openstack application using CMD
(system host-apply stx-openstack), With Latest ISO and helm charts.
The status hung at 70% and it fails.
Severity
--------
Provide the severity of the defect.
<Critical: System/Feature is not usable after the defect>

Steps to Reproduce
------------------
Reference link:
https://wiki.openstack.org/wiki/StarlingX/Containers/InstallationOnStandard#Verify_the_cluster_endpoints

  1. Configured Setup all nodes
  2. Provisioning done for all nodes
  3. Add ceph OSDs to controllers
  4. Using sysinv to bring up/down the containerized services
     a. When doing Bring up services, used below command
     >> system application-apply stx-openstack
     >> system application-list
Error: operation aborted, check logs for detail

Expected Behavior
------------------
The application have to reach 100% and its should be in applied state.
Actual Behavior
----------------
The application hung at 70% and it in failed state.
+---------------+-----------------+---------------+---------------+------------------------------------------+
| application | manifest name | manifest file | status | progress |
+---------------+-----------------+---------------+---------------+------------------------------------------+
| stx-openstack | armada-manifest | manifest.yaml | apply-failed | operation aborted, check logs for detail |
+---------------+-----------------+---------------+---------------+------------------------------------------+

Reproducibility
---------------
<Reproducible/Intermittent>
State if the issue is 100% reproducible.

System Configuration
--------------------
<Multi-node system, Dedicated storage>

Branch/Pull Time/Commit
-----------------------
bootimage.iso 2019-Apr-28 00:19:54 1.9G application/x-iso9660-image
ISO:
http://mirror.starlingx.cengn.ca/mirror/starlingx/master/centos/20190428T013000Z/outputs/iso/
HelmChart:
http://mirror.starlingx.cengn.ca/mirror/starlingx/master/centos/20190428T013000Z/outputs/helm-charts/
helm-charts-manifest-centos-stable-latest.tgz

Last Pass
---------
Yes, it passed with older ISO
http://mirror.starlingx.cengn.ca/mirror/starlingx/master/centos/20190407T233001Z/outputs/
And used the helm chart from same link above

Timestamp/Logs
--------------
Attached logs of pods_status, sysinv,syslog and dmesg.
Docker logs:
[wrsroot@controller-0 ~(keystone_admin)]$ sudo docker exec armada_service tail -f stx-openstack-apply.log
Password:
2019-04-28 21:48:36.600 14 ERROR armada.cli File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
2019-04-28 21:48:36.600 14 ERROR armada.cli raise self._exception
2019-04-28 21:48:36.600 14 ERROR armada.cli File "/usr/lib/python3.6/concurrent/futures/thread.py", line 56, in run
2019-04-28 21:48:36.600 14 ERROR armada.cli result = self.fn(*self.args, **self.kwargs)
2019-04-28 21:48:36.600 14 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/cli/apply.py", line 252, in handle
2019-04-28 21:48:36.600 14 ERROR armada.cli return armada.sync()
2019-04-28 21:48:36.600 14 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/handlers/armada.py", line 249, in sync
2019-04-28 21:48:36.600 14 ERROR armada.cli raise armada_exceptions.ChartDeployException(failures)
2019-04-28 21:48:36.600 14 ERROR armada.cli armada.exceptions.armada_exceptions.ChartDeployException: Exception deploying charts: ['openvswitch', 'libvirt', 'neutron', 'nova']
2019-04-28 21:48:36.600 14 ERROR armada.cli

See original description

Tags:

Revision history for this message

Gopinath (gprabakx) wrote on 2019-04-29:

sysinv.log Edit (834.3 KiB, text/plain)

description:

updated

Revision history for this message

Gopinath (gprabakx) wrote on 2019-04-29:

dmesg Edit (37.6 KiB, text/plain)

Revision history for this message

Elio Martinez (elio1979) wrote on 2019-04-29:

As a matter of fact, if we try to execute system-apply after failure, it is going to be stucked on 26% showing the following message from sysinv.log

processing chart: osh-openstack-garbd, overall completion: 26.0%
2019-04-29 12:13:48.466 7345 INFO sysinv.api.controllers.v1.host [-] controller-0 ihost_patch_start_2019-04-29-12-13-48 patch
2019-04-29 12:13:48.467 7345 INFO sysinv.api.controllers.v1.host [-] controller-0 ihost_patch_end. No changes from mtce/1.0.
2019-04-29 12:13:49.099 7344 INFO sysinv.api.controllers.v1.host [-] compute-1 ihost_patch_start_2019-04-29-12-13-49 patch
2019-04-29 12:13:49.102 7344 INFO sysinv.api.controllers.v1.host [-] compute-1 ihost_patch_end. No changes from mtce/1.0.
2019-04-29 12:13:49.201 7344 INFO sysinv.api.controllers.v1.host [-] compute-0 ihost_patch_start_2019-04-29-12-13-49 patch
2019-04-29 12:13:49.202 7344 INFO sysinv.api.controllers.v1.host [-] compute-0 ihost_patch_end. No changes from mtce/1.0.
2019-04-29 12:13:58.117 7344 INFO sysinv.api.controllers.v1.host [-] controller-1 ihost_patch_start_2019-04-29-12-13-58 patch
2019-04-29 12:13:58.118 7344 INFO sysinv.api.controllers.v1.host [-] controller-1 ihost_patch_end. No changes from mtce/1.0.

Revision history for this message

Frank Miller (sensfan22) wrote on 2019-05-01:

Marking as stx.2.0 release gating as the standard configuration is one of the main configurations supported for StarlingX.

This issue appears to be the same issue as https://bugs.launchpad.net/starlingx/+bug/1826445 that Mingyuan is currently investigating. Assigning to Cindy to determine if this is indeed a duplicate and if so to update the LP to be a duplicate of 1826445.

Changed in starlingx:
status:	New → Triaged
importance:	Undecided → High
assignee:	nobody → Cindy Xie (xxie1)
tags:	added: stx.2.0 stx.containers

Revision history for this message

Cindy Xie (xxie1) wrote on 2019-05-07:

we also have a similar bug LP#1827521 that Yong is investigating. Assign to Yong.

Changed in starlingx:
assignee:	Cindy Xie (xxie1) → yong hu (yhu6)

Revision history for this message

yong hu (yhu6) wrote on 2019-05-07:

this issue is different from "LP#1827521" which was stuck in the middle of applying mariadb helm chart. In this issue, the fatal error was:
2019-04-28 11:35:10.791 71 ERROR armada.cli [-] Caught internal exception: armada.exceptions.tiller_exceptions.TillerPodNotRunningException: No Tiller pods found in running state

Will find this 0428 Cengn image for a reproduce.

@gprabakx, a few more questions:
"With Latest ISO and helm charts." I suppose you meant the helm-chart tgz from the same build, wasn't it?
"The status hung at 70% and it fails." I didn't see 70% progress from sysinv.log. what did you mean here?

Revision history for this message

yong hu (yhu6) wrote on 2019-05-07:

@gprabakx, before running "application-upload", please check the status of "tiller" pod and deployment and service, by:
$ kubectl get pods --all-namespaces | grep tiller
$ kubectl get services --all-namespaces | grep tiller
$ kubectl get deployments.apps --all-namespaces | grep tiller

The following info were outputs from my environment:

controller-0:~$ kubectl get pods --all-namespaces | grep tiller
kube-system tiller-deploy-5b859c7dd8-n5t5m 1/1 Running 0 78m

controller-0:~$ kubectl get services --all-namespaces | grep tiller
kube-system tiller-deploy ClusterIP 10.105.228.145 <none> 44134/TCP 84m

controller-0:~$ kubectl get deployments.apps --all-namespaces | grep tiller
kube-system tiller-deploy 1/1 1 1 83m

Revision history for this message

yong hu (yhu6) wrote on 2019-05-07:

0428_build_multi-nodes.png Edit (277.8 KiB, image/png)

The issue couldn't be reproduced with Cengn 0428 build on virtual machines (not bare metal hosts), please see the details in the attachment.
So we need more info for further debugging.

Changed in starlingx:
status:	Triaged → Incomplete

Revision history for this message

Elio Martinez (elio1979) wrote on 2019-05-07:

Yong, could you provide your vm specs?

Revision history for this message

yong hu (yhu6) wrote on 2019-05-08:

#10

stx-2+2+2-node-vm-specs.tgz Edit (2.3 KiB, application/x-tar)

@elio1979 please see VM specs for 2+2+2 nodes in the attachment.

Revision history for this message

yong hu (yhu6) wrote on 2019-05-10:

#11

@elio1979 and @gprabakx, can you confirm that after having bigger RAM sizes with VMs, this application-apply can work well?

If Yes, we can close this issue.

Revision history for this message

Cindy Xie (xxie1) wrote on 2019-05-15:

#12

please check if this bug is of duplicate of LP: https://bugs.launchpad.net/starlingx/+bug/1826445

Revision history for this message

yong hu (yhu6) wrote on 2019-05-15:

#13

The symptom was not the same.
In this current failure, No Tiller pods were under running at all. As we know, Tiller is the "daemon" of Helm and actually does the work of installation of "application" helm-charts. So without Tiller pods running, none of helm-charts in packed in stx-openstack wouldn't be installed.

In 1826445, stx-openstack was in the middle of applying (up to 26%).

So, we don't have evidence they were duplicated.

Since both Martin and I couldn't see this issue with VM configs (controller 6 Cores + 30GB RAM).
I would suggest Eli or Gopinath test it once more with increased RAM size.

Revision history for this message

Cindy Xie (xxie1) wrote on 2019-05-17:

#14

Please retest it under minimal config documented on wiki:

Controller:
4 CPUs
16GB Mem

Compute:
3 CPUs
10GB Mem

Revision history for this message

chen haochuan (martin1982) wrote on 2019-05-21:

#15

sysinv.log Edit (582.3 KiB, text/plain)

with 2+2+2
controller: 4 CPU + 16G memory
storage and compute: 3 CPU + 10G memory

Revision history for this message

Cindy Xie (xxie1) wrote on 2019-05-21:

#16

@Martin, are you reproduce the issue using 20190428T013000Z? Or are you using the latest Cengen ISO? Please specify the build#.

Revision history for this message

chen haochuan (martin1982) wrote on 2019-05-23:

#17

I use 0428 daily build

Revision history for this message

Cindy Xie (xxie1) wrote on 2019-05-23:

#18

@Martin, I think this LP may of duplicate of https://bugs.launchpad.net/starlingx/+bug/1827952, please discuss with Shuicheng.

Revision history for this message

chen haochuan (martin1982) wrote on 2019-05-23:

#19

today I check again, apply succeed. with configuration on wiki

controller 4 cpu + 16G
compute 3 cpu + 10G
storage 3 cpu + 10G

[wrsroot@controller-0 ~(keystone_admin)]$ system application-list

Revision history for this message

Lin Shuicheng (shuicheng) wrote on 2019-05-23:

#20

For LP1827952, it is just about 2 benign error message from stevedore. These 2 error message will not cause any issue. So it is not the same issue as this one.

Revision history for this message

yong hu (yhu6) wrote on 2019-05-30:

#21

@Elio, please update whether you can see this issue or not.

On our side, with these minimal VM configs, the system worked as expected:
controller 4 cpu + 16G
compute 3 cpu + 10G
storage 3 cpu + 10G

Changed in starlingx:
assignee:	yong hu (yhu6) → nobody
assignee:	nobody → Elio Martinez (elio1979)

Revision history for this message

yong hu (yhu6) wrote on 2019-06-10:

#22

It was tried with several minimal config in virtual environment and the issue was not reproduced for weeks, so close this.

Changed in starlingx:
assignee:	Elio Martinez (elio1979) → yong hu (yhu6)
status:	Incomplete → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.