StarlingX

stx-openstack application stuck at applying status

Bug #1827521 reported by Peng Peng on 2019-05-03

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Critical	Daniel Badea

Bug Description

Brief Description
-----------------
After "system application-apply stx-openstack", application stuck at "applying" status.

Severity
--------
Critical

Steps to Reproduce
------------------
....
TC-name:

Expected Behavior
------------------

Actual Behavior
----------------

Reproducibility
---------------
Reproducible

System Configuration
--------------------
one or two nodes system

Lab-name: SM-2, WP_1-2

Branch/Pull Time/Commit
-----------------------
stx master as of 20190503T013000Z

Last Pass
---------
20190502T013000Z

Test Activity
-------------
lab setup

Tags:

Revision history for this message

Peng Peng (ppeng) wrote on 2019-05-03:

ALL_NODES_20190503.134444.tar Edit (29.2 MiB, application/x-tar)

Revision history for this message

Dariush Eslimi (deslimi) wrote on 2019-05-03:

Initial triage indicates, this is related to ceph upversion. ceph-mgr process is dying. Marking stx.2.0 gating as application apply is failing.

Changed in starlingx:
status:	New → Triaged
importance:	Undecided → Critical
assignee:	nobody → Daniel Badea (daniel.badea)
tags:	added: stx.2.0 stx.distro.other stx.retestneeded

Revision history for this message

Fernando Hernandez Gonzalez (fhernan2) wrote on 2019-05-03:

It is also presented on BM 2+2+2 cluster.

Revision history for this message

Fernando Hernandez Gonzalez (fhernan2) wrote on 2019-05-03:

sysinv.log Edit (1.8 MiB, text/plain)

BM 2+2+2 sysinv.log

Revision history for this message

yong hu (yhu6) wrote on 2019-05-05:

Download full text (3.7 KiB)

@fernando, "stx-openstack" applied failed due to a set of missing docker images in registry:
1. stx-glance:master-centos-stable-20190429T233000Z
2. stx-keystone-api-proxy:master-centos-stable-20190429T233000Z
3. stx-nova-api-proxy:master-centos-stable-20190429T233000Z

You can see all these images have a common version tag (20190429) which indicates they were generated very recently 20190429. so please check your local docker registry and assure these images are placed.

-------------------log from sysinv.log --------------------------------------------------

2019-05-03 11:26:31.413 97338 INFO sysinv.conductor.kube_app [-] Image 10.10.58.2:9001/docker.io/starlingx/stx-glance:master-centos-stable-20190429T233000Z is not available in local registry, download started from public/private registry
2019-05-03 11:26:31.440 97338 INFO sysinv.conductor.kube_app [-] Image 10.10.58.2:9001/quay.io/stackanetes/kubernetes-entrypoint:v0.3.1 download succeeded in 0 seconds
2019-05-03 11:26:31.447 97338 ERROR sysinv.conductor.kube_app [-] Image 192.168.100.60/starlingx/stx-glance:master-centos-stable-20190429T233000Z download failed from public/privateregistry: 404 Client Error: Not Found ("manifest for 192.168.100.60/starlingx/stx-glance:master-centos-stable-20190429T233000Z not found")
2019-05-03 11:26:31.608 97338 INFO sysinv.conductor.kube_app [-] Image 10.10.58.2:9001/docker.io/openstackhelm/openvswitch:v2.8.1 download succeeded in 0 seconds
2019-05-03 11:26:31.636 97338 INFO sysinv.conductor.kube_app [-] Image 10.10.58.2:9001/docker.io/starlingx/stx-keystone-api-proxy:master-centos-stable-20190429T233000Z is not available in local registry, download started from public/private registry
2019-05-03 11:26:31.676 97338 ERROR sysinv.conductor.kube_app [-] Image 192.168.100.60/starlingx/stx-keystone-api-proxy:master-centos-stable-20190429T233000Z download failed from public/privateregistry: 404 Client Error: Not Found ("manifest for 192.168.100.60/starlingx/stx-keystone-api-proxy:master-centos-stable-20190429T233000Z not found")
2019-05-03 11:26:31.728 97338 INFO sysinv.conductor.kube_app [-] Image 10.10.58.2:9001/docker.io/starlingx/stx-nova-api-proxy:master-centos-stable-20190429T233000Z is not available in local registry, download started from public/private registry
2019-05-03 11:26:31.764 97338 ERROR sysinv.conductor.kube_app [-] Image 192.168.100.60/starlingx/stx-nova-api-proxy:master-centos-stable-20190429T233000Z download failed from public/privateregistry: 404 Client Error: Not Found ("manifest for 192.168.100.60/starlingx/stx-nova-api-proxy:master-centos-stable-20190429T233000Z not found")
2019-05-03 11:26:31.765 97338 ERROR sysinv.conductor.kube_app [-] Deployment of application stx-openstack (1.0-11-centos-stable-versioned) failed: failed to download one or more image(s).
2019-05-03 11:26:31.765 97338 TRACE sysinv.conductor.kube_app Traceback (most recent call last):
2019-05-03 11:26:31.765 97338 TRACE sysinv.conductor.kube_app File "/usr/lib64/python2.7/site-packages/sysinv/conductor/kube_app.py", line 1000, in perform_app_apply
2019-05-03 11:26:31.765 97338 TRACE sysinv.conductor.kube_app self._download_images(app)
2019-05-03 11:26:31.7...

-------------------log from sysinv.log --------------------------------------------------

2019-05-03 11:26:31.413 97338 INFO sysinv.conductor.kube_app [-] Image 10.10.58.2:9001/docker.io/starlingx/stx-glance:master-centos-stable-20190429T233000Z is not available in local registry, download started from public/private registry
2019-05-03 11:26:31.440 97338 INFO sysinv.conductor.kube_app [-] Image 10.10.58.2:9001/quay.io/stackanetes/kubernetes-entrypoint:v0.3.1 download succeeded in 0 seconds
2019-05-03 11:26:31.447 97338 ERROR sysinv.conductor.kube_app [-] Image 192.168.100.60/starlingx/stx-glance:master-centos-stable-20190429T233000Z download failed from public/privateregistry: 404 Client Error: Not Found ("manifest for 192.168.100.60/starlingx/stx-glance:master-centos-stable-20190429T233000Z not found")
2019-05-03 11:26:31.608 97338 INFO sysinv.conductor.kube_app [-] Image 10.10.58.2:9001/docker.io/openstackhelm/openvswitch:v2.8.1 download succeeded in 0 seconds
2019-05-03 11:26:31.636 97338 INFO sysinv.conductor.kube_app [-] Image 10.10.58.2:9001/docker.io/starlingx/stx-keystone-api-proxy:master-centos-stable-20190429T233000Z is not available in local registry, download started from public/private registry
2019-05-03 11:26:31.676 97338 ERROR sysinv.conductor.kube_app [-] Image 192.168.100.60/starlingx/stx-keystone-api-proxy:master-centos-stable-20190429T233000Z download failed from public/privateregistry: 404 Client Error: Not Found ("manifest for 192.168.100.60/starlingx/stx-keystone-api-proxy:master-centos-stable-20190429T233000Z not found")
2019-05-03 11:26:31.728 97338 INFO sysinv.conductor.kube_app [-] Image 10.10.58.2:9001/docker.io/starlingx/stx-nova-api-proxy:master-centos-stable-20190429T233000Z is not available in local registry, download started from public/private registry
2019-05-03 11:26:31.764 97338 ERROR sysinv.conductor.kube_app [-] Image 192.168.100.60/starlingx/stx-nova-api-proxy:master-centos-stable-20190429T233000Z download failed from public/privateregistry: 404 Client Error: Not Found ("manifest for 192.168.100.60/starlingx/stx-nova-api-proxy:master-centos-stable-20190429T233000Z not found")
2019-05-03 11:26:31.765 97338 ERROR sysinv.conductor.kube_app [-] Deployment of application stx-openstack (1.0-11-centos-stable-versioned) failed: failed to download one or more image(s).
2019-05-03 11:26:31.765 97338 TRACE sysinv.conductor.kube_app Traceback (most recent call last):
2019-05-03 11:26:31.765 97338 TRACE sysinv.conductor.kube_app   File "/usr/lib64/python2.7/site-packages/sysinv/conductor/kube_app.py", line 1000, in perform_app_apply
2019-05-03 11:26:31.765 97338 TRACE sysinv.conductor.kube_app     self._download_images(app)
2019-05-03 11:26:31.765 97338 TRACE sysinv.conductor.kube_app   File "/usr/lib64/python2.7/site-packages/sysinv/conductor/kube_app.py", line 531, in _download_images
2019-05-03 11:26:31.765 97338 TRACE sysinv.conductor.kube_app     reason="failed to download one or more image(s).")
2019-05-03 11:26:31.765 97338 TRACE sysinv.conductor.kube_app KubeAppApplyFailure: Deployment of application stx-openstack (1.0-11-centos-stable-versioned) failed: failed to download one or more image(s).
2019-05-03 11:26:31.765 97338 TRACE sysinv.conductor.kube_app
2019-05-03 11:26:31.769 97338 ERROR sysinv.conductor.kube_app [-] Application apply aborted!.

Revision history for this message

yong hu (yhu6) wrote on 2019-05-05:

@ppeng, when you were trying to run "application-apply stx-openstack", how many controllers were available? only controller-0 or both controller-0 and -1?
From your log, "ALL_NODES_20190503.134444.tar" actually only has "controller-0_20190503.134444.tgz".

If it was the case, could you try the test again with the minimal setup: controllre-0 (available) -> controller-1 (available) -> storage-0 (OSDs added and available)?

Better you can check Ceph health by "ceph -s" before running "system application-apply".

yong hu (yhu6) on 2019-05-05

Changed in starlingx:
assignee:	Daniel Badea (daniel.badea) → yong hu (yhu6)

Revision history for this message

yong hu (yhu6) wrote on 2019-05-06:

2+2+1+application-applied.png Edit (255.0 KiB, image/png)

@fhernan2, I installed the build from the latest codebase for a 2+2+1 and I could successfully apply stx-openstack, AFTER I assured in my local registry all docker images are placed.

You can see the attached screen-shot I had.

In addition, for the missing images in your local registry, you can pull them with the corresponding tag from public docker hub:

1. stx-glance:master-centos-stable-20190429T233000Z
https://hub.docker.com/r/starlingx/stx-glance/tags

2. stx-keystone-api-proxy:master-centos-stable-20190429T233000Z
https://hub.docker.com/r/starlingx/stx-keystone-api-proxy/tags

3. stx-nova-api-proxy:master-centos-stable-20190429T233000Z
https://hub.docker.com/r/starlingx/stx-nova-api-proxy/tags

Revision history for this message

yong hu (yhu6) wrote on 2019-05-06:

2+2+1+application-applied.png Edit (255.0 KiB, image/png)

@fhernan2, I installed the build from the latest codebase for a 2+2+1 and I could successfully apply stx-openstack, AFTER I assured in my local registry all docker images are placed.

You can see the attached screen-shot I had.

In addition, for the missing images in your local registry, you can pull them with the corresponding tag from public docker hub:

1. stx-glance:master-centos-stable-20190429T233000Z
https://hub.docker.com/r/starlingx/stx-glance/tags

2. stx-keystone-api-proxy:master-centos-stable-20190429T233000Z
https://hub.docker.com/r/starlingx/stx-keystone-api-proxy/tags

3. stx-nova-api-proxy:master-centos-stable-20190429T233000Z
https://hub.docker.com/r/starlingx/stx-nova-api-proxy/tags

Revision history for this message

yong hu (yhu6) wrote on 2019-05-06:

2+2+1+application-applied.png Edit (255.0 KiB, image/png)

@fhernan2, I installed the build from the latest codebase for a 2+2+1 and I could successfully apply stx-openstack, AFTER I assured in my local registry all docker images are placed.

You can see the attached screen-shot I had.

In addition, for the missing images in your local registry, you can pull them with the corresponding tag from public docker hub:

1. stx-glance:master-centos-stable-20190429T233000Z
https://hub.docker.com/r/starlingx/stx-glance/tags

2. stx-keystone-api-proxy:master-centos-stable-20190429T233000Z
https://hub.docker.com/r/starlingx/stx-keystone-api-proxy/tags

3. stx-nova-api-proxy:master-centos-stable-20190429T233000Z
https://hub.docker.com/r/starlingx/stx-nova-api-proxy/tags

Revision history for this message

yong hu (yhu6) wrote on 2019-05-06:

#10

2+2+2-cengn-0504-build-application-applied.png Edit (267.3 KiB, image/png)

another deployment (2+2+2) was done on Cengn 0504 build successfully, with stx-openstack application applied. see the attachment.

Revision history for this message

yong hu (yhu6) wrote on 2019-05-06:

#11

simplex_AIO-application-applied.png Edit (339.7 KiB, image/png)

@ppeng, I deployed a simplex AIO (built from 0505 codebase) and it worked well with stx-application.
pls see the attachment.

Changed in starlingx:
status:	Triaged → Incomplete

Revision history for this message

Peng Peng (ppeng) wrote on 2019-05-06:

#12

Issue was reproduced on 20190505T233000Z. both hosts logs are collected and attached.

Issue was reproduced on 20190505T233000Z. both hosts logs are collected and attached.

Revision history for this message

Peng Peng (ppeng) wrote on 2019-05-06:

#13

ALL_NODES_20190506.134534.tar Edit (63.7 MiB, application/x-tar)

Revision history for this message

Fernando Hernandez Gonzalez (fhernan2) wrote on 2019-05-06:

#14

After our docker registry was updated with missing docker images the BM 2+2+2 was installed successfully. system application-apply stx-openstack succeed.

Revision history for this message

Fernando Hernandez Gonzalez (fhernan2) wrote on 2019-05-06:

#15

[wrsroot@controller-0 ~(keystone_admin)]$ ceph -s
  cluster:
    id: 85261611-1245-4b21-bd0b-cc9afdc26cff
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum controller-0,controller-1,storage-0
    mgr: controller-0(active), standbys: controller-1
    osd: 2 osds: 2 up, 2 in
    rgw: 1 daemon active

  data:
    pools: 9 pools, 856 pgs
    objects: 1.59 k objects, 825 MiB
    usage: 1.8 GiB used, 890 GiB / 892 GiB avail
    pgs: 856 active+clean

io:
client: 360 KiB/s wr, 0 op/s rd, 68 op/s wr

Revision history for this message

yong hu (yhu6) wrote on 2019-05-07:

#16

duplex_application-applied.png Edit (319.8 KiB, image/png)

@ppeng, looking into the log, the application-apply stuck at here:

2149 2019-05-06 08:06:53.772 104648 INFO sysinv.conductor.kube_app [-] processing chart: osh-openstack-mariadb, overall completion: 21.0%

one of possible causes was something went wrong with ceph osd pool "cube-rbd".
Please provide some info by the following commands:
1. ceph -s
2. ceph osd tree
3. ceph osd pool ls
4. ps aux | grep ceph
5. kubectl get secrets --all-namespaces | grep ceph
6. kubectl get jobs.batch --all-namespaces | grep ceph
7. kubectl get serviceaccounts --all-namespaces | grep ceph
8. kubectl get pods --all-namespaces | grep ceph
9. history > my_operations.log

In addition, with the same Cengn image, we also deployed a duplex with 2 OSDs (on controller-0 and controller-1 respectively) and application-apply worked well. You can see the screenshot in the attachment.

Revision history for this message

yong hu (yhu6) wrote on 2019-05-07:

#17

duplex_ceph_related_info.png Edit (326.6 KiB, image/png)

Revision history for this message

Daniel Badea (daniel.badea) wrote on 2019-05-08:

#18

This is likely an issue caused by Ceph Rest API not being available when stx-openstack is applied:

1. at 2019-05-03 07:07:38 system application-apply stx-openstack

2. but sm-customer.log reports mgr-restful-plugin started at 2019-05-03 07:13:47

3. and also mgr-restful-plugin.log reports response ok from ceph-mgr (REST API) at 2019-05-03 07:14:04

2019-05-03 07:14:04,634 199978 INFO mgr-restful-plugin get url=https://controller-0:5001/, response=<Response [200]>

Chances are mariadb chart failed because mgr-restful-plugin/ceph-mgr were not up and running but there are no logs to confirm that (container logs were discarded before log collect).

Cindy Xie (xxie1) on 2019-05-13

tags:

added: stx.storage

Revision history for this message

Daniel Badea (daniel.badea) wrote on 2019-05-13:

#19

ceph-mgr and restful plugin are restarted by mgr-restful-plugin because of a missing liveness request:

2019-05-03 07:12:00.465 7fdf1ae49700 0 log_channel(audit) log [DBG] : from='mon.0 192.168.204.3:0/84215' entity='mon.controller-0' cmd=[{"prefix": "osd df", "output_method": "tree", "format": "json"}]: dispatch
2019-05-03 07:12:00.465 7fdf1ae49700 1 mgr.server reply handle_command (0) Success
2019-05-03 07:12:00.542 7fdf1b64a700 1 mgr[restful] Unknown request '140596180385936:0'
2019-05-03 07:12:01.534 7fdf2a784700 1 mgr send_beacon active

then

2019-05-03 07:12:26.038 7fdf2e78c700 -1 received signal: Terminated from /usr/bin/python /etc/init.d/mgr-restful-plugin start (PID: 119438) UID: 0
2019-05-03 07:12:26.038 7fdf2e78c700 -1 mgr handle_signal *** Got signal Terminated ***
2019-05-03 07:12:35.426 7f5ecbe17380 0 ceph version 13.2.2 (67ecc2fbbca4f602f1172fa8e561b813eb564df5) mimic (stable), process ceph-mgr, pid 190485

This is caused by a synchronization issue in ceph-mgr. See https://github.com/starlingx-staging/stx-ceph/pull/29

Daniel Badea (daniel.badea) on 2019-05-13

Changed in starlingx:
assignee:	yong hu (yhu6) → Daniel Badea (daniel.badea)

Ghada Khalil (gkhalil) on 2019-05-13

Changed in starlingx:
status:	Incomplete → In Progress

Revision history for this message

Joseph Richard (josephrichard) wrote on 2019-05-14:

#20

I've seen this a few times as well with a pull from yesterday.

The reason the application gets stuck in applying state is because sm restarts sysinv (due to the failure of mgr-restful-plugin that Daniel is looking at), which stops armada from running, and nothing cleans up the application state.
You can see this in sm logs:
2019-05-03T07:13:43.000 controller-0 sm: debug time[892.099] log<632> INFO: sm[88506]: sm_service_fsm.c(1032): Service (sysinv-conductor) received event (disable) was in the enabled-active state and is now in the disabling state.

The only way I could recover was to manually reset the application in the psql database:
sudo -u postgres psql -d sysinv -c"update kube_app set status='applied' where name='stx-openstack';"

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-05-14:

#21

Bob Church and Daniel Badea have identified an upstream fix that was targeted in ceph v13.2.3 (stx has ceph v13.2.2):

https://tracker.ceph.com/issues/37271
https://github.com/ceph/ceph/pull/25113/commits/6d7f6209b76f2d38ff832e556937510228d76bda

This fix was cherry-picked and merged in starlingx by pull request:
https://github.com/starlingx-staging/stx-ceph/pull/30

The fix should be available in May 15 loads.

Changed in starlingx:
status:	In Progress → Fix Released

Ghada Khalil (gkhalil) on 2019-05-15

tags:

added: stx.sanity

Revision history for this message

Cristopher Lemus (cjlemusc) wrote on 2019-05-16:

#22

controller-0_20190516.132547.tar Edit (41.9 MiB, application/x-tar)

Using CENGN image 20190515T220331Z: http://mirror.starlingx.cengn.ca/mirror/starlingx/master/centos/20190515T220331Z/outputs/iso/bootimage.iso

During sanity execution, Volumes cannot be created yet (behavior described on: https://bugs.launchpad.net/starlingx/+bug/1828059 which seems to be a duplicated of this one), the error is like this:

Invalid volume: Invalid input received: Invalid volume: Volume attachments can not be created if the volume is in an error state. The Volume ea728fbf-dd46-4ae8-9ca7-48dd8fd7b421 currently has a status of: error (HTTP 400) (Request-ID: req-757d37f1-8eb3-4e3c-9695-1895959aecfa) (HTTP 400) (Request-ID: req-3b404fca-d7b0-4241-857b-9acf950be7c9)

I attached a collect of a 2+2+2, however, the issue is also present on Simplex, Duplex, 2+2.

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-05-16:

#23

Please add the above note to https://bugs.launchpad.net/starlingx/+bug/1828059 as this is the bug that reports the volume issue you are reporting now. This issue reports a different symptom entirely even if the root-cause of both issues are the same.

Revision history for this message

Peng Peng (ppeng) wrote on 2019-05-22:

#24

Not saw this issue is recent load.
Lab: WCP_99_103
Load: 2019-05-21_14-14-17

Lab: WP_1_2
Load: 2019-05-18_06-36-50

tags:

removed: stx.retestneeded

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

Bug #1828899

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

auto-tracker.ceph.com #37271 Edit

Bug watches keep track of this bug in other bug trackers.

StarlingX

stx-openstack application stuck at applying status

Bug Description

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches