stx-openstack application stuck at applying status

Bug #1827521 reported by Peng Peng
20
This bug affects 2 people
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Critical
Daniel Badea

Bug Description

Brief Description
-----------------
After "system application-apply stx-openstack", application stuck at "applying" status.

Severity
--------
Critical

Steps to Reproduce
------------------
....
TC-name:

Expected Behavior
------------------

Actual Behavior
----------------

Reproducibility
---------------
Reproducible

System Configuration
--------------------
one or two nodes system

Lab-name: SM-2, WP_1-2

Branch/Pull Time/Commit
-----------------------
stx master as of 20190503T013000Z

Last Pass
---------
20190502T013000Z

Timestamp/Logs
--------------
2019-05-03 07:07:38 [admin@admin]> RUNNING: system application-apply stx-openstack
+---------------+----------------------------------+
| Property | Value |
+---------------+----------------------------------+
| app_version | 1.0-11-centos-stable-versioned |
| created_at | 2019-05-03T07:07:13.357067+00:00 |
| manifest_file | manifest.yaml |
| manifest_name | armada-manifest |
| name | stx-openstack |
| progress | None |
| status | applying |
| updated_at | 2019-05-03T07:07:33.336143+00:00 |
+---------------+----------------------------------+

[wrsroot@controller-0 ~(keystone_admin)]$ date
Fri May 3 13:50:22 UTC 2019
[wrsroot@controller-0 ~(keystone_admin)]$ system application-show stx-openstack
+---------------+----------------------------------+
| Property | Value |
+---------------+----------------------------------+
| app_version | 1.0-11-centos-stable-versioned |
| created_at | 2019-05-03T07:07:13.357067+00:00 |
| manifest_file | manifest.yaml |
| manifest_name | armada-manifest |
| name | stx-openstack |
| progress | retrieving docker images |
| status | applying |
| updated_at | 2019-05-03T07:07:44.409510+00:00 |
+---------------+----------------------------------+

Test Activity
-------------
lab setup

Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
Dariush Eslimi (deslimi) wrote :

Initial triage indicates, this is related to ceph upversion. ceph-mgr process is dying. Marking stx.2.0 gating as application apply is failing.

Changed in starlingx:
status: New → Triaged
importance: Undecided → Critical
assignee: nobody → Daniel Badea (daniel.badea)
tags: added: stx.2.0 stx.distro.other stx.retestneeded
Revision history for this message
Fernando Hernandez Gonzalez (fhernan2) wrote :

It is also presented on BM 2+2+2 cluster.

[wrsroot@controller-0 ~(keystone_admin)]$ system application-apply stx-openstack
+---------------+----------------------------------+
| Property | Value |
+---------------+----------------------------------+
| app_version | 1.0-11-centos-stable-versioned |
| created_at | 2019-05-03T11:29:24.278612+00:00 |
| manifest_file | manifest.yaml |
| manifest_name | armada-manifest |
| name | stx-openstack |
| progress | None |
| status | applying |
| updated_at | 2019-05-03T11:29:41.223227+00:00 |
+---------------+----------------------------------+

[wrsroot@controller-0 ~(keystone_admin)]$ system application-list
+---------------+------------------------------+-----------------+-----------+--------------+--------------------+
| application | version | manifest name | manifest | status | progress |
| | | | file | | |
+---------------+------------------------------+-----------------+-----------+--------------+--------------------+
| stx-openstack | 1.0-11-centos-stable- | armada-manifest | manifest. | apply-failed | operation aborted, |
| | versioned | | yaml | | check logs for |
| | | | | | detail |
| | | | | | |
+---------------+------------------------------+-----------------+-----------+--------------+--------------------+

Revision history for this message
Fernando Hernandez Gonzalez (fhernan2) wrote :

BM 2+2+2 sysinv.log

Revision history for this message
yong hu (yhu6) wrote :
Download full text (3.7 KiB)

@fernando, "stx-openstack" applied failed due to a set of missing docker images in registry:
1. stx-glance:master-centos-stable-20190429T233000Z
2. stx-keystone-api-proxy:master-centos-stable-20190429T233000Z
3. stx-nova-api-proxy:master-centos-stable-20190429T233000Z

You can see all these images have a common version tag (20190429) which indicates they were generated very recently 20190429. so please check your local docker registry and assure these images are placed.

-------------------log from sysinv.log --------------------------------------------------

2019-05-03 11:26:31.413 97338 INFO sysinv.conductor.kube_app [-] Image 10.10.58.2:9001/docker.io/starlingx/stx-glance:master-centos-stable-20190429T233000Z is not available in local registry, download started from public/private registry
2019-05-03 11:26:31.440 97338 INFO sysinv.conductor.kube_app [-] Image 10.10.58.2:9001/quay.io/stackanetes/kubernetes-entrypoint:v0.3.1 download succeeded in 0 seconds
2019-05-03 11:26:31.447 97338 ERROR sysinv.conductor.kube_app [-] Image 192.168.100.60/starlingx/stx-glance:master-centos-stable-20190429T233000Z download failed from public/privateregistry: 404 Client Error: Not Found ("manifest for 192.168.100.60/starlingx/stx-glance:master-centos-stable-20190429T233000Z not found")
2019-05-03 11:26:31.608 97338 INFO sysinv.conductor.kube_app [-] Image 10.10.58.2:9001/docker.io/openstackhelm/openvswitch:v2.8.1 download succeeded in 0 seconds
2019-05-03 11:26:31.636 97338 INFO sysinv.conductor.kube_app [-] Image 10.10.58.2:9001/docker.io/starlingx/stx-keystone-api-proxy:master-centos-stable-20190429T233000Z is not available in local registry, download started from public/private registry
2019-05-03 11:26:31.676 97338 ERROR sysinv.conductor.kube_app [-] Image 192.168.100.60/starlingx/stx-keystone-api-proxy:master-centos-stable-20190429T233000Z download failed from public/privateregistry: 404 Client Error: Not Found ("manifest for 192.168.100.60/starlingx/stx-keystone-api-proxy:master-centos-stable-20190429T233000Z not found")
2019-05-03 11:26:31.728 97338 INFO sysinv.conductor.kube_app [-] Image 10.10.58.2:9001/docker.io/starlingx/stx-nova-api-proxy:master-centos-stable-20190429T233000Z is not available in local registry, download started from public/private registry
2019-05-03 11:26:31.764 97338 ERROR sysinv.conductor.kube_app [-] Image 192.168.100.60/starlingx/stx-nova-api-proxy:master-centos-stable-20190429T233000Z download failed from public/privateregistry: 404 Client Error: Not Found ("manifest for 192.168.100.60/starlingx/stx-nova-api-proxy:master-centos-stable-20190429T233000Z not found")
2019-05-03 11:26:31.765 97338 ERROR sysinv.conductor.kube_app [-] Deployment of application stx-openstack (1.0-11-centos-stable-versioned) failed: failed to download one or more image(s).
2019-05-03 11:26:31.765 97338 TRACE sysinv.conductor.kube_app Traceback (most recent call last):
2019-05-03 11:26:31.765 97338 TRACE sysinv.conductor.kube_app File "/usr/lib64/python2.7/site-packages/sysinv/conductor/kube_app.py", line 1000, in perform_app_apply
2019-05-03 11:26:31.765 97338 TRACE sysinv.conductor.kube_app self._download_images(app)
2019-05-03 11:26:31.7...

Read more...

Revision history for this message
yong hu (yhu6) wrote :

@ppeng, when you were trying to run "application-apply stx-openstack", how many controllers were available? only controller-0 or both controller-0 and -1?
From your log, "ALL_NODES_20190503.134444.tar" actually only has "controller-0_20190503.134444.tgz".

If it was the case, could you try the test again with the minimal setup: controllre-0 (available) -> controller-1 (available) -> storage-0 (OSDs added and available)?

Better you can check Ceph health by "ceph -s" before running "system application-apply".

yong hu (yhu6)
Changed in starlingx:
assignee: Daniel Badea (daniel.badea) → yong hu (yhu6)
Revision history for this message
yong hu (yhu6) wrote :

@fhernan2, I installed the build from the latest codebase for a 2+2+1 and I could successfully apply stx-openstack, AFTER I assured in my local registry all docker images are placed.

You can see the attached screen-shot I had.

In addition, for the missing images in your local registry, you can pull them with the corresponding tag from public docker hub:

1. stx-glance:master-centos-stable-20190429T233000Z
https://hub.docker.com/r/starlingx/stx-glance/tags

2. stx-keystone-api-proxy:master-centos-stable-20190429T233000Z
https://hub.docker.com/r/starlingx/stx-keystone-api-proxy/tags

3. stx-nova-api-proxy:master-centos-stable-20190429T233000Z
https://hub.docker.com/r/starlingx/stx-nova-api-proxy/tags

Revision history for this message
yong hu (yhu6) wrote :

@fhernan2, I installed the build from the latest codebase for a 2+2+1 and I could successfully apply stx-openstack, AFTER I assured in my local registry all docker images are placed.

You can see the attached screen-shot I had.

In addition, for the missing images in your local registry, you can pull them with the corresponding tag from public docker hub:

1. stx-glance:master-centos-stable-20190429T233000Z
https://hub.docker.com/r/starlingx/stx-glance/tags

2. stx-keystone-api-proxy:master-centos-stable-20190429T233000Z
https://hub.docker.com/r/starlingx/stx-keystone-api-proxy/tags

3. stx-nova-api-proxy:master-centos-stable-20190429T233000Z
https://hub.docker.com/r/starlingx/stx-nova-api-proxy/tags

Revision history for this message
yong hu (yhu6) wrote :

@fhernan2, I installed the build from the latest codebase for a 2+2+1 and I could successfully apply stx-openstack, AFTER I assured in my local registry all docker images are placed.

You can see the attached screen-shot I had.

In addition, for the missing images in your local registry, you can pull them with the corresponding tag from public docker hub:

1. stx-glance:master-centos-stable-20190429T233000Z
https://hub.docker.com/r/starlingx/stx-glance/tags

2. stx-keystone-api-proxy:master-centos-stable-20190429T233000Z
https://hub.docker.com/r/starlingx/stx-keystone-api-proxy/tags

3. stx-nova-api-proxy:master-centos-stable-20190429T233000Z
https://hub.docker.com/r/starlingx/stx-nova-api-proxy/tags

Revision history for this message
yong hu (yhu6) wrote :

another deployment (2+2+2) was done on Cengn 0504 build successfully, with stx-openstack application applied. see the attachment.

Revision history for this message
yong hu (yhu6) wrote :

@ppeng, I deployed a simplex AIO (built from 0505 codebase) and it worked well with stx-application.
pls see the attachment.

Changed in starlingx:
status: Triaged → Incomplete
Revision history for this message
Peng Peng (ppeng) wrote :

Issue was reproduced on 20190505T233000Z. both hosts logs are collected and attached.

[wrsroot@controller-0 ~(keystone_admin)]$ system application-show stx-openstack
+---------------+--------------------------------------------------------------------+
| Property | Value |
+---------------+--------------------------------------------------------------------+
| app_version | 1.0-11-centos-stable-versioned |
| created_at | 2019-05-06T07:49:13.048191+00:00 |
| manifest_file | manifest.yaml |
| manifest_name | armada-manifest |
| name | stx-openstack |
| progress | processing chart: osh-openstack-mariadb, overall completion: 21.0% |
| status | applying |
| updated_at | 2019-05-06T08:06:53.780579+00:00 |
+---------------+--------------------------------------------------------------------+
[wrsroot@controller-0 ~(keystone_admin)]$ collect all
[sudo] password for wrsroot:
collecting data from 2 host(s): controller-0 controller-1
collecting controller-0_20190506.134534 ... done (00:02:00 35M)
collecting controller-1_20190506.134534 ... done (00:01:19 30M)
creating all-nodes tarball /scratch/ALL_NODES_20190506.134534.tar ... done (00:03:19 64M)
[wrsroot@controller-0 ~(keystone_admin)]$ date
Mon May 6 13:49:09 UTC 2019
[wrsroot@controller-0 ~(keystone_admin)]$ system host-list
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | controller-1 | controller | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------+
[wrsroot@controller-0 ~(keystone_admin)]$

Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
Fernando Hernandez Gonzalez (fhernan2) wrote :

After our docker registry was updated with missing docker images the BM 2+2+2 was installed successfully. system application-apply stx-openstack succeed.

Revision history for this message
Fernando Hernandez Gonzalez (fhernan2) wrote :

[wrsroot@controller-0 ~(keystone_admin)]$ system host-list
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | compute-0 | worker | unlocked | enabled | available |
| 3 | compute-1 | worker | unlocked | enabled | available |
| 6 | storage-1 | storage | unlocked | enabled | available |
| 7 | storage-0 | storage | unlocked | enabled | available |
| 12 | controller-1 | controller | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------

[wrsroot@controller-0 ~(keystone_admin)]$ ceph -s
  cluster:
    id: 85261611-1245-4b21-bd0b-cc9afdc26cff
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum controller-0,controller-1,storage-0
    mgr: controller-0(active), standbys: controller-1
    osd: 2 osds: 2 up, 2 in
    rgw: 1 daemon active

  data:
    pools: 9 pools, 856 pgs
    objects: 1.59 k objects, 825 MiB
    usage: 1.8 GiB used, 890 GiB / 892 GiB avail
    pgs: 856 active+clean

  io:
    client: 360 KiB/s wr, 0 op/s rd, 68 op/s wr

Revision history for this message
yong hu (yhu6) wrote :

@ppeng, looking into the log, the application-apply stuck at here:

2149 2019-05-06 08:06:53.772 104648 INFO sysinv.conductor.kube_app [-] processing chart: osh-openstack-mariadb, overall completion: 21.0%

one of possible causes was something went wrong with ceph osd pool "cube-rbd".
Please provide some info by the following commands:
1. ceph -s
2. ceph osd tree
3. ceph osd pool ls
4. ps aux | grep ceph
5. kubectl get secrets --all-namespaces | grep ceph
6. kubectl get jobs.batch --all-namespaces | grep ceph
7. kubectl get serviceaccounts --all-namespaces | grep ceph
8. kubectl get pods --all-namespaces | grep ceph
9. history > my_operations.log

In addition, with the same Cengn image, we also deployed a duplex with 2 OSDs (on controller-0 and controller-1 respectively) and application-apply worked well. You can see the screenshot in the attachment.

Revision history for this message
yong hu (yhu6) wrote :
Revision history for this message
Daniel Badea (daniel.badea) wrote :

This is likely an issue caused by Ceph Rest API not being available when stx-openstack is applied:

1. at 2019-05-03 07:07:38 system application-apply stx-openstack

2. but sm-customer.log reports mgr-restful-plugin started at 2019-05-03 07:13:47

  | 2019-05-03T07:13:47.582 | 299 | service-scn | mgr-restful-plugin | enabling-failed | enabled-active | enable success

3. and also mgr-restful-plugin.log reports response ok from ceph-mgr (REST API) at 2019-05-03 07:14:04

  2019-05-03 07:14:04,634 199978 INFO mgr-restful-plugin get url=https://controller-0:5001/, response=<Response [200]>

Chances are mariadb chart failed because mgr-restful-plugin/ceph-mgr were not up and running but there are no logs to confirm that (container logs were discarded before log collect).

Cindy Xie (xxie1)
tags: added: stx.storage
Revision history for this message
Daniel Badea (daniel.badea) wrote :

ceph-mgr and restful plugin are restarted by mgr-restful-plugin because of a missing liveness request:

2019-05-03 07:12:00.465 7fdf1ae49700 0 log_channel(audit) log [DBG] : from='mon.0 192.168.204.3:0/84215' entity='mon.controller-0' cmd=[{"prefix": "osd df", "output_method": "tree", "format": "json"}]: dispatch
2019-05-03 07:12:00.465 7fdf1ae49700 1 mgr.server reply handle_command (0) Success
2019-05-03 07:12:00.542 7fdf1b64a700 1 mgr[restful] Unknown request '140596180385936:0'
2019-05-03 07:12:01.534 7fdf2a784700 1 mgr send_beacon active

then

2019-05-03 07:12:26.038 7fdf2e78c700 -1 received signal: Terminated from /usr/bin/python /etc/init.d/mgr-restful-plugin start (PID: 119438) UID: 0
2019-05-03 07:12:26.038 7fdf2e78c700 -1 mgr handle_signal *** Got signal Terminated ***
2019-05-03 07:12:35.426 7f5ecbe17380 0 ceph version 13.2.2 (67ecc2fbbca4f602f1172fa8e561b813eb564df5) mimic (stable), process ceph-mgr, pid 190485

This is caused by a synchronization issue in ceph-mgr. See https://github.com/starlingx-staging/stx-ceph/pull/29

Changed in starlingx:
assignee: yong hu (yhu6) → Daniel Badea (daniel.badea)
Ghada Khalil (gkhalil)
Changed in starlingx:
status: Incomplete → In Progress
Revision history for this message
Joseph Richard (josephrichard) wrote :

I've seen this a few times as well with a pull from yesterday.

The reason the application gets stuck in applying state is because sm restarts sysinv (due to the failure of mgr-restful-plugin that Daniel is looking at), which stops armada from running, and nothing cleans up the application state.
You can see this in sm logs:
2019-05-03T07:13:43.000 controller-0 sm: debug time[892.099] log<632> INFO: sm[88506]: sm_service_fsm.c(1032): Service (sysinv-conductor) received event (disable) was in the enabled-active state and is now in the disabling state.

The only way I could recover was to manually reset the application in the psql database:
sudo -u postgres psql -d sysinv -c"update kube_app set status='applied' where name='stx-openstack';"

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Bob Church and Daniel Badea have identified an upstream fix that was targeted in ceph v13.2.3 (stx has ceph v13.2.2):

https://tracker.ceph.com/issues/37271
https://github.com/ceph/ceph/pull/25113/commits/6d7f6209b76f2d38ff832e556937510228d76bda

This fix was cherry-picked and merged in starlingx by pull request:
https://github.com/starlingx-staging/stx-ceph/pull/30

The fix should be available in May 15 loads.

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
tags: added: stx.sanity
Revision history for this message
Cristopher Lemus (cjlemusc) wrote :

Using CENGN image 20190515T220331Z: http://mirror.starlingx.cengn.ca/mirror/starlingx/master/centos/20190515T220331Z/outputs/iso/bootimage.iso

During sanity execution, Volumes cannot be created yet (behavior described on: https://bugs.launchpad.net/starlingx/+bug/1828059 which seems to be a duplicated of this one), the error is like this:

Invalid volume: Invalid input received: Invalid volume: Volume attachments can not be created if the volume is in an error state. The Volume ea728fbf-dd46-4ae8-9ca7-48dd8fd7b421 currently has a status of: error (HTTP 400) (Request-ID: req-757d37f1-8eb3-4e3c-9695-1895959aecfa) (HTTP 400) (Request-ID: req-3b404fca-d7b0-4241-857b-9acf950be7c9)

I attached a collect of a 2+2+2, however, the issue is also present on Simplex, Duplex, 2+2.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Please add the above note to https://bugs.launchpad.net/starlingx/+bug/1828059 as this is the bug that reports the volume issue you are reporting now. This issue reports a different symptom entirely even if the root-cause of both issues are the same.

Revision history for this message
Peng Peng (ppeng) wrote :

Not saw this issue is recent load.
Lab: WCP_99_103
Load: 2019-05-21_14-14-17

Lab: WP_1_2
Load: 2019-05-18_06-36-50

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.