stx-monitor stuck at applying status when apply is not possible - it should reach apply-failed

Bug #1867019 reported by Peng Peng
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Kevin Smith

Bug Description

Brief Description
-----------------
stx-monitor app applied and deleted successfully on SX system. After Backup and Restore, tried to apply stx-monitor on system, but app stucked at 'applyin' status. And host host cannot be unlocked after locking due to this issue

Severity
--------
Major

Steps to Reproduce
------------------
BnR on SX system
check stx-monitor app status

TC-name: sanity after BnR

Expected Behavior
------------------
stx-monitor applied

Actual Behavior
----------------
stx-monitor stuck as applying

Reproducibility
---------------
Unknown - first time this is seen in sanity, will monitor

System Configuration
--------------------
One node system

Lab-name: wcp-112

Branch/Pull Time/Commit
-----------------------
2020-03-09_04-10-00

Last Pass
---------
unknown

Timestamp/Logs
--------------
[2020-03-10 07:57:07,891] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[abcd:204::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne application-list'
[2020-03-10 07:57:09,050] 436 DEBUG MainThread ssh.expect :: Output:
+---------------------+---------+-------------------------------+------------------+----------+-----------+
| application | version | manifest name | manifest file | status | progress |
+---------------------+---------+-------------------------------+------------------+----------+-----------+
| oidc-auth-apps | 1.0-0 | oidc-auth-manifest | manifest.yaml | uploaded | completed |
| platform-integ-apps | 1.0-8 | platform-integration-manifest | manifest.yaml | applied | completed |
| stx-monitor | 1.0-1 | monitor-armada-manifest | stx-monitor.yaml | applied | completed |
+---------------------+---------+-------------------------------+------------------+----------+-----------+

[2020-03-10 07:58:40,715] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[abcd:204::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne application-delete stx-monitor'
[2020-03-10 07:58:43,088] 436 DEBUG MainThread ssh.expect :: Output:
Application stx-monitor deleted.
controller-0:~$

BnR ....

[2020-03-10 20:06:27,015] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[abcd:204::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne application-list'
[2020-03-10 20:06:28,164] 436 DEBUG MainThread ssh.expect :: Output:
+---------------------+---------+-------------------------------+------------------+--------------+------------------------------------------+
| application | version | manifest name | manifest file | status | progress |
+---------------------+---------+-------------------------------+------------------+--------------+------------------------------------------+
| oidc-auth-apps | 1.0-0 | oidc-auth-manifest | manifest.yaml | uploaded | completed |
| platform-integ-apps | 1.0-8 | platform-integration-manifest | manifest.yaml | apply-failed | operation aborted, check logs for detail |
| stx-monitor | 1.0-1 | monitor-armada-manifest | stx-monitor.yaml | uploaded | completed |
+---------------------+---------+-------------------------------+------------------+--------------+------------------------------------------+
controller-0:~$
[2020-03-10 20:06:28,164] 314 DEBUG MainThread ssh.send :: Send 'echo $?'
[2020-03-10 20:06:28,267] 436 DEBUG MainThread ssh.expect :: Output:
0
controller-0:~$
[2020-03-10 20:06:28,268] 254 INFO MainThread container_helper.wait_for_apps_status:: ['stx-monitor'] reached expected status uploaded
[2020-03-10 20:06:28,268] 144 INFO MainThread container_helper.upload_app:: stx-monitor uploaded successfully
[2020-03-10 20:06:28,268] 287 INFO MainThread test_stx_monitor.app_upload_apply:: Apply stx-monitor
[2020-03-10 20:06:28,268] 296 INFO MainThread container_helper.apply_app:: Apply application: stx-monitor
[2020-03-10 20:06:28,269] 1604 DEBUG MainThread ssh.get_active_controller:: Getting active controller client for wcp_112
[2020-03-10 20:06:28,269] 479 DEBUG MainThread ssh.exec_cmd:: Executing command...
[2020-03-10 20:06:28,269] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[abcd:204::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne application-apply stx-monitor'

[2020-03-10 21:05:45,153] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[abcd:204::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne application-list'
[2020-03-10 21:05:46,275] 436 DEBUG MainThread ssh.expect :: Output:
+---------------------+---------+-------------------------------+------------------+--------------+------------------------------------------+
| application | version | manifest name | manifest file | status | progress |
+---------------------+---------+-------------------------------+------------------+--------------+------------------------------------------+
| oidc-auth-apps | 1.0-0 | oidc-auth-manifest | manifest.yaml | uploaded | completed |
| platform-integ-apps | 1.0-8 | platform-integration-manifest | manifest.yaml | apply-failed | operation aborted, check logs for detail |
| stx-monitor | 1.0-1 | monitor-armada-manifest | stx-monitor.yaml | applying | None |
+---------------------+---------+-------------------------------+------------------+--------------+------------------------------------------+

system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[abcd:204::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne application-remove stx-monitor' failed to execute. Output: Application-remove rejected: operation is not allowed while the current status is applying.

Test Activity
-------------
Sanity

Peng Peng (ppeng)
tags: added: stx.retestneeded
Revision history for this message
Peng Peng (ppeng) wrote :

It seems, after BnR, platform-deployment-manager pod is not running.

[root@controller-0 sysadmin(keystone_admin)]# kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS
RESTARTS AGE
kube-system calico-kube-controllers-855577b7b5-c49fs 1/1 Running
7 24h
kube-system calico-node-kg5jv 1/1 Running
3 24h
kube-system coredns-6889846b6b-fw7qh 1/1 Running
3 24h
kube-system kube-apiserver-controller-0 1/1 Running
3 24h
kube-system kube-controller-manager-controller-0 1/1 Running
4 24h
kube-system kube-multus-ds-amd64-x6vmc 1/1 Running
3 24h
kube-system kube-proxy-9flbc 1/1 Running
3 24h
kube-system kube-scheduler-controller-0 1/1 Running
4 24h
kube-system kube-sriov-cni-ds-amd64-r67q5 1/1 Running
3 24h
kube-system tiller-deploy-d6b59fcb-ldb9p 1/1 Running
3 24h
[root@controller-0 sysadmin(keystone_admin)]#

[root@controller-0 sysadmin(keystone_admin)]# KUBECONFIG=/etc/kubernetes/admin.conf /bin/kubectl apply -f andy_backup/deployment-config.yaml
namespace/deployment unchanged
secret/platform-certificate unchanged
secret/system-endpoint configured
secret/system-license unchanged
unable to recognize "andy_backup/deployment-config.yaml": no matches for kind "System" in version "starlingx.windriver.com/v1"
unable to recognize "andy_backup/deployment-config.yaml": no matches for kind "DataNetwork" in version "starlingx.windriver.com/v1"
unable to recognize "andy_backup/deployment-config.yaml": no matches for kind "HostProfile" in version "starlingx.windriver.com/v1"
unable to recognize "andy_backup/deployment-config.yaml": no matches for kind "Host" in version "starlingx.windriver.com/v1"

Frank Miller (sensfan22)
Changed in starlingx:
assignee: nobody → Mihnea Saracin (msaracin)
Revision history for this message
Yang Liu (yliu12) wrote :

I want to clarify the main complaint of this issue is not BnR. On BnR side, it seems no system application can be applied after restore, which could be similar as following.
https://bugs.launchpad.net/starlingx/+bug/1866704 - SX after BnR, platform-integ-apps apply-failed

For this one, someone works on stx-monitor should investigate on why stx-monitor stucks at applying instead of apply-failed.
The impact for "applying" state is that once controller is locked, it cannot be unlocked anymore, so system will basically become useless. And BnR is not the only scenario that causes stx-monitor to behave like this. In non-ceph case, whether stx-monitor is not supported, it also stucks at applying.

summary: - After BnR, stx-monitor stuck at applying status
+ stx-monitor stuck at applying status when apply is not possible - it
+ should reach apply-failed
Revision history for this message
Mihnea Saracin (msaracin) wrote :

In this scenario, I think stx-monitor behaves like this because platform-integ-apps it's in a failed state. I'll try to solve https://bugs.launchpad.net/starlingx/+bug/1866704 and then see if this bug is still reproducible.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Assigning to Kevin Smith to investigate. The request here is to change the stx-monitor chart to timeout and fail.

tags: added: stx.4.0 stx.monitor
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
assignee: Mihnea Saracin (msaracin) → Kevin Smith (kevin.smith.wrs)
Revision history for this message
Peng Peng (ppeng) wrote :
Download full text (3.2 KiB)

Issue was reproduced on
Lab: WCP_112
Load: 2020-03-19_04-10-00

Log added:
https://files.starlingx.kube.cengn.ca/launchpad/1867019
ALL_NODES_20200310.145754.tar is right after BnR
ALL_NODES_20200311.033530.tar is after BnR sanity

[2020-03-20 16:57:03,503] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[abcd:204::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne application-list'
[2020-03-20 16:57:04,667] 436 DEBUG MainThread ssh.expect :: Output:
+---------------------+---------+-------------------------------+------------------+----------+-------------------------------+
| application | version | manifest name | manifest file | status | progress |
+---------------------+---------+-------------------------------+------------------+----------+-------------------------------+
| oidc-auth-apps | 1.0-0 | oidc-auth-manifest | manifest.yaml | uploaded | completed |
| platform-integ-apps | 1.0-8 | platform-integration-manifest | manifest.yaml | applied | completed |
| stx-monitor | 1.0-1 | monitor-armada-manifest | stx-monitor.yaml | applying | applying application manifest |
+---------------------+---------+-------------------------------+------------------+----------+-------------------------------+
controller-0:~$
[2020-03-20 16:57:04,668] 314 DEBUG MainThread ssh.send :: Send 'echo $?'
[2020-03-20 16:57:04,771] 436 DEBUG MainThread ssh.expect :: Output:
0
controller-0:~$
[2020-03-20 16:58:04,831] 1604 DEBUG MainThread ssh.get_active_controller:: Getting active controller client for wcp_112
[2020-03-20 16:58:04,831] 479 DEBUG MainThread ssh.exec_cmd:: Executing command...
[2020-03-20 16:58:04,832] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[abcd:204::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne application-list'
[2020-03-20 16:58:05,974] 436 DEBUG MainThread ssh.expect :: Output:
+---------------------+---------+-------------------------------+------------------+--------------+------------------------------------------+
| application | version | manifest name | manifest file | status | progress |
+---------------------+---------+-------------------------------+------------------+--------------+------------------------------------------+
| oidc-auth-apps | 1.0-0 | oidc-auth-manifest | manifest.yaml | uploaded | completed |
| platform-integ-apps | 1.0-8 | platform-integration-manifest | manifest.yaml | applied | completed |
| stx-monitor | 1.0-1 | monitor-armada-manifest | stx-monitor.yaml | apply-failed | operation aborted, check logs for detail |
+------------------...

Read more...

Revision history for this message
Kevin Smith (kevin.smith.wrs) wrote :

Not sure I understand. The previous comment shows stx-monitor in apply-failed status, which is what the lp is asking for. I also don't see stx-monitor stuck applying in the logs.

Revision history for this message
Kevin Smith (kevin.smith.wrs) wrote :

The problem is readily reproducible by applying any application requiring storage when none is configured (ceph). The problem is the exception string logged in the sysinv kube_app table is longer than the 255 character limit.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/714426

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/714426
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=08aa950393a7e3c5fd5299b88e134307800584aa
Submitter: Zuul
Branch: master

commit 08aa950393a7e3c5fd5299b88e134307800584aa
Author: Kevin Smith <email address hidden>
Date: Sun Mar 22 14:29:15 2020 -0400

    application-apply error string too long

    During application-apply exception handling, str(e) is
    used as the input to the progress column of the kube_app
    table in the database, which may be longer than the 255
    character limit. The result is an application stuck
    in 'applying' status. This update adds a more readable
    error message to just check logs.

    There are other instances where str(e) is used as input to
    the database and could cause a similar problem which should
    also be looked at.

    Change-Id: I01a5e8f56a628726163e2cfffc58143ae8d5f845
    Closes-Bug: 1867019
    Signed-off-by: Kevin Smith <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/716137

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (f/centos8)
Download full text (32.3 KiB)

Reviewed: https://review.opendev.org/716137
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=cb4cf4299c2ec10fb2eb03cdee3f6d78a6413089
Submitter: Zuul
Branch: f/centos8

commit 16477935845e1c27b4c9d31743e359b0aa94a948
Author: Steven Webster <email address hidden>
Date: Sat Mar 28 17:19:30 2020 -0400

    Fix SR-IOV runtime manifest apply

    When an SR-IOV interface is configured, the platform's
    network runtime manifest is applied in order to apply the virtual
    function (VF) config and restart the interface. This results in
    sysinv being able to determine and populate the puppet hieradata
    with the virtual function PCI addresses.

    A side effect of the network manifest apply is that potentially
    all platform interfaces may be brought down/up if it is determined
    that their configuration has changed. This will likely be the case
    for a system which configures SR-IOV interfaces before initial
    unlock.

    A few issues have been encountered because of this, with some
    services not behaving well when the interface they are communicating
    over suddenly goes down.

    This commit makes the SR-IOV VF configuration much more targeted
    so that only the operation of setting the desired number of VFs
    is performed.

    Closes-Bug: #1868584
    Depends-On: https://review.opendev.org/715669
    Change-Id: Ie162380d3732eb1b6e9c553362fe68cbc313ae2b
    Signed-off-by: Steven Webster <email address hidden>

commit 45c9fe2d3571574b9e0503af108fe7c1567007db
Author: Zhipeng Liu <email address hidden>
Date: Thu Mar 26 01:58:34 2020 +0800

    Add ipv6 support for novncproxy_base_url.

    For ipv6 address, we need url with below format
    [ip]:port

    Partial-Bug: 1859641

    Change-Id: I01a5cd92deb9e88c2d31bd1e16e5bce1e849fcc7
    Signed-off-by: Zhipeng Liu <email address hidden>

commit d119336b3a3b24d924e000277a37ab0b5f93aae1
Author: Andy Ning <email address hidden>
Date: Mon Mar 23 16:26:21 2020 -0400

    Fix timeout waiting for CA cert install during ansible replay

    During ansible bootstrap replay, the ssl_ca_complete_flag file is
    removed. It expects puppet platform::config::runtime manifest apply
    during system CA certificate install to re-generate it. So this commit
    updated conductor manager to run that puppet manifest even if the CA cert
    has already installed so that the ssl_ca_complete_flag file is created
    and makes ansible replay to continue.

    Change-Id: Ic9051fba9afe5d5a189e2be8c8c2960bdb0d20a4
    Closes-Bug: 1868585
    Signed-off-by: Andy Ning <email address hidden>

commit 24a533d800b2c57b84f1086593fe5f04f95fe906
Author: Zhipeng Liu <email address hidden>
Date: Fri Mar 20 23:10:31 2020 +0800

    Fix rabbitmq could not bind port to ipv6 address issue

    When we use Armada to deploy openstack service for ipv6, rabbitmq
    pod could not start listen on [::]:5672 and [::]:15672.
    For ipv6, we need an override for configuration file.

    Upstream patch link is:
    https://review.opendev.org/#/c/714027/

    Test pass for deploying rabbitmq service on both ipv...

tags: added: in-f-centos8
Revision history for this message
Peng Peng (ppeng) wrote :

Verified on
Lab: WCP_112
Load: 2020-04-27_20-00-00

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.