Applications upload aborted by failed to validate application manifest

Bug #1879970 reported by Peng Peng
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Paul-Ionut Vaduva

Bug Description

Brief Description
-----------------
During system initializing, both platform-integ-apps and oidc-auth-apps applications upload aborted. The sysinv.log shows "Failed to validate application manifest".

Severity
--------
Major

Steps to Reproduce
------------------
Upload application platform-integ-apps and oidc-auth-apps

Expected Behavior
------------------
Both application platform-integ-apps and oidc-auth-apps upload success

Actual Behavior
----------------
Both application platform-integ-apps and oidc-auth-apps upload abort.

Reproducibility
---------------
Unknown - first time this is seen in sanity, will monitor

System Configuration
--------------------
Two node system

Lab-name: R430_3-4

Branch/Pull Time/Commit
-----------------------
2020-05-20_21-00-00

Last Pass
---------
2020-05-19_20-00-00

Timestamp/Logs
--------------
[sysadmin@controller-0 ~(keystone_admin)]$ system application-list
+--------------------------+---------+-----------------------------------+-----------------------+---------------+-----------------------------------------------------+
| application | version | manifest name | manifest file | status | progress |
+--------------------------+---------+-----------------------------------+-----------------------+---------------+-----------------------------------------------------+
| cert-manager | 1.0-0 | cert-manager-manifest | certmanager-manifest. | applied | completed |
| | | | yaml | | |
| | | | | | |
| nginx-ingress-controller | 1.0-0 | nginx-ingress-controller-manifest | nginx_ingress_control | applied | completed |
| | | | ler_manifest.yaml | | |
| | | | | | |
| oidc-auth-apps | 1.0-0 | oidc-auth-manifest | manifest.yaml | upload-failed | Upload of application oidc-auth-apps (1.0-0) failed |
| | | | | | : Failed to validate application manifest. |
| | | | | | |
| platform-integ-apps | 1.0-8 | platform-integration-manifest | manifest.yaml | upload-failed | Upload of application platform-integ-apps (1.0-8) |
| | | | | | failed: Failed to validate application manifest. |
| | | | | | |

From sysinv.log:
sysinv 2020-05-21 09:12:29.137 104305 ERROR sysinv.conductor.kube_app [-] Upload of application platform-integ-apps (1.0-8) failed: Failed to validate application manifest.: KubeAppUploadFailure: Upload of application platform-integ-apps (1.0-8) failed: Failed to validate application manifest.

sysinv 2020-05-21 09:12:29.978 104305 ERROR sysinv.conductor.kube_app [-] Upload of application oidc-auth-apps (1.0-0) failed: Failed to validate application manifest.: KubeAppUploadFailure: Upload of application oidc-auth-apps (1.0-0) failed: Failed to validate application manifest.

Test Activity
-------------
installation

Revision history for this message
Peng Peng (ppeng) wrote :
Ghada Khalil (gkhalil)
tags: added: stx.conta
tags: added: stx.4.0 stx.containers
removed: stx.conta
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
assignee: nobody → Paul-Ionut Vaduva (pvaduva)
Revision history for this message
Paul-Ionut Vaduva (pvaduva) wrote :
Download full text (6.5 KiB)

An assesment of what happened in the lab when platform-integ-apps and oidc-auth-apps failed to upload

At 09:12:29.137 platform-integ-apps failed to upload
2020-05-21 09:12:29.137 104305 ERROR sysinv.conductor.kube_app Traceback (most recent call last): │
2020-05-21 09:12:29.137 104305 ERROR sysinv.conductor.kube_app File "/usr/lib64/python2.7/site-packages/sysinv/conductor/kube_app.py", line 1928, in perform_app_upload │
2020-05-21 09:12:29.137 104305 ERROR sysinv.conductor.kube_app reason="Failed to validate application manifest.") │
2020-05-21 09:12:29.137 104305 ERROR sysinv.conductor.kube_app KubeAppUploadFailure: Upload of application platform-integ-apps (1.0-8) failed: Failed to validate application manifest. │
2020-05-21 09:12:29.137 104305 ERROR sysinv.conductor.kube_app

At 09:12:29.978 oidc-auth-apps failed to upload
2020-05-21 09:12:29.978 104305 ERROR sysinv.conductor.kube_app Traceback (most recent call last): │
2020-05-21 09:12:29.978 104305 ERROR sysinv.conductor.kube_app File "/usr/lib64/python2.7/site-packages/sysinv/conductor/kube_app.py", line 1928, in perform_app_upload │
2020-05-21 09:12:29.978 104305 ERROR sysinv.conductor.kube_app reason="Failed to validate application manifest.") │
2020-05-21 09:12:29.978 104305 ERROR sysinv.conductor.kube_app KubeAppUploadFailure: Upload of application oidc-auth-apps (1.0-0) failed: Failed to validate application manifest. │
2020-05-21 09:12:29.978 104305 ERROR sysinv.conductor.kube_app

Around 2020-05-21T09:12:04
The cert-manager pod container started to throw this errors in a loop:

2020-05-21T09:12:04.317805242Z stderr F E0521 09:12:04.317715 1 dynamic_source.go:87] "msg"="Failed to generate initial serving certificate, retrying..." "er
ror"="failed verifying CA keypair: tls: failed to find any PEM data in certificate input" "interval"=1000000000
2020-05-21T09:12:05.308809217Z stderr F I0521 09:12:05.308645 1 dynamic_source.go:171] "msg"="Generating new ECDSA private key"
2020-05-21T09:12:05.315177628Z stderr F I0521 09:12:05.315036 1 dynamic_source.go:186] "msg"="Signing new serving certificate"

Around 2020-05-21T09:12:09Z
Things started to go wrong with one of cert-manager application pod
2020-05-21T09:12:09Z cm-cert-manager-webhook-7d5c897795-tstjz Pod Readiness probe failed: HTTP probe failed with statuscode: 500 Unhealthy Warning
2020-05-21T09:12:10Z calico-kube-controllers-5cd4695574-mtspd Pod Container image "registry.local:9001/quay.io/calico/kube-controllers:v3.12.0" already pres
ent on machine Pulled Normal
2020-05-21T09:12:10Z coredns-78d9fd7cb9-q5nxw Pod Readiness probe failed: HTTP probe failed with statuscode: 503 Unhealthy Warning
202...

Read more...

Revision history for this message
Bob Church (rchurch) wrote :
Download full text (10.2 KiB)

This looks like another variation of https://bugs.launchpad.net/starlingx/+bug/1877582. It’s not quite the same failure, but is basically an issue with the proper shutdown of the armada container on an unlock, leaves the container unable to be started/restarted after the reboot.

Thu May 21 13:50:59 UTC 2020 : : docker container ps -a
--------------------------------------------------------------------
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
ae7089822ac6 registry.local:9001/quay.io/airshipit/armada:8a1638098f88d92bf799ef4934abe569789b885e-ubuntu_bionic "./entrypoint.sh ser…" 5 hours ago Exited (127) 5 hours ago armada_service

So exited approx around May 21 08:50:59...

2020-05-21T08:59:57.143 + Host Info +--------------------------------------+
2020-05-21T08:59:57.143 | action : unlock
2020-05-21T08:59:57.143 | personality: controller
2020-05-21T08:59:57.143 | hostname : controller-0
2020-05-21T08:59:57.143 | task : none
2020-05-21T08:59:57.143 | info : none
2020-05-21T08:59:57.143 | ip : face::2
2020-05-21T08:59:57.143 | mac : 3c:fd:fe:25:d5:c0
2020-05-21T08:59:57.143 | uuid : b726aaf0-96aa-42d3-a668-a22e360f9691
2020-05-21T08:59:57.143 | adminState: locked
2020-05-21T08:59:57.143 | operState: disabled
2020-05-21T08:59:57.143 | availStatus: online
2020-05-21T08:59:57.143 | bm ip : none
2020-05-21T08:59:57.143 | bm un : none
2020-05-21T08:59:57.143 | bm type : none
2020-05-21T08:59:57.143 | subFunction: controller,worker
2020-05-21T08:59:57.143 | operState: disabled
2020-05-21T08:59:57.143 | availStatus: online
2020-05-21T08:59:57.143 +------------+--------------------------------------+

2020-05-21T09:00:30.637 localhost containerd[108819]: info time="2020-05-21T09:00:30.636987457Z" level=info msg="shim reaped" id=ae7089822ac60e810d9f348d57a31f03a64bf2c6a8932da82721d4e260f363e6
2020-05-21T09:00:30.646 localhost dockerd[108828]: info time="2020-05-21T09:00:30.645241723Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
2020-05-21T09:00:30.663 localhost systemd[1]: info Unmounted /var/lib/docker/containers/ae7089822ac60e810d9f348d57a31f03a64bf2c6a8932da82721d4e260f363e6/mounts/shm.

2020-05-21T09:04:20.331 controller-0 containerd[2121]: info time="2020-05-21T09:04:20.331574760Z" level=info msg="shim reaped" id=ae7089822ac60e810d9f348d57a31f03a64bf2c6a8932da82721d4e260f363e6
2020-05-21T09:04:20.341 controller-0 dockerd[2130]: info time="2020-05-21T09:04:20.341878502Z" level=error msg="stream copy error: reading from a closed fifo"
2020-05-21T09:04:20.341 controller-0 dockerd[2130]: info time="2020-05-21T09:04:20.341873475Z" level=error msg="stream copy error: reading from a closed fifo"
2020-05-21T09:04:20.367 controller-0 dockerd[2130]: info time="2020-05-21T09:04:20.367669129Z" level=error msg="ae7089822ac60e810d9f348d57a31f03a64bf2c6a8932da82721d4e260f363e6 cleanup: failed to delete container fr...

Revision history for this message
Frank Miller (sensfan22) wrote :

The fix for LP 1877582 is expected to address this variant of issue as well.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Duplicate LP has been addressed.
https://review.opendev.org/735374
Merged in stx master on 2020-06-15

Changed in starlingx:
status: Triaged → Fix Released
Revision history for this message
Yang Liu (yliu12) wrote :

This issue has not been seen in recent sanity on same system (r430-3-4)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.