Comment 2 for bug 1879970

Revision history for this message
Paul-Ionut Vaduva (pvaduva) wrote :

An assesment of what happened in the lab when platform-integ-apps and oidc-auth-apps failed to upload

At 09:12:29.137 platform-integ-apps failed to upload
2020-05-21 09:12:29.137 104305 ERROR sysinv.conductor.kube_app Traceback (most recent call last): │
2020-05-21 09:12:29.137 104305 ERROR sysinv.conductor.kube_app File "/usr/lib64/python2.7/site-packages/sysinv/conductor/kube_app.py", line 1928, in perform_app_upload │
2020-05-21 09:12:29.137 104305 ERROR sysinv.conductor.kube_app reason="Failed to validate application manifest.") │
2020-05-21 09:12:29.137 104305 ERROR sysinv.conductor.kube_app KubeAppUploadFailure: Upload of application platform-integ-apps (1.0-8) failed: Failed to validate application manifest. │
2020-05-21 09:12:29.137 104305 ERROR sysinv.conductor.kube_app

At 09:12:29.978 oidc-auth-apps failed to upload
2020-05-21 09:12:29.978 104305 ERROR sysinv.conductor.kube_app Traceback (most recent call last): │
2020-05-21 09:12:29.978 104305 ERROR sysinv.conductor.kube_app File "/usr/lib64/python2.7/site-packages/sysinv/conductor/kube_app.py", line 1928, in perform_app_upload │
2020-05-21 09:12:29.978 104305 ERROR sysinv.conductor.kube_app reason="Failed to validate application manifest.") │
2020-05-21 09:12:29.978 104305 ERROR sysinv.conductor.kube_app KubeAppUploadFailure: Upload of application oidc-auth-apps (1.0-0) failed: Failed to validate application manifest. │
2020-05-21 09:12:29.978 104305 ERROR sysinv.conductor.kube_app

Around 2020-05-21T09:12:04
The cert-manager pod container started to throw this errors in a loop:

2020-05-21T09:12:04.317805242Z stderr F E0521 09:12:04.317715 1 dynamic_source.go:87] "msg"="Failed to generate initial serving certificate, retrying..." "er
ror"="failed verifying CA keypair: tls: failed to find any PEM data in certificate input" "interval"=1000000000
2020-05-21T09:12:05.308809217Z stderr F I0521 09:12:05.308645 1 dynamic_source.go:171] "msg"="Generating new ECDSA private key"
2020-05-21T09:12:05.315177628Z stderr F I0521 09:12:05.315036 1 dynamic_source.go:186] "msg"="Signing new serving certificate"

Around 2020-05-21T09:12:09Z
Things started to go wrong with one of cert-manager application pod
2020-05-21T09:12:09Z cm-cert-manager-webhook-7d5c897795-tstjz Pod Readiness probe failed: HTTP probe failed with statuscode: 500 Unhealthy Warning
2020-05-21T09:12:10Z calico-kube-controllers-5cd4695574-mtspd Pod Container image "registry.local:9001/quay.io/calico/kube-controllers:v3.12.0" already pres
ent on machine Pulled Normal
2020-05-21T09:12:10Z coredns-78d9fd7cb9-q5nxw Pod Readiness probe failed: HTTP probe failed with statuscode: 503 Unhealthy Warning
2020-05-21T09:12:10Z calico-kube-controllers-5cd4695574-mtspd Pod Started container calico-kube-controllers Started Normal
2020-05-21T09:12:10Z calico-kube-controllers-5cd4695574-mtspd Pod Created container calico-kube-controllers Created Normal
2020-05-21T09:12:12Z calico-kube-controllers-5cd4695574-mtspd Pod Readiness probe failed: Failed to read status file status.json: open status.json: no such
file or directory
        Unhealthy Warning
2020-05-21T09:12:21Z calico-kube-controllers-5cd4695574-mtspd Pod Back-off restarting failed container BackOff Warning
2020-05-21T09:12:24Z kube-scheduler Lease controller-0_df95482f-1ac0-474a-a491-c503622f57d1 became leader LeaderElection Normal
2020-05-21T09:12:24Z coredns-78d9fd7cb9-lsv8g Pod 0/1 nodes are available: 1 node(s) didn't match pod affinity/anti-affinity, 1 node(s) didn't satisfy exist
ing pods anti-affinity rules. FailedScheduling Warning
2020-05-21T09:12:24Z kube-scheduler Endpoints controller-0_df95482f-1ac0-474a-a491-c503622f57d1 became leader LeaderElection Normal
2020-05-21T09:12:32Z cm-cert-manager-cainjector-56b68989b5-8xrw6 Pod Back-off restarting failed container BackOff Warning
2020-05-21T09:12:33Z cert-manager-controller ConfigMap cm-cert-manager-7b8b94bf9f-v5cmx-external-cert-manager-controller became leader LeaderElection No
rmal
2020-05-21T09:12:34Z platform-deployment-manager-0 Pod Back-off restarting failed container BackOff Warning

Some issues with disk capacity around 2020-05-21T09:38:34 (The times are ulterior to the application failure so my guess is that they can't be causal)

2020-05-21T09:38:34Z coredns-78d9fd7cb9-lsv8g Pod Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "07b16d
823331e1ac4b326b65c01737d4eb2ce258835c2a5d74b4899664d78fc6": Multus: [kube-system/coredns-78d9fd7cb9-lsv8g]: error adding container to network "chain": delegateAd
d: error invoking conflistAdd - "chain": conflistAdd: error in getting result from AddNetworkList: stat /var/lib/calico/nodename: no such file or directory: check
 that the calico/node container is running and has mounted /var/lib/calico/ FailedCreatePodSandBox Warning
2020-05-21T09:38:37Z ic-nginx-ingress-controller-hbs8r Pod Successfully pulled image "registry.local:9001/quay.io/kubernetes-ingress-controller/nginx-ingress
-controller:0.23.0" Pulled Normal
2020-05-21T09:38:39Z ic-nginx-ingress-controller-hbs8r Pod Created container nginx-ingress-controller Created Normal
2020-05-21T09:38:39Z ic-nginx-ingress-controller-hbs8r Pod Started container nginx-ingress-controller Started Normal
2020-05-21T09:38:41Z ic-nginx-ingress-controller ConfigMap ConfigMap kube-system/ic-nginx-ingress-controller CREATE Normal
2020-05-21T09:38:48Z calico-node-dpfzj Pod Successfully pulled image "registry.local:9001/quay.io/calico/node:v3.12.0" Pulled Normal
2020-05-21T09:38:55Z controller-1 Node Starting kubelet. Starting Normal
2020-05-21T09:38:55Z controller-1 Node invalid capacity 0 on image filesystem InvalidDiskCapacity Warning
2020-05-21T09:38:55Z controller-1 Node invalid capacity 0 on image filesystem ImageGCFailed Warning

My only conclusion until now is that cert-manager malfunction is causal to failing application upload.