StarlingX

Bug #1887648
Activity log

Activity log for bug #1887648

Date	Who	What changed	Old value	New value	Message
2020-07-15 10:08:53	Dan Voiculeasa	bug			added bug
2020-07-15 10:25:20	Dan Voiculeasa	starlingx: assignee		Dan Voiculeasa (dvoicule)
2020-07-15 15:12:57	OpenStack Infra	starlingx: status	New	In Progress
2020-07-16 01:45:34	Ghada Khalil	tags		stx.update
2020-09-02 17:06:27	Frank Miller	description	Brief Description ----------------- During restore. Apps will fail to auto-apply when controller-1 is brought up. Severity -------- Provide the severity of the defect. <Critical: System/Feature is not usable due to the defect> <Major: System/Feature is usable but degraded> <Minor: System/Feature is usable with minor issue> Steps to Reproduce ------------------ Bring up AIO-DX. Do backup. Restore Controller-0 with wipe_ceph_osds=false. Unlock Controller-0. Boot Controller-1 from pxe. Expected Behavior ------------------ Apps in `applied` state after controller-1 is booted. Actual Behavior ---------------- Auto application of the apps times out resulting int the apps status to be changed to `apply-failed`. The timeout takes a long time ~1800 seconds. 1)armada will fail to apply platfrom-integ-apps because it can't take the armada lock. This happens because another app apply is in progress. 2)armada apply of cert-manager got the armada lock. But it is waiting for a pod to go to `Ready` state. The pod is stuck and never reaches the `Ready` state. Stuck pods are those which kubernetes knows they are scheduled on a node different than controller-0. Pods will become unstuck after the other node(controller-1 in this scenario) is unlocked, when kubernetes services on the other node can communicate with kubernetes services on controller-0( didn't dig to see pinpoint the exact service). 3) While an app is applying controller-1 can't be unlocked. Because of number 2, there is a 1800 unnecessary wait. Reproducibility --------------- 100% reproducible System Configuration -------------------- AIO-DX [I think all different than AIO-SX will hit this issue, but there may be more. Better track them separately. Case of deployment types containing computes needs a separate analysis. In case of deployment types containing storages the restore procedure is different. It needs a separate analysis. ] Branch/Pull Time/Commit ----------------------- 7 Jul Last Pass --------- ? Timestamp/Logs -------------- cert-manager cm-cert-manager-856678cfb7-pn84l 1/1 Running 0 77m 172.16.192.108 controller-0 <none> <none> cert-manager cm-cert-manager-856678cfb7-vqvcw 1/1 Terminating 0 2d4h 172.16.166.141 controller-1 <none> <none> cert-manager cm-cert-manager-cainjector-85849bd97-cvrgm 1/1 Running 0 77m 172.16.192.105 controller-0 <none> <none> cert-manager cm-cert-manager-cainjector-85849bd97-q747l 1/1 Terminating 0 2d4h 172.16.166.140 controller-1 <none> <none> cert-manager cm-cert-manager-webhook-5745478cbc-lqjls 1/1 Terminating 0 2d4h 172.16.166.142 controller-1 <none> <none> cert-manager cm-cert-manager-webhook-5745478cbc-v6m54 1/1 Running 0 77m 172.16.192.107 controller-0 <none> <none> cert-manager apply log: 2020-07-09 23:25:25.241 68 ERROR armada.handlers.wait [-] [chart=cert-manager]: Timed out waiting for pods (namespace=cert-manager, labels=(app=cert-manager)). These pods were not ready=['cm-cert-manager-856678cfb7-vqvcw']^[[00m 2020-07-09 23:25:25.242 68 ERROR armada.handlers.armada [-] Chart deploy [cert-manager] failed: armada.exceptions.k8s_exceptions.KubernetesWatchTimeoutException: Timed out waiting for pods (namespace=cert-manager, labels=(app=cert-manager)). These pods were not ready=['cm-cert-manager-856678cfb7-vqvcw'] 5 platform-integ-apps logs: 2020-07-09 23:08:07.268 326 WARNING armada.handlers.lock [-] There is already an existing lock: kubernetes.client.rest.ApiException: (409)^[[00m 2020-07-09 23:08:07.276 326 DEBUG armada.handlers.lock [-] Sleeping before attempting to acquire lock again acquire_lock /usr/local/lib/python3.6/dist-packages/armada/handlers/lock.py:167^[[00m 2020-07-09 23:08:12.286 326 ERROR armada.cli [-] Caught unexpected exception: armada.handlers.lock.LockException: Unable to acquire lock before timeout \| cert-manager \| 1.0-5 \| cert-manager-manifest \| certmanager-manifest.yaml \| apply-failed \| operation aborted, check logs for detail \| \| nginx-ingress-controller \| 1.0-0 \| nginx-ingress-controller-manifest \| nginx_ingress_controller_manifest.yaml \| applied \| completed \| \| oidc-auth-apps \| 1.0-27 \| oidc-auth-manifest \| manifest.yaml \| uploaded \| completed \| \| platform-integ-apps \| 1.0-9 \| platform-integration-manifest \| manifest.yaml \| apply-failed \| operation aborted, check logs for detail \| Test Activity ------------- Developer Testing Workaround ---------- system application-abort cert-manager [or any app that as an armada apply waiting for a stuck pod][abort all of such apps] system host-unlock controller-1 wait for unlocked/enabled/available system application-apply apply-failed apps manually.	Brief Description ----------------- During restore. Apps may fail to auto-apply when controller-1 is brought up. Severity -------- Provide the severity of the defect. <Critical: System/Feature is not usable due to the defect> <Major: System/Feature is usable but degraded> <Minor: System/Feature is usable with minor issue> Steps to Reproduce ------------------ Bring up AIO-DX. Do backup. Restore Controller-0 with wipe_ceph_osds=false. Unlock Controller-0. Boot Controller-1 from pxe. Expected Behavior ------------------ Apps in `applied` state after controller-1 is booted. Actual Behavior ---------------- Auto application of the apps times out resulting int the apps status to be changed to `apply-failed`. The timeout takes a long time ~1800 seconds. 1)armada will fail to apply platfrom-integ-apps because it can't take the armada lock. This happens because another app apply is in progress. 2)armada apply of cert-manager got the armada lock. But it is waiting for a pod to go to `Ready` state. The pod is stuck and never reaches the `Ready` state. Stuck pods are those which kubernetes knows they are scheduled on a node different than controller-0. Pods will become unstuck after the other node(controller-1 in this scenario) is unlocked, when kubernetes services on the other node can communicate with kubernetes services on controller-0( didn't dig to see pinpoint the exact service). 3) While an app is applying controller-1 can't be unlocked. Because of number 2, there is a 1800 unnecessary wait. Reproducibility --------------- 100% reproducible System Configuration -------------------- AIO-DX [I think all different than AIO-SX will hit this issue, but there may be more. Better track them separately. Case of deployment types containing computes needs a separate analysis. In case of deployment types containing storages the restore procedure is different. It needs a separate analysis. ] Branch/Pull Time/Commit ----------------------- 7 Jul Last Pass --------- ? Timestamp/Logs -------------- cert-manager cm-cert-manager-856678cfb7-pn84l 1/1 Running 0 77m 172.16.192.108 controller-0 <none> <none> cert-manager cm-cert-manager-856678cfb7-vqvcw 1/1 Terminating 0 2d4h 172.16.166.141 controller-1 <none> <none> cert-manager cm-cert-manager-cainjector-85849bd97-cvrgm 1/1 Running 0 77m 172.16.192.105 controller-0 <none> <none> cert-manager cm-cert-manager-cainjector-85849bd97-q747l 1/1 Terminating 0 2d4h 172.16.166.140 controller-1 <none> <none> cert-manager cm-cert-manager-webhook-5745478cbc-lqjls 1/1 Terminating 0 2d4h 172.16.166.142 controller-1 <none> <none> cert-manager cm-cert-manager-webhook-5745478cbc-v6m54 1/1 Running 0 77m 172.16.192.107 controller-0 <none> <none> cert-manager apply log: 2020-07-09 23:25:25.241 68 ERROR armada.handlers.wait [-] [chart=cert-manager]: Timed out waiting for pods (namespace=cert-manager, labels=(app=cert-manager)). These pods were not ready=['cm-cert-manager-856678cfb7-vqvcw']^[[00m 2020-07-09 23:25:25.242 68 ERROR armada.handlers.armada [-] Chart deploy [cert-manager] failed: armada.exceptions.k8s_exceptions.KubernetesWatchTimeoutException: Timed out waiting for pods (namespace=cert-manager, labels=(app=cert-manager)). These pods were not ready=['cm-cert-manager-856678cfb7-vqvcw'] 5 platform-integ-apps logs: 2020-07-09 23:08:07.268 326 WARNING armada.handlers.lock [-] There is already an existing lock: kubernetes.client.rest.ApiException: (409)^[[00m 2020-07-09 23:08:07.276 326 DEBUG armada.handlers.lock [-] Sleeping before attempting to acquire lock again acquire_lock /usr/local/lib/python3.6/dist-packages/armada/handlers/lock.py:167^[[00m 2020-07-09 23:08:12.286 326 ERROR armada.cli [-] Caught unexpected exception: armada.handlers.lock.LockException: Unable to acquire lock before timeout \| cert-manager \| 1.0-5 \| cert-manager-manifest \| certmanager-manifest.yaml \| apply-failed \| operation aborted, check logs for detail \| \| nginx-ingress-controller \| 1.0-0 \| nginx-ingress-controller-manifest \| nginx_ingress_controller_manifest.yaml \| applied \| completed \| \| oidc-auth-apps \| 1.0-27 \| oidc-auth-manifest \| manifest.yaml \| uploaded \| completed \| \| platform-integ-apps \| 1.0-9 \| platform-integration-manifest \| manifest.yaml \| apply-failed \| operation aborted, check logs for detail \| Test Activity ------------- Developer Testing Workaround ---------- system application-abort cert-manager [or any app that as an armada apply waiting for a stuck pod][abort all of such apps] system host-unlock controller-1 wait for unlocked/enabled/available system application-apply apply-failed apps manually.
2020-09-02 17:11:33	Frank Miller	description	Brief Description ----------------- During restore. Apps may fail to auto-apply when controller-1 is brought up. Severity -------- Provide the severity of the defect. <Critical: System/Feature is not usable due to the defect> <Major: System/Feature is usable but degraded> <Minor: System/Feature is usable with minor issue> Steps to Reproduce ------------------ Bring up AIO-DX. Do backup. Restore Controller-0 with wipe_ceph_osds=false. Unlock Controller-0. Boot Controller-1 from pxe. Expected Behavior ------------------ Apps in `applied` state after controller-1 is booted. Actual Behavior ---------------- Auto application of the apps times out resulting int the apps status to be changed to `apply-failed`. The timeout takes a long time ~1800 seconds. 1)armada will fail to apply platfrom-integ-apps because it can't take the armada lock. This happens because another app apply is in progress. 2)armada apply of cert-manager got the armada lock. But it is waiting for a pod to go to `Ready` state. The pod is stuck and never reaches the `Ready` state. Stuck pods are those which kubernetes knows they are scheduled on a node different than controller-0. Pods will become unstuck after the other node(controller-1 in this scenario) is unlocked, when kubernetes services on the other node can communicate with kubernetes services on controller-0( didn't dig to see pinpoint the exact service). 3) While an app is applying controller-1 can't be unlocked. Because of number 2, there is a 1800 unnecessary wait. Reproducibility --------------- 100% reproducible System Configuration -------------------- AIO-DX [I think all different than AIO-SX will hit this issue, but there may be more. Better track them separately. Case of deployment types containing computes needs a separate analysis. In case of deployment types containing storages the restore procedure is different. It needs a separate analysis. ] Branch/Pull Time/Commit ----------------------- 7 Jul Last Pass --------- ? Timestamp/Logs -------------- cert-manager cm-cert-manager-856678cfb7-pn84l 1/1 Running 0 77m 172.16.192.108 controller-0 <none> <none> cert-manager cm-cert-manager-856678cfb7-vqvcw 1/1 Terminating 0 2d4h 172.16.166.141 controller-1 <none> <none> cert-manager cm-cert-manager-cainjector-85849bd97-cvrgm 1/1 Running 0 77m 172.16.192.105 controller-0 <none> <none> cert-manager cm-cert-manager-cainjector-85849bd97-q747l 1/1 Terminating 0 2d4h 172.16.166.140 controller-1 <none> <none> cert-manager cm-cert-manager-webhook-5745478cbc-lqjls 1/1 Terminating 0 2d4h 172.16.166.142 controller-1 <none> <none> cert-manager cm-cert-manager-webhook-5745478cbc-v6m54 1/1 Running 0 77m 172.16.192.107 controller-0 <none> <none> cert-manager apply log: 2020-07-09 23:25:25.241 68 ERROR armada.handlers.wait [-] [chart=cert-manager]: Timed out waiting for pods (namespace=cert-manager, labels=(app=cert-manager)). These pods were not ready=['cm-cert-manager-856678cfb7-vqvcw']^[[00m 2020-07-09 23:25:25.242 68 ERROR armada.handlers.armada [-] Chart deploy [cert-manager] failed: armada.exceptions.k8s_exceptions.KubernetesWatchTimeoutException: Timed out waiting for pods (namespace=cert-manager, labels=(app=cert-manager)). These pods were not ready=['cm-cert-manager-856678cfb7-vqvcw'] 5 platform-integ-apps logs: 2020-07-09 23:08:07.268 326 WARNING armada.handlers.lock [-] There is already an existing lock: kubernetes.client.rest.ApiException: (409)^[[00m 2020-07-09 23:08:07.276 326 DEBUG armada.handlers.lock [-] Sleeping before attempting to acquire lock again acquire_lock /usr/local/lib/python3.6/dist-packages/armada/handlers/lock.py:167^[[00m 2020-07-09 23:08:12.286 326 ERROR armada.cli [-] Caught unexpected exception: armada.handlers.lock.LockException: Unable to acquire lock before timeout \| cert-manager \| 1.0-5 \| cert-manager-manifest \| certmanager-manifest.yaml \| apply-failed \| operation aborted, check logs for detail \| \| nginx-ingress-controller \| 1.0-0 \| nginx-ingress-controller-manifest \| nginx_ingress_controller_manifest.yaml \| applied \| completed \| \| oidc-auth-apps \| 1.0-27 \| oidc-auth-manifest \| manifest.yaml \| uploaded \| completed \| \| platform-integ-apps \| 1.0-9 \| platform-integration-manifest \| manifest.yaml \| apply-failed \| operation aborted, check logs for detail \| Test Activity ------------- Developer Testing Workaround ---------- system application-abort cert-manager [or any app that as an armada apply waiting for a stuck pod][abort all of such apps] system host-unlock controller-1 wait for unlocked/enabled/available system application-apply apply-failed apps manually.	Brief Description ----------------- During restore. Apps may fail to auto-apply when controller-1 is brought up. Severity -------- Provide the severity of the defect. <Critical: System/Feature is not usable due to the defect> <Major: System/Feature is usable but degraded> <Minor: System/Feature is usable with minor issue> Steps to Reproduce ------------------ Bring up AIO-DX. Do backup. Restore Controller-0 with wipe_ceph_osds=false. Unlock Controller-0. Boot Controller-1 Issue is observed after booting controller-1 but before unlocking controller-1 Expected Behavior ------------------ Apps in `applied` state after controller-1 is booted. For a restore, apps that depend on controller-1 pods should not attempt to apply until after controller-1 is unlocked. Actual Behavior ---------------- Auto application of the apps times out resulting int the apps status to be changed to `apply-failed`. The timeout takes a long time ~1800 seconds. 1) armada fails to apply platform-integ-apps because it can't take the armada lock. This happens because another app apply is in progress. 2) armada apply of cert-manager got the armada lock. But it is waiting for a pod to go to `Ready` state. The pod is stuck and won't reach the `Ready` state until after controller-1 is unlocked. Stuck pods are those which kubernetes knows they are scheduled on a node different than controller-0. Pods will become unstuck after the other node(controller-1 in this scenario) is unlocked, when kubernetes services on the other node can communicate with kubernetes services on controller-0( didn't dig to see pinpoint the exact service). 3) While an app is applying controller-1 can't be unlocked. Because of number 2, there is a 1800 unnecessary wait. Reproducibility --------------- 100% reproducible System Configuration -------------------- AIO-DX [I think all different than AIO-SX will hit this issue, but there may be more. Better track them separately. Case of deployment types containing computes needs a separate analysis. In case of deployment types containing storages the restore procedure is different. It needs a separate analysis. ] Branch/Pull Time/Commit ----------------------- 7 Jul Last Pass --------- ? Timestamp/Logs -------------- cert-manager cm-cert-manager-856678cfb7-pn84l 1/1 Running 0 77m 172.16.192.108 controller-0 <none> <none> cert-manager cm-cert-manager-856678cfb7-vqvcw 1/1 Terminating 0 2d4h 172.16.166.141 controller-1 <none> <none> cert-manager cm-cert-manager-cainjector-85849bd97-cvrgm 1/1 Running 0 77m 172.16.192.105 controller-0 <none> <none> cert-manager cm-cert-manager-cainjector-85849bd97-q747l 1/1 Terminating 0 2d4h 172.16.166.140 controller-1 <none> <none> cert-manager cm-cert-manager-webhook-5745478cbc-lqjls 1/1 Terminating 0 2d4h 172.16.166.142 controller-1 <none> <none> cert-manager cm-cert-manager-webhook-5745478cbc-v6m54 1/1 Running 0 77m 172.16.192.107 controller-0 <none> <none> cert-manager apply log: 2020-07-09 23:25:25.241 68 ERROR armada.handlers.wait [-] [chart=cert-manager]: Timed out waiting for pods (namespace=cert-manager, labels=(app=cert-manager)). These pods were not ready=['cm-cert-manager-856678cfb7-vqvcw']^[[00m 2020-07-09 23:25:25.242 68 ERROR armada.handlers.armada [-] Chart deploy [cert-manager] failed: armada.exceptions.k8s_exceptions.KubernetesWatchTimeoutException: Timed out waiting for pods (namespace=cert-manager, labels=(app=cert-manager)). These pods were not ready=['cm-cert-manager-856678cfb7-vqvcw'] 5 platform-integ-apps logs: 2020-07-09 23:08:07.268 326 WARNING armada.handlers.lock [-] There is already an existing lock: kubernetes.client.rest.ApiException: (409)^[[00m 2020-07-09 23:08:07.276 326 DEBUG armada.handlers.lock [-] Sleeping before attempting to acquire lock again acquire_lock /usr/local/lib/python3.6/dist-packages/armada/handlers/lock.py:167^[[00m 2020-07-09 23:08:12.286 326 ERROR armada.cli [-] Caught unexpected exception: armada.handlers.lock.LockException: Unable to acquire lock before timeout \| cert-manager \| 1.0-5 \| cert-manager-manifest \| certmanager-manifest.yaml \| apply-failed \| operation aborted, check logs for detail \| \| nginx-ingress-controller \| 1.0-0 \| nginx-ingress-controller-manifest \| nginx_ingress_controller_manifest.yaml \| applied \| completed \| \| oidc-auth-apps \| 1.0-27 \| oidc-auth-manifest \| manifest.yaml \| uploaded \| completed \| \| platform-integ-apps \| 1.0-9 \| platform-integration-manifest \| manifest.yaml \| apply-failed \| operation aborted, check logs for detail \| Test Activity ------------- Developer Testing Workaround ---------- system application-abort cert-manager [or any app that as an armada apply waiting for a stuck pod][abort all of such apps] system host-unlock controller-1 wait for unlocked/enabled/available system application-apply apply-failed apps manually.
2020-09-09 17:48:54	Frank Miller	summary	B&R: AIO-DX apps in `apply-failed` after controller-1 boots	B&R: AIO-DX: apps may be in `apply-failed` after controller-1 boots
2020-09-09 17:57:35	Frank Miller	description	Brief Description ----------------- During restore. Apps may fail to auto-apply when controller-1 is brought up. Severity -------- Provide the severity of the defect. <Critical: System/Feature is not usable due to the defect> <Major: System/Feature is usable but degraded> <Minor: System/Feature is usable with minor issue> Steps to Reproduce ------------------ Bring up AIO-DX. Do backup. Restore Controller-0 with wipe_ceph_osds=false. Unlock Controller-0. Boot Controller-1 Issue is observed after booting controller-1 but before unlocking controller-1 Expected Behavior ------------------ Apps in `applied` state after controller-1 is booted. For a restore, apps that depend on controller-1 pods should not attempt to apply until after controller-1 is unlocked. Actual Behavior ---------------- Auto application of the apps times out resulting int the apps status to be changed to `apply-failed`. The timeout takes a long time ~1800 seconds. 1) armada fails to apply platform-integ-apps because it can't take the armada lock. This happens because another app apply is in progress. 2) armada apply of cert-manager got the armada lock. But it is waiting for a pod to go to `Ready` state. The pod is stuck and won't reach the `Ready` state until after controller-1 is unlocked. Stuck pods are those which kubernetes knows they are scheduled on a node different than controller-0. Pods will become unstuck after the other node(controller-1 in this scenario) is unlocked, when kubernetes services on the other node can communicate with kubernetes services on controller-0( didn't dig to see pinpoint the exact service). 3) While an app is applying controller-1 can't be unlocked. Because of number 2, there is a 1800 unnecessary wait. Reproducibility --------------- 100% reproducible System Configuration -------------------- AIO-DX [I think all different than AIO-SX will hit this issue, but there may be more. Better track them separately. Case of deployment types containing computes needs a separate analysis. In case of deployment types containing storages the restore procedure is different. It needs a separate analysis. ] Branch/Pull Time/Commit ----------------------- 7 Jul Last Pass --------- ? Timestamp/Logs -------------- cert-manager cm-cert-manager-856678cfb7-pn84l 1/1 Running 0 77m 172.16.192.108 controller-0 <none> <none> cert-manager cm-cert-manager-856678cfb7-vqvcw 1/1 Terminating 0 2d4h 172.16.166.141 controller-1 <none> <none> cert-manager cm-cert-manager-cainjector-85849bd97-cvrgm 1/1 Running 0 77m 172.16.192.105 controller-0 <none> <none> cert-manager cm-cert-manager-cainjector-85849bd97-q747l 1/1 Terminating 0 2d4h 172.16.166.140 controller-1 <none> <none> cert-manager cm-cert-manager-webhook-5745478cbc-lqjls 1/1 Terminating 0 2d4h 172.16.166.142 controller-1 <none> <none> cert-manager cm-cert-manager-webhook-5745478cbc-v6m54 1/1 Running 0 77m 172.16.192.107 controller-0 <none> <none> cert-manager apply log: 2020-07-09 23:25:25.241 68 ERROR armada.handlers.wait [-] [chart=cert-manager]: Timed out waiting for pods (namespace=cert-manager, labels=(app=cert-manager)). These pods were not ready=['cm-cert-manager-856678cfb7-vqvcw']^[[00m 2020-07-09 23:25:25.242 68 ERROR armada.handlers.armada [-] Chart deploy [cert-manager] failed: armada.exceptions.k8s_exceptions.KubernetesWatchTimeoutException: Timed out waiting for pods (namespace=cert-manager, labels=(app=cert-manager)). These pods were not ready=['cm-cert-manager-856678cfb7-vqvcw'] 5 platform-integ-apps logs: 2020-07-09 23:08:07.268 326 WARNING armada.handlers.lock [-] There is already an existing lock: kubernetes.client.rest.ApiException: (409)^[[00m 2020-07-09 23:08:07.276 326 DEBUG armada.handlers.lock [-] Sleeping before attempting to acquire lock again acquire_lock /usr/local/lib/python3.6/dist-packages/armada/handlers/lock.py:167^[[00m 2020-07-09 23:08:12.286 326 ERROR armada.cli [-] Caught unexpected exception: armada.handlers.lock.LockException: Unable to acquire lock before timeout \| cert-manager \| 1.0-5 \| cert-manager-manifest \| certmanager-manifest.yaml \| apply-failed \| operation aborted, check logs for detail \| \| nginx-ingress-controller \| 1.0-0 \| nginx-ingress-controller-manifest \| nginx_ingress_controller_manifest.yaml \| applied \| completed \| \| oidc-auth-apps \| 1.0-27 \| oidc-auth-manifest \| manifest.yaml \| uploaded \| completed \| \| platform-integ-apps \| 1.0-9 \| platform-integration-manifest \| manifest.yaml \| apply-failed \| operation aborted, check logs for detail \| Test Activity ------------- Developer Testing Workaround ---------- system application-abort cert-manager [or any app that as an armada apply waiting for a stuck pod][abort all of such apps] system host-unlock controller-1 wait for unlocked/enabled/available system application-apply apply-failed apps manually.	Brief Description ----------------- During restore of AIO-DX, in some cases apps like cert-manager and/or platform-integ-apps may fail to apply after controller-0 is unlocked. This leads to the apps failing to auto-apply when controller-1 is brought up. Severity -------- Provide the severity of the defect. <Critical: System/Feature is not usable due to the defect> <Major: System/Feature is usable but degraded> <Minor: System/Feature is usable with minor issue> Steps to Reproduce ------------------ Bring up AIO-DX. Do backup. Restore Controller-0 with wipe_ceph_osds=false. Unlock Controller-0. Some conditions can lead to apps failing to apply (eg: docker registry temporary unavailable). Boot controller-1 (issues occurs after boot and before unlock) Unlock controller-1 Expected Behavior ------------------ Apps in `applied` state after controller-0 is unlocked and even when controller-1 is booted. For a restore, apps that depend on controller-1 pods should not attempt to apply until after controller-1 is unlocked. Actual Behavior ---------------- Auto application of the apps times out resulting int the apps status to be changed to `apply-failed`. The timeout takes a long time ~1800 seconds. 1) armada fails to apply platform-integ-apps because it can't take the armada lock. This happens because another app apply is in progress. 2) armada apply of cert-manager got the armada lock. But it is waiting for a pod to go to `Ready` state. The pod is stuck and won't reach the `Ready` state until after controller-1 is unlocked. Stuck pods are those which kubernetes knows they are scheduled on a node different than controller-0. Pods will become unstuck after the other node(controller-1 in this scenario) is unlocked, when kubernetes services on the other node can communicate with kubernetes services on controller-0( didn't dig to see pinpoint the exact service). 3) While an app is applying controller-1 can't be unlocked. Because of number 2, there is a 1800 unnecessary wait. Reproducibility --------------- 100% reproducible System Configuration -------------------- AIO-DX [I think all different than AIO-SX will hit this issue, but there may be more. Better track them separately. Case of deployment types containing computes needs a separate analysis. In case of deployment types containing storages the restore procedure is different. It needs a separate analysis. ] Branch/Pull Time/Commit ----------------------- 7 Jul Last Pass --------- ? Timestamp/Logs -------------- cert-manager cm-cert-manager-856678cfb7-pn84l 1/1 Running 0 77m 172.16.192.108 controller-0 <none> <none> cert-manager cm-cert-manager-856678cfb7-vqvcw 1/1 Terminating 0 2d4h 172.16.166.141 controller-1 <none> <none> cert-manager cm-cert-manager-cainjector-85849bd97-cvrgm 1/1 Running 0 77m 172.16.192.105 controller-0 <none> <none> cert-manager cm-cert-manager-cainjector-85849bd97-q747l 1/1 Terminating 0 2d4h 172.16.166.140 controller-1 <none> <none> cert-manager cm-cert-manager-webhook-5745478cbc-lqjls 1/1 Terminating 0 2d4h 172.16.166.142 controller-1 <none> <none> cert-manager cm-cert-manager-webhook-5745478cbc-v6m54 1/1 Running 0 77m 172.16.192.107 controller-0 <none> <none> cert-manager apply log: 2020-07-09 23:25:25.241 68 ERROR armada.handlers.wait [-] [chart=cert-manager]: Timed out waiting for pods (namespace=cert-manager, labels=(app=cert-manager)). These pods were not ready=['cm-cert-manager-856678cfb7-vqvcw']^[[00m 2020-07-09 23:25:25.242 68 ERROR armada.handlers.armada [-] Chart deploy [cert-manager] failed: armada.exceptions.k8s_exceptions.KubernetesWatchTimeoutException: Timed out waiting for pods (namespace=cert-manager, labels=(app=cert-manager)). These pods were not ready=['cm-cert-manager-856678cfb7-vqvcw'] 5 platform-integ-apps logs: 2020-07-09 23:08:07.268 326 WARNING armada.handlers.lock [-] There is already an existing lock: kubernetes.client.rest.ApiException: (409)^[[00m 2020-07-09 23:08:07.276 326 DEBUG armada.handlers.lock [-] Sleeping before attempting to acquire lock again acquire_lock /usr/local/lib/python3.6/dist-packages/armada/handlers/lock.py:167^[[00m 2020-07-09 23:08:12.286 326 ERROR armada.cli [-] Caught unexpected exception: armada.handlers.lock.LockException: Unable to acquire lock before timeout \| cert-manager \| 1.0-5 \| cert-manager-manifest \| certmanager-manifest.yaml \| apply-failed \| operation aborted, check logs for detail \| \| nginx-ingress-controller \| 1.0-0 \| nginx-ingress-controller-manifest \| nginx_ingress_controller_manifest.yaml \| applied \| completed \| \| oidc-auth-apps \| 1.0-27 \| oidc-auth-manifest \| manifest.yaml \| uploaded \| completed \| \| platform-integ-apps \| 1.0-9 \| platform-integration-manifest \| manifest.yaml \| apply-failed \| operation aborted, check logs for detail \| Test Activity ------------- Developer Testing Workaround ---------- system application-abort cert-manager [or any app that as an armada apply waiting for a stuck pod][abort all of such apps] system host-unlock controller-1 wait for unlocked/enabled/available system application-apply apply-failed apps manually.
2020-09-09 18:00:56	Ghada Khalil	starlingx: importance	Undecided	Medium
2020-09-09 18:01:07	Ghada Khalil	tags	stx.update	stx.5.0 stx.update
2020-09-18 18:27:27	Ghada Khalil	bug			added subscriber Allain Legacy
2020-09-23 01:40:35	Austin Sun	bug			added subscriber Austin Sun
2020-09-23 01:40:55	Austin Sun	bug			added subscriber chendongqi
2020-10-09 20:30:16	OpenStack Infra	starlingx: status	In Progress	Fix Released
2021-06-07 17:50:24	OpenStack Infra	tags	stx.5.0 stx.update	in-f-centos8 stx.5.0 stx.update