Brief Description
-----------------
During restore. Apps will fail to auto-apply when controller-1 is brought up.
Severity
--------
Provide the severity of the defect.
<Critical: System/Feature is not usable due to the defect>
<Major: System/Feature is usable but degraded>
<Minor: System/Feature is usable with minor issue>
Steps to Reproduce
------------------
Bring up AIO-DX.
Do backup.
Restore Controller-0 with wipe_ceph_osds=false.
Unlock Controller-0.
Boot Controller-1 from pxe.
Expected Behavior
------------------
Apps in `applied` state after controller-1 is booted.
Actual Behavior
----------------
Auto application of the apps times out resulting int the apps status to be changed to `apply-failed`. The timeout takes a long time ~1800 seconds.
1)armada will fail to apply platfrom-integ-apps because it can't take the armada lock. This happens because another app apply is in progress.
2)armada apply of cert-manager got the armada lock. But it is waiting for a pod to go to `Ready` state. The pod is stuck and never reaches the `Ready` state.
Stuck pods are those which kubernetes knows they are scheduled on a node different than controller-0.
Pods will become unstuck after the other node(controller-1 in this scenario) is unlocked, when kubernetes services on the other node can communicate with kubernetes services on controller-0( didn't dig to see pinpoint the exact service).
3) While an app is applying controller-1 can't be unlocked. Because of number 2, there is a 1800 unnecessary wait.
Reproducibility
---------------
100% reproducible
System Configuration
--------------------
AIO-DX
[I think all different than AIO-SX will hit this issue, but there may be more. Better track them separately.
Case of deployment types containing computes needs a separate analysis.
In case of deployment types containing storages the restore procedure is different. It needs a separate analysis.
]
cert-manager apply log:
2020-07-09 23:25:25.241 68 ERROR armada.handlers.wait [-] [chart=cert-manager]: Timed out waiting for pods (namespace=cert-manager, labels=(app=cert-manager)). These pods were not ready=['cm-cert-manager-856678cfb7-vqvcw']^[[00m
2020-07-09 23:25:25.242 68 ERROR armada.handlers.armada [-] Chart deploy [cert-manager] failed: armada.exceptions.k8s_exceptions.KubernetesWatchTimeoutException: Timed out waiting for pods (namespace=cert-manager, labels=(app=cert-manager)). These pods were not ready=['cm-cert-manager-856678cfb7-vqvcw']
5 platform-integ-apps logs:
2020-07-09 23:08:07.268 326 WARNING armada.handlers.lock [-] There is already an existing lock: kubernetes.client.rest.ApiException: (409)^[[00m
2020-07-09 23:08:07.276 326 DEBUG armada.handlers.lock [-] Sleeping before attempting to acquire lock again acquire_lock /usr/local/lib/python3.6/dist-packages/armada/handlers/lock.py:167^[[00m
2020-07-09 23:08:12.286 326 ERROR armada.cli [-] Caught unexpected exception: armada.handlers.lock.LockException: Unable to acquire lock before timeout
Brief Description
-----------------
During restore. Apps will fail to auto-apply when controller-1 is brought up.
Severity
--------
Provide the severity of the defect.
<Critical: System/Feature is not usable due to the defect>
<Major: System/Feature is usable but degraded>
<Minor: System/Feature is usable with minor issue>
Steps to Reproduce osds=false.
------------------
Bring up AIO-DX.
Do backup.
Restore Controller-0 with wipe_ceph_
Unlock Controller-0.
Boot Controller-1 from pxe.
Expected Behavior
------------------
Apps in `applied` state after controller-1 is booted.
Actual Behavior
----------------
Auto application of the apps times out resulting int the apps status to be changed to `apply-failed`. The timeout takes a long time ~1800 seconds.
1)armada will fail to apply platfrom-integ-apps because it can't take the armada lock. This happens because another app apply is in progress.
2)armada apply of cert-manager got the armada lock. But it is waiting for a pod to go to `Ready` state. The pod is stuck and never reaches the `Ready` state.
Stuck pods are those which kubernetes knows they are scheduled on a node different than controller-0.
Pods will become unstuck after the other node(controller-1 in this scenario) is unlocked, when kubernetes services on the other node can communicate with kubernetes services on controller-0( didn't dig to see pinpoint the exact service).
3) While an app is applying controller-1 can't be unlocked. Because of number 2, there is a 1800 unnecessary wait.
Reproducibility
---------------
100% reproducible
System Configuration ------- ------
-------
AIO-DX
[I think all different than AIO-SX will hit this issue, but there may be more. Better track them separately.
Case of deployment types containing computes needs a separate analysis.
In case of deployment types containing storages the restore procedure is different. It needs a separate analysis.
]
Branch/Pull Time/Commit ------- ------- --
-------
7 Jul
Last Pass
---------
?
Timestamp/Logs manager- 856678cfb7- pn84l 1/1 Running 0 77m 172.16.192.108 controller-0 <none> <none> manager- 856678cfb7- vqvcw 1/1 Terminating 0 2d4h 172.16.166.141 controller-1 <none> <none> manager- cainjector- 85849bd97- cvrgm 1/1 Running 0 77m 172.16.192.105 controller-0 <none> <none> manager- cainjector- 85849bd97- q747l 1/1 Terminating 0 2d4h 172.16.166.140 controller-1 <none> <none> manager- webhook- 5745478cbc- lqjls 1/1 Terminating 0 2d4h 172.16.166.142 controller-1 <none> <none> manager- webhook- 5745478cbc- v6m54 1/1 Running 0 77m 172.16.192.107 controller-0 <none> <none>
--------------
cert-manager cm-cert-
cert-manager cm-cert-
cert-manager cm-cert-
cert-manager cm-cert-
cert-manager cm-cert-
cert-manager cm-cert-
cert-manager apply log: handlers. wait [-] [chart= cert-manager] : Timed out waiting for pods (namespace= cert-manager, labels= (app=cert- manager) ). These pods were not ready=[ 'cm-cert- manager- 856678cfb7- vqvcw'] ^[[00m handlers. armada [-] Chart deploy [cert-manager] failed: armada. exceptions. k8s_exceptions. KubernetesWatch TimeoutExceptio n: Timed out waiting for pods (namespace= cert-manager, labels= (app=cert- manager) ). These pods were not ready=[ 'cm-cert- manager- 856678cfb7- vqvcw']
2020-07-09 23:25:25.241 68 ERROR armada.
2020-07-09 23:25:25.242 68 ERROR armada.
5 platform-integ-apps logs: handlers. lock [-] There is already an existing lock: kubernetes. client. rest.ApiExcepti on: (409)^[[00m handlers. lock [-] Sleeping before attempting to acquire lock again acquire_lock /usr/local/ lib/python3. 6/dist- packages/ armada/ handlers/ lock.py: 167^[[00m handlers. lock.LockExcept ion: Unable to acquire lock before timeout
2020-07-09 23:08:07.268 326 WARNING armada.
2020-07-09 23:08:07.276 326 DEBUG armada.
2020-07-09 23:08:12.286 326 ERROR armada.cli [-] Caught unexpected exception: armada.
| cert-manager | 1.0-5 | cert-manager- manifest | certmanager- manifest. yaml | apply-failed | operation aborted, check logs for detail | controller | 1.0-0 | nginx-ingress- controller- manifest | nginx_ingress_ controller_ manifest. yaml | applied | completed | integration- manifest | manifest.yaml | apply-failed | operation aborted, check logs for detail |
| nginx-ingress-
| oidc-auth-apps | 1.0-27 | oidc-auth-manifest | manifest.yaml | uploaded | completed |
| platform-integ-apps | 1.0-9 | platform-
Test Activity
-------------
Developer Testing
Workaround
----------
system application-abort cert-manager [or any app that as an armada apply waiting for a stuck pod][abort all of such apps]
system host-unlock controller-1 enabled/ available
wait for unlocked/
system application-apply apply-failed apps manually.