B&R: AIO-DX: apps may be in `apply-failed` after controller-1 boots
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Medium
|
Dan Voiculeasa |
Bug Description
Brief Description
-----------------
During restore of AIO-DX, in some cases apps like cert-manager and/or platform-integ-apps may fail to apply after controller-0 is unlocked. This leads to the apps failing to auto-apply when controller-1 is brought up.
Severity
--------
Provide the severity of the defect.
<Critical: System/Feature is not usable due to the defect>
<Major: System/Feature is usable but degraded>
<Minor: System/Feature is usable with minor issue>
Steps to Reproduce
------------------
Bring up AIO-DX.
Do backup.
Restore Controller-0 with wipe_ceph_
Unlock Controller-0.
Some conditions can lead to apps failing to apply (eg: docker registry temporary unavailable).
Boot controller-1 (issues occurs after boot and before unlock)
Unlock controller-1
Expected Behavior
------------------
Apps in `applied` state after controller-0 is unlocked and even when controller-1 is booted.
For a restore, apps that depend on controller-1 pods should not attempt to apply until after controller-1 is unlocked.
Actual Behavior
----------------
Auto application of the apps times out resulting int the apps status to be changed to `apply-failed`. The timeout takes a long time ~1800 seconds.
1) armada fails to apply platform-integ-apps because it can't take the armada lock. This happens because another app apply is in progress.
2) armada apply of cert-manager got the armada lock. But it is waiting for a pod to go to `Ready` state. The pod is stuck and won't reach the `Ready` state until after controller-1 is unlocked.
Stuck pods are those which kubernetes knows they are scheduled on a node different than controller-0.
Pods will become unstuck after the other node(controller-1 in this scenario) is unlocked, when kubernetes services on the other node can communicate with kubernetes services on controller-0( didn't dig to see pinpoint the exact service).
3) While an app is applying controller-1 can't be unlocked. Because of number 2, there is a 1800 unnecessary wait.
Reproducibility
---------------
100% reproducible
System Configuration
-------
AIO-DX
[I think all different than AIO-SX will hit this issue, but there may be more. Better track them separately.
Case of deployment types containing computes needs a separate analysis.
In case of deployment types containing storages the restore procedure is different. It needs a separate analysis.
]
Branch/Pull Time/Commit
-------
7 Jul
Last Pass
---------
?
Timestamp/Logs
--------------
cert-manager cm-cert-
cert-manager cm-cert-
cert-manager cm-cert-
cert-manager cm-cert-
cert-manager cm-cert-
cert-manager cm-cert-
cert-manager apply log:
2020-07-09 23:25:25.241 68 ERROR armada.
2020-07-09 23:25:25.242 68 ERROR armada.
5 platform-integ-apps logs:
2020-07-09 23:08:07.268 326 WARNING armada.
2020-07-09 23:08:07.276 326 DEBUG armada.
2020-07-09 23:08:12.286 326 ERROR armada.cli [-] Caught unexpected exception: armada.
| cert-manager | 1.0-5 | cert-manager-
| nginx-ingress-
| oidc-auth-apps | 1.0-27 | oidc-auth-manifest | manifest.yaml | uploaded | completed |
| platform-integ-apps | 1.0-9 | platform-
Test Activity
-------------
Developer Testing
Workaround
----------
system application-abort cert-manager [or any app that as an armada apply waiting for a stuck pod][abort all of such apps]
system host-unlock controller-1
wait for unlocked/
system application-apply apply-failed apps manually.
Changed in starlingx: | |
assignee: | nobody → Dan Voiculeasa (dvoicule) |
tags: | added: stx.update |
description: | updated |
description: | updated |
summary: |
- B&R: AIO-DX apps in `apply-failed` after controller-1 boots + B&R: AIO-DX: apps may be in `apply-failed` after controller-1 boots |
description: | updated |
Changed in starlingx: | |
importance: | Undecided → Medium |
tags: | added: stx.5.0 |
Fix proposed to branch: master /review. opendev. org/741238
Review: https:/