B&R: AIO-DX: apps may be in `apply-failed` after controller-1 boots

Bug #1887648 reported by Dan Voiculeasa
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Dan Voiculeasa

Bug Description

Brief Description
-----------------
During restore of AIO-DX, in some cases apps like cert-manager and/or platform-integ-apps may fail to apply after controller-0 is unlocked. This leads to the apps failing to auto-apply when controller-1 is brought up.

Severity
--------
Provide the severity of the defect.
<Critical: System/Feature is not usable due to the defect>
<Major: System/Feature is usable but degraded>
<Minor: System/Feature is usable with minor issue>

Steps to Reproduce
------------------
Bring up AIO-DX.
Do backup.
Restore Controller-0 with wipe_ceph_osds=false.
Unlock Controller-0.
Some conditions can lead to apps failing to apply (eg: docker registry temporary unavailable).
Boot controller-1 (issues occurs after boot and before unlock)
Unlock controller-1

Expected Behavior
------------------
Apps in `applied` state after controller-0 is unlocked and even when controller-1 is booted.
For a restore, apps that depend on controller-1 pods should not attempt to apply until after controller-1 is unlocked.

Actual Behavior
----------------
Auto application of the apps times out resulting int the apps status to be changed to `apply-failed`. The timeout takes a long time ~1800 seconds.

1) armada fails to apply platform-integ-apps because it can't take the armada lock. This happens because another app apply is in progress.

2) armada apply of cert-manager got the armada lock. But it is waiting for a pod to go to `Ready` state. The pod is stuck and won't reach the `Ready` state until after controller-1 is unlocked.
Stuck pods are those which kubernetes knows they are scheduled on a node different than controller-0.
Pods will become unstuck after the other node(controller-1 in this scenario) is unlocked, when kubernetes services on the other node can communicate with kubernetes services on controller-0( didn't dig to see pinpoint the exact service).

3) While an app is applying controller-1 can't be unlocked. Because of number 2, there is a 1800 unnecessary wait.

Reproducibility
---------------
100% reproducible

System Configuration
--------------------
AIO-DX

[I think all different than AIO-SX will hit this issue, but there may be more. Better track them separately.
Case of deployment types containing computes needs a separate analysis.
In case of deployment types containing storages the restore procedure is different. It needs a separate analysis.
]

Branch/Pull Time/Commit
-----------------------
7 Jul

Last Pass
---------
?

Timestamp/Logs
--------------
cert-manager cm-cert-manager-856678cfb7-pn84l 1/1 Running 0 77m 172.16.192.108 controller-0 <none> <none>
cert-manager cm-cert-manager-856678cfb7-vqvcw 1/1 Terminating 0 2d4h 172.16.166.141 controller-1 <none> <none>
cert-manager cm-cert-manager-cainjector-85849bd97-cvrgm 1/1 Running 0 77m 172.16.192.105 controller-0 <none> <none>
cert-manager cm-cert-manager-cainjector-85849bd97-q747l 1/1 Terminating 0 2d4h 172.16.166.140 controller-1 <none> <none>
cert-manager cm-cert-manager-webhook-5745478cbc-lqjls 1/1 Terminating 0 2d4h 172.16.166.142 controller-1 <none> <none>
cert-manager cm-cert-manager-webhook-5745478cbc-v6m54 1/1 Running 0 77m 172.16.192.107 controller-0 <none> <none>

cert-manager apply log:
2020-07-09 23:25:25.241 68 ERROR armada.handlers.wait [-] [chart=cert-manager]: Timed out waiting for pods (namespace=cert-manager, labels=(app=cert-manager)). These pods were not ready=['cm-cert-manager-856678cfb7-vqvcw']^[[00m
2020-07-09 23:25:25.242 68 ERROR armada.handlers.armada [-] Chart deploy [cert-manager] failed: armada.exceptions.k8s_exceptions.KubernetesWatchTimeoutException: Timed out waiting for pods (namespace=cert-manager, labels=(app=cert-manager)). These pods were not ready=['cm-cert-manager-856678cfb7-vqvcw']

5 platform-integ-apps logs:
2020-07-09 23:08:07.268 326 WARNING armada.handlers.lock [-] There is already an existing lock: kubernetes.client.rest.ApiException: (409)^[[00m
2020-07-09 23:08:07.276 326 DEBUG armada.handlers.lock [-] Sleeping before attempting to acquire lock again acquire_lock /usr/local/lib/python3.6/dist-packages/armada/handlers/lock.py:167^[[00m
2020-07-09 23:08:12.286 326 ERROR armada.cli [-] Caught unexpected exception: armada.handlers.lock.LockException: Unable to acquire lock before timeout

| cert-manager | 1.0-5 | cert-manager-manifest | certmanager-manifest.yaml | apply-failed | operation aborted, check logs for detail |
| nginx-ingress-controller | 1.0-0 | nginx-ingress-controller-manifest | nginx_ingress_controller_manifest.yaml | applied | completed |
| oidc-auth-apps | 1.0-27 | oidc-auth-manifest | manifest.yaml | uploaded | completed |
| platform-integ-apps | 1.0-9 | platform-integration-manifest | manifest.yaml | apply-failed | operation aborted, check logs for detail |

Test Activity
-------------
Developer Testing

Workaround
----------
system application-abort cert-manager [or any app that as an armada apply waiting for a stuck pod][abort all of such apps]

system host-unlock controller-1
wait for unlocked/enabled/available
system application-apply apply-failed apps manually.

Changed in starlingx:
assignee: nobody → Dan Voiculeasa (dvoicule)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)

Fix proposed to branch: master
Review: https://review.opendev.org/741238

Changed in starlingx:
status: New → In Progress
Ghada Khalil (gkhalil)
tags: added: stx.update
Frank Miller (sensfan22)
description: updated
Frank Miller (sensfan22)
description: updated
Frank Miller (sensfan22)
summary: - B&R: AIO-DX apps in `apply-failed` after controller-1 boots
+ B&R: AIO-DX: apps may be in `apply-failed` after controller-1 boots
Frank Miller (sensfan22)
description: updated
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.5.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/752127

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/752128

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/753310

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/753311

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)

Fix proposed to branch: master
Review: https://review.opendev.org/753317

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on ansible-playbooks (master)

Change abandoned by Dan Voiculeasa (<email address hidden>) on branch: master
Review: https://review.opendev.org/741238
Reason: Using the work from https://review.opendev.org/#/c/753317/
there is no need to delete the `Terminating` pods. The apps will no longer be applied during restore, armada will not timeout waiting for stuck in `Terminating` pods. It is the job of kubernetes to recover the pods when the nodes come up.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/752127
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=025817dfb1be546e798f939ea03ded990dc6a20f
Submitter: Zuul
Branch: master

commit 025817dfb1be546e798f939ea03ded990dc6a20f
Author: Dan Voiculeasa <email address hidden>
Date: Tue Sep 15 20:14:32 2020 +0300

    Synchronize auto applies

    One source for auto apply is an audit periodic thread.
    Another source for auto apply is regenerating hiera data.

    At the moment the auto apply of platform-integ-apps from the periodic
    thread doesn't check if any other app is in progress. When armada apply
    will begin it may starve waiting for the armada lock taken by other app
    applies.

    This does not sync with user executed application apply commands.

    Partial-Bug: 1887648
    Change-Id: I4d5b552c4e73f0918bbd46206dd2f8deb5d6039a
    Signed-off-by: Dan Voiculeasa <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/752128
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=d74b5576d1bbadb1ea1b75d1c74501596f0c9114
Submitter: Zuul
Branch: master

commit d74b5576d1bbadb1ea1b75d1c74501596f0c9114
Author: Dan Voiculeasa <email address hidden>
Date: Wed Sep 16 01:09:18 2020 +0300

    Change auto reapply behavior during restore

    There is a scenario where ceph becomes HEALTH_OK after `system
    host-unlock controller-1` command, and before controller-1 reboot.
    When this scenario is hit platform-integ-apps will be auto applied.
    The armada apply will become failed because controller-1 reboots.
    Four more retries while controller-1 is down are attempted with no
    success.

    Even if the scenario is observed during the restore procedure, I think
    doing a lock, unlock sequence or lock, reboot, unlock sequence of the
    standby controller may reproduce the issue.

    This commit fixes that scenario by gating the audit function with a
    check. The check waits for the nodes to be stable before attempting to
    do automatic actions on apps.

    Added exception handling when checking for node stability.

    Partial-Bug: 1887648
    Change-Id: I64307f75d3e693dfdb0f15828743d196aaa7ae04
    Signed-off-by: Dan Voiculeasa <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/755833

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/753310
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=5cb20a610ff2b76f5f196dd93928a9102a1055a1
Submitter: Zuul
Branch: master

commit 5cb20a610ff2b76f5f196dd93928a9102a1055a1
Author: Dan Voiculeasa <email address hidden>
Date: Fri Sep 18 12:59:22 2020 +0300

    Change restore procedure

    During restore playbook, before controller-0 unlock there will be a flag
    file created to indicate that the system is going through a system
    restore.
    Exiting the system restore state is done through a command after all
    nodes are up and unlocked.
    Next commit will introduce sysinv commands to query and control the
    system restore state.

    Until all nodes are up, pods are stuck in `Terminating`. Armada will
    timeout waiting for those pods, if an armada apply is requested.
    This commit ensures auto-apply of apps does not occur during the
    system restore.

    While in the restore state, allow apps to have their images downloaded.
    If an image download failed, revert the status of the app to
    APP_RESTORE_REQUESTED instead of APP_APPLY_FAILURE. The auto image
    download is tried for apps in APP_RESTORE_REQUESTED until the
    system restore state is exited.
    This commit ensures enough time for manual intervention to fix the
    networking, docker registries connectivity or any other issues
    related to container images.

    Note: In the case of multi-nodes setups helm overrides may have been
    detected so apps will be auto-applied after exiting the restore
    state. The auto apply is started by a peridic audit thread.

    Change-Id: I44fc4aaa528e372a84115714f271b4f5e063f86e
    Partial-Bug: 1887648
    Signed-off-by: Dan Voiculeasa <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/753311
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=99616a7f72125c6fd5c86e5dd90af3d87ef3902c
Submitter: Zuul
Branch: master

commit 99616a7f72125c6fd5c86e5dd90af3d87ef3902c
Author: Dan Voiculeasa <email address hidden>
Date: Mon Sep 21 13:49:18 2020 +0300

    Introduce CLI commands for system restore control

    Introduce 3 new sysinv commands: restore-show, restore-start,
    restore-complete.

    When doing a restore, the system will be put automatically into a
    restore state by the restore playbook. After all the nodes are up
    and unlocked the user must do a `system restore-complete` to get
    out of the system restore state.

    Note: In the case of multi-nodes setups helm overrides may have been
    detected so apps will be auto-applied after exiting the restore
    state. The auto apply is started by a periodic audit thread.

    Depends-On: I44fc4aaa528e372a84115714f271b4f5e063f86e
    Partial-Bug: 1887648
    Change-Id: I7b7fab99d457056032dbbd612363cd5036736cda
    Signed-off-by: Dan Voiculeasa <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/753317
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=b439b862bb374677729e3e8320499d8aa45dc93c
Submitter: Zuul
Branch: master

commit b439b862bb374677729e3e8320499d8aa45dc93c
Author: Dan Voiculeasa <email address hidden>
Date: Tue Sep 22 16:37:39 2020 +0300

    Change restore procedure

    Do not force platform-integ-apps to uploaded state.
    The implication of this change is that if the backup was taken when
    platform-integ-apps was not in applied state, then at the end of the
    restore procedure a manual apply of platform-integ-apps is needed.
    Otherwise the restore procedure takes care of it.

    As one of the final steps of the restore playbook place the system in
    the restore state.

    This commit is the last one in a series of fixes for the automatic
    applies of apps and restore procedure. This commit keeps track of the
    other commits.
    CLI commands to control the restore state were provided.
    Changes to the periodic thread to support the new restore procedure were done.

    Depends-On: I7b7fab99d457056032dbbd612363cd5036736cda
    Depends-On: I44fc4aaa528e372a84115714f271b4f5e063f86e
    Depends-On: I64307f75d3e693dfdb0f15828743d196aaa7ae04
    Depends-On: I4d5b552c4e73f0918bbd46206dd2f8deb5d6039a

    Closes-Bug: 1887648
    Change-Id: I8b978a341a388ffcdd154d39dc52b93e09658ef0
    Signed-off-by: Dan Voiculeasa <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/793460

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/793696

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/794611

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/794906

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on config (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/794611

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (f/centos8)
Download full text (147.3 KiB)

Reviewed: https://review.opendev.org/c/starlingx/config/+/794906
Committed: https://opendev.org/starlingx/config/commit/75758b37a5a23c8811355b67e2a430a1713cd85b
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 9e420d9513e5fafb1df4d29567bc299a9e04d58d
Author: Bin Qian <email address hidden>
Date: Mon May 31 14:45:52 2021 -0400

    Add more logging to run docker login

    Add error log for running docker login. The new log could
    help identify docker login failure.

    Closes-Bug: 1930310
    Change-Id: I8a709fb6665de8301fbe3022563499a92b2a0211
    Signed-off-by: Bin Qian <email address hidden>

commit 31c77439d2cea590dfcca13cfa646522665f8686
Author: albailey <email address hidden>
Date: Fri May 28 13:42:42 2021 -0500

    Fix controller-0 downgrade failing to kill ceph

    kill_ceph_storage_monitor tried to manipulate a pmon
    file that does not exist in an AIO-DX environment.

    We no longer invoke kill_ceph_storage_monitor in an
    AIO SX or DX env.

    This allows: "system host-downgrade controller-0"
    to proceed in an AIO-DX environment where that second
    controller (controller-0) was upgraded.

    Partial-Bug: 1929884
    Signed-off-by: albailey <email address hidden>
    Change-Id: I633853f75317736084feae96b5b849c601204c13

commit 0dc99eee608336fe01b58821ea404286371f1408
Author: albailey <email address hidden>
Date: Fri May 28 11:05:43 2021 -0500

    Fix file permissions failure during duplex upgrade abort

    When issuing a downgrade for controller-0 in a duplex upgrade
    abort and rollback scenario, the downgrade command was failing
    because the sysinv API does not have root permissions to set
    a file flag.
    The fix is to use RPC so the conductor can create the flag
    and allow the downgrade for controller-0 to get further.

    Partial-Bug: 1929884
    Signed-off-by: albailey <email address hidden>
    Change-Id: I913bcad73309fe887a12cbb016a518da93327947

commit 7ef3724dad173754e40b45538b1cc726a458cc1c
Author: Chen, Haochuan Z <email address hidden>
Date: Tue May 25 16:16:29 2021 +0800

    Fix bug rook-ceph provision with multi osd on one host

    Test case:
    1, deploy simplex system
    2, apply rook-ceph with below override value
    value.yaml
    cluster:
      storage:
        nodes:
        - name: controller-0
          devices:
          - name: sdb
          - name: sdc
    3, reboot

    Without this fix, only osd pod could launch successfully after boot
    as vg start with ceph could not correctly add in sysinv-database

    Closes-bug: 1929511

    Change-Id: Ia5be599cd168d13d2aab7b5e5890376c3c8a0019
    Signed-off-by: Chen, Haochuan Z <email address hidden>

commit 23505ba77d76114cf8a0bf833f9a5bcd05bc1dd1
Author: Angie Wang <email address hidden>
Date: Tue May 25 18:49:21 2021 -0400

    Fix issue in partition data migration script

    The created partition dictonary partition_map is not
    an ordered dict so we need to sort it by its key -
    device node when iterating it to adjust the device
    nodes/paths for user created extra partitions to ensure
    the number of device node...

tags: added: in-f-centos8
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on config (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/793696

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/793460

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.