DC installation failed at "Install dc root CA" on system controller

Bug #1881606 reported by Difu Hu
22
This bug affects 2 people
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Sabeel Ansari

Bug Description

Brief Description
-----------------
During DC installation, running Ansible bootstrap playbook failed
at "Install dc root CA".

Severity
--------
Major

Steps to Reproduce
------------------
Install controller-0
Run Ansible bootstrap playbook

Expected Behavior
------------------
Ansible bootstrap playbook completes without failure

Actual Behavior
----------------
Ansible bootstrap playbook failed
(Ansible replay passed without reinstalling)

Reproducibility
---------------
Intermittent

System Configuration
--------------------
DC system
Lab-name: WCP_80_91 (DC-1)

Branch/Pull Time/Commit
-----------------------
2020-05-29_20-00-00

Last Pass
---------
DC-1: 2020-05-22_20-00-00

Timestamp/Logs
--------------
bootstrap/bringup-bootstrap-applications : Install dc root CA
"stderr": "Error from server (InternalError): error when creating
 \"/etc/kubernetes/dc-ca.yaml\": Internal error occurred:
 failed calling webhook creating
\"/tmp/adminep/setup-sc-adminep-certs.yaml\":
Internal error occurred: failed calling webhook \"webhook.cert-manager.io\"

Test Activity
-------------
DC installation

Revision history for this message
Difu Hu (difuhu) wrote :
Yang Liu (yliu12)
description: updated
Difu Hu (difuhu)
description: updated
Revision history for this message
Ghada Khalil (gkhalil) wrote :

As per Yang Liu, test PL, this is seen with a frequency of 10%
A similar failure is also reported on the system controller. See https://bugs.launchpad.net/starlingx/+bug/1880574

Marking for stx.4.0 - this needs further investigation as it results in bootstrap failing

tags: added: stx.4.0 stx.apps stx.config
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
assignee: nobody → Sabeel Ansari (sansariwr)
tags: added: stx.distcloud
Changed in starlingx:
status: Triaged → In Progress
Ghada Khalil (gkhalil)
summary: - DC installation failed at "Install dc root CA"
+ DC installation failed at "Install dc root CA" on system controller
Revision history for this message
Ghada Khalil (gkhalil) wrote :

As discussed with Sabeel Ansari, this is the same underlying issue reported in: https://bugs.launchpad.net/starlingx/+bug/1880574

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)

Fix proposed to branch: master
Review: https://review.opendev.org/736726

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Issue has not been reproduced in many attempts. The proposed code change adds some preventative code to check that cert-mgr is fully up by triggering a test cmd to it during bootstrap and repeating it until the cmd goes through. This code will require more testing and soak, so the proposal is to put it early in stx.5.0

tags: added: stx.5.0
removed: stx.4.0
Ghada Khalil (gkhalil)
tags: added: stx.security
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/736726
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=f7e42ff787328e4b4525b6f7a393d48b315a6bee
Submitter: Zuul
Branch: master

commit f7e42ff787328e4b4525b6f7a393d48b315a6bee
Author: Sabeel Ansari <email address hidden>
Date: Thu Jun 18 09:54:37 2020 -0400

    Add test for cert-manager resource readiness

    In some cases, DC installations were trying to create
    cert-manager resources before app was ready. This adds a
    check to ensure cert-manager is applied and a test issuer
    can be created before moving on to next phase.

    While Armada claims all pods are ready (the readiness check
    for the pods transitions to Ready state), there are some aspects
    of the app that aren't ready to use. An issue was raised in
    upstream community, and it was confirmed that this type of
    indirect check was the only way to confirm readiness.
    See GitHub Issue here: https://github.com/jetstack/cert-manager/issues/3045

    Testing was performed with installs on AIO, standard & DC
    systems. B&R testing was also performed.

    Closes-bug: 1881606
    Closes-bug: 1880574

    Change-Id: Iaf84b43bb4ce20476a9bc66b4ad7ced21753e0ff
    Signed-off-by: Sabeel Ansari <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Changing back to stx.4.0 since the fix made it for that release

tags: added: stx.4.0
removed: stx.5.0
Revision history for this message
Difu Hu (difuhu) wrote :

Similar issue is reproduced on DC-4 subcloud11, build 2020-07-14_20-00-00

Ghada Khalil (gkhalil)
Changed in starlingx:
status: Fix Released → Confirmed
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Moving to stx.5.0 as the issue is intermittent and a retry usually passes.

tags: added: stx.5.0
removed: stx.4.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)

Fix proposed to branch: master
Review: https://review.opendev.org/749139

Changed in starlingx:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/749139
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=0db7c1f9426240e51c7ff20c429f53647b32e4f8
Submitter: Zuul
Branch: master

commit 0db7c1f9426240e51c7ff20c429f53647b32e4f8
Author: Sabeel Ansari <email address hidden>
Date: Mon Aug 31 15:56:18 2020 -0400

    Add retries during subcloud certificate setup

    During DC installation, some installations failed due to
    cert-manager not being available. Retries are being added
    during the installation procedure and detailed system info
    data capture in case of further failures.

    Closes-bug: 1881606

    Change-Id: I073d1d8547e6a2a51655efe4f4d69ba92a87cb9a
    Signed-off-by: Sabeel Ansari <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.