Have subcloud rehoming playbook failures to be reported as errors in 'dcmanager subcloud errors subcloud#'

Bug #2047645 reported by Fabrizio Perez
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Low
Fabrizio Perez

Bug Description

Brief Description
-----------------

Currently dcmanager is already able to extract some ansible execution failures from playbooks and have it reported as errors to the user when 'dcmanager subcloud errors subcloud#' is run.

Failures in the rehome playbook are not included at the moment. The intent of this story is to have ansible failures in the rehoming playbook listed as well.

For instance:

Ansible playbook error:

[sysadmin@controller-0 dc-config(keystone_admin)]$ tail -n 40 /var/log/dcmanager/ansible/subcloud1_playbook_output.log

TASK [common/recover-subcloud-certificates : Verify if Rest API certificate is expired] ***
Saturday 26 September 2026 22:27:31 +0000 (0:00:00.039) 0:09:16.953 ****
skipping: [subcloud1]

TASK [common/recover-subcloud-certificates : Fail if Rest API or Docker Registry certificates are expired] ***
Saturday 26 September 2026 22:27:31 +0000 (0:00:00.066) 0:09:17.020 ****
fatal: [subcloud1]: FAILED! => changed=false
  msg: |2-
     Docker Registry certificate is expired. Manual action required! On the subcloud, please update the expired certificates with `system certificate-install` or run /usr/share/ansible/stx-ansible/playbooks/migrate_platform_certificates_to_certmanager.yml playbook following the section Migrate Platform Certificates to Use Cert Manager of the docs.

TASK [common/recover-subcloud-certificates : Delete root ca key file after use in compute nodes] ***
Saturday 26 September 2026 22:27:31 +0000 (0:00:00.059) 0:09:17.080 ****
skipping: [subcloud1]

PLAY RECAP *********************************************************************
subcloud1 : ok=77 changed=49 unreachable=0 failed=1 skipped=36 rescued=0 ignored=0

Saturday 26 September 2026 22:27:31 +0000 (0:00:00.031) 0:09:17.112 ****
===============================================================================
common/recover-subcloud-certificates : Wait pods to restart (become READY) on controller - 272.85s
common/recover-subcloud-certificates : async_status ------------------- 136.22s
common/recover-subcloud-certificates : Verify if Docker Registry certificate is expired -- 66.29s
common/recover-subcloud-certificates : Recover k8s control plane leaf certificates -- 13.29s
common/recover-subcloud-certificates : Wait till kubectl starts replying -- 11.36s
common/recover-subcloud-certificates : Pause for 10 seconds to wait k8s to start rolling out pods -- 10.05s
common/recover-subcloud-certificates : Verify if HTTPS is enabled ------- 3.96s
common/recover-subcloud-certificates : Trigger a restart of every pod (deployment,statefulset,daemonset rollout) --- 3.69s
common/recover-subcloud-certificates : Delete temporary files on subcloud --- 1.83s
common/recover-subcloud-certificates : Check if controller-1 is online --- 1.75s
common/recover-subcloud-certificates : Check if admin credentials can be sourced --- 1.61s
common/prepare-env : stat ----------------------------------------------- 1.50s
common/recover-subcloud-certificates : Delete temporary files on subcloud --- 1.48s
common/recover-subcloud-certificates : Save certificate signing request to /tmp/ansible.5g_5ms5jtmp_kubelet_conf_csr --- 1.26s
common/recover-subcloud-certificates : Restart kubelet ------------------ 1.13s
common/recover-subcloud-certificates : Verify if Kubernetes Root CA is expired --- 1.03s
common/recover-subcloud-certificates : Save the ICA key to file --------- 1.00s
common/recover-subcloud-certificates : Save current dc-adminep ICA cert to file --- 0.97s
common/recover-subcloud-certificates : Trigger restart of networking pods first to avoid pod scheduling issues --- 0.95s
common/recover-subcloud-certificates : Save the ICA certificate to file --- 0.94s
[sysadmin@controller-0 dc-config(keystone_admin)]$

Should show subcloud error as below:

[sysadmin@controller-0 ~(keystone_admin)]$ dcmanager subcloud errors subcloud1
FAILED bootstrapping playbook of (subcloud1).
detail: fatal: [subcloud1]: FAILED! => changed=false
  msg: |2-
     Docker Registry certificate is expired. Manual action required! On the subcloud, please update the expired certificates with `system certificate-install` or run /usr/share/ansible/stx-ansible/playbooks/migrate_platform_certificates_to_certmanager.yml playbook following the section Migrate Platform Certificates to Use Cert Manager of the docs.
For bootstrap failures, please delete and re-add the subcloud after the cause of failure has been resolved.

This feature was suggested by Peters, Matt in a demo meeting.

Severity
--------
<Minor: System/Feature is usable with minor issue>

Steps to Reproduce
------------------
Force a failure in subcloud rehoming.
Check output of dcmanager subcloud errors <subcloud>

Expected Behavior
------------------
The command displays the error.

Actual Behavior
----------------
The command doesn't display the error.

Reproducibility
---------------
100%

System Configuration
--------------------
Any

Branch/Pull Time/Commit
-----------------------

Last Pass
---------

Timestamp/Logs
--------------

Test Activity
-------------
Demo

Workaround
----------

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to distcloud (master)

Reviewed: https://review.opendev.org/c/starlingx/distcloud/+/904363
Committed: https://opendev.org/starlingx/distcloud/commit/04c8b51b4003986d6b979ccb1b20e62ae4cbd802
Submitter: "Zuul (22348)"
Branch: master

commit 04c8b51b4003986d6b979ccb1b20e62ae4cbd802
Author: fperez <email address hidden>
Date: Tue Dec 26 20:39:49 2023 -0300

    Report rehoming playbook failures

    This commit extends ansible error catching for rehoming
    subcloud operation.

    Test plan:
    PASS: Intentionally force a failure in the rehoming playbook.
          Verify that the error is displayed correctly

    Closes-bug: 2047645

    Change-Id: I4571e04247bdcf273f5de860ae5032597b173ed2
    Signed-off-by: fperez <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Low
tags: added: stx.9.0 stx.distcloud
Changed in starlingx:
assignee: nobody → Fabrizio Perez (fperezwindriver)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.