StarlingX

Have subcloud rehoming playbook failures to be reported as errors in 'dcmanager subcloud errors subcloud#'

Bug #2047645 reported by Fabrizio Perez on 2023-12-28

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Low	Fabrizio Perez

Bug Description

Brief Description
-----------------

Currently dcmanager is already able to extract some ansible execution failures from playbooks and have it reported as errors to the user when 'dcmanager subcloud errors subcloud#' is run.

Failures in the rehome playbook are not included at the moment. The intent of this story is to have ansible failures in the rehoming playbook listed as well.

For instance:

Ansible playbook error:

[sysadmin@controller-0 dc-config(keystone_admin)]$ tail -n 40 /var/log/dcmanager/ansible/subcloud1_playbook_output.log

TASK [common/recover-subcloud-certificates : Verify if Rest API certificate is expired] ***
Saturday 26 September 2026 22:27:31 +0000 (0:00:00.039) 0:09:16.953 ****
skipping: [subcloud1]

TASK [common/recover-subcloud-certificates : Fail if Rest API or Docker Registry certificates are expired] ***
Saturday 26 September 2026 22:27:31 +0000 (0:00:00.066) 0:09:17.020 ****
fatal: [subcloud1]: FAILED! => changed=false
msg: |2-
Docker Registry certificate is expired. Manual action required! On the subcloud, please update the expired certificates with `system certificate-install` or run /usr/share/ansible/stx-ansible/playbooks/migrate_platform_certificates_to_certmanager.yml playbook following the section Migrate Platform Certificates to Use Cert Manager of the docs.

TASK [common/recover-subcloud-certificates : Delete root ca key file after use in compute nodes] ***
Saturday 26 September 2026 22:27:31 +0000 (0:00:00.059) 0:09:17.080 ****
skipping: [subcloud1]

PLAY RECAP *********************************************************************
subcloud1 : ok=77 changed=49 unreachable=0 failed=1 skipped=36 rescued=0 ignored=0

Saturday 26 September 2026 22:27:31 +0000 (0:00:00.031) 0:09:17.112 ****
===============================================================================
common/recover-subcloud-certificates : Wait pods to restart (become READY) on controller - 272.85s
common/recover-subcloud-certificates : async_status ------------------- 136.22s
common/recover-subcloud-certificates : Verify if Docker Registry certificate is expired -- 66.29s
common/recover-subcloud-certificates : Recover k8s control plane leaf certificates -- 13.29s
common/recover-subcloud-certificates : Wait till kubectl starts replying -- 11.36s
common/recover-subcloud-certificates : Pause for 10 seconds to wait k8s to start rolling out pods -- 10.05s
common/recover-subcloud-certificates : Verify if HTTPS is enabled ------- 3.96s
common/recover-subcloud-certificates : Trigger a restart of every pod (deployment,statefulset,daemonset rollout) --- 3.69s
common/recover-subcloud-certificates : Delete temporary files on subcloud --- 1.83s
common/recover-subcloud-certificates : Check if controller-1 is online --- 1.75s
common/recover-subcloud-certificates : Check if admin credentials can be sourced --- 1.61s
common/prepare-env : stat ----------------------------------------------- 1.50s
common/recover-subcloud-certificates : Delete temporary files on subcloud --- 1.48s
common/recover-subcloud-certificates : Save certificate signing request to /tmp/ansible.5g_5ms5jtmp_kubelet_conf_csr --- 1.26s
common/recover-subcloud-certificates : Restart kubelet ------------------ 1.13s
common/recover-subcloud-certificates : Verify if Kubernetes Root CA is expired --- 1.03s
common/recover-subcloud-certificates : Save the ICA key to file --------- 1.00s
common/recover-subcloud-certificates : Save current dc-adminep ICA cert to file --- 0.97s
common/recover-subcloud-certificates : Trigger restart of networking pods first to avoid pod scheduling issues --- 0.95s
common/recover-subcloud-certificates : Save the ICA certificate to file --- 0.94s
[sysadmin@controller-0 dc-config(keystone_admin)]$

Should show subcloud error as below:

[sysadmin@controller-0 ~(keystone_admin)]$ dcmanager subcloud errors subcloud1
FAILED bootstrapping playbook of (subcloud1).
detail: fatal: [subcloud1]: FAILED! => changed=false
msg: |2-
Docker Registry certificate is expired. Manual action required! On the subcloud, please update the expired certificates with `system certificate-install` or run /usr/share/ansible/stx-ansible/playbooks/migrate_platform_certificates_to_certmanager.yml playbook following the section Migrate Platform Certificates to Use Cert Manager of the docs.
For bootstrap failures, please delete and re-add the subcloud after the cause of failure has been resolved.

This feature was suggested by Peters, Matt in a demo meeting.

Severity
--------
<Minor: System/Feature is usable with minor issue>

Steps to Reproduce
------------------
Force a failure in subcloud rehoming.
Check output of dcmanager subcloud errors <subcloud>

Expected Behavior
------------------
The command displays the error.

Actual Behavior
----------------
The command doesn't display the error.

Reproducibility
---------------
100%

System Configuration
--------------------
Any

Branch/Pull Time/Commit
-----------------------

Last Pass
---------

Timestamp/Logs
--------------

Test Activity
-------------
Demo

Workaround
----------

Tags:

OpenStack Infra (hudson-openstack) on 2023-12-28

Changed in starlingx:
status:	New → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-01-31: Fix merged to distcloud (master)

Reviewed: https://review.opendev.org/c/starlingx/distcloud/+/904363
Committed: https://opendev.org/starlingx/distcloud/commit/04c8b51b4003986d6b979ccb1b20e62ae4cbd802
Submitter: "Zuul (22348)"
Branch: master

commit 04c8b51b4003986d6b979ccb1b20e62ae4cbd802
Author: fperez <email address hidden>
Date: Tue Dec 26 20:39:49 2023 -0300

Report rehoming playbook failures

This commit extends ansible error catching for rehoming
subcloud operation.

    Test plan:
    PASS: Intentionally force a failure in the rehoming playbook.
          Verify that the error is displayed correctly

Closes-bug: 2047645

Change-Id: I4571e04247bdcf273f5de860ae5032597b173ed2
Signed-off-by: fperez <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Ghada Khalil (gkhalil) on 2024-02-02

Changed in starlingx:
importance:	Undecided → Low
tags:	added: stx.9.0 stx.distcloud
Changed in starlingx:
assignee:	nobody → Fabrizio Perez (fperezwindriver)

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.