Subcloud rehoming fails mid-operation when subcloud has mgmt affecting alarms
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Low
|
Yuxing |
Bug Description
Brief Description
-----------------
When rehoming a subcloud that has a mgmt affecting alarm, the rehoming playbook fails when applying the Kubernetes RootCA update strategy, leaving the subcloud unavailable on both system controllers.
Since the rootca update strategy is expected to fail if there are any mgmt affecting alarms, the rehoming operation needs to check mgmt affecting alarms during the 'validate-
Severity
----------------
Minor
Steps to Reproduce
------------------
Rehome a subcloud that has a mgmt affecting alarm (e.g. 260.001)
Expected Behavior
-------
Subcloud rehoming should fail during the 'validate-
Actual Behavior
-------
The rehoming fails during the Kubernetes RootCA update strategy task, making the subcloud unavailable on both system controllers.
Reproducibility
------------------
100% reproducible.
System Configuration
-------
Distributed cloud
Load info (eg: 2022-03-
-------
Master
Last Pass
---------
NA
Timestamp/Logs
From the rehoming playbook log:
-------
TASK [rehome-
Thursday 28 September 2023 20:19:29 +0000 (0:00:00.015) 0:05:57.788 ****
FAILED - RETRYING: Check if strategy is ready to apply (3 retries left).
FAILED - RETRYING: Check if strategy is ready to apply (2 retries left).
FAILED - RETRYING: Check if strategy is ready to apply (1 retries left).
fatal: [subcloud3]: FAILED! => changed=true
attempts: 3
cmd: source /etc/platform/
delta: '0:00:03.274137'
end: '2023-09-28 20:20:45.331956'
rc: 0
start: '2023-09-28 20:20:42.057819'
stderr: ''
stderr_lines: <omitted>
stdout: |-
Strategy Kubernetes RootCA Update Strategy:
strategy-
controlle
storage-
worker-
default-
alarm-
current-
current-
state: build-failed
build-result: failed
build-reason: active alarms present [ 260.001 ]
stdout_lines: <omitted>
Subcloud3 is offline and has the rehome-failed deploy-staus on WRCPENG-DC1:
──0 "wrcpeng-
[sysadmin@
+----+-
| id | name | management | availability | deploy status | sync | backup status | backup datetime |
+----+-
| 2 | subcloud1 | managed | online | complete | in-sync | complete-central | 2023-09-26 22:09:04.017642 |
| 7 | subcloud3 | unmanaged | offline | rehome-failed | unknown | None | None |
| 8 | subcloud4 | managed | online | complete | in-sync | None | None |
+----+-
It's also offline on WRCPENG-DC1_1:
──1 "wrcpeng-
[sysadmin@
+----+-
| id | name | management | availability | deploy status | sync | backup status | backup datetime |
+----+-
| 5 | subcloud3 | managed | offline | complete | unknown | None | None |
| 6 | subcloud4 | unmanaged | offline | secondary | unknown | None | None |
| 8 | subcloud1 | unmanaged | offline | secondary | unknown | None | None |
+----+-
The audit.log from it's original system controller (WRCPENG-DC1_1) has SSL errors when it tries to audit the subcloud:
keystoneauth1.
OBS: The subcloud is still managed on WRCPENG-DC1_1 because the rehoming occurred when the system was powered down.
Alarms
Subcloud3 alarms:
+------
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+------
| 200.007 | compute-0 is reporting a 'critical' out-of-tolerance reading from | host=compute-0. | critical | 2023-09-29T1 |
| | the 'HpServerPowerS
| | | erSupply | | 271164 |
| | | | | |
| 200.001 | compute-0 was administratively locked to take it out-of-service. | host=compute-0 | warning | 2023-09-28T1 |
| | | | | 3:54:33. |
| | | | | 507198 |
| | | | | |
| 260.001 | Deployment Manager resource not reconciled: resource=
| | .windriver.
| | | windriver.com,name | | 988189 |
| | | =compute-0 | | |
| | | | | |
+------
Test Activity
-------
Dev testing
Workaround
-------
Clear the alarm and try again
Changed in starlingx: | |
assignee: | nobody → Yuxing (yuxing) |
Changed in starlingx: | |
importance: | Undecided → Low |
tags: | added: stx.9.0 stx.distcloud |
Fix proposed to branch: master /review. opendev. org/c/starlingx /ansible- playbooks/ +/897577
Review: https:/