Subcloud rehoming fails mid-operation when subcloud has mgmt affecting alarms

Bug #2038677 reported by Yuxing
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Low
Yuxing

Bug Description

Brief Description
-----------------
When rehoming a subcloud that has a mgmt affecting alarm, the rehoming playbook fails when applying the Kubernetes RootCA update strategy, leaving the subcloud unavailable on both system controllers.

Since the rootca update strategy is expected to fail if there are any mgmt affecting alarms, the rehoming operation needs to check mgmt affecting alarms during the 'validate-before-rehoming' role, so it fails before rehoming is initiated.

Severity
----------------
Minor

Steps to Reproduce
------------------
Rehome a subcloud that has a mgmt affecting alarm (e.g. 260.001)

Expected Behavior
----------------------
Subcloud rehoming should fail during the 'validate-before-rehoming' role, before the admin endpoint cert is updated on the subcloud. This way the subcloud would still be available to the original system controller.

Actual Behavior
---------------------
The rehoming fails during the Kubernetes RootCA update strategy task, making the subcloud unavailable on both system controllers.

Reproducibility
------------------
100% reproducible.

System Configuration
----------------------
Distributed cloud

Load info (eg: 2022-03-10_20-00-07)
---------------------------------------------
Master

Last Pass
---------
NA

Timestamp/Logs

From the rehoming playbook log:
----------------------------------
TASK [rehome-subcloud/update-k8s-root-ca : Check if strategy is ready to apply] ***
Thursday 28 September 2023  20:19:29 +0000 (0:00:00.015)       0:05:57.788 ****
FAILED - RETRYING: Check if strategy is ready to apply (3 retries left).
FAILED - RETRYING: Check if strategy is ready to apply (2 retries left).
FAILED - RETRYING: Check if strategy is ready to apply (1 retries left).
fatal: [subcloud3]: FAILED! => changed=true
  attempts: 3
  cmd: source /etc/platform/openrc; sw-manager kube-rootca-update-strategy show
  delta: '0:00:03.274137'
  end: '2023-09-28 20:20:45.331956'
  rc: 0
  start: '2023-09-28 20:20:42.057819'
  stderr: ''
  stderr_lines: <omitted>
  stdout: |-
    Strategy Kubernetes RootCA Update Strategy:
      strategy-uuid:                          67f13dd8-b024-465f-97db-c74be0f790c1
      controller-apply-type:                  serial
      storage-apply-type:                     serial
      worker-apply-type:                      serial
      default-instance-action:                stop-start
      alarm-restrictions:                     relaxed
      current-phase:                          build
      current-phase-completion:               100%
      state:                                  build-failed
      build-result:                           failed
      build-reason:                           active alarms present [ 260.001 ]
  stdout_lines: <omitted>

Subcloud3 is offline and has the rehome-failed deploy-staus on WRCPENG-DC1:

──0 "wrcpeng-dc1-floating"──────────────────────────────────────────────────────────────────────────────────────────────────
[sysadmin@controller-0 ~(keystone_admin)]$ dcmanager subcloud list
+----+-----------+------------+--------------+---------------+---------+------------------+----------------------------+
| id | name      | management | availability | deploy status | sync    | backup status    | backup datetime            |
+----+-----------+------------+--------------+---------------+---------+------------------+----------------------------+
|  2 | subcloud1 | managed    | online       | complete      | in-sync | complete-central | 2023-09-26 22:09:04.017642 |
|  7 | subcloud3 | unmanaged  | offline      | rehome-failed | unknown | None             | None                       |
|  8 | subcloud4 | managed    | online       | complete      | in-sync | None             | None                       |
+----+-----------+------------+--------------+---------------+---------+------------------+----------------------------+

It's also offline on WRCPENG-DC1_1:

──1 "wrcpeng-dc1-1-floating"──────────────────────────────────────────────────────────────────────────────────────────────────
[sysadmin@controller-0 ~(keystone_admin)]$ dcmanager subcloud list --all
+----+-----------+------------+--------------+---------------+---------+---------------+-----------------+
| id | name      | management | availability | deploy status | sync    | backup status | backup datetime |
+----+-----------+------------+--------------+---------------+---------+---------------+-----------------+
|  5 | subcloud3 | managed    | offline      | complete      | unknown | None          | None            |
|  6 | subcloud4 | unmanaged  | offline      | secondary     | unknown | None          | None            |
|  8 | subcloud1 | unmanaged  | offline      | secondary     | unknown | None          | None            |
+----+-----------+------------+--------------+---------------+---------+---------------+-----------------+

The audit.log from it's original system controller (WRCPENG-DC1_1) has SSL errors when it tries to audit the subcloud:

keystoneauth1.exceptions.connection.SSLError: SSL exception connecting to https://[fd00:8:55::2]:5001/v3/auth/tokens: HTTPSConnectionPool(host='fd00:8:55::2', port=5001): Max retries exceeded with url: /v3/auth/tokens (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1123)')))

OBS: The subcloud is still managed on WRCPENG-DC1_1 because the rehoming occurred when the system was powered down.

Alarms

Subcloud3 alarms:

+----------+---------------------------------------------------------------------+--------------------+----------+--------------+
| Alarm ID | Reason Text                                                         | Entity ID          | Severity | Time Stamp   |
+----------+---------------------------------------------------------------------+--------------------+----------+--------------+
| 200.007  | compute-0 is reporting a 'critical' out-of-tolerance reading from   | host=compute-0.    | critical | 2023-09-29T1 |
|          | the 'HpServerPowerSupply' sensor                                    | sensor=HpServerPow |          | 7:31:57.     |
|          |                                                                     | erSupply           |          | 271164       |
|          |                                                                     |                    |          |              |
| 200.001  | compute-0 was administratively locked to take it out-of-service.    | host=compute-0     | warning  | 2023-09-28T1 |
|          |                                                                     |                    |          | 3:54:33.     |
|          |                                                                     |                    |          | 507198       |
|          |                                                                     |                    |          |              |
| 260.001  | Deployment Manager resource not reconciled: resource=host.starlingx | resource=host.     | major    | 2023-09-28T1 |
|          | .windriver.com,name=compute-0                                       | starlingx.         |          | 3:54:18.     |
|          |                                                                     | windriver.com,name |          | 988189       |
|          |                                                                     | =compute-0         |          |              |
|          |                                                                     |                    |          |              |
+----------+---------------------------------------------------------------------+--------------------+----------+--------------+

Test Activity
---------------------
Dev testing

Workaround
---------------------
Clear the alarm and try again

Yuxing (yuxing)
Changed in starlingx:
assignee: nobody → Yuxing (yuxing)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)
Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/c/starlingx/ansible-playbooks/+/897577
Committed: https://opendev.org/starlingx/ansible-playbooks/commit/bf76c80da3c536f3eb7cf0268635287b34ab1a23
Submitter: "Zuul (22348)"
Branch: master

commit bf76c80da3c536f3eb7cf0268635287b34ab1a23
Author: Yuxing Jiang <email address hidden>
Date: Fri Oct 6 12:41:35 2023 -0400

    Disallow rehoming if there is a management affecting alarm

    After introducing the kube rootca update as a task in the subcloud
    rehoming, the subcloud rehoming should be fail in validation before
    the actual process based on the requirement of the kube rootca update
    strategy.

    This commit adds the pre-check of the mgmt affecting alarms in the
    subcloud before the acutal operation in the subcloud rehoming
    playbook, removes the check of the online check of the controller-1
    and the check of the 250.001 alarm because both of them can be
    verified by the mgmt affecting alarm check.

    Test plan:
    1. Test rehoming SX/DX subcloud w/o mgmt affecting alarms successfully
    - passed
    2. Test rehoming DX subcloud with the controller-1 powered off,
    rehoming failed as mgmt affecting alarm raised by the missing of the
    standby services as expected, and the error message can be found in
    the ansible log - passed.
    3. Introducing mgmt affecting alarms [250.001 and 260.001] in the
    subcloud, test rehoming failed as expected with the error message in
    the ansible log - passed.

    Closes-bug: 2038677
    Signed-off-by: Yuxing Jiang <email address hidden>
    Change-Id: I7a8fc37a189be1290a024b0127092d7ed33da484

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Low
tags: added: stx.9.0 stx.distcloud
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.