Controller-1 fails to come online during installation

Bug #2070487 reported by Bala Shankar MV
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Fabrizio Perez

Bug Description

Brief Description
-----------------
Controller-1 is not becoming online and it is failing to trigger the re-installation.

Severity
--------
Major.

Steps to Reproduce
------------------
Trigger the installation of the lab.

Expected Behavior
-----------------
All nodes to become unlocked, enabled and available

Actual Behavior
---------------
controller-1 is locked, disabled and offline

Reproducibility
---------------
Reproducible. It is failing to reinstall controller-1.

Load info
---------
20240621T060058Z

[sysadmin@controller-0 ~(keystone_admin)]$ tail -f /var/log/mtcAgent.log
2024-06-18T19:25:48.720 fmAlarmUtils.cpp(657): FM Response for raise alarm: (0), alarm_id (200.021), entity_id (host=controller-1.command=reinstall)
2024-06-18T19:25:58.682 [18393.00676] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (4701) reinstall_handler :Error : controller-1 Reinstall wipedisk ACK timeout
2024-06-18T19:25:58.682 [18393.00677] controller-0 mtcAgent inv mtcInvApi.cpp ( 334) mtcInvApi_update_task : Info : controller-1 Task: Reinstall Failed (seq:203)
2024-06-18T19:25:58.688 fmAPI.cpp(501): Enqueue raise alarm request: UUID (de6aecdb-b4db-46a7-8db7-4dd260446b60) alarm id (200.022) instant id (host=controller-1.status=reinstall-failed)
2024-06-18T19:25:58.733 fmAlarmUtils.cpp(623): Sending FM raise alarm request: alarm_id (200.022), entity_id (host=controller-1.status=reinstall-failed)
2024-06-18T19:25:58.776 fmAlarmUtils.cpp(657): FM Response for raise alarm: (0), alarm_id (200.022), entity_id (host=controller-1.status=reinstall-failed)
2024-06-18T19:26:28.693 [18393.00678] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp (4810) reinstall_handler : Info : controller-1 Reinstall complete ; operation failure
2024-06-18T19:26:28.693 [18393.00679] controller-0 mtcAgent inv mtcInvApi.cpp ( 434) mtcInvApi_force_task : Info : controller-1 task clear (seq:205) (was:Reinstall Failed)
2024-06-18T19:35:22.528 [18393.00680] controller-0 mtcAgent tok tokenUtil.cpp ( 720) tokenUtil_new_token : Info : controller-0 Requesting Authentication Token
2024-06-18T19:35:22.988 [18393.00681] controller-0 mtcAgent tok tokenUtil.cpp ( 357) tokenUtil_log_refresh : Info : Token Refresh: [e4937711b911aaa094a3efbecc1290bd] [Expiry: 2024-06-18 20:35:22]
[sysadmin@controller-0 ~(keystone_admin)]$ system host-list
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | controller-1 | controller | locked | disabled | offline |
+----+--------------+-------------+----------------+-------------+--------------+

System Configuration
--------------------
AIO-DX

Test Activity
-------------
Sanity

Revision history for this message
Bala Shankar MV (bshankar) wrote :
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → High
tags: added: stx.10.0
Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: nobody → Fabrizio Perez (fperezwindriver)
tags: added: stx.config
Revision history for this message
Bala Shankar MV (bshankar) wrote :
Revision history for this message
Bala Shankar MV (bshankar) wrote :

Issue was seen in stx load 20240630T060059Z

Refer the attachment https://bugs.launchpad.net/starlingx/+bug/2070487/+attachment/5794164/+files/20240630T060059Z.tar

Revision history for this message
Ghada Khalil (gkhalil) wrote :

A code change was merged on July 2 to mitigate this sanity failure. However, the root-cause is still not understood.

LP: https://bugs.launchpad.net/starlingx/+bug/2071619
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/923161

Waiting for a sanity re-run

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/923956
Committed: https://opendev.org/starlingx/config/commit/6014f5c46e415f6fd65600cd8f4b7f92004e01a5
Submitter: "Zuul (22348)"
Branch: master

commit 6014f5c46e415f6fd65600cd8f4b7f92004e01a5
Author: fperez <email address hidden>
Date: Thu Jul 11 10:53:45 2024 -0300

    Prevent incorrect runtime entries creation

    Enhance _create_runtime_config_entries method with additional checks
    to prevent incorrect configurations from being prematurely created in
    the database.

    Test plan:
             PASS:
              - Deploy Dx system. Verify that bootstrap is
                completed successfuly. Both controllers were
                unlocked, enabled and available.
                No presence of config out of date alarm (250.001)

    partial-bug: 2070487

    Change-Id: Iddc4996243cfac66f4be692cbb74ea0d8026ef36
    Signed-off-by: fperez <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/c/starlingx/config/+/923682
Committed: https://opendev.org/starlingx/config/commit/b84ed695f86c08b314be55ca38dc78538d0d9ef7
Submitter: "Zuul (22348)"
Branch: master

commit b84ed695f86c08b314be55ca38dc78538d0d9ef7
Author: fperez <email address hidden>
Date: Mon Jul 8 20:26:16 2024 -0300

    Fix wrong hieradata generation

    Some configs containing the force flag could generate incorrect
    hieradata files for unintended hosts. This commit introduces additional
    verification to prevent this issue, mainly by checking key
    configurations that should be present at the moment hieradata is
    generated.
    In addition, this commits introduces some refactoring to better handle
    a configurations with force flag.

    Test plan:
              PASS:
              Install DX and verify that the system is healthy.
              Ensure no presence of alarms. Verify that all hosts are in
              the status: 'unlocked | enabled | available'.
              Check sysinv.log to confirm that this verification is
              preventing the incorrect generation of data.

              PASS:
              Install Dx + compute(s) and verify that the puppet
              config is not being created in controller-0 at an
              inappropriate time.
              Ensure No presence of alarms and verify all hosts are in
              status 'unlocked | enabled | available'.

              PASS:
              Check collected ssl_ca certificates installed for the compute
              node and the active controller. Compare and verify that both
              are equal.

    Partial-Bug: 2070487

    Signed-off-by: fperez <email address hidden>
    Change-Id: Ic49252b5c08e0d7e2a1212a52d8949a05bd55927

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.