Unexpected reboot while running lock command on worker nodes

Bug #2058426 reported by Reinildes Oliveira
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Reinildes Oliveira

Bug Description

Brief Description
------------------------------------------------

Unexpected reboot while running lock command on worker nodes

Scenario:

    Subcloud rehoming testing in cert recovery scenario

        cmd: dcmanager subcloud add --migrate --bootstrap-address 2620:10a:a001:d49::12 --bootstrap-values subcloud-3/subcloud3_ipv6-bootstrap-values.yaml --sysadmin-password Li69nux*
    Subcloud type: AIO-DX+

Error condition:

    Have a AIO-DX+ subcloud ready to be rehomed
        Controller-1 was offline and compute-0 was unlocked and online
    Trigger rehome playbook
        The rehome playbook has a task to lock nodes and later another task to unlock these nodes
    Lock task:
        TASK [common/recover-subcloud-certificates : Lock standby controller and compute nodes] ***
        Tuesday 27 February 2035 22:43:30 +0000 (0:00:00.026) 0:01:54.421 ******
        changed: [subcloud3] => (item=controller-1)
        changed: [subcloud3] => (item=compute-0)

    Unlock task: (FAILED)
        TASK [common/recover-subcloud-certificates : Unlock standby controller and compute nodes] ***
        stderr: Host compute-0 is not online. Wait for 'online' availability status and then 'unlock'
        end: '2035-02-27 22:52:29.201927'

    Compute-0 mtc logs:
        compute-0:
        2035-02-27T22:43:59.299 [9978.00069] compute-0 mtcClient msg mtcCompMsg.cpp ( 322) mtc_service_command : Info : reboot command received (Mgmnt)
        ++

Severity
------------------------------------------------

<Major: System/Feature is usable but degraded>

Note: The condition described above failed rehome playbook

Steps to Reproduce
------------------------------------------------

Described above

Expected Behavior
------------------------------------------------

There should not be a reboot in lock command.

Actual Behavior
------------------------------------------------

There was a reboot in lock command

Reproducibility
------------------------------------------------
Seen once

I'll check the reproducibility rate

System Configuration
------------------------------------------------

DC

sysadmin@controller-0:~$ cat /etc/build.info
SW_VERSION="22.12"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="2022-12-19_02-22-00"
BUILD_BY="jenkins"
BUILD_NUMBER="50"

Last Pass
------------------------------------------------
N/A

Timestamp/Logs
------------------------------------------------
N/A

Alarms
------------------------------------------------

fm alarm-list --mgmt_affecting +-------+------------------------------------------------------------------------+---------------------------------------+----------+----------------------+----------------+ | Alarm | Reason Text | Entity ID | Severity | Management Affecting | Time Stamp | | ID | | | | | | +-------+------------------------------------------------------------------------+---------------------------------------+----------+----------------------+----------------+ | 200. | compute-0 is reporting a 'critical' out-of-tolerance reading from the | host=compute-0.sensor= | critical | False | 2035-02-28T13: | | 007 | 'HpServerPowerSupply' sensor | HpServerPowerSupply | | | 29:53.365641 | | | | | | | | | 200. | compute-0 'ntp' process has failed. Manual recovery is required. | host=compute-0.process=ntp | minor | True | 2035-02-27T23: | | 006 | | | | | 02:48.431900 | | | | | | | | | 200. | compute-0 was administratively locked to take it out-of-service. | host=compute-0 | warning | True | 2035-02-27T22: | | 001 | | | | | 44:09.006717 | | | | | | | | | 200. | controller-0 'ntp' process has failed. Manual recovery is required. | host=controller-0.process=ntp | minor | True | 2035-02-27T22: | | 006 | | | | | 42:52.516070 | | | | | | | | | 100. | NTP configuration does not contain any valid or reachable NTP servers. | host=controller-0 | major | False | 2035-02-27T22: | | 114 | | | | | 20:38.679981 | | | | | | | | | 100. | NTP address 2620:10a:a001:a103::2 is not a valid or a reachable NTP | host=controller-0=2620:10a:a001:a103: | minor | False | 2035-02-27T22: | | 114 | server. | :2 | | | 20:38.581092 | | | | | | | | | 100. | NTP address 2620:10a:a001:d43:3efd:feff:feb2:4362 is not a valid or a | host=controller-0=2620:10a:a001:d43: | minor | False | 2035-02-27T22: | | 114 | reachable NTP server. | 3efd:feff:feb2:4362 | | | 20:38.561503 | | | | | | | | | 500. | Certificate 'system certificate-show aa41aa65-c502-499b- | system.certificate.mode=ssl_ca.uuid= | critical | False | 2035-02-27T19: | | 210 | b4a1-707110e07d56' (mode=ssl_ca) expired. | aa41aa65-c502-499b-b4a1-707110e07d56 | | | 27:06.540978 | | | | | | | | +-------+------------------------------------------------------------------------+---------------------------------------+----------+----------------------+----------------+

Test Activity
------------------------------------------------
Regression Testing

Workaround
------------------------------------------------
Re-try rehome playbook.

Changed in starlingx:
assignee: nobody → Reinildes Oliveira (rjosemat)
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)
Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/c/starlingx/ansible-playbooks/+/913725
Committed: https://opendev.org/starlingx/ansible-playbooks/commit/d441c0afc7ae8de7f853469719fc85c7d4697bea
Submitter: "Zuul (22348)"
Branch: master

commit d441c0afc7ae8de7f853469719fc85c7d4697bea
Author: Rei Oliveira <email address hidden>
Date: Tue Mar 19 20:12:34 2024 -0300

    Increase timeout to unlock host on cert recovery

    The certificate recovery role may try to unlock a host while it's
    on a reboot loop. In order for the host-unlock command to be accepted
    to host needs to be online.

    For slower hardware the current 2 min timeout is not enough.
    This commit changes it to 5 min.

    Test case:

    PASS: Run certificate recovery rehoming with multi node systems.

    Closes-Bug: 2058426

    Signed-off-by: Rei Oliveira <email address hidden>
    Change-Id: I7a7faf7707a19f477a968766cbe383e7dfdbd1cd

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.10.0 stx.config stx.security
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.