Brief Description
------------------------------------------------
Unexpected reboot while running lock command on worker nodes
Scenario:
Subcloud rehoming testing in cert recovery scenario
cmd: dcmanager subcloud add --migrate --bootstrap-address 2620:10a:a001:d49::12 --bootstrap-values subcloud-3/subcloud3_ipv6-bootstrap-values.yaml --sysadmin-password Li69nux*
Subcloud type: AIO-DX+
Error condition:
Have a AIO-DX+ subcloud ready to be rehomed
Controller-1 was offline and compute-0 was unlocked and online
Trigger rehome playbook
The rehome playbook has a task to lock nodes and later another task to unlock these nodes
Lock task:
TASK [common/recover-subcloud-certificates : Lock standby controller and compute nodes] ***
Tuesday 27 February 2035 22:43:30 +0000 (0:00:00.026) 0:01:54.421 ******
changed: [subcloud3] => (item=controller-1)
changed: [subcloud3] => (item=compute-0)
Unlock task: (FAILED)
TASK [common/recover-subcloud-certificates : Unlock standby controller and compute nodes] ***
stderr: Host compute-0 is not online. Wait for 'online' availability status and then 'unlock'
end: '2035-02-27 22:52:29.201927'
Compute-0 mtc logs:
compute-0:
2035-02-27T22:43:59.299 [9978.00069] compute-0 mtcClient msg mtcCompMsg.cpp ( 322) mtc_service_command : Info : reboot command received (Mgmnt)
++
Severity
------------------------------------------------
<Major: System/Feature is usable but degraded>
Note: The condition described above failed rehome playbook
Steps to Reproduce
------------------------------------------------
Described above
Expected Behavior
------------------------------------------------
There should not be a reboot in lock command.
Actual Behavior
------------------------------------------------
There was a reboot in lock command
Reproducibility
------------------------------------------------
Seen once
I'll check the reproducibility rate
System Configuration
------------------------------------------------
DC
sysadmin@controller-0:~$ cat /etc/build.info
SW_VERSION="22.12"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="2022-12-19_02-22-00"
BUILD_BY="jenkins"
BUILD_NUMBER="50"
Last Pass
------------------------------------------------
N/A
Timestamp/Logs
------------------------------------------------
N/A
Alarms
------------------------------------------------
fm alarm-list --mgmt_affecting +-------+------------------------------------------------------------------------+---------------------------------------+----------+----------------------+----------------+ | Alarm | Reason Text | Entity ID | Severity | Management Affecting | Time Stamp | | ID | | | | | | +-------+------------------------------------------------------------------------+---------------------------------------+----------+----------------------+----------------+ | 200. | compute-0 is reporting a 'critical' out-of-tolerance reading from the | host=compute-0.sensor= | critical | False | 2035-02-28T13: | | 007 | 'HpServerPowerSupply' sensor | HpServerPowerSupply | | | 29:53.365641 | | | | | | | | | 200. | compute-0 'ntp' process has failed. Manual recovery is required. | host=compute-0.process=ntp | minor | True | 2035-02-27T23: | | 006 | | | | | 02:48.431900 | | | | | | | | | 200. | compute-0 was administratively locked to take it out-of-service. | host=compute-0 | warning | True | 2035-02-27T22: | | 001 | | | | | 44:09.006717 | | | | | | | | | 200. | controller-0 'ntp' process has failed. Manual recovery is required. | host=controller-0.process=ntp | minor | True | 2035-02-27T22: | | 006 | | | | | 42:52.516070 | | | | | | | | | 100. | NTP configuration does not contain any valid or reachable NTP servers. | host=controller-0 | major | False | 2035-02-27T22: | | 114 | | | | | 20:38.679981 | | | | | | | | | 100. | NTP address 2620:10a:a001:a103::2 is not a valid or a reachable NTP | host=controller-0=2620:10a:a001:a103: | minor | False | 2035-02-27T22: | | 114 | server. | :2 | | | 20:38.581092 | | | | | | | | | 100. | NTP address 2620:10a:a001:d43:3efd:feff:feb2:4362 is not a valid or a | host=controller-0=2620:10a:a001:d43: | minor | False | 2035-02-27T22: | | 114 | reachable NTP server. | 3efd:feff:feb2:4362 | | | 20:38.561503 | | | | | | | | | 500. | Certificate 'system certificate-show aa41aa65-c502-499b- | system.certificate.mode=ssl_ca.uuid= | critical | False | 2035-02-27T19: | | 210 | b4a1-707110e07d56' (mode=ssl_ca) expired. | aa41aa65-c502-499b-b4a1-707110e07d56 | | | 27:06.540978 | | | | | | | | +-------+------------------------------------------------------------------------+---------------------------------------+----------+----------------------+----------------+
Test Activity
------------------------------------------------
Regression Testing
Workaround
------------------------------------------------
Re-try rehome playbook.
Fix proposed to branch: master /review. opendev. org/c/starlingx /ansible- playbooks/ +/913725
Review: https:/