Activity log for bug #2058426

Date Who What changed Old value New value Message
2024-03-19 23:10:30 Reinildes Oliveira bug added bug
2024-03-19 23:11:05 Reinildes Oliveira starlingx: assignee Reinildes Oliveira (rjosemat)
2024-03-19 23:11:54 Reinildes Oliveira description Brief Description ------------------------------------------------ Unexpected reboot while running lock command on worker nodes Scenario: Subcloud rehoming testing in cert recovery scenario cmd: dcmanager subcloud add --migrate --bootstrap-address 2620:10a:a001:d49::12 --bootstrap-values subcloud-3/subcloud3_ipv6-bootstrap-values.yaml --sysadmin-password Li69nux* Subcloud type: AIO-DX+ Error condition: Have a AIO-DX+ subcloud ready to be rehomed Controller-1 was offline and compute-0 was unlocked and online Trigger rehome playbook The rehome playbook has a task to lock nodes and later another task to unlock these nodes Lock task: TASK [common/recover-subcloud-certificates : Lock standby controller and compute nodes] *** Tuesday 27 February 2035 22:43:30 +0000 (0:00:00.026) 0:01:54.421 ****** changed: [subcloud3] => (item=controller-1) changed: [subcloud3] => (item=compute-0) Unlock task: (FAILED) TASK [common/recover-subcloud-certificates : Unlock standby controller and compute nodes] *** stderr: Host compute-0 is not online. Wait for 'online' availability status and then 'unlock' end: '2035-02-27 22:52:29.201927' Compute-0 mtc logs: compute-0: 2035-02-27T22:43:59.299 [9978.00069] compute-0 mtcClient msg mtcCompMsg.cpp ( 322) mtc_service_command : Info : reboot command received (Mgmnt) ++ Severity ------------------------------------------------ <Major: System/Feature is usable but degraded> Note: The condition described above failed rehome playbook Steps to Reproduce ------------------------------------------------ Described above Expected Behavior ------------------------------------------------ There should not be a reboot in lock command. Actual Behavior ------------------------------------------------ There was a reboot in lock command Reproducibility ------------------------------------------------ Seen once I'll check the reproducibility rate System Configuration ------------------------------------------------ DC sysadmin@controller-0:~$ cat /etc/build.info SW_VERSION="22.12" BUILD_TARGET="Host Installer" BUILD_TYPE="Formal" BUILD_ID="2022-12-19_02-22-00" SRC_BUILD_ID="38"JOB="wrcp-22.12-debian" BUILD_BY="jenkins" BUILD_NUMBER="50" BUILD_HOST="yow-wrcp-lx.wrs.com" BUILD_DATE="2022-12-19 07:22:00 +0000" Last Pass ------------------------------------------------ N/A Timestamp/Logs ------------------------------------------------ N/A Alarms ------------------------------------------------ fm alarm-list --mgmt_affecting +-------+------------------------------------------------------------------------+---------------------------------------+----------+----------------------+----------------+ | Alarm | Reason Text | Entity ID | Severity | Management Affecting | Time Stamp | | ID | | | | | | +-------+------------------------------------------------------------------------+---------------------------------------+----------+----------------------+----------------+ | 200. | compute-0 is reporting a 'critical' out-of-tolerance reading from the | host=compute-0.sensor= | critical | False | 2035-02-28T13: | | 007 | 'HpServerPowerSupply' sensor | HpServerPowerSupply | | | 29:53.365641 | | | | | | | | | 200. | compute-0 'ntp' process has failed. Manual recovery is required. | host=compute-0.process=ntp | minor | True | 2035-02-27T23: | | 006 | | | | | 02:48.431900 | | | | | | | | | 200. | compute-0 was administratively locked to take it out-of-service. | host=compute-0 | warning | True | 2035-02-27T22: | | 001 | | | | | 44:09.006717 | | | | | | | | | 200. | controller-0 'ntp' process has failed. Manual recovery is required. | host=controller-0.process=ntp | minor | True | 2035-02-27T22: | | 006 | | | | | 42:52.516070 | | | | | | | | | 100. | NTP configuration does not contain any valid or reachable NTP servers. | host=controller-0 | major | False | 2035-02-27T22: | | 114 | | | | | 20:38.679981 | | | | | | | | | 100. | NTP address 2620:10a:a001:a103::2 is not a valid or a reachable NTP | host=controller-0=2620:10a:a001:a103: | minor | False | 2035-02-27T22: | | 114 | server. | :2 | | | 20:38.581092 | | | | | | | | | 100. | NTP address 2620:10a:a001:d43:3efd:feff:feb2:4362 is not a valid or a | host=controller-0=2620:10a:a001:d43: | minor | False | 2035-02-27T22: | | 114 | reachable NTP server. | 3efd:feff:feb2:4362 | | | 20:38.561503 | | | | | | | | | 500. | Certificate 'system certificate-show aa41aa65-c502-499b- | system.certificate.mode=ssl_ca.uuid= | critical | False | 2035-02-27T19: | | 210 | b4a1-707110e07d56' (mode=ssl_ca) expired. | aa41aa65-c502-499b-b4a1-707110e07d56 | | | 27:06.540978 | | | | | | | | +-------+------------------------------------------------------------------------+---------------------------------------+----------+----------------------+----------------+ Test Activity ------------------------------------------------ Regression Testing Workaround ------------------------------------------------ Re-try rehome playbook. Brief Description ------------------------------------------------ Unexpected reboot while running lock command on worker nodes Scenario:     Subcloud rehoming testing in cert recovery scenario         cmd: dcmanager subcloud add --migrate --bootstrap-address 2620:10a:a001:d49::12 --bootstrap-values subcloud-3/subcloud3_ipv6-bootstrap-values.yaml --sysadmin-password Li69nux*     Subcloud type: AIO-DX+ Error condition:     Have a AIO-DX+ subcloud ready to be rehomed         Controller-1 was offline and compute-0 was unlocked and online     Trigger rehome playbook         The rehome playbook has a task to lock nodes and later another task to unlock these nodes     Lock task:         TASK [common/recover-subcloud-certificates : Lock standby controller and compute nodes] ***         Tuesday 27 February 2035 22:43:30 +0000 (0:00:00.026) 0:01:54.421 ******         changed: [subcloud3] => (item=controller-1)         changed: [subcloud3] => (item=compute-0)     Unlock task: (FAILED)         TASK [common/recover-subcloud-certificates : Unlock standby controller and compute nodes] ***         stderr: Host compute-0 is not online. Wait for 'online' availability status and then 'unlock'         end: '2035-02-27 22:52:29.201927'     Compute-0 mtc logs:         compute-0:         2035-02-27T22:43:59.299 [9978.00069] compute-0 mtcClient msg mtcCompMsg.cpp ( 322) mtc_service_command : Info : reboot command received (Mgmnt)         ++ Severity ------------------------------------------------ <Major: System/Feature is usable but degraded> Note: The condition described above failed rehome playbook Steps to Reproduce ------------------------------------------------ Described above Expected Behavior ------------------------------------------------ There should not be a reboot in lock command. Actual Behavior ------------------------------------------------ There was a reboot in lock command Reproducibility ------------------------------------------------ Seen once I'll check the reproducibility rate System Configuration ------------------------------------------------ DC sysadmin@controller-0:~$ cat /etc/build.info SW_VERSION="22.12" BUILD_TARGET="Host Installer" BUILD_TYPE="Formal" BUILD_ID="2022-12-19_02-22-00" BUILD_BY="jenkins" BUILD_NUMBER="50" Last Pass ------------------------------------------------ N/A Timestamp/Logs ------------------------------------------------ N/A Alarms ------------------------------------------------ fm alarm-list --mgmt_affecting +-------+------------------------------------------------------------------------+---------------------------------------+----------+----------------------+----------------+ | Alarm | Reason Text | Entity ID | Severity | Management Affecting | Time Stamp | | ID | | | | | | +-------+------------------------------------------------------------------------+---------------------------------------+----------+----------------------+----------------+ | 200. | compute-0 is reporting a 'critical' out-of-tolerance reading from the | host=compute-0.sensor= | critical | False | 2035-02-28T13: | | 007 | 'HpServerPowerSupply' sensor | HpServerPowerSupply | | | 29:53.365641 | | | | | | | | | 200. | compute-0 'ntp' process has failed. Manual recovery is required. | host=compute-0.process=ntp | minor | True | 2035-02-27T23: | | 006 | | | | | 02:48.431900 | | | | | | | | | 200. | compute-0 was administratively locked to take it out-of-service. | host=compute-0 | warning | True | 2035-02-27T22: | | 001 | | | | | 44:09.006717 | | | | | | | | | 200. | controller-0 'ntp' process has failed. Manual recovery is required. | host=controller-0.process=ntp | minor | True | 2035-02-27T22: | | 006 | | | | | 42:52.516070 | | | | | | | | | 100. | NTP configuration does not contain any valid or reachable NTP servers. | host=controller-0 | major | False | 2035-02-27T22: | | 114 | | | | | 20:38.679981 | | | | | | | | | 100. | NTP address 2620:10a:a001:a103::2 is not a valid or a reachable NTP | host=controller-0=2620:10a:a001:a103: | minor | False | 2035-02-27T22: | | 114 | server. | :2 | | | 20:38.581092 | | | | | | | | | 100. | NTP address 2620:10a:a001:d43:3efd:feff:feb2:4362 is not a valid or a | host=controller-0=2620:10a:a001:d43: | minor | False | 2035-02-27T22: | | 114 | reachable NTP server. | 3efd:feff:feb2:4362 | | | 20:38.561503 | | | | | | | | | 500. | Certificate 'system certificate-show aa41aa65-c502-499b- | system.certificate.mode=ssl_ca.uuid= | critical | False | 2035-02-27T19: | | 210 | b4a1-707110e07d56' (mode=ssl_ca) expired. | aa41aa65-c502-499b-b4a1-707110e07d56 | | | 27:06.540978 | | | | | | | | +-------+------------------------------------------------------------------------+---------------------------------------+----------+----------------------+----------------+ Test Activity ------------------------------------------------ Regression Testing Workaround ------------------------------------------------ Re-try rehome playbook.
2024-03-19 23:13:00 OpenStack Infra starlingx: status New In Progress
2024-03-20 20:39:36 OpenStack Infra starlingx: status In Progress Fix Released
2024-03-25 15:29:51 Ghada Khalil starlingx: importance Undecided Medium
2024-03-25 15:30:23 Ghada Khalil tags stx.10.0 stx.config stx.security