StarlingX

Unexpected reboot while running lock command on worker nodes

Bug #2058426 reported by Reinildes Oliveira on 2024-03-19

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	Reinildes Oliveira

Bug Description

Brief Description
------------------------------------------------

Unexpected reboot while running lock command on worker nodes

Scenario:

Subcloud rehoming testing in cert recovery scenario

cmd: dcmanager subcloud add --migrate --bootstrap-address 2620:10a:a001:d49::12 --bootstrap-values subcloud-3/subcloud3_ipv6-bootstrap-values.yaml --sysadmin-password Li69nux*
Subcloud type: AIO-DX+

Error condition:

    Have a AIO-DX+ subcloud ready to be rehomed
        Controller-1 was offline and compute-0 was unlocked and online
    Trigger rehome playbook
        The rehome playbook has a task to lock nodes and later another task to unlock these nodes
    Lock task:
        TASK [common/recover-subcloud-certificates : Lock standby controller and compute nodes] ***
        Tuesday 27 February 2035 22:43:30 +0000 (0:00:00.026) 0:01:54.421 ******
        changed: [subcloud3] => (item=controller-1)
        changed: [subcloud3] => (item=compute-0)

    Unlock task: (FAILED)
        TASK [common/recover-subcloud-certificates : Unlock standby controller and compute nodes] ***
        stderr: Host compute-0 is not online. Wait for 'online' availability status and then 'unlock'
        end: '2035-02-27 22:52:29.201927'

    Compute-0 mtc logs:
        compute-0:
        2035-02-27T22:43:59.299 [9978.00069] compute-0 mtcClient msg mtcCompMsg.cpp ( 322) mtc_service_command : Info : reboot command received (Mgmnt)
        ++

Severity
------------------------------------------------

<Major: System/Feature is usable but degraded>

Note: The condition described above failed rehome playbook

Steps to Reproduce
------------------------------------------------

Described above

Expected Behavior
------------------------------------------------

There should not be a reboot in lock command.

Actual Behavior
------------------------------------------------

There was a reboot in lock command

Reproducibility
------------------------------------------------
Seen once

I'll check the reproducibility rate

System Configuration
------------------------------------------------

sysadmin@controller-0:~$ cat /etc/build.info
SW_VERSION="22.12"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="2022-12-19_02-22-00"
BUILD_BY="jenkins"
BUILD_NUMBER="50"

Last Pass
------------------------------------------------
N/A

Timestamp/Logs
------------------------------------------------
N/A

Alarms
------------------------------------------------

fm alarm-list --mgmt_affecting +-------+------------------------------------------------------------------------+---------------------------------------+----------+----------------------+----------------+ | Alarm | Reason Text | Entity ID | Severity | Management Affecting | Time Stamp | | ID | | | | | | +-------+------------------------------------------------------------------------+---------------------------------------+----------+----------------------+----------------+ | 200. | compute-0 is reporting a 'critical' out-of-tolerance reading from the | host=compute-0.sensor= | critical | False | 2035-02-28T13: | | 007 | 'HpServerPowerSupply' sensor | HpServerPowerSupply | | | 29:53.365641 | | | | | | | | | 200. | compute-0 'ntp' process has failed. Manual recovery is required. | host=compute-0.process=ntp | minor | True | 2035-02-27T23: | | 006 | | | | | 02:48.431900 | | | | | | | | | 200. | compute-0 was administratively locked to take it out-of-service. | host=compute-0 | warning | True | 2035-02-27T22: | | 001 | | | | | 44:09.006717 | | | | | | | | | 200. | controller-0 'ntp' process has failed. Manual recovery is required. | host=controller-0.process=ntp | minor | True | 2035-02-27T22: | | 006 | | | | | 42:52.516070 | | | | | | | | | 100. | NTP configuration does not contain any valid or reachable NTP servers. | host=controller-0 | major | False | 2035-02-27T22: | | 114 | | | | | 20:38.679981 | | | | | | | | | 100. | NTP address 2620:10a:a001:a103::2 is not a valid or a reachable NTP | host=controller-0=2620:10a:a001:a103: | minor | False | 2035-02-27T22: | | 114 | server. | :2 | | | 20:38.581092 | | | | | | | | | 100. | NTP address 2620:10a:a001:d43:3efd:feff:feb2:4362 is not a valid or a | host=controller-0=2620:10a:a001:d43: | minor | False | 2035-02-27T22: | | 114 | reachable NTP server. | 3efd:feff:feb2:4362 | | | 20:38.561503 | | | | | | | | | 500. | Certificate 'system certificate-show aa41aa65-c502-499b- | system.certificate.mode=ssl_ca.uuid= | critical | False | 2035-02-27T19: | | 210 | b4a1-707110e07d56' (mode=ssl_ca) expired. | aa41aa65-c502-499b-b4a1-707110e07d56 | | | 27:06.540978 | | | | | | | | +-------+------------------------------------------------------------------------+---------------------------------------+----------+----------------------+----------------+

Test Activity
------------------------------------------------
Regression Testing

Workaround
------------------------------------------------
Re-try rehome playbook.

See original description

Tags:

Reinildes Oliveira (rjosemat) on 2024-03-19

Changed in starlingx:
assignee:	nobody → Reinildes Oliveira (rjosemat)
description:	updated

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-03-19: Fix proposed to ansible-playbooks (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/ansible-playbooks/+/913725

Changed in starlingx:
status:	New → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-03-20: Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/c/starlingx/ansible-playbooks/+/913725
Committed: https://opendev.org/starlingx/ansible-playbooks/commit/d441c0afc7ae8de7f853469719fc85c7d4697bea
Submitter: "Zuul (22348)"
Branch: master

commit d441c0afc7ae8de7f853469719fc85c7d4697bea
Author: Rei Oliveira <email address hidden>
Date: Tue Mar 19 20:12:34 2024 -0300

Increase timeout to unlock host on cert recovery

    The certificate recovery role may try to unlock a host while it's
    on a reboot loop. In order for the host-unlock command to be accepted
    to host needs to be online.

For slower hardware the current 2 min timeout is not enough.
This commit changes it to 5 min.

Test case:

PASS: Run certificate recovery rehoming with multi node systems.

Closes-Bug: 2058426

Signed-off-by: Rei Oliveira <email address hidden>
Change-Id: I7a7faf7707a19f477a968766cbe383e7dfdbd1cd

Changed in starlingx:
status:	In Progress → Fix Released

Ghada Khalil (gkhalil) on 2024-03-25

Changed in starlingx:
importance:	Undecided → Medium
tags:	added: stx.10.0 stx.config stx.security

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.