StarlingX provision failed for all bare-metal configurations

Bug #1973888 reported by Alexandru Dimofte
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Critical
Leonardo Fagundes Luz Serrano

Bug Description

Brief Description
-----------------
StarlingX provision failed for all bare-metal configurations. There are no controllers unlocked.
Command: source /etc/platform/openrc output:
Openstack Admin credentials can only be loaded from the active controller.

Severity
--------
<Critical: System/Feature is not usable due to the defect>

Steps to Reproduce
------------------
Try to install stx 20220517T213711Z

Expected Behavior
------------------
Starlingx installation should pass fine.

Actual Behavior
----------------
Setup stage was fine, but the issue appears at the provisioning now.

20220518 10:18:45.502 - INFO - +------- START KW: SSHLibrary.Write [ ${cmd} ]
20220518 10:18:45.508 - INFO - source /etc/platform/openrc
20220518 10:18:45.509 - INFO - +------- END KW: SSHLibrary.Write (7)
20220518 10:18:45.509 - INFO - +------- START KW: SSHLibrary.Read Until Prompt [ ]
20220518 10:18:45.517 - INFO - Openstack Admin credentials can only be loaded from the active controller.
controller-0:~$

or:
controller-0:~$
controller-0:~$ . /etc/platform/openrc
Openstack Admin credentials can only be loaded from the active controller.
controller-0:~$

ONLY THE BARE-METAL SERVERS ARE AFFECTED BY THIS ISSUE!

Reproducibility
---------------
 100% reproducible

System Configuration
--------------------
All bare-metal configurations

Branch/Pull Time/Commit
-----------------------
master: 20220517T213711Z

Last Pass
---------
master - 20220420T033744Z

Timestamp/Logs
--------------
Will be attached

Test Activity
-------------
Sanity

Workaround
----------
-

Revision history for this message
Alexandru Dimofte (adimofte) wrote :
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Critical
tags: added: stx.7.0
Changed in starlingx:
assignee: nobody → John Kung (john-kung)
Revision history for this message
John Kung (john-kung) wrote :

/var/log/puppet/puppet.log indicates the following Error:

2022-05-18-07-49-31_controller/puppet.log:2022-05-18T07:56:18.566 Error: 2022-05-18 07:56:18 +0000 Command exceeded timeout
2022-05-18-07-49-31_controller/puppet.log:2022-05-18T07:56:18.676 Error: 2022-05-18 07:56:18 +0000 /Stage[main]/Platform::Ntp/Exec[ntp-initial-config]/returns: change from notrun to 0 1 failed: Command exceeded timeout

OS="centos"
SW_VERSION="22.06"
BUILD_ID="20220517T213711Z"

FLOCK_BUILD_DATE="2022-05-17 21:37:11 +0000"

There was a recent update in ntp config, which introduced Exec'ntp-initial-config' that warrants follow-up investigation:

https://review.opendev.org/c/starlingx/stx-puppet/+/839992 (Merged May 11, 2022)

Changed in starlingx:
assignee: John Kung (john-kung) → Leonardo Fagundes Luz Serrano (lfagunde)
Ghada Khalil (gkhalil)
Changed in starlingx:
status: New → Triaged
tags: added: stx.config
Revision history for this message
Leonardo Fagundes Luz Serrano (lfagunde) wrote :

Most likely the system isn't getting a response from the ntp servers.

We have seen a similar occurrence on debian and the new code for ntp causes unlock to fail when that happens.

I'm looking into a solution.

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (master)

Reviewed: https://review.opendev.org/c/starlingx/stx-puppet/+/841784
Committed: https://opendev.org/starlingx/stx-puppet/commit/cc4ff43301b8e5b2984a5fac9660f095bc4a04d6
Submitter: "Zuul (22348)"
Branch: master

commit cc4ff43301b8e5b2984a5fac9660f095bc4a04d6
Author: Leonardo Fagundes Luz Serrano <email address hidden>
Date: Fri May 13 14:10:30 2022 -0300

    Fix ntp timeout puppet error

    ntpd command freezes when it can't get a response
    from the servers in the file given to it as
    a parameter, triggering a timeout

    if a puppet exec causes a timeout, the exec is
    said to fail, which breaks system unlock

    this change moves the timeout mechanism
    so that it doesn't cause unlock to fail

    Test Plan - Debian:
    PASS: unlock despite giving ntp-modify a broken server

    Test Plan - Centos:
    PASS: unlock despite giving ntp-modify a broken server

    Story: 2009965
    Task: 45370

    Closes-Bug: #1973888

    Signed-off-by: Leonardo Fagundes Luz Serrano <email address hidden>
    Change-Id: I0dc6de32407e02b1687578bb10b77b5242e19730

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Alexandru Dimofte (adimofte) wrote :

This issue is fixed starting with 20220521T032715Z

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.