Critical issue installing stx, ssh connection to controller is not possible

Bug #1928341 reported by Alexandru Dimofte
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Invalid
Critical
Unassigned

Bug Description

Brief Description
-----------------
Installation of starlingx fails because, during installation the ssh connection to controller is not possible. Unable to connect to port 22 on 192.168.200.52 Seems there is an openssh issue?!

Severity
--------
<Critical: System/Feature is not usable due to the defect>

Steps to Reproduce
------------------
Try to install STX image RC5.0 20210512T230401Z or Master image 20210513T042821Z.

Expected Behavior
------------------
Installation should work.

Actual Behavior
----------------
Starlingx installation fails during setup:
15:24:08 [2021-05-13T12:24:08.671Z] Install ISO BareMetal :: Installation of controller node and defin... | FAIL |
15:24:08 [2021-05-13T12:24:08.671Z] TIMEOUT: Timeout exceeded.
15:24:08 [2021-05-13T12:24:08.671Z] <pexpect.pty_spawn.spawn object at 0x7f8c9d8934d0>
15:24:08 [2021-05-13T12:24:08.671Z] command: /usr/bin/ipmitool
15:24:08 [2021-05-13T12:24:08.671Z] args: ['/usr/bin/ipmitool', '-I', 'lanplus', '-H', '192.168.100.52', '-U', 'starlingx', '-P', 'Passw0rd', 'sol', 'activate']
15:24:08 [2021-05-13T12:24:08.671Z] buffer (last 100 chars): b'plain attachments\x1b[1;24r\x1b[H\x1b[5B\x1b[1;23r\x1b[H\x1b[5B12:03:10 Running pre-installation scripts\x1b[1;24r\x1b[H\x1b[6B'
15:24:08 [2021-05-13T12:24:08.671Z] before (last 100 chars): b'plain attachments\x1b[1;24r\x1b[H\x1b[5B\x1b[1;23r\x1b[H\x1b[5B12:03:10 Running pre-installation scripts\x1b[1;24r\x1b[H\x1b[6B'
15:24:08 [2021-05-13T12:24:08.671Z] after: <class 'pexpect.exceptions.TIMEOUT'>
15:24:08 [2021-05-13T12:24:08.671Z] match: None
15:24:08 [2021-05-13T12:24:08.671Z] match_index: None
15:24:08 [2021-05-13T12:24:08.671Z] exitstatus: None
15:24:08 [2021-05-13T12:24:08.671Z] flag_eof: False
15:24:08 [2021-05-13T12:24:08.671Z] pid: 35763
15:24:08 [2021-05-13T12:24:08.671Z] child_fd: 15
15:24:08 [2021-05-13T12:24:08.671Z] closed: False
15:24:08 [2021-05-13T12:24:08.671Z] timeout: 1200
15:24:08 [2021-05-13T12:24:08.671Z] delimiter: <class 'pexpect.exceptions.EOF'>
15:24:08 [2021-05-13T12:24:08.671Z] logfile: <_io.BufferedWriter name='/localdisk/starlingx/s1_duplex/test/automated-robot-suite/Results/20210513135824_Setup/iso_setup_installation.txt'>
15:24:08 [2021-05-13T12:24:08.671Z] logfile_read: None
15:24:08 [2021-05-13T12:24:08.671Z] logfile_send: None
15:24:08 [2021-05-13T12:24:08.671Z] maxread: 2000
15:24:08 [2021-05-13T12:24:08.671Z] ignorecase: False
15:24:08 [2021-05-13T12:24:08.671Z] searchwindowsize: None
15:24:08 [2021-05-13T12:24:08.671Z] delaybeforesend: 0.05
15:24:08 [2021-05-13T12:24:08.671Z] delayafterclose: 0.1
15:24:08 [2021-05-13T12:24:08.671Z] delayafterterminate: 0.1
15:24:08 [2021-05-13T12:24:08.671Z] searcher: searcher_re:
15:24:08 [2021-05-13T12:24:08.671Z] 0: re.compile(b'Performing post-installation setup tasks')
15:24:08 [2021-05-13T12:24:08.671Z] ------------------------------------------------------------------------------
15:29:00 [2021-05-13T12:29:00.147Z] Ansible Bootstrap Configuration :: Configure controller with local... | FAIL |
15:29:00 [2021-05-13T12:29:00.147Z] Keyword 'Connect to Controller Node' failed after retrying for 5 minutes. The last error was: NoValidConnectionsError: [Errno None] Unable to connect to port 22 on 192.168.200.52
15:29:00 [2021-05-13T12:29:00.147Z] ------------------------------------------------------------------------------
15:34:06 [2021-05-13T12:34:06.650Z] Copy Install Packages :: Copy packages required to install seconda... | FAIL |
15:34:06 [2021-05-13T12:34:06.650Z] Keyword 'Connect to Controller Node' failed after retrying for 5 minutes. The last error was: NoValidConnectionsError: [Errno None] Unable to connect to port 22 on 192.168.200.52

Reproducibility
---------------
100% reproducible

System Configuration
--------------------
One node system, Two node system, Multi-node system, Dedicated storage

Branch/Pull Time/Commit
-----------------------
RC5.0 and Master

Last Pass
---------
RC5.0 20210511T230342Z
Master 20210512T043622Z

Timestamp/Logs
--------------
-

Test Activity
-------------
Sanity

Workaround
----------
-

Revision history for this message
Alexandru Dimofte (adimofte) wrote :
Revision history for this message
Frank Miller (sensfan22) wrote :

It's unclear if this could be related to a very old commit that was allowed tomerge. As there is not need for this commit, recommendation is to revert this commit:

https://review.opendev.org/c/starlingx/metal/+/736950

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (master)

Reviewed: https://review.opendev.org/c/starlingx/metal/+/791103
Committed: https://opendev.org/starlingx/metal/commit/5942a56ec6f0b265ca6d1c8c800fe84c4a22860f
Submitter: "Zuul (22348)"
Branch: master

commit 5942a56ec6f0b265ca6d1c8c800fe84c4a22860f
Author: Eric MacDonald <email address hidden>
Date: Thu May 13 15:57:43 2021 +0000

    Revert "Align partitions created by kickstarters"

    This reverts commit 0e89acc83c616741952a068a3ff07ba91440eff8.

    Reason for revert: Review should have been abandoned rather than merged.

    Change-Id: I95f1e151183f122d93b834ab2a785736e5a8ef12
    Closes-Bug: 1928341

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Ghada Khalil (gkhalil) wrote :

screening: marking as critical for stx.5.0 & stx.6.0 since this issue is causing a red sanity

tags: added: stx.metal
Changed in starlingx:
importance: Undecided → Critical
assignee: nobody → Eric MacDonald (rocksolidmtce)
tags: added: stx.5.0 stx.6.0
Revision history for this message
Ghada Khalil (gkhalil) wrote :

I'm unsure how the commits listed above apply to stx.5.0 as they are only in stx master:
https://review.opendev.org/c/starlingx/metal/+/736950
https://review.opendev.org/c/starlingx/metal/+/791103

Revision history for this message
Frank Miller (sensfan22) wrote :

Reverting that old commit did not address the issue. Sanity reported the same issue with the following loads:
master: 20210514T042932Z
stx.5.0: 20210513T230342Z

Therefore re-opening this LP.

Changed in starlingx:
status: Fix Released → Triaged
assignee: Eric MacDonald (rocksolidmtce) → nobody
Revision history for this message
Ghada Khalil (gkhalil) wrote :

As per email from Scott Little to starlingx-discuss, it turns out the rc/5.0 builds were not building the correct content and were picking up content from the stx master branch:
http://lists.starlingx.io/pipermail/starlingx-discuss/2021-May/011450.html
https://bugs.launchpad.net/starlingx/+bug/1928511

This explains why the same issue is seen on both builds. We'll need to see the sanity results for the valid stx.5.0 build to determine if this issue is applicable to the r/stx.5.0 branch or not.
http://mirror.starlingx.cengn.ca/mirror/starlingx/rc/5.0/centos/flock/20210514T224518Z/outputs/iso/

@Alexandru, please try the above 5.0 load and update with the sanity results. Thanks.

Revision history for this message
Alexandru Dimofte (adimofte) wrote :
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Removing the stx.5.0 release tag based on Alexandru's note above.
@Alexandru, please also confirm if this is still an issue in the stx master builds or not

tags: removed: stx.5.0
Revision history for this message
Alexandru Dimofte (adimofte) wrote :

This issue was not seen in latest master image too.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Closing as the issue is no longer reproducible

Changed in starlingx:
status: Triaged → Invalid
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to metal (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/metal/+/792250

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (f/centos8)
Download full text (34.9 KiB)

Reviewed: https://review.opendev.org/c/starlingx/metal/+/792250
Committed: https://opendev.org/starlingx/metal/commit/6c2905e665ceeebfa7717c9cbccc1c277d10966b
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 5942a56ec6f0b265ca6d1c8c800fe84c4a22860f
Author: Eric MacDonald <email address hidden>
Date: Thu May 13 15:57:43 2021 +0000

    Revert "Align partitions created by kickstarters"

    This reverts commit 0e89acc83c616741952a068a3ff07ba91440eff8.

    Reason for revert: Review should have been abandoned rather than merged.

    Change-Id: I95f1e151183f122d93b834ab2a785736e5a8ef12
    Closes-Bug: 1928341

commit c7c341b198e79bb98f443c7c07f671c6387075af
Author: Don Penney <email address hidden>
Date: Fri May 7 08:56:06 2021 -0400

    Add /pxeboot/grubx64.efi symlink for UEFI pxeboot

    UEFI pxeboot with shim.efi looks for the grubx64.efi in the tftpboot
    root directory. This update creates a symlink to the
    /pxeboot/EFI/grubx64.efi file in /pxeboot.

    Change-Id: Iabf8ec89d0af6e6b1a62e20159ecdfa16729444e
    Partial-Bug: 1927730
    Signed-off-by: Don Penney <email address hidden>

commit ce7529964932a9fd1cc10ce18dbe11e89ee02223
Author: Eric MacDonald <email address hidden>
Date: Wed May 5 19:05:55 2021 -0400

    Fix enabling heartbeat of self from the peer controller

    This issue only occurs over an hbsAgent process restart
    where the ready event response does not include the
    heartbeat start of the peer controller.

    This update reverts a small code change that was
    introduced by the following update.

    https://review.opendev.org/c/starlingx/metal/+/788495

    Remove the my_hostname gate introduced at line 1267 of
    mtcCtrlMsg.cpp because it prevents enabling heartbeat
    of self by the peer controller.

    Change-Id: Id72c35f25e2a5231a8a8363a35a81e042f00085e
    Closes-Bug: 1922584
    Signed-off-by: Eric MacDonald <email address hidden>

commit 48978d804d6f22130d0bd8bd17f361441024bc6c
Author: Eric MacDonald <email address hidden>
Date: Wed Apr 28 09:39:19 2021 -0400

    Improved maintenance handling of spontaneous active controller reboot

    Performing a forced reboot of the active controller sometimes
    results in a second reboot of that controller. The cause of the
    second reboot was due to its reported uptime in the first mtcAlive
    message, following the reboot, as greater than 10 minutes.

    Maintenance has a long standing graceful recovery threshold of
    10 minutes. Meaning that if a host looses heartbeat and enters
    Graceful Recovery, if the uptime value extracted from the first
    mtcAlive message following the recovery of that host exceeds 10
    minutes, then maintenance interprets that the host did not reboot.
    If a host goes absent for longer than this threshold then for
    reasons not limited to security, maintenance declares the host
    as 'failed' and force re-enables it through a reboot.

    With the introduction of containers and addition of new features
    over the last few releases, boot times on some servers are
    approaching the 10 minute threshold an...

tags: added: in-f-centos8
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.