Default timeout of enable_ssh_admin is too short in case multiple servers deployment

Bug #1805725 reported by Yossi Ovadia
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Medium
Yossi Ovadia

Bug Description

Description
===========
In case of deployment of multiple servers ( in my case , 8 ) where some are storage computes , the workflow of enable_ssh_admin timed-out with the following Err -

---
2018-11-27 22:43:45Z [overcloud]: CREATE_COMPLETE Stack CREATE completed successfully
Stack overcloud/76f8a758-f995-410f-812f-33830b1c9696 CREATE_COMPLETE
Deploying overcloud configuration
Enabling ssh admin (tripleo-admin) for hosts:
172.31.0.9 172.31.0.24 172.31.0.5 172.31.0.6 172.31.0.21 172.31.0.15 172.31.0.13 172.31.0.11
Using ssh user heat-admin for initial connection.
Using ssh key at /home/stack/.ssh/id_rsa for initial connection.
Inserting TripleO short term key for 172.31.0.9
Inserting TripleO short term key for 172.31.0.24
Inserting TripleO short term key for 172.31.0.5
Inserting TripleO short term key for 172.31.0.6
Inserting TripleO short term key for 172.31.0.21
Inserting TripleO short term key for 172.31.0.15
Inserting TripleO short term key for 172.31.0.13
Inserting TripleO short term key for 172.31.0.11
Starting ssh admin enablement workflow
ssh admin enablement workflow - RUNNING.
ssh admin enablement workflow - RUNNING.
ssh admin enablement workflow - RUNNING.
ssh admin enablement workflow - RUNNING.
ssh admin enablement workflow - RUNNING.
ssh admin enablement workflow - RUNNING.
..
ssh admin enablement workflow - RUNNING.
Removing short term keys locallyException caught during deployment: ssh admin enablement workflow - TIMED OUT.
Exception caught during deployment: ssh admin enablement workflow - TIMED OUT.
stack named overcloud found.

exit status: 255

----

looking at the relevant workflow shows that it too ** slightly ** more than 300 seconds -

(undercloud) [stack@undercloud templates]$ os workflow execution list |grep enable_ssh_admin
(undercloud) [stack@undercloud templates]$ os workflow execution show b1d74029-d846-4b6c-b745-8d4aa0c36d3a
+--------------------+--------------------------------------+
| Field | Value |
+--------------------+--------------------------------------+
| ID | b1d74029-d846-4b6c-b745-8d4aa0c36d3a |
| Workflow ID | 0a607914-da20-4ed0-9c68-c790aff14157 |
| Workflow name | tripleo.access.v1.enable_ssh_admin |
| Workflow namespace | |
| Description | |
| Task Execution ID | <none> |
| Root Execution ID | <none> |
| State | SUCCESS |
| State info | None |
| Created at | 2018-11-28 06:06:43 |
| Updated at | 2018-11-28 06:11:55 |
+--------------------+--------------------------------------+

Trying to deploy with less server deployment did passed successfully.

Manually increasing the timeout of the constand timeout from 300 to 600 resolved the problem.

Steps to reproduce
==================
Deploy on relatively large amount of real servers.

Expected result
===============
Should work.

Actual result
=============
Did not.

Environment:
============
Nokia Airframe 3 storage servers + 5 computes.

Version:
=========
Containerized Rocky

Yossi Ovadia (jabadia)
Changed in tripleo:
assignee: nobody → Yossi Ovadia (jabadia)
Revision history for this message
Yossi Ovadia (jabadia) wrote :
Changed in tripleo:
importance: Undecided → Medium
status: New → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to python-tripleoclient (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/709843

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to python-tripleoclient (master)

Reviewed: https://review.opendev.org/709843
Committed: https://git.openstack.org/cgit/openstack/python-tripleoclient/commit/?id=1458e2750942b4656fbf12d25610200a4db18657
Submitter: Zuul
Branch: master

commit 1458e2750942b4656fbf12d25610200a4db18657
Author: Alex Schultz <email address hidden>
Date: Tue Feb 25 14:47:14 2020 -0700

    Increase ssh port timeout

    It's been reported that we still failing too quickly for some
    environments with a mix of vms and physical hardware. Let's increase the
    default ssh port timeout to 600 seconds to work around this.

    See also: https://bugzilla.redhat.com/show_bug.cgi?id=1805429
    Related-Bug: #1805725

    Change-Id: I95ddc0b7cb6342c367772b0cf296ee372dbb92dd

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to python-tripleoclient (stable/train)

Related fix proposed to branch: stable/train
Review: https://review.opendev.org/710282

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to python-tripleoclient (stable/train)

Reviewed: https://review.opendev.org/710282
Committed: https://git.openstack.org/cgit/openstack/python-tripleoclient/commit/?id=760b354d05e6e9d3c2ae2c3a013104e9cecbcaec
Submitter: Zuul
Branch: stable/train

commit 760b354d05e6e9d3c2ae2c3a013104e9cecbcaec
Author: Alex Schultz <email address hidden>
Date: Tue Feb 25 14:47:14 2020 -0700

    Increase ssh port timeout

    It's been reported that we still failing too quickly for some
    environments with a mix of vms and physical hardware. Let's increase the
    default ssh port timeout to 600 seconds to work around this.

    See also: https://bugzilla.redhat.com/show_bug.cgi?id=1805429
    Related-Bug: #1805725

    Change-Id: I95ddc0b7cb6342c367772b0cf296ee372dbb92dd
    (cherry picked from commit 1458e2750942b4656fbf12d25610200a4db18657)

tags: added: in-stable-train
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.