[tripleo] master baremetal overcloud deploy fails running script enable-ssh-admin.sh

Bug #1769230 reported by Raoul Scarazzini
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
James Slagle

Bug Description

Description
===========
Overcloud master baremetal deployments in rdophase2 CI related are failing right after the Heat stack events

Steps to reproduce
==================
Just run the overcloud deployment with a command line similar to this one:

openstack overcloud deploy \
    --verbose \
    --templates /usr/share/openstack-tripleo-heat-templates \
    --libvirt-type kvm\
    --control-flavor baremetal\
    --compute-flavor baremetal\
    --ceph-storage-flavor baremetal\
    --block-storage-flavor oooq_blockstorage\
    --swift-storage-flavor oooq_objectstorage\
    --timeout 90 \
    -e /home/stack/cloud-names.yaml \
    -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml \
    -e /usr/share/openstack-tripleo-heat-templates/environments/docker.yaml \
    -e /usr/share/openstack-tripleo-heat-templates/environments/docker-ha.yaml \
    -e /home/stack/containers-default-parameters.yaml \
    -e /usr/share/openstack-tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml \
    -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
    -e /home/stack/network-environment.yaml \
    -e /usr/share/openstack-tripleo-heat-templates/environments/low-memory-usage.yaml \
    -e /home/stack/inject-trust-anchor.yaml \
    --validation-warnings-fatal \
    --roles-file /home/stack/overcloud_roles.yaml \
    --ntp-server 10.5.26.10 \
    -e /usr/share/openstack-tripleo-heat-templates/environments/config-debug.yaml

So nothing different from a "usual" deploy.

Expected result
===============
Successful deployment.

Actual result
=============
This is the message I'm getting:

2018-05-04 15:19:26 | /usr/share/openstack-tripleo-heat-templates/deployed-server/scripts/enable-ssh-admin.sh failed.

Environment
===========
This seems to be a race condition that happens due to the fact that sshd has not yet started on the provisioned nodes, but in the master baremetal environments in RDOPhase2 [1] in which we test HA, it happens systematically.

[1] https://thirdparty.logs.rdoproject.org/jenkins-oooq-master-rdo_trunk-bmu-haa16-lab-float_nic_with_vlans-150/

Raoul Scarazzini (rasca)
Changed in tripleo:
importance: Undecided → High
status: New → Triaged
Changed in tripleo:
assignee: nobody → James Slagle (james-slagle)
status: Triaged → In Progress
Changed in tripleo:
milestone: none → rocky-2
wes hayutin (weshayutin)
tags: added: promotion-blocker
Changed in tripleo:
importance: High → Critical
Revision history for this message
James Slagle (james-slagle) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to python-tripleoclient (master)

Reviewed: https://review.openstack.org/566129
Committed: https://git.openstack.org/cgit/openstack/python-tripleoclient/commit/?id=6cc67c33984135afc458ad6482eb16507a0c5a88
Submitter: Zuul
Branch: master

commit 6cc67c33984135afc458ad6482eb16507a0c5a88
Author: James Slagle <email address hidden>
Date: Thu May 3 12:57:52 2018 -0400

    Convert enable-ssh-admin.sh to python

    Instead of using the script from the templates, use python for the
    enable-ssh-admin logic. This will allow for more properly handling
    failures.

    This also fixes a race condition where sshd has not already started on
    some of the nodes before we try and connect via ssh. A timeout is added
    where we wait for the port to come up. If the timeout has passed and the
    port is still not up, then an exception is raised.

    Change-Id: I3431d2ec724a880baf0de8f586490d145bedf870
    Closes-Bug: #1769230

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to python-tripleoclient (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/572155

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/python-tripleoclient 10.2.0

This issue was fixed in the openstack/python-tripleoclient 10.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to python-tripleoclient (stable/queens)

Reviewed: https://review.openstack.org/572155
Committed: https://git.openstack.org/cgit/openstack/python-tripleoclient/commit/?id=7c518a40d9bd2cc41e29b04c4e89d76f8eeae376
Submitter: Zuul
Branch: stable/queens

commit 7c518a40d9bd2cc41e29b04c4e89d76f8eeae376
Author: Emilien Macchi <email address hidden>
Date: Mon Jun 18 09:08:15 2018 -0700

    Convert enable-ssh-admin.sh to python

    Instead of using the script from the templates, use python for the
    enable-ssh-admin logic. This will allow for more properly handling
    failures.

    This also fixes a race condition where sshd has not already started on
    some of the nodes before we try and connect via ssh. A timeout is added
    where we wait for the port to come up. If the timeout has passed and the
    port is still not up, then an exception is raised.

    Change-Id: I3431d2ec724a880baf0de8f586490d145bedf870
    Closes-Bug: #1769230
    (cherry picked from commit I3431d2ec724a880baf0de8f586490d145bedf870)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/python-tripleoclient 9.2.3

This issue was fixed in the openstack/python-tripleoclient 9.2.3 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.