tripleo

create_admin_via_nova returns before the ssh key is installed on all nodes

Bug #1720793 reported by John Fulton on 2017-10-02

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	tripleo	Fix Released	High	Giulio Fidente	tripleo queens-1

Bug Description

When Mistral kicks off Ceph-Ansible, I am seeing issues like :

2017-09-29 15:38:10,768 p=19459 u=mistral | TASK [ceph-defaults : is ceph running already?] ********************************
2017-09-29 15:38:10,780 p=19459 u=mistral | [DEPRECATION WARNING]: always_run is deprecated. Use check_mode = no instead..
2017-09-29 15:38:11,180 p=19459 u=mistral | fatal: [192.168.24.56]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Could not create directory '/home/mistral/.ssh'.\r\nssh_exchange_identification: Connection closed by remote host\r\n", "unreachable": true}
2017-09-29 15:38:11,181 p=19459 u=mistral | fatal: [192.168.24.71]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Could not create directory '/home/mistral/.ssh'.\r\nssh_exchange_identification: Connection closed by remote host\r\n", "unreachable": true}
2017-09-29 15:38:11,188 p=19459 u=mistral | [DEPRECATION WARNING]: always_run is deprecated. Use check_mode = no instead..

Which causes the deployment to fail due to the host being unreachable.

However, I am able to login to the hosts that mentions unreachable=1.

This only has become a problem since growing the overcloud deployment to 3 controllers, 3 ceph nodes, and 26 compute nodes (deployed at once).

See original description

Tags:

Revision history for this message

John Fulton (jfulton-org) wrote on 2017-10-02:

Workaround: in /usr/share/ceph-ansible/ansible.cfg, set retry = 5

John Fulton (jfulton-org) on 2017-10-02

description:

updated

Giulio Fidente (gfidente) on 2017-10-02

summary:

- ceph-ansible starts before hosts are ready
+ create_admin_via_nova returns before the ssh key is installed on all
+ nodes

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-10-02: Fix proposed to tripleo-common (master)

Fix proposed to branch: master
Review: https://review.openstack.org/509001

Changed in tripleo:
status:	Triaged → In Progress

Emilien Macchi (emilienm) on 2017-10-02

Changed in tripleo:
milestone:	none → queens-1

Giulio Fidente (gfidente) on 2017-10-03

tags:

added: pike-backport-potential

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-10-10: Fix proposed to tripleo-common (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/510970

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-10-14: Fix merged to tripleo-common (master)

Reviewed: https://review.openstack.org/509001
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=47e66a81681b8327b5d1c284e54bb0495e2f4872
Submitter: Jenkins
Branch: master

commit 47e66a81681b8327b5d1c284e54bb0495e2f4872
Author: Giulio Fidente <email address hidden>
Date: Mon Oct 2 22:44:18 2017 +0200

Ensure ssh key is active before returning from create_admin_via_nova

We need to make sure os-collect-config has pulled in the new
software deployment and committed the changes before returning.

Also sets the ceph-ansible playbook retries to 3 to make sure we
don't fail unnecessarily on unpredictable network issues.

Change-Id: I544abf5053f18984d93cf381812372029f4ce498
Closes-Bug: #1720793

Changed in tripleo:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-10-22: Fix merged to tripleo-common (stable/pike)

Reviewed: https://review.openstack.org/510970
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=f27b723b0a3ebade2f687a6c7f81a5df7aedddcc
Submitter: Zuul
Branch: stable/pike

commit f27b723b0a3ebade2f687a6c7f81a5df7aedddcc
Author: Giulio Fidente <email address hidden>
Date: Mon Oct 2 22:44:18 2017 +0200

Ensure ssh key is active before returning from create_admin_via_nova

We need to make sure os-collect-config has pulled in the new
software deployment and committed the changes before returning.

Also sets the ceph-ansible playbook retries to 3 to make sure we
don't fail unnecessarily on unpredictable network issues.

    Change-Id: I544abf5053f18984d93cf381812372029f4ce498
    Closes-Bug: #1720793
    (cherry picked from commit 47e66a81681b8327b5d1c284e54bb0495e2f4872)