ssh timeout after flush net cache tasks

Bug #1476885 reported by Stanley@Linux Simba on 2015-07-22
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
openstack-ansible
Low
Jesse Pretorius
Kilo
Low
Jesse Pretorius
Trunk
Low
Jesse Pretorius

Bug Description

Building openstack-ansible in vagrant and keep on encountering ssh timeouts after the flush net cache task.

Example:

PLAY [Install galera server] **************************************************

GATHERING FACTS ***************************************************************
ok: [stackserver_galera_container-7e1cec08]

TASK: [Galera extra lxc config] ***********************************************
changed: [stackserver_galera_container-7e1cec08 -> stackserver]

TASK: [Flush net cache] *******************************************************
changed: [stackserver_galera_container-7e1cec08 -> stackserver]

TASK: [Wait for container ssh] ************************************************
ok: [stackserver_galera_container-7e1cec08 -> stackserver]

TASK: [apt_package_pinning | Add apt pin preferences] *************************
skipping: [stackserver_galera_container-7e1cec08]

TASK: [pip_install | Create pip config directory] *****************************
fatal: [stackserver_galera_container-7e1cec08] => SSH Error: data could not be sent to the remote host. Make sure this host can be reached over ssh

FATAL: all hosts have already failed -- aborting

attached simple patch that resolves the issue. Checks to make sure that not only port is up but that SSH is ready.
code borrowed from wait_for module example docs.

Kevin Carter (kevin-carter) wrote :

@linuxsimba this looks like a sensible fix, could you send this up for review to the master branch of openstack-ansible? With that we can gate the change and get it in for master / backported to kilo.

Changed in openstack-ansible:
status: New → Incomplete
status: Incomplete → Triaged
importance: Undecided → Low

I'm new to launchpad. is there a simple howto on how to send up code for review via launchpad?

back to the drawing board. ran vagrant install of openstack-ansible. died 3 of of 10 times at the spots that were patched yesterday. Not sure why ssh connection is dying even though ansible wait_for confirms that it passes. For now will just increase SSH delay from 5 to 10 and see if my installs go well.

Increasing ssh delay from 5 to 10 seems to produce more consistent results. now I get far fewer failures. Only once per install now. Would be nice to know why these SSH timeouts happen in vagrant. I assume it does not occur when deploying on bare metal.
attached is the patch I'm using in my vagrant setup.

project website for my vagrant install of openstack-ansible.
http://github.com/skamithi/vagrant-osad.

Thank you for submitting the patch. I've added it into the project's review system for review by the team members: https://review.openstack.org/207793

Reviewed: https://review.openstack.org/207793
Committed: https://git.openstack.org/cgit/stackforge/os-ansible-deployment/commit/?id=9e08d31fe2ea7cab057f933984ce32d0835393d8
Submitter: Jenkins
Branch: master

commit 9e08d31fe2ea7cab057f933984ce32d0835393d8
Author: Stanley Karunditu <email address hidden>
Date: Fri Jul 31 10:23:26 2015 +0100

    Add regex check for ssh connection

    This patch adds a check for the appropriate OpenSSH Daemon
    reponse when waiting for the container to restart. This is
    an optimisation over simply waiting for the TCP port.

    Change-Id: Ie25af4f57bb98fb1d846d579b58b4d479b476675
    Closes-Bug: #1476885

Changed in openstack-ansible:
status: In Progress → Fix Committed

Reviewed: https://review.openstack.org/216196
Committed: https://git.openstack.org/cgit/stackforge/os-ansible-deployment/commit/?id=fcdf558447724a3996d0c493c99e0a559a1f6869
Submitter: Jenkins
Branch: kilo

commit fcdf558447724a3996d0c493c99e0a559a1f6869
Author: Stanley Karunditu <email address hidden>
Date: Fri Jul 31 10:23:26 2015 +0100

    Add regex check for ssh connection

    This patch adds a check for the appropriate OpenSSH Daemon
    reponse when waiting for the container to restart. This is
    an optimisation over simply waiting for the TCP port.

    Change-Id: Ie25af4f57bb98fb1d846d579b58b4d479b476675
    Closes-Bug: #1476885
    (cherry picked from commit 9e08d31fe2ea7cab057f933984ce32d0835393d8)

Reviewed: https://review.openstack.org/215907
Committed: https://git.openstack.org/cgit/stackforge/os-ansible-deployment/commit/?id=57c5f2c77e95b1cc4277ca36c72d3627f38555ab
Submitter: Jenkins
Branch: master

commit 57c5f2c77e95b1cc4277ca36c72d3627f38555ab
Author: Stanley Karunditu <email address hidden>
Date: Sat Aug 22 11:40:01 2015 +0100

    Add configurable ssh_delay

    This patch adds a configurable delay time for retrying the
    ssh connection when waiting for the containers to restart.

    This is useful for environments where resources are constrained
    and containers may take longer to restart.

    Change-Id: I0383e34a273b93e1b2651460c853cf1ceba89029
    Closes-Bug: #1476885

Reviewed: https://review.openstack.org/216429
Committed: https://git.openstack.org/cgit/stackforge/os-ansible-deployment/commit/?id=9f3f063aca39dbc5e717ea486a7b64e1d525c9ae
Submitter: Jenkins
Branch: kilo

commit 9f3f063aca39dbc5e717ea486a7b64e1d525c9ae
Author: Stanley Karunditu <email address hidden>
Date: Sat Aug 22 11:40:01 2015 +0100

    Add configurable ssh_delay

    This patch adds a configurable delay time for retrying the
    ssh connection when waiting for the containers to restart.

    This is useful for environments where resources are constrained
    and containers may take longer to restart.

    Change-Id: I0383e34a273b93e1b2651460c853cf1ceba89029
    Closes-Bug: #1476885
    (cherry picked from commit 57c5f2c77e95b1cc4277ca36c72d3627f38555ab)

Stanley Karunditu (stanleyk-f) wrote :

Thanks for committing the patch. With the variable ssh delay, os-ansible-deployment in vagrant is installing now without any ssh errors with a much higher ssh delay.

@Stanley Thank you for submitting the patch in the first place - even if the final version ended up a little different. :)

This issue was fixed in the openstack/openstack-ansible 11.2.11 release.

This issue was fixed in the openstack/openstack-ansible 11.2.12 release.

This issue was fixed in the openstack/openstack-ansible 11.2.14 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers