External CI failures due to SSH issues

Bug #1404343 reported by Hugh Saunders
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack-Ansible
Fix Released
High
Hugh Saunders
Juno
Fix Released
High
Jesse Pretorius
Trunk
Fix Released
High
Hugh Saunders

Bug Description

Many runs fail due to SSH issues, for example:

    16:27:58 fatal: [node17_kibana_container-169f3f0f] => SSH Error: data could not be sent to the remote host. Make sure this host can be reached over ssh
    16:27:58 fatal: [node17_logstash_container-4e8eba29] => SSH Error: data could not be sent to the remote host. Make sure this host can be reached over ssh

I have a couple of theories so far.

1) Multiplexing issues
1.1)I have caught one exception where the ssh client dies as it can't connect to the master to the controlmaster in order to request a new session, thats a client side issue which I can't think of a solution for apart from disabling multiplexing (controlmaster=no)

1.2) I have no evidence for this, but connections couldn't be failing if MaxSessions is hit, which is plausible as max sessions defaults to 10 and forks is 15. This could happen when a task targets all containers and is then delegated to the host.

2) Too many unauthenticated connections
I have enabled ssh logging on the hosts, and run ansible with -vvvv to attempt to determine the cause, however this is not straightforward. The server reports "didn't receive identification string from client" and the client reports "connection closed by server" (because identification wasn't received). So the client is sending identification but the server isn't receiving it, one possibility is that MaxStartups is being hit, though this seems unlikely if controlMaster is set to Auto.

Changed in openstack-ansible:
importance: Undecided → High
assignee: nobody → Hugh Saunders (hughsaunders)
Changed in openstack-ansible:
status: New → In Progress
Revision history for this message
Miguel Alejandro Cantu (miguel-cantu) wrote :

I saw the same thing happen when we were running the playbooks on a 160 node cluster. Some tasks within the plays would intermittently fail with this error, forcing us to re-run the play with the --rejoin flag. At first I thought we were saturating the network. You can't see this problem unless you run the plays on a big cluster. I'll keep my eye out for any of the theories stated above next time I'm running the playbooks on this cluster.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to os-ansible-deployment (master)

Reviewed: https://review.openstack.org/143151
Committed: https://git.openstack.org/cgit/stackforge/os-ansible-deployment/commit/?id=1dae4cca9f9cfb8230385c303f6c1b0292934bf6
Submitter: Jenkins
Branch: master

commit 1dae4cca9f9cfb8230385c303f6c1b0292934bf6
Author: Hugh Saunders <email address hidden>
Date: Fri Dec 19 16:55:30 2014 +0000

    Add configure ssh task

    This task will run on all nodes as part of the common role, it will:
     * enable logging to /var/log/sshd
     * set MaxSessions to 500 (Default 10)
     * set MaxStartups to 500 (Defualt 10)

    I believe that the low default values of of MaxS* may be causing us problems
    when we are delegating container tasks to the host, and forks > MaxS*.

    Logging Bug:
    Closes-Bug: #1404219

    Gate Ssh Failures Bug:
    Closes-Bug: #1404343

    Change-Id: Ia97e2a90d5a9fed0d9bfdd3e575c8cdc01b83dab

Changed in openstack-ansible:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to os-ansible-deployment (master)

Fix proposed to branch: master
Review: https://review.openstack.org/145205

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on os-ansible-deployment (master)

Change abandoned by Hugh Saunders (<email address hidden>) on branch: master
Review: https://review.openstack.org/145205
Reason: Abandoning in favour of https://review.openstack.org/#q,Ib3d9e86b6434f686eba1cbea08b5ee8a8cb74a50,n,z

Changed in openstack-ansible:
milestone: none → next
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to os-ansible-deployment (juno)

Fix proposed to branch: juno
Review: https://review.openstack.org/152454

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to os-ansible-deployment (juno)

Reviewed: https://review.openstack.org/152454
Committed: https://git.openstack.org/cgit/stackforge/os-ansible-deployment/commit/?id=c5d488059d9407f1b9b96552159ffc298c8dc547
Submitter: Jenkins
Branch: juno

commit c5d488059d9407f1b9b96552159ffc298c8dc547
Author: Hugh Saunders <email address hidden>
Date: Fri Dec 19 16:55:30 2014 +0000

    Add configure ssh task

    This task will run on all nodes as part of the common role, it will:
     * enable logging to /var/log/sshd
     * set MaxSessions to 500 (Default 10)
     * set MaxStartups to 500 (Defualt 10)

    I believe that the low default values of of MaxS* may be causing us problems
    when we are delegating container tasks to the host, and forks > MaxS*.

    Logging Bug:
    Closes-Bug: #1404219

    Gate Ssh Failures Bug:
    Closes-Bug: #1404343

    Change-Id: Ia97e2a90d5a9fed0d9bfdd3e575c8cdc01b83dab
    (cherry picked from commit 1dae4cca9f9cfb8230385c303f6c1b0292934bf6)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to os-ansible-deployment (master)

Reviewed: https://review.openstack.org/152936
Committed: https://git.openstack.org/cgit/stackforge/os-ansible-deployment/commit/?id=e9f7a0dec128ffc0244f11196642e290c1eb727d
Submitter: Jenkins
Branch: master

commit e9f7a0dec128ffc0244f11196642e290c1eb727d
Author: git-harry <email address hidden>
Date: Wed Feb 4 14:41:35 2015 +0000

    Add ssh_retry connection plugin

    The default ssh connection plugin will cause a task to fail if a
    connection cannot be made first time. The failures have been found to
    cause a number of builds to fail.

    This patch adds a new connection plugin called ssh_retry and sets it as
    the default one to use.

    The plugin can be enabled by setting the following options in
    ansible.cfg:

        [defaults]
        connection_plugins = plugins/connection_plugins
        transport = ssh_retry

        [ssh_retry]
        retries = 3

    Note, the default retries is 3.

    Change-Id: Ic187fb154cfa7b6fa95b19bee4757ec976f3f368
    Co-Authored-By: Hugh Saunders <email address hidden>
    Closes-Bug: #1404343

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to os-ansible-deployment (icehouse)

Fix proposed to branch: icehouse
Review: https://review.openstack.org/158675

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to os-ansible-deployment (juno)

Fix proposed to branch: juno
Review: https://review.openstack.org/158676

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to os-ansible-deployment (icehouse)

Reviewed: https://review.openstack.org/158675
Committed: https://git.openstack.org/cgit/stackforge/os-ansible-deployment/commit/?id=2337284c16efdbc6c3768c401c805761fb987a42
Submitter: Jenkins
Branch: icehouse

commit 2337284c16efdbc6c3768c401c805761fb987a42
Author: git-harry <email address hidden>
Date: Wed Feb 4 14:41:35 2015 +0000

    Add ssh_retry connection plugin

    The default ssh connection plugin will cause a task to fail if a
    connection cannot be made first time. The failures have been found to
    cause a number of builds to fail.

    This patch adds a new connection plugin called ssh_retry and sets it as
    the default one to use.

    The plugin can be enabled by setting the following options in
    ansible.cfg:

        [defaults]
        connection_plugins = plugins/connection_plugins
        transport = ssh_retry

        [ssh_retry]
        retries = 3

    Note, the default retries is 3.

    Change-Id: Ic187fb154cfa7b6fa95b19bee4757ec976f3f368
    Co-Authored-By: Hugh Saunders <email address hidden>
    Closes-Bug: #1404343
    (cherry picked from commit e9f7a0dec128ffc0244f11196642e290c1eb727d)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to os-ansible-deployment (juno)

Reviewed: https://review.openstack.org/158676
Committed: https://git.openstack.org/cgit/stackforge/os-ansible-deployment/commit/?id=10562b410d1f726dbf656dd566ea5604b29d9c1b
Submitter: Jenkins
Branch: juno

commit 10562b410d1f726dbf656dd566ea5604b29d9c1b
Author: git-harry <email address hidden>
Date: Wed Feb 4 14:41:35 2015 +0000

    Add ssh_retry connection plugin

    The default ssh connection plugin will cause a task to fail if a
    connection cannot be made first time. The failures have been found to
    cause a number of builds to fail.

    This patch adds a new connection plugin called ssh_retry and sets it as
    the default one to use.

    The plugin can be enabled by setting the following options in
    ansible.cfg:

        [defaults]
        connection_plugins = plugins/connection_plugins
        transport = ssh_retry

        [ssh_retry]
        retries = 3

    Note, the default retries is 3.

    Change-Id: Ic187fb154cfa7b6fa95b19bee4757ec976f3f368
    Co-Authored-By: Hugh Saunders <email address hidden>
    Closes-Bug: #1404343
    (cherry picked from commit e9f7a0dec128ffc0244f11196642e290c1eb727d)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.