OpenStack-Ansible

External CI failures due to SSH issues

Bug #1404343 reported by Hugh Saunders on 2014-12-19

This bug affects 1 person

	Status	Importance	Assigned to	Milestone
OpenStack-Ansible	Fix Released	High	Hugh Saunders
Juno	Fix Released	High	Jesse Pretorius	OpenStack-Ansible 10.1.2
Trunk	Fix Released	High	Hugh Saunders

Bug Description

Many runs fail due to SSH issues, for example:

16:27:58 fatal: [node17_kibana_container-169f3f0f] => SSH Error: data could not be sent to the remote host. Make sure this host can be reached over ssh
16:27:58 fatal: [node17_logstash_container-4e8eba29] => SSH Error: data could not be sent to the remote host. Make sure this host can be reached over ssh

I have a couple of theories so far.

1) Multiplexing issues
1.1)I have caught one exception where the ssh client dies as it can't connect to the master to the controlmaster in order to request a new session, thats a client side issue which I can't think of a solution for apart from disabling multiplexing (controlmaster=no)

1.2) I have no evidence for this, but connections couldn't be failing if MaxSessions is hit, which is plausible as max sessions defaults to 10 and forks is 15. This could happen when a task targets all containers and is then delegated to the host.

2) Too many unauthenticated connections
I have enabled ssh logging on the hosts, and run ansible with -vvvv to attempt to determine the cause, however this is not straightforward. The server reports "didn't receive identification string from client" and the client reports "connection closed by server" (because identification wasn't received). So the client is sending identification but the server isn't receiving it, one possibility is that MaxStartups is being hit, though this seems unlikely if controlMaster is set to Auto.

Jesse Pretorius (jesse-pretorius) on 2014-12-22

Changed in openstack-ansible:
importance:	Undecided → High
assignee:	nobody → Hugh Saunders (hughsaunders)

OpenStack Infra (hudson-openstack) on 2014-12-22

Changed in openstack-ansible:
status:	New → In Progress

Revision history for this message

Miguel Alejandro Cantu (miguel-cantu) wrote on 2014-12-29:

I saw the same thing happen when we were running the playbooks on a 160 node cluster. Some tasks within the plays would intermittently fail with this error, forcing us to re-run the play with the --rejoin flag. At first I thought we were saturating the network. You can't see this problem unless you run the plays on a big cluster. I'll keep my eye out for any of the theories stated above next time I'm running the playbooks on this cluster.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-01-05: Fix merged to os-ansible-deployment (master)

Reviewed: https://review.openstack.org/143151
Committed: https://git.openstack.org/cgit/stackforge/os-ansible-deployment/commit/?id=1dae4cca9f9cfb8230385c303f6c1b0292934bf6
Submitter: Jenkins
Branch: master

commit 1dae4cca9f9cfb8230385c303f6c1b0292934bf6
Author: Hugh Saunders <email address hidden>
Date: Fri Dec 19 16:55:30 2014 +0000

Add configure ssh task

    This task will run on all nodes as part of the common role, it will:
     * enable logging to /var/log/sshd
     * set MaxSessions to 500 (Default 10)
     * set MaxStartups to 500 (Defualt 10)

I believe that the low default values of of MaxS* may be causing us problems
when we are delegating container tasks to the host, and forks > MaxS*.

Logging Bug:
Closes-Bug: #1404219

Gate Ssh Failures Bug:
Closes-Bug: #1404343

Change-Id: Ia97e2a90d5a9fed0d9bfdd3e575c8cdc01b83dab

Changed in openstack-ansible:
status:	In Progress → Fix Committed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-01-06: Fix proposed to os-ansible-deployment (master)

Fix proposed to branch: master
Review: https://review.openstack.org/145205

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-01-06: Change abandoned on os-ansible-deployment (master)

Change abandoned by Hugh Saunders (<email address hidden>) on branch: master
Review: https://review.openstack.org/145205
Reason: Abandoning in favour of https://review.openstack.org/#q,Ib3d9e86b6434f686eba1cbea08b5ee8a8cb74a50,n,z

Jesse Pretorius (jesse-pretorius) on 2015-02-03

Changed in openstack-ansible:
milestone:	none → next

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-02-03: Fix proposed to os-ansible-deployment (juno)

Fix proposed to branch: juno
Review: https://review.openstack.org/152454

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-02-03: Fix merged to os-ansible-deployment (juno)

Reviewed: https://review.openstack.org/152454
Committed: https://git.openstack.org/cgit/stackforge/os-ansible-deployment/commit/?id=c5d488059d9407f1b9b96552159ffc298c8dc547
Submitter: Jenkins
Branch: juno

commit c5d488059d9407f1b9b96552159ffc298c8dc547
Author: Hugh Saunders <email address hidden>
Date: Fri Dec 19 16:55:30 2014 +0000

Add configure ssh task

    This task will run on all nodes as part of the common role, it will:
     * enable logging to /var/log/sshd
     * set MaxSessions to 500 (Default 10)
     * set MaxStartups to 500 (Defualt 10)

I believe that the low default values of of MaxS* may be causing us problems
when we are delegating container tasks to the host, and forks > MaxS*.

Logging Bug:
Closes-Bug: #1404219

Gate Ssh Failures Bug:
Closes-Bug: #1404343

Change-Id: Ia97e2a90d5a9fed0d9bfdd3e575c8cdc01b83dab
(cherry picked from commit 1dae4cca9f9cfb8230385c303f6c1b0292934bf6)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-02-16: Fix merged to os-ansible-deployment (master)

Reviewed: https://review.openstack.org/152936
Committed: https://git.openstack.org/cgit/stackforge/os-ansible-deployment/commit/?id=e9f7a0dec128ffc0244f11196642e290c1eb727d
Submitter: Jenkins
Branch: master

commit e9f7a0dec128ffc0244f11196642e290c1eb727d
Author: git-harry <email address hidden>
Date: Wed Feb 4 14:41:35 2015 +0000

Add ssh_retry connection plugin

    The default ssh connection plugin will cause a task to fail if a
    connection cannot be made first time. The failures have been found to
    cause a number of builds to fail.

This patch adds a new connection plugin called ssh_retry and sets it as
the default one to use.

The plugin can be enabled by setting the following options in
ansible.cfg:

        [defaults]
        connection_plugins = plugins/connection_plugins
        transport = ssh_retry

[ssh_retry]
retries = 3

Note, the default retries is 3.

    Change-Id: Ic187fb154cfa7b6fa95b19bee4757ec976f3f368
    Co-Authored-By: Hugh Saunders <email address hidden>
    Closes-Bug: #1404343

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-02-24: Fix proposed to os-ansible-deployment (icehouse)

Fix proposed to branch: icehouse
Review: https://review.openstack.org/158675

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-02-24: Fix proposed to os-ansible-deployment (juno)

Fix proposed to branch: juno
Review: https://review.openstack.org/158676

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-02-24: Fix merged to os-ansible-deployment (icehouse)

#10

Reviewed: https://review.openstack.org/158675
Committed: https://git.openstack.org/cgit/stackforge/os-ansible-deployment/commit/?id=2337284c16efdbc6c3768c401c805761fb987a42
Submitter: Jenkins
Branch: icehouse

commit 2337284c16efdbc6c3768c401c805761fb987a42
Author: git-harry <email address hidden>
Date: Wed Feb 4 14:41:35 2015 +0000

Add ssh_retry connection plugin

    The default ssh connection plugin will cause a task to fail if a
    connection cannot be made first time. The failures have been found to
    cause a number of builds to fail.

This patch adds a new connection plugin called ssh_retry and sets it as
the default one to use.

The plugin can be enabled by setting the following options in
ansible.cfg:

        [defaults]
        connection_plugins = plugins/connection_plugins
        transport = ssh_retry

[ssh_retry]
retries = 3

Note, the default retries is 3.

    Change-Id: Ic187fb154cfa7b6fa95b19bee4757ec976f3f368
    Co-Authored-By: Hugh Saunders <email address hidden>
    Closes-Bug: #1404343
    (cherry picked from commit e9f7a0dec128ffc0244f11196642e290c1eb727d)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-02-24: Fix merged to os-ansible-deployment (juno)

#11

Reviewed: https://review.openstack.org/158676
Committed: https://git.openstack.org/cgit/stackforge/os-ansible-deployment/commit/?id=10562b410d1f726dbf656dd566ea5604b29d9c1b
Submitter: Jenkins
Branch: juno

commit 10562b410d1f726dbf656dd566ea5604b29d9c1b
Author: git-harry <email address hidden>
Date: Wed Feb 4 14:41:35 2015 +0000

Add ssh_retry connection plugin

    The default ssh connection plugin will cause a task to fail if a
    connection cannot be made first time. The failures have been found to
    cause a number of builds to fail.

This patch adds a new connection plugin called ssh_retry and sets it as
the default one to use.

The plugin can be enabled by setting the following options in
ansible.cfg:

        [defaults]
        connection_plugins = plugins/connection_plugins
        transport = ssh_retry

[ssh_retry]
retries = 3

Note, the default retries is 3.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.