ssh connection failures during deployments

Bug #1479812 reported by Jesse Pretorius
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack-Ansible
Fix Released
Low
Jesse Pretorius
Kilo
Fix Released
Low
Jesse Pretorius
Trunk
Fix Released
Low
Jesse Pretorius

Bug Description

Physical deployments, gate checks and AIO deployments all suffer reasonably regularly from ssh connectivity errors by ansible to its targets, even in an AIO when the targets are containers on the same host!

This is rather frustrating.

It seems that there has been some success with reducing the number of forks ansible uses. That may perhaps improve deployment success.

Tags: in-kilo
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to os-ansible-deployment (master)

Fix proposed to branch: master
Review: https://review.openstack.org/207474

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to os-ansible-deployment (master)

Reviewed: https://review.openstack.org/207474
Committed: https://git.openstack.org/cgit/stackforge/os-ansible-deployment/commit/?id=7b10c64007a14e78b52b0afcfc2fa2f7b4e17ef3
Submitter: Jenkins
Branch: master

commit 7b10c64007a14e78b52b0afcfc2fa2f7b4e17ef3
Author: Jesse Pretorius <email address hidden>
Date: Thu Jul 30 15:11:14 2015 +0100

    Change ansible forks used

    This patch changes the number of forks used by ansible when
    using any of the convenience (and thus gate check) scripts
    to the number of processors available on the deployment
    system.

    The previous values used were found to cause ssh connection
    errors and it was found that reducing the number improved
    the chances of success.

    This patch also removes the forks setting from ansible.conf
    so that ansible will use the default value when run in any
    other way. This leaves the decision of setting the number
    of forks to the deployer, as it should be.

    Change-Id: I31ad7353344f7994063127ecfce8f4733769234c
    Closes-Bug: #1479812

Changed in openstack-ansible:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to os-ansible-deployment (kilo)

Fix proposed to branch: kilo
Review: https://review.openstack.org/209426

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to os-ansible-deployment (kilo)

Reviewed: https://review.openstack.org/209426
Committed: https://git.openstack.org/cgit/stackforge/os-ansible-deployment/commit/?id=e3742c04ba9f29ae95ee0766398e8182dafe2249
Submitter: Jenkins
Branch: kilo

commit e3742c04ba9f29ae95ee0766398e8182dafe2249
Author: Jesse Pretorius <email address hidden>
Date: Thu Jul 30 15:11:14 2015 +0100

    Change ansible forks used

    This patch changes the number of forks used by ansible when
    using any of the convenience (and thus gate check) scripts
    to the number of processors available on the deployment
    system.

    The previous values used were found to cause ssh connection
    errors and it was found that reducing the number improved
    the chances of success.

    This patch also removes the forks setting from ansible.conf
    so that ansible will use the default value when run in any
    other way. This leaves the decision of setting the number
    of forks to the deployer, as it should be.

    Change-Id: I31ad7353344f7994063127ecfce8f4733769234c
    Closes-Bug: #1479812
    (cherry picked from commit 7b10c64007a14e78b52b0afcfc2fa2f7b4e17ef3)

Revision history for this message
Byron McCollum (byron-g-mccollum) wrote :

The SSH MaxStartups default of 10:30:100 can cause SSH connection issues when Ansible forks > 10.

In v10, MaxStartups was being set to 500 (500:30:500), however this setting is no longer being managed in v11.

The underlying issue is with the way delegate_to works. If there is a task for multiple hosts, but that task is delegated to a single host, Ansible doesn't serialize the delegated tasks for that single host. What can happen is a flood of SSH connection to the delegated to host, up to the number of allowable forks. In the case of some of the container management tasks using delegate_to, this is exactly what is happening.

With the MaxStartups default of 10:30:100, if there are more than 10 simultaneous unauthenticated connections, new connections will be refused with a probability of 30%. This percentage increases linearly as you approach the maximum number of simultaneous unauthenticated connections, which is 100 by default (10:30:100).

In an AIO deployment with high container affinity, and a high number of forks, you will encounter lots of SSH connection failures.

Revision history for this message
Jesse Pretorius (jesse-pretorius) wrote :

@Byron That makes some sense - resetting this back to new for further investigation.

Revision history for this message
Byron McCollum (byron-g-mccollum) wrote :

Did several AIO tests with OnMetal Compute Instances (40 Cores, 27 Containers)

40 Forks
MaxStartups 10:30:100 (Default)
MaxSessions 10 (Default)
Several SSH connection errors, Several Retries

40 Forks
MaxStartups 100:100:100
MaxSessions 100
0 SSH connection errors, 0 Retries

Revision history for this message
Serge van Ginderachter (svg) wrote :

Interesting and intriguing.

About MaxStartups, I'd expect with the the default ControlMaster=auto there would only be 1 connection to the host, and this not being a problem.

Could increasing MaxSessions be the relevant solution? Did you test only that change?

Revision history for this message
Byron McCollum (byron-g-mccollum) wrote :

@svg...ill do another round of testing and see

Revision history for this message
Byron McCollum (byron-g-mccollum) wrote :

@svg...all three combinations seem to be equally effective compared to the results using the default values, and with very little difference in run time for setup-hosts.yml...

MaxSessions 100
MaxStartups 10:30:100
SSH Errors: 0
Retries: 0
Time: 3m5.850s

MaxSessions 10
MaxStartups 100:100:100
SSH Errors: 0
Retries: 0
Time: 3m11.509s

MaxSessions 100
MaxStartups 100:100:100
SSH Errors: 0
Retries: 0
Time: 3m20.736s

MaxSessions 10
MaxStartups 10:30:100
SSH Errors: 27
Retries: 5 (Fatal)
Time: 7m10.559s

Revision history for this message
Jesse Pretorius (jesse-pretorius) wrote :

@Byron I think it makes sense for us to default to staying within the limits of the default settings for sshd, but then to provide some sort of documented guide around how a deployer can improve performance.

Could you prepare please a patch for review that provides both the adjustment to the convenience scripts and a document note? If not, can you perhaps provide a comment in this bug regarding your recommendations for change?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to openstack-ansible (master)

Fix proposed to branch: master
Review: https://review.openstack.org/229786

Changed in openstack-ansible:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-ansible (master)

Reviewed: https://review.openstack.org/229786
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible/commit/?id=fe3b328023b946413a675a0ffdbafcad353aca44
Submitter: Jenkins
Branch: master

commit fe3b328023b946413a675a0ffdbafcad353aca44
Author: Jesse Pretorius <email address hidden>
Date: Thu Oct 1 10:12:03 2015 +0100

    Limit the number of Ansible forks used to 10

    The default MaxSessions setting for SSHD is 10. Each Ansible fork makes use of
    a Session, so this patch still uses the CPU number to set the number of forks
    used but limits it to 10 forks when the number of CPU's is larger.

    Developer Docs and Install Guide Docs entries have been included.

    Closes-Bug: #1479812
    Change-Id: I9abd33e184c706796ede9963393876a8aae9837c

Changed in openstack-ansible:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to openstack-ansible (kilo)

Fix proposed to branch: kilo
Review: https://review.openstack.org/232387

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-ansible (kilo)

Reviewed: https://review.openstack.org/232387
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible/commit/?id=5d2c8b537a3f553a3e6f93d67465415e1564bb00
Submitter: Jenkins
Branch: kilo

commit 5d2c8b537a3f553a3e6f93d67465415e1564bb00
Author: Jesse Pretorius <email address hidden>
Date: Thu Oct 1 10:12:03 2015 +0100

    Limit the number of Ansible forks used to 10

    The default MaxSessions setting for SSHD is 10. Each Ansible fork makes use of
    a Session, so this patch still uses the CPU number to set the number of forks
    used but limits it to 10 forks when the number of CPU's is larger.

    Developer Docs and Install Guide Docs entries have been included.

    Closes-Bug: #1479812
    Change-Id: I9abd33e184c706796ede9963393876a8aae9837c
    (cherry picked from commit fe3b328023b946413a675a0ffdbafcad353aca44)

tags: added: in-kilo
Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote : Fix included in openstack/openstack-ansible 11.2.11

This issue was fixed in the openstack/openstack-ansible 11.2.11 release.

Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/openstack-ansible 11.2.12

This issue was fixed in the openstack/openstack-ansible 11.2.12 release.

Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote : Fix included in openstack/openstack-ansible 11.2.14

This issue was fixed in the openstack/openstack-ansible 11.2.14 release.

Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/openstack-ansible 11.2.15

This issue was fixed in the openstack/openstack-ansible 11.2.15 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.