Overcloud deployment fails with "TimeoutError: Timer expired after 10 seconds" when ansible 2.6.2

Bug #1806073 reported by Giulio Fidente
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Giulio Fidente

Bug Description

As per [1], when trying an overcloud deployment with ansible 2.6.2 the deployment frequently fails with a timeout error "TimeoutError: Timer expired after 10 seconds"

This seems a known issue in ansible 2.6.2 [2] and the only effective workaround is to increase the gather_timeout [3]

1. https://logs.rdoproject.org/60/14960/17/check/legacy-rdoinfo-tripleo-master-testing-centos-7-multinode-1ctlr-featureset016/b93ed7f/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz#_2018-11-08_18_43_14
2. https://github.com/ansible/ansible/issues/43884
3. https://github.com/ansible/ansible/issues/43884#issuecomment-419650710

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (master)

Fix proposed to branch: master
Review: https://review.openstack.org/621207

Changed in tripleo:
assignee: nobody → Giulio Fidente (gfidente)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (master)

Reviewed: https://review.openstack.org/621207
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=25183dd0b155fb7c5b3b3c7837fcfdb5df1e6691
Submitter: Zuul
Branch: master

commit 25183dd0b155fb7c5b3b3c7837fcfdb5df1e6691
Author: Giulio Fidente <email address hidden>
Date: Fri Nov 30 17:13:30 2018 +0100

    Increase ansible gather_timeout to 30secs for config-download

    With ansible 2.6.2 the facts gathering run time seems to be much
    longer [1], causing unpredictable failures like [2]

    1. https://github.com/ansible/ansible/issues/43884
    2. https://logs.rdoproject.org/60/14960/17/check/legacy-rdoinfo-tripleo-master-testing-centos-7-multinode-1ctlr-featureset016/b93ed7f/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz#_2018-11-08_18_43_14

    Change-Id: Ia4aeac06d4c0e237180e4ba60063828b0d1c5350
    Closes-Bug: 1806073

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.openstack.org/622477

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/623069

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (stable/queens)

Reviewed: https://review.openstack.org/623069
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=44ec3a62676670f19b8f9e6cfc47d4971e54f99a
Submitter: Zuul
Branch: stable/queens

commit 44ec3a62676670f19b8f9e6cfc47d4971e54f99a
Author: Giulio Fidente <email address hidden>
Date: Fri Nov 30 17:13:30 2018 +0100

    Increase ansible gather_timeout to 30secs for config-download

    With ansible 2.6.2 the facts gathering run time seems to be much
    longer [1], causing unpredictable failures like [2]

    1. https://github.com/ansible/ansible/issues/43884
    2. https://logs.rdoproject.org/60/14960/17/check/legacy-rdoinfo-tripleo-master-testing-centos-7-multinode-1ctlr-featureset016/b93ed7f/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz#_2018-11-08_18_43_14

    Change-Id: Ia4aeac06d4c0e237180e4ba60063828b0d1c5350
    Closes-Bug: 1806073
    (cherry picked from commit 25183dd0b155fb7c5b3b3c7837fcfdb5df1e6691)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (stable/rocky)

Reviewed: https://review.openstack.org/622477
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=9801115090db18ca1cf7b3f8be61fe04274ac174
Submitter: Zuul
Branch: stable/rocky

commit 9801115090db18ca1cf7b3f8be61fe04274ac174
Author: Giulio Fidente <email address hidden>
Date: Fri Nov 30 17:13:30 2018 +0100

    Increase ansible gather_timeout to 30secs for config-download

    With ansible 2.6.2 the facts gathering run time seems to be much
    longer [1], causing unpredictable failures like [2]

    1. https://github.com/ansible/ansible/issues/43884
    2. https://logs.rdoproject.org/60/14960/17/check/legacy-rdoinfo-tripleo-master-testing-centos-7-multinode-1ctlr-featureset016/b93ed7f/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz#_2018-11-08_18_43_14

    Change-Id: Ia4aeac06d4c0e237180e4ba60063828b0d1c5350
    Closes-Bug: 1806073
    (cherry picked from commit 25183dd0b155fb7c5b3b3c7837fcfdb5df1e6691)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (master)

Reviewed: https://review.openstack.org/623205
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=55a22c5caf191d092d3cc1c3d558a90365efeb3d
Submitter: Zuul
Branch: master

commit 55a22c5caf191d092d3cc1c3d558a90365efeb3d
Author: Giulio Fidente <email address hidden>
Date: Thu Dec 6 14:38:41 2018 +0100

    Lower mistral-executor nofile to 1024

    Containers inherit file descriptor limit from docker daemon (currently:1048576)
    which is very high causing python2 subprocess to take very long and ansible
    facts gathering to time out.

    This patch defaults nofile limit to 1024 for mistral-executor, like it is
    on the baremetal node.

    Co-Authored-By: Yatin Karel <email address hidden>

    Change-Id: Ia76fcb87fc98fd93d6f487dd40d407c0bc875ffd
    Related-Bug: 1806073

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (stable/rocky)

Related fix proposed to branch: stable/rocky
Review: https://review.openstack.org/623961

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (stable/rocky)

Reviewed: https://review.openstack.org/623961
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=19e9b6a2287e9afbe26d06b63c3d2f30efd87cde
Submitter: Zuul
Branch: stable/rocky

commit 19e9b6a2287e9afbe26d06b63c3d2f30efd87cde
Author: Giulio Fidente <email address hidden>
Date: Thu Dec 6 14:38:41 2018 +0100

    Lower mistral-executor nofile to 1024

    Containers inherit file descriptor limit from docker daemon (currently:1048576)
    which is very high causing python2 subprocess to take very long and ansible
    facts gathering to time out.

    This patch defaults nofile limit to 1024 for mistral-executor, like it is
    on the baremetal node.

    Co-Authored-By: Yatin Karel <email address hidden>

    Change-Id: Ia76fcb87fc98fd93d6f487dd40d407c0bc875ffd
    Related-Bug: 1806073
    (cherry picked from commit 55a22c5caf191d092d3cc1c3d558a90365efeb3d)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-common 10.2.0

This issue was fixed in the openstack/tripleo-common 10.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-common 8.6.7

This issue was fixed in the openstack/tripleo-common 8.6.7 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-common 9.5.0

This issue was fixed in the openstack/tripleo-common 9.5.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.