Overcloud Deploy Hangs During a Large Deployment

Bug #1872823 reported by Luke Short
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Luke Short

Bug Description

Description
===========
When deploying a large number of nodes, the output of from Ansible during the config_download_deploy workflow stops. The ansible.log file continues to completion and it can be seen that the deployment does complete successfully. However, the tripleoclient CLI will hang / be stuck indefinitely. This leads operators to think that the deployment is stuck or failed.

We have ruled out the problem being related to our usage of Python's subprocess to execute the ansible-playbook command. This means that the issue is definitely isolated to Mistral and/or Zaqar. It is believed that the large amount of memory/buffer is overloading the services when trying to send messages to/from Zaqar.

As a workaround, using `openstack overcloud deploy --quiet` can help. The config-download playbooks can also be ran manually.

Steps to reproduce
==================
This issue is difficult to replicate all the time. The more Overcloud nodes, the more likely it is to happen.

* Deploy an Overcloud with at least 50 Compute nodes using Train.

Expected result
===============
The full Ansible output should be displayed and tripleoclient should exit with status 0.

Actual result
=============
Normally during step 4 or 5 the CLI will hang and stop outputting the Ansible stdout/stderr. It stops at random points in every re-deployment. No errors are reported in the Mistral or Zaqar logs.

Environment
===========
All OpenStack releases using Mistral and Zaqar for the deployment (<= Train).

Logs & Configs
==============
BZ with more information: https://bugzilla.redhat.com/show_bug.cgi?id=1792500

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to python-tripleoclient (master)

Fix proposed to branch: master
Review: https://review.opendev.org/720083

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/720845

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/724181

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-common (stable/train)

Change abandoned by Luke Short (<email address hidden>) on branch: stable/train
Review: https://review.opendev.org/720845
Reason: I will hold off on this patch to see what the results of using only https://review.opendev.org/#/c/720083/ provides. This patch may be unnecessary.

wes hayutin (weshayutin)
Changed in tripleo:
milestone: ussuri-rc1 → ussuri-rc3
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on python-tripleoclient (master)

Change abandoned by Luke Short (<email address hidden>) on branch: master
Review: https://review.opendev.org/720083

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to puppet-tripleo (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/728974

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to puppet-tripleo (master)

Reviewed: https://review.opendev.org/728974
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=5c3e736e409e661b7e1db51749719eafb86f2f9a
Submitter: Zuul
Branch: master

commit 5c3e736e409e661b7e1db51749719eafb86f2f9a
Author: Luke Short <email address hidden>
Date: Mon May 18 14:06:47 2020 -0400

    Allow the Mistral tunnel timeout to be configurable.

    Change-Id: Ibfd5587476d5a411206f62e8b4b886db662bf7d1
    Related-Bug: #1872823
    Signed-off-by: Luke Short <email address hidden>

tags: added: queens-backport-potential train-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to puppet-tripleo (stable/train)

Related fix proposed to branch: stable/train
Review: https://review.opendev.org/730805

wes hayutin (weshayutin)
Changed in tripleo:
milestone: ussuri-rc3 → victoria-1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to puppet-tripleo (stable/ussuri)

Related fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/731031

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to puppet-tripleo (stable/train)

Reviewed: https://review.opendev.org/730805
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=9bd8331053afbd2a3dd7056d2ba377641ac8dd7a
Submitter: Zuul
Branch: stable/train

commit 9bd8331053afbd2a3dd7056d2ba377641ac8dd7a
Author: Luke Short <email address hidden>
Date: Mon May 18 14:06:47 2020 -0400

    Allow the Mistral tunnel timeout to be configurable.

    Change-Id: Ibfd5587476d5a411206f62e8b4b886db662bf7d1
    Related-Bug: #1872823
    Signed-off-by: Luke Short <email address hidden>
    (cherry picked from commit 5c3e736e409e661b7e1db51749719eafb86f2f9a)

tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to puppet-tripleo (stable/ussuri)

Reviewed: https://review.opendev.org/731031
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=ec4e58927f33c77f65ec3b7a8f580b54cf4589e4
Submitter: Zuul
Branch: stable/ussuri

commit ec4e58927f33c77f65ec3b7a8f580b54cf4589e4
Author: Luke Short <email address hidden>
Date: Mon May 18 14:06:47 2020 -0400

    Allow the Mistral tunnel timeout to be configurable.

    Change-Id: Ibfd5587476d5a411206f62e8b4b886db662bf7d1
    Related-Bug: #1872823
    Signed-off-by: Luke Short <email address hidden>
    (cherry picked from commit 5c3e736e409e661b7e1db51749719eafb86f2f9a)

tags: added: in-stable-ussuri
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on puppet-tripleo (stable/train)

Change abandoned by Luke Short (<email address hidden>) on branch: stable/train
Review: https://review.opendev.org/724181
Reason: Abandoning due to preferring the patches mentioned in my previous comment.

Revision history for this message
Luke Short (ekultails) wrote :

This issue has also been reported a few times recently on small deployments as well. Adding the argument `--quiet` continues to be the recommended workaround.

Changed in tripleo:
milestone: victoria-1 → victoria-3
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-common (stable/train)

Change abandoned by Luke Short (<email address hidden>) on branch: stable/train
Review: https://review.opendev.org/720845

Changed in tripleo:
milestone: victoria-3 → wallaby-1
Changed in tripleo:
milestone: wallaby-1 → wallaby-2
Changed in tripleo:
milestone: wallaby-2 → wallaby-3
Revision history for this message
Rabi Mishra (rabi) wrote :
Revision history for this message
Rabi Mishra (rabi) wrote :
Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/python-tripleoclient 12.5.0

This issue was fixed in the openstack/python-tripleoclient 12.5.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.