SSH enablement workflow timeout during deploy of a large overcloud using config-download

Bug #1842102 reported by James Slagle
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Medium
James Slagle

Bug Description

Description of problem:
Trying to scale up a 103 node overcloud to 207 nodes, on a multiple attempts we see the heat stack update finish successfully, but the deploy fail due to timeout of ssh enablement workflow.

ssh admin enabssh admin enablement workflow - TIMED OUT.

We saw the same result even on bumping ENABLE_SSH_ADMIN_TIMEOUT and ENABLE_SSH_ADMIN_SSH_PORT_TIMEOUT to 600 from 300

James Slagle Looked at the ansible log file, andit appears ansible itself succeeded but in the logs we see

2019-08-29 16:40:59.867 409069 ERROR mistral.db.utils [req-704d2539-505b-4259-b8a3-9f82c1ffe4da 2a6d10bddc274e00b00ad4d4adeffda5 c67ce78faf0643708bc7b067eb7525bd - default default] DB error detected, operation will be retried: <function on_action_complete at 0x7f1332047140>: DBConnectionError: (pymysql.err.OperationalError) (2006, "MySQL server has gone away (error(32, 'Broken pipe'))") [SQL: u'UPDATE action_executions_v2 SET updated_at=%(updated_at)s, state=%(state)s, accepted=%(accepted)s, output=%(output)s WHERE action_executions_v2.id = %(action_executions_v2_id)s'] [parameters: {'output': '{"result": {"log_path": "/tmp/ansible-mistral-actionwGCOhN/ansible.log", "stderr": "ansible-playbook 2.6.11\\n config file = /tmp/ansible-mistral-ac ... (24673890 characters truncated) ... +0000 (0:00:20.791) 0:02:14.533 ******* \\n=============================================================================== \\n", "stdout": ""}}', 'state': 'SUCCESS', 'accepted': 1, 'updated_at': datetime.datetime(2019, 8, 29, 16, 40, 59), 'action_executions_v2_id': u'6712d0f7-0c20-4239-b03d-b4560193bf46'}] (Background on this error at: http://sqlalche.me/e/e3q8)

James feels this could be related to the stdout geenrated by the command.

Version-Release number of selected component (if applicable):
13

How reproducible:
100% on an overcloud of this size

Steps to Reproduce:
1. deploy a large overcloud using config-donwload
2.
3.

Actual results:
Deploy fails after successful heat stack create/update but fails during ssh enablement workflow.

Expected results:
SSh enablement should succeed as well as overcloud deployment.

Additional info:

Changed in tripleo:
status: New → In Progress
importance: Undecided → Medium
assignee: nobody → James Slagle (james-slagle)
milestone: none → train-3
tags: added: stein-backport-potential
tags: added: queens-backport-potential rocky-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (master)

Fix proposed to branch: master
Review: https://review.opendev.org/679475

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/679476

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (master)

Reviewed: https://review.opendev.org/679475
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=c7d44bc2ca67c40098982c88e9625aacd25b780d
Submitter: Zuul
Branch: master

commit c7d44bc2ca67c40098982c88e9625aacd25b780d
Author: James Slagle <email address hidden>
Date: Fri Aug 30 12:11:08 2019 -0400

    Honor trash_output when not using queue

    Previously, trash_output was not honored if a queue was not being used
    to post messages.

    This patch changes the behavior so that trash_output will be honored
    even if a queue is not being used, and all stdout/stderr will be
    discarded.

    Change-Id: I4fccfa0cb2a5382a52d63598f66dae446ff29c25
    Closes-Bug: #1842102

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/680707

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/680708

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/680709

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (master)

Reviewed: https://review.opendev.org/679476
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=76d3ccc4457a3c23d5637951b5d26824bc26247e
Submitter: Zuul
Branch: master

commit 76d3ccc4457a3c23d5637951b5d26824bc26247e
Author: James Slagle <email address hidden>
Date: Fri Aug 30 12:18:56 2019 -0400

    Use trash_output in create_admin_via_ssh workflow

    When deploying a large amount of nodes, the create_admin_via_ssh
    workflow could fail due to the large amount of ansible output generated.

    This patch updates the tripleo.ansible-playbook action in the workflow
    with trash_output:true so that the output is not saved in the mistral
    DB.

    There is a log file saved already in case the output is needed for debug
    purposes.

    Change-Id: I078b22fb0a0e7116f87419b444b8b4039db73ef8
    Closes-Bug: #1842102

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (stable/stein)

Reviewed: https://review.opendev.org/680707
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=d7c06cbdf06b43b54c9f47b3fc874b9f8c405ed3
Submitter: Zuul
Branch: stable/stein

commit d7c06cbdf06b43b54c9f47b3fc874b9f8c405ed3
Author: James Slagle <email address hidden>
Date: Fri Aug 30 12:11:08 2019 -0400

    Honor trash_output when not using queue

    Previously, trash_output was not honored if a queue was not being used
    to post messages.

    This patch changes the behavior so that trash_output will be honored
    even if a queue is not being used, and all stdout/stderr will be
    discarded.

    Change-Id: I4fccfa0cb2a5382a52d63598f66dae446ff29c25
    Closes-Bug: #1842102
    (cherry picked from commit c7d44bc2ca67c40098982c88e9625aacd25b780d)

tags: added: in-stable-stein
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/681019

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/681020

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/681021

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (stable/queens)

Reviewed: https://review.opendev.org/680709
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=84b68fc7afa061f288e93d826f677373f200c7e9
Submitter: Zuul
Branch: stable/queens

commit 84b68fc7afa061f288e93d826f677373f200c7e9
Author: James Slagle <email address hidden>
Date: Fri Aug 30 12:11:08 2019 -0400

    Honor trash_output when not using queue

    Previously, trash_output was not honored if a queue was not being used
    to post messages.

    This patch changes the behavior so that trash_output will be honored
    even if a queue is not being used, and all stdout/stderr will be
    discarded.

    Change-Id: I4fccfa0cb2a5382a52d63598f66dae446ff29c25
    Closes-Bug: #1842102
    (cherry picked from commit c7d44bc2ca67c40098982c88e9625aacd25b780d)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (stable/rocky)

Reviewed: https://review.opendev.org/680708
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=ace90b9419c0b6766abbb000ef14bae738dfe81e
Submitter: Zuul
Branch: stable/rocky

commit ace90b9419c0b6766abbb000ef14bae738dfe81e
Author: James Slagle <email address hidden>
Date: Fri Aug 30 12:11:08 2019 -0400

    Honor trash_output when not using queue

    Previously, trash_output was not honored if a queue was not being used
    to post messages.

    This patch changes the behavior so that trash_output will be honored
    even if a queue is not being used, and all stdout/stderr will be
    discarded.

    Change-Id: I4fccfa0cb2a5382a52d63598f66dae446ff29c25
    Closes-Bug: #1842102
    (cherry picked from commit c7d44bc2ca67c40098982c88e9625aacd25b780d)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (stable/stein)

Reviewed: https://review.opendev.org/681020
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=02f8c7aece0fb1f52b7412808cfba88d11baaac8
Submitter: Zuul
Branch: stable/stein

commit 02f8c7aece0fb1f52b7412808cfba88d11baaac8
Author: James Slagle <email address hidden>
Date: Fri Aug 30 12:18:56 2019 -0400

    Use trash_output in create_admin_via_ssh workflow

    When deploying a large amount of nodes, the create_admin_via_ssh
    workflow could fail due to the large amount of ansible output generated.

    This patch updates the tripleo.ansible-playbook action in the workflow
    with trash_output:true so that the output is not saved in the mistral
    DB.

    There is a log file saved already in case the output is needed for debug
    purposes.

    Change-Id: I078b22fb0a0e7116f87419b444b8b4039db73ef8
    Closes-Bug: #1842102
    (cherry picked from commit 76d3ccc4457a3c23d5637951b5d26824bc26247e)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (stable/rocky)

Reviewed: https://review.opendev.org/681019
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=ba073e4dc31e260df742abc6e02b463ffde3bb8e
Submitter: Zuul
Branch: stable/rocky

commit ba073e4dc31e260df742abc6e02b463ffde3bb8e
Author: James Slagle <email address hidden>
Date: Fri Aug 30 12:18:56 2019 -0400

    Use trash_output in create_admin_via_ssh workflow

    When deploying a large amount of nodes, the create_admin_via_ssh
    workflow could fail due to the large amount of ansible output generated.

    This patch updates the tripleo.ansible-playbook action in the workflow
    with trash_output:true so that the output is not saved in the mistral
    DB.

    There is a log file saved already in case the output is needed for debug
    purposes.

    Change-Id: I078b22fb0a0e7116f87419b444b8b4039db73ef8
    Closes-Bug: #1842102
    (cherry picked from commit 76d3ccc4457a3c23d5637951b5d26824bc26247e)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (stable/queens)

Reviewed: https://review.opendev.org/681021
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=aed7deb2e099659fa52e0dd31b702e6848a8b08c
Submitter: Zuul
Branch: stable/queens

commit aed7deb2e099659fa52e0dd31b702e6848a8b08c
Author: James Slagle <email address hidden>
Date: Fri Aug 30 12:18:56 2019 -0400

    Use trash_output in create_admin_via_ssh workflow

    When deploying a large amount of nodes, the create_admin_via_ssh
    workflow could fail due to the large amount of ansible output generated.

    This patch updates the tripleo.ansible-playbook action in the workflow
    with trash_output:true so that the output is not saved in the mistral
    DB.

    There is a log file saved already in case the output is needed for debug
    purposes.

    Change-Id: I078b22fb0a0e7116f87419b444b8b4039db73ef8
    Closes-Bug: #1842102
    (cherry picked from commit 76d3ccc4457a3c23d5637951b5d26824bc26247e)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-common 10.8.1

This issue was fixed in the openstack/tripleo-common 10.8.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-common 8.7.1

This issue was fixed in the openstack/tripleo-common 8.7.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-common 11.2.0

This issue was fixed in the openstack/tripleo-common 11.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-common 10.8.2

This issue was fixed in the openstack/tripleo-common 10.8.2 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-common rocky-eol

This issue was fixed in the openstack/tripleo-common rocky-eol release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-common queens-eol

This issue was fixed in the openstack/tripleo-common queens-eol release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.