N-> Upgrade, check0 doesn't timeout and fails with heat timeout.

Bug #1680477 reported by Sofer Athlan-Guyot
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Medium
Sofer Athlan-Guyot

Bug Description

Hi,

during the composable upgrade, I was stuck at step0 during hours:

  overcloud-AllNodesDeploySteps-xbbw5kmcvktz-ControllerUpgrade_Step0-uul7fe3kjozd

and the command waiting on the three overcloud was:

  ansible|
         |- pcs cluster status

As a matter of fact, pcs status was working on all three nodes but for some reason the pcsd daemon on controller-2 wasn't responding to the network request send by "pcs cluster status". More details there https://bugzilla.redhat.com/show_bug.cgi?id=1439767

After digging with Michele, it appears that we are bitten by that https://bugzilla.redhat.com/show_bug.cgi?id=1292858.

All in all this make the check hanging forever and it's a very bad user experience.

To have it working again, we had to kill -9 the pcsd on controller-2 and restart it.

Changed in tripleo:
status: New → Triaged
importance: Undecided → Medium
milestone: none → pike-1
Changed in tripleo:
assignee: nobody → Sofer Athlan-Guyot (sofer-athlan-guyot)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/454766

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.openstack.org/454224
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=0ea21f51a8128e536404ffd87f741443c9287593
Submitter: Jenkins
Branch: master

commit 0ea21f51a8128e536404ffd87f741443c9287593
Author: Sofer Athlan-Guyot <email address hidden>
Date: Thu Apr 6 16:55:08 2017 +0200

    Timeout early on pcs cluster status check0 during upgrade.

    There is a windows for the pcs cluster status to hang forever[1]. We
    add a timeout during check0 to avoid this situation. 2 minutes should
    be more than enought to get all the pcsd nodes to reply.

    [1] https://bugzilla.redhat.com/show_bug.cgi?id=1292858

    Closes-Bug: #1680477

    Change-Id: Icb3dc76e031a3d4f26294f37d169f2f61d30973e

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 7.0.0.0b1

This issue was fixed in the openstack/tripleo-heat-templates 7.0.0.0b1 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/ocata)

Reviewed: https://review.openstack.org/454766
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=75a5a4681eadf8eff214e81aff65c6ca13e12133
Submitter: Jenkins
Branch: stable/ocata

commit 75a5a4681eadf8eff214e81aff65c6ca13e12133
Author: Sofer Athlan-Guyot <email address hidden>
Date: Thu Apr 6 16:55:08 2017 +0200

    Timeout early on pcs cluster status check0 during upgrade.

    There is a windows for the pcs cluster status to hang forever[1]. We
    add a timeout during check0 to avoid this situation. 2 minutes should
    be more than enought to get all the pcsd nodes to reply.

    [1] https://bugzilla.redhat.com/show_bug.cgi?id=1292858

    Closes-Bug: #1680477

    Change-Id: Icb3dc76e031a3d4f26294f37d169f2f61d30973e
    (cherry picked from commit 0ea21f51a8128e536404ffd87f741443c9287593)

tags: added: in-stable-ocata
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 6.2.0

This issue was fixed in the openstack/tripleo-heat-templates 6.2.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.