Replacing a controller node takes a very long time (1h)

Bug #1659741 reported by Michele Baldessari
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Michele Baldessari

Bug Description

Takes a long time to complete replacing a controller node

Following:
https://access.redhat.com/documentation/en/red-hat-openstack-platform/8/paged/director-installation-and-usage/94-replacing-controller-nodes

1. create remove.yaml file according to the proposed document.
   like this:
   $ cat ~/remove.yaml
   parameters:
     ControllerRemovalPolicies:
         [{'resource_list': ['1']}]
   "['1']" is node index which expect to replace.

2. re-deploy overcloud according to the proposed document.

It has waited for a long time until re-deployment failed.

stack@director$ time openstack overcloud deploy --templates --control-scale 3 -e ~/remove.yaml
Deploying templates in the directory /home/stack/controller_replacetest_templates/templates
Stack failed with status: resources.ControllerNodesPostDeployment: resources.ControllerLoadBalancerDeployment_Step1: Error: resources[3]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 6
ERROR: openstack Heat Stack update failed.

real 85m53.449s
user 0m2.076s
sys 0m0.320s

The waiting time to re-deploy a controller should be shortened. One and a half hours is too much to wait.

The reason is that we currently can't override the exec-wait-for-settle timeout (1hour).

Changed in tripleo:
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (master)

Fix proposed to branch: master
Review: https://review.openstack.org/426113

Changed in tripleo:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/426114

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/426115

tags: added: mitaka-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.openstack.org/426113
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=48692127d09ef577b5b691fc12ccb9055b73759b
Submitter: Jenkins
Branch: master

commit 48692127d09ef577b5b691fc12ccb9055b73759b
Author: Michele Baldessari <email address hidden>
Date: Fri Jan 27 08:10:39 2017 +0100

    Allow the override of pacemaker::corosync::settle_tries

    When replacing a controller node, Exec['wait-for-settle'] needs to
    timeout, which means that the command pcs cluster auth will be executed
    360 times with 10 seconds in between. So that means waiting for an hour
    for no reason. Let's allow to override the settle_tries counter so
    an operator can shorten it accordingly.

    Tested this by setting CorosyncSettleTries to 100 and I correctly get
    proper hiera settings:
    $ hiera pacemaker::corosync::settle_tries
    100

    And effectively we try a number of 100 times as opposed to the 360
    default:
    /Stage[main]/Pacemaker::Corosync/Exec[reauthenticate-across-all-nodes]/returns
    (debug): Exec try 1/100

    Change-Id: I5e21b4215cb0b8686d2059b3d71e2444a96719dc
    Closes-Bug: #1659741

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/newton)

Reviewed: https://review.openstack.org/426114
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=07490069795056909b8b70d67d5abc75f99b565f
Submitter: Jenkins
Branch: stable/newton

commit 07490069795056909b8b70d67d5abc75f99b565f
Author: Michele Baldessari <email address hidden>
Date: Fri Jan 27 08:10:39 2017 +0100

    Allow the override of pacemaker::corosync::settle_tries

    When replacing a controller node, Exec['wait-for-settle'] needs to
    timeout, which means that the command pcs cluster auth will be executed
    360 times with 10 seconds in between. So that means waiting for an hour
    for no reason. Let's allow to override the settle_tries counter so
    an operator can shorten it accordingly.

    Tested this by setting CorosyncSettleTries to 100 and I correctly get
    proper hiera settings:
    $ hiera pacemaker::corosync::settle_tries
    100

    And effectively we try a number of 100 times as opposed to the 360
    default:
    /Stage[main]/Pacemaker::Corosync/Exec[reauthenticate-across-all-nodes]/returns
    (debug): Exec try 1/100

    Change-Id: I5e21b4215cb0b8686d2059b3d71e2444a96719dc
    Closes-Bug: #1659741
    (cherry picked from commit 48692127d09ef577b5b691fc12ccb9055b73759b)

tags: added: in-stable-newton
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 6.0.0.0rc1

This issue was fixed in the openstack/tripleo-heat-templates 6.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/mitaka)

Reviewed: https://review.openstack.org/426115
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=1b1a64feab45a2af221c746968b3de7be5593b72
Submitter: Jenkins
Branch: stable/mitaka

commit 1b1a64feab45a2af221c746968b3de7be5593b72
Author: Michele Baldessari <email address hidden>
Date: Fri Jan 27 08:35:56 2017 +0100

    Allow the override of pacemaker::corosync::settle_tries

    When replacing a controller node, Exec['wait-for-settle'] needs to
    timeout, which means that the command pcs cluster auth will be executed
    360 times with 10 seconds in between. So that means waiting for an hour
    for no reason. Let's allow to override the settle_tries counter so
    an operator can shorten it accordingly.

    Tested this by setting CorosyncSettleTries to 100 and I correctly get
    proper hiera settings:
    $ hiera pacemaker::corosync::settle_tries
    100

    And effectively we try a number of 100 times as opposed to the 360
    default:
    /Stage[main]/Pacemaker::Corosync/Exec[reauthenticate-across-all-nodes]/returns
    (debug): Exec try 1/100

    Note that a straight cherry-pick from 48692127d09ef577b5b691fc12ccb9055b73759b
    was not possible in this case

    Change-Id: I5e21b4215cb0b8686d2059b3d71e2444a96719dc
    Closes-Bug: #1659741

tags: added: in-stable-mitaka
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 2.2.0

This issue was fixed in the openstack/tripleo-heat-templates 2.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 5.3.0

This issue was fixed in the openstack/tripleo-heat-templates 5.3.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.