cluster recheck interval should be configurable (and lower by default when using pacemaker remote)

Bug #1679753 reported by Michele Baldessari
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Michele Baldessari

Bug Description

From https://bugzilla.redhat.com/show_bug.cgi?id=1438879

In the case pacemaker-remote (OS::TripleO::Services::PacemakerRemote profile) node is being deployed within composable roles topology we should lower pacemaker property of cluster-recheck-interval to much lower value from default value which is 15mins. The reason is that even though we set ocf:pacemaker:remote attribute reconnect_interval to 60s it's not guaranteed to be performed more frequently than the value of the cluster-recheck-interval cluster option and the consequence is that in a case of failover pacemaker_remote node is reported as down those 15mins at least even though it may be back online after fencing withint a minute.
We do set that property to 60s with Instance HA topology.

Steps to Reproduce:
1. Deploy composable roles with a node which has OS::TripleO::Services::PacemakerRemote
2. Reset the node

Actual results:
Node is being reported as offline more than 15 minutes even though It comes back online within a minute

Expected results:
Quicker recovery.

We also want to make this configurable so that an operator can always override it.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (master)

Fix proposed to branch: master
Review: https://review.openstack.org/453250

Changed in tripleo:
status: Triaged → In Progress
Changed in tripleo:
milestone: ongoing → pike-1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (master)

Reviewed: https://review.openstack.org/453250
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=f464e9f703b824f8971ade50c32884748caffefc
Submitter: Jenkins
Branch: master

commit f464e9f703b824f8971ade50c32884748caffefc
Author: Michele Baldessari <email address hidden>
Date: Tue Apr 4 18:15:06 2017 +0200

    Make the cluster-check property configurable

    This change will make the global cluster-check property configurable
    and will pick a lower default (60s) in case a pacemaker remote node
    is deployed.

    The cluster-recheck-interval is set to default to 15minutes by
    pacemaker. This value is too high when a pacemaker remote service
    is deployed. With this default value a reboot of a pacemaker remote
    node will be reported as offline by pacemaker for up to 15minutes.

    With this change we do the following:
    1) Do nothing in case pacemaker remote is not deployed
    2) When pacemaker remote is deployed and the operator has not
       specified otherwise, we set the recheck interval to 60s.
    3) When the operator specifies the recheck interval we set that.

    Change-Id: I900952b33317b7998a1f26a65f4d70c1726df19c
    Closes-Bug: #1679753

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/454554

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (stable/ocata)

Reviewed: https://review.openstack.org/454554
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=0a12215f908acaabba3de5ca32cea76d8a990494
Submitter: Jenkins
Branch: stable/ocata

commit 0a12215f908acaabba3de5ca32cea76d8a990494
Author: Michele Baldessari <email address hidden>
Date: Tue Apr 4 18:15:06 2017 +0200

    Make the cluster-check property configurable

    This change will make the global cluster-check property configurable
    and will pick a lower default (60s) in case a pacemaker remote node
    is deployed.

    The cluster-recheck-interval is set to default to 15minutes by
    pacemaker. This value is too high when a pacemaker remote service
    is deployed. With this default value a reboot of a pacemaker remote
    node will be reported as offline by pacemaker for up to 15minutes.

    With this change we do the following:
    1) Do nothing in case pacemaker remote is not deployed
    2) When pacemaker remote is deployed and the operator has not
       specified otherwise, we set the recheck interval to 60s.
    3) When the operator specifies the recheck interval we set that.

    Change-Id: I900952b33317b7998a1f26a65f4d70c1726df19c
    Closes-Bug: #1679753
    (cherry picked from commit f464e9f703b824f8971ade50c32884748caffefc)

tags: added: in-stable-ocata
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 7.0.0

This issue was fixed in the openstack/puppet-tripleo 7.0.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 6.4.0

This issue was fixed in the openstack/puppet-tripleo 6.4.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.