tripleo

cluster recheck interval should be configurable (and lower by default when using pacemaker remote)

Bug #1679753 reported by Michele Baldessari on 2017-04-04

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	tripleo	Fix Released	High	Michele Baldessari	tripleo pike-1 "pike-1"

Bug Description

From https://bugzilla.redhat.com/show_bug.cgi?id=1438879

In the case pacemaker-remote (OS::TripleO::Services::PacemakerRemote profile) node is being deployed within composable roles topology we should lower pacemaker property of cluster-recheck-interval to much lower value from default value which is 15mins. The reason is that even though we set ocf:pacemaker:remote attribute reconnect_interval to 60s it's not guaranteed to be performed more frequently than the value of the cluster-recheck-interval cluster option and the consequence is that in a case of failover pacemaker_remote node is reported as down those 15mins at least even though it may be back online after fencing withint a minute.
We do set that property to 60s with Instance HA topology.

Steps to Reproduce:
1. Deploy composable roles with a node which has OS::TripleO::Services::PacemakerRemote
2. Reset the node

Actual results:
Node is being reported as offline more than 15 minutes even though It comes back online within a minute

Expected results:
Quicker recovery.

We also want to make this configurable so that an operator can always override it.

Tags:

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-04-04: Fix proposed to puppet-tripleo (master)

Fix proposed to branch: master
Review: https://review.openstack.org/453250

Changed in tripleo:
status:	Triaged → In Progress

Emilien Macchi (emilienm) on 2017-04-05

Changed in tripleo:
milestone:	ongoing → pike-1

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-04-07: Fix merged to puppet-tripleo (master)

Reviewed: https://review.openstack.org/453250
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=f464e9f703b824f8971ade50c32884748caffefc
Submitter: Jenkins
Branch: master

commit f464e9f703b824f8971ade50c32884748caffefc
Author: Michele Baldessari <email address hidden>
Date: Tue Apr 4 18:15:06 2017 +0200

Make the cluster-check property configurable

    This change will make the global cluster-check property configurable
    and will pick a lower default (60s) in case a pacemaker remote node
    is deployed.

    The cluster-recheck-interval is set to default to 15minutes by
    pacemaker. This value is too high when a pacemaker remote service
    is deployed. With this default value a reboot of a pacemaker remote
    node will be reported as offline by pacemaker for up to 15minutes.

    With this change we do the following:
    1) Do nothing in case pacemaker remote is not deployed
    2) When pacemaker remote is deployed and the operator has not
       specified otherwise, we set the recheck interval to 60s.
    3) When the operator specifies the recheck interval we set that.

Change-Id: I900952b33317b7998a1f26a65f4d70c1726df19c
Closes-Bug: #1679753

Changed in tripleo:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-04-07: Fix proposed to puppet-tripleo (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/454554

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-04-07: Fix merged to puppet-tripleo (stable/ocata)

Reviewed: https://review.openstack.org/454554
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=0a12215f908acaabba3de5ca32cea76d8a990494
Submitter: Jenkins
Branch: stable/ocata

commit 0a12215f908acaabba3de5ca32cea76d8a990494
Author: Michele Baldessari <email address hidden>
Date: Tue Apr 4 18:15:06 2017 +0200

Make the cluster-check property configurable

    This change will make the global cluster-check property configurable
    and will pick a lower default (60s) in case a pacemaker remote node
    is deployed.

    Change-Id: I900952b33317b7998a1f26a65f4d70c1726df19c
    Closes-Bug: #1679753
    (cherry picked from commit f464e9f703b824f8971ade50c32884748caffefc)