Use pacemaker maintanance mode when scaling controllers up/down

Bug #1555203 reported by Bogdan Dobrelya
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
Wishlist
Michael Polenchuk
Mitaka
Won't Fix
Wishlist
Fuel Library (Deprecated)
Newton
Fix Committed
Wishlist
Michael Polenchuk

Bug Description

When applying deploy changes (add/remove controllers) resources run managed, and may be affected by undesired restarting bringing unnecessary downtime to the cloud ops.

We should modify or introduce additional cluster deploy tasks to put Corosync/Pacemaker into maintenance mode for the critical operations being done to the corosync cluster, which is adding or removing members. While running in the MM, resources remain in unmanaged state and will no suffer additional restarts.

Changed in fuel:
importance: Undecided → High
milestone: none → 9.0
tags: added: area-library corosync ha life-cycle-management pacemaker
Changed in fuel:
assignee: nobody → Fuel Library Team (fuel-library)
status: New → Confirmed
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Example snapshot for the remove node-1 and add node-3 controller. As you can see there was undesired mysql downtime to the remaining node-2, see events after the 2016-03-09 15:54:54 :
Mar 9 15:55:46 notice: notice: process_lrm_event: Operation p_mysqld_monitor_60000: unknown error (node=node-2.test.domain.local, call=352, rc=1, cib-update=378, confirmed=false)

Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

This bug can't be high because deployment isn't broken.

Changed in fuel:
importance: High → Medium
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Kyrylo Galanov (kgalanov)
Changed in fuel:
status: Confirmed → In Progress
Changed in fuel:
status: In Progress → Confirmed
Revision history for this message
Bug Checker Bot (bug-checker) wrote : Autochecker

(This check performed automatically)
Please, make sure that bug description contains the following sections filled in with the appropriate data related to the bug you are describing:

actual result

version

expected result

steps to reproduce

For more detailed information on the contents of each of the listed sections see https://wiki.openstack.org/wiki/Fuel/How_to_contribute#Here_is_how_you_file_a_bug

tags: added: need-info
tags: added: keep-in-9.0
Revision history for this message
Kyrylo Galanov (kgalanov) wrote :

What I can propose is a new task run after rsync_core_puppet. It would set maintenance mode on all (or selected) existing pacemaker resources. If no pacemaker resources exist in system it will do nothing (new deployment).

Changed in fuel:
importance: Medium → Wishlist
tags: added: feature
tags: removed: need-info
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

I believe it is responsibility of deployment graph builder (Nailgun+Astute), to inject or not inject maintenance related tasks into graph, based on the data changes being applied. For example, insert the pacemaker MM task in the graph, If there are new/removed nodes with controller role/cluster task assigned. Ditto to ceph cluster, and the rest of clusters.

tags: added: area-python
removed: area-library corosync
Revision history for this message
Kyrylo Galanov (kgalanov) wrote :

As soon as it feature is ready in nailgun we will add necessary code to library.

Changed in fuel:
assignee: Kyrylo Galanov (kgalanov) → Fuel Library Team (fuel-library)
tags: removed: keep-in-9.0
Revision history for this message
Michael Polenchuk (mpolenchuk) wrote :

The following pseudo code will be used for implementation:
  _nodes = add/del(changed)
  (+) set <crm node maintenance> if _nodes && *this not in _nodes
  (-) finalize stage: <crm node ready>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/319932

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/319932
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=cfc5a9bcca26d59b38fd72af2ed6be9e384c3b00
Submitter: Jenkins
Branch: master

commit cfc5a9bcca26d59b38fd72af2ed6be9e384c3b00
Author: Michael Polenchuk <email address hidden>
Date: Mon May 23 16:04:10 2016 +0300

    Scale controllers up/down using pacemaker m-mode

    Put pacemaker into maintenance mode for the critical ops
    (e.g. adding/removing nodes) being done to the cluster.
    Running in maintenance mode resources remain in unmanaged state
    and have no impact by unnecessary restarts.
    Plus excessive reqs (which caused loops) have been removed
    from the following tasks:
      - openstack-haproxy-mysqld
      - conntrackd
      - cluster_health

    Change-Id: Ibe00effa7c9b5c6d8209f977815272447819bc22
    Closes-Bug: #1555203

Changed in fuel:
status: In Progress → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.