Fuel for OpenStack

Use pacemaker maintanance mode when scaling controllers up/down

Bug #1555203 reported by Bogdan Dobrelya on 2016-03-09

This bug affects 1 person

	Status	Importance	Assigned to	Milestone
Fuel for OpenStack	Fix Committed	Wishlist	Michael Polenchuk	Fuel for OpenStack 10.0
Mitaka	Won't Fix	Wishlist	Fuel Library (Deprecated)	Fuel for OpenStack 9.0
Newton	Fix Committed	Wishlist	Michael Polenchuk	Fuel for OpenStack 10.0

Bug Description

When applying deploy changes (add/remove controllers) resources run managed, and may be affected by undesired restarting bringing unnecessary downtime to the cloud ops.

We should modify or introduce additional cluster deploy tasks to put Corosync/Pacemaker into maintenance mode for the critical operations being done to the corosync cluster, which is adding or removing members. While running in the MM, resources remain in unmanaged state and will no suffer additional restarts.

Tags:

Bogdan Dobrelya (bogdando) on 2016-03-09

Changed in fuel:
importance:	Undecided → High
milestone:	none → 9.0
tags:	added: area-library corosync ha life-cycle-management pacemaker

Oleksiy Molchanov (omolchanov) on 2016-03-09

Changed in fuel:
assignee:	nobody → Fuel Library Team (fuel-library)
status:	New → Confirmed

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-03-09:

fuel-snapshot-2016-03-09_16-08-47.tar.xz Edit (45.5 MiB, application/octet-stream)

Example snapshot for the remove node-1 and add node-3 controller. As you can see there was undesired mysql downtime to the remaining node-2, see events after the 2016-03-09 15:54:54 :
Mar 9 15:55:46 notice: notice: process_lrm_event: Operation p_mysqld_monitor_60000: unknown error (node=node-2.test.domain.local, call=352, rc=1, cib-update=378, confirmed=false)

Revision history for this message

Matthew Mosesohn (raytrac3r) wrote on 2016-03-10:

This bug can't be high because deployment isn't broken.

Changed in fuel:
importance:	High → Medium

Matthew Mosesohn (raytrac3r) on 2016-03-10

Changed in fuel:
assignee:	Fuel Library Team (fuel-library) → Kyrylo Galanov (kgalanov)

Kyrylo Galanov (kgalanov) on 2016-03-11

Changed in fuel:
status:	Confirmed → In Progress

Kyrylo Galanov (kgalanov) on 2016-03-15

Changed in fuel:
status:	In Progress → Confirmed

Revision history for this message

Bug Checker Bot (bug-checker) wrote on 2016-03-28: Autochecker

(This check performed automatically)
Please, make sure that bug description contains the following sections filled in with the appropriate data related to the bug you are describing:

actual result

version

expected result

steps to reproduce

For more detailed information on the contents of each of the listed sections see https://wiki.openstack.org/wiki/Fuel/How_to_contribute#Here_is_how_you_file_a_bug

tags:

added: need-info

Nastya Urlapova (aurlapova) on 2016-03-29

tags:

added: keep-in-9.0

Revision history for this message

Kyrylo Galanov (kgalanov) wrote on 2016-04-01:

What I can propose is a new task run after rsync_core_puppet. It would set maintenance mode on all (or selected) existing pacemaker resources. If no pacemaker resources exist in system it will do nothing (new deployment).

Bogdan Dobrelya (bogdando) on 2016-04-01

Changed in fuel:
importance:	Medium → Wishlist
tags:	added: feature

Bug Checker Bot (bug-checker) on 2016-04-01

tags:

removed: need-info

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-04-01:

I believe it is responsibility of deployment graph builder (Nailgun+Astute), to inject or not inject maintenance related tasks into graph, based on the data changes being applied. For example, insert the pacemaker MM task in the graph, If there are new/removed nodes with controller role/cluster task assigned. Ditto to ceph cluster, and the rest of clusters.

tags:

added: area-python
removed: area-library corosync

Revision history for this message

Kyrylo Galanov (kgalanov) wrote on 2016-04-04:

As soon as it feature is ready in nailgun we will add necessary code to library.

Changed in fuel:
assignee:	Kyrylo Galanov (kgalanov) → Fuel Library Team (fuel-library)
tags:	removed: keep-in-9.0

Revision history for this message

Michael Polenchuk (mpolenchuk) wrote on 2016-04-29:

The following pseudo code will be used for implementation:
  _nodes = add/del(changed)
  (+) set <crm node maintenance> if _nodes && *this not in _nodes
  (-) finalize stage: <crm node ready>

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-05-23: Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/319932

Changed in fuel:
status:	Confirmed → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-06-01: Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/319932
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=cfc5a9bcca26d59b38fd72af2ed6be9e384c3b00
Submitter: Jenkins
Branch: master

commit cfc5a9bcca26d59b38fd72af2ed6be9e384c3b00
Author: Michael Polenchuk <email address hidden>
Date: Mon May 23 16:04:10 2016 +0300

Scale controllers up/down using pacemaker m-mode

    Put pacemaker into maintenance mode for the critical ops
    (e.g. adding/removing nodes) being done to the cluster.
    Running in maintenance mode resources remain in unmanaged state
    and have no impact by unnecessary restarts.
    Plus excessive reqs (which caused loops) have been removed
    from the following tasks:
      - openstack-haproxy-mysqld
      - conntrackd
      - cluster_health

Change-Id: Ibe00effa7c9b5c6d8209f977815272447819bc22
Closes-Bug: #1555203