Add cluster health task with disk monitor

Bug #1500422 reported by OpenStack Infra
38
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
Bogdan Dobrelya
8.0.x
Won't Fix
High
Fuel Documentation Team
Mitaka
Fix Committed
High
Fuel Documentation Team

Bug Description

https://review.openstack.org/226062
commit 03e7683381d14c4a9d5da93481b2d5140e7896f0
Author: Alex Schultz <email address hidden>
Date: Mon Sep 21 16:29:56 2015 -0500

    Add cluster health task with disk monitor

    This change adds a monitor into corosync/pacemaker to migrate services
    if the monitored disks drop below 100M free.

    Once the operator has resolved the full disk, they must clear the
    alarm by running:

     crm node status-attr <hostname> delete "#health_disk"

    After the alarm has been cleared, the services should be automatically
    restarted.

    This change is not a replacement for proper monitoring, but it will
    properly shut down and migrate services if a controller runs out of disk
    space.

    DocImpact
    Closes-Bug: 1493520

    Change-Id: I8a2cb4bd8d0b6070400d13e25d2310f4777b9faf

Tags: area-docs
Changed in fuel:
assignee: nobody → Fuel Documentation Team (fuel-docs)
milestone: none → 8.0
importance: Undecided → Medium
status: New → Confirmed
Dmitry Pyzhov (dpyzhov)
tags: added: area-docs
Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

Please also see this one (about the same issue):

https://bugs.launchpad.net/fuel/+bug/1595146

no longer affects: fuel/newton
Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

We need to add the description of the recovering procedure for the following case:

1. Customer have HA environment with MOS 8.x-9.x
2. One partition on one controller become full (not enough free disk space error)
3. Pacemaker automatically shut down all services on this controller
4. Operator should login to the controller node, move/remove extra files from the disks and then execute the following command to recover pacemaker:
crm node status-attr `hostname -f` delete "#health_disk"

Other possible workarounds:
1. Restart pacemaker service:
service pacemaker restart
2. Reboot controller node

We need to describe in the documentation for OpenStack operators and support team the right workflow of recovering for this situation.

Please see comments from Vladimir Kuklin here for more detailed information:
https://bugs.launchpad.net/fuel/+bug/1595100

Changed in fuel:
importance: Medium → High
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to mos/mos-docs (master)

Fix proposed to branch: master
Change author: Bogdan Dobrelya <email address hidden>
Review: https://review.fuel-infra.org/22506

Changed in fuel:
assignee: Fuel Documentation Team (fuel-docs) → Bogdan Dobrelya (bogdando)
status: Confirmed → In Progress
tags: removed: fuel-library
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to mos/mos-docs (master)

Reviewed: https://review.fuel-infra.org/22506
Submitter: Svetlana Karslioglu <email address hidden>
Branch: master

Commit: e9f2c8576f064dfd59409c1f59655bb1284077a4
Author: Bogdan Dobrelya <email address hidden>
Date: Mon Jul 11 09:02:14 2016

Add free space monitoring guide

Closes-bug: #1500422

Change-Id: Id31279c4dc5eb7e1102fc8d90d0defb265742d88
Signed-off-by: Bogdan Dobrelya <email address hidden>

Changed in fuel:
status: In Progress → Fix Released
milestone: 10.0 → 9.1
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.