Bug #1493520 “OpenStack is down after filling the root filesyste...” : Bugs : Fuel for OpenStack

Vladimir Khlyunev (vkhlyunev) on 2015-09-08

tags:

added: customer-found

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2015-09-09:

#1

What is expected behavior for these steps to reproduce?

I mean that it is expected behavior that everything will crashed in this case (any service which will use disk / data base). What we expect from OpenStack services?

Alexey Shtokolov (ashtokolov) on 2015-09-09

tags:

added: feature

Revision history for this message

Alexey Shtokolov (ashtokolov) wrote on 2015-09-09:

#2

Expected behaviour: pacemaker should mark this node offline and migrate resources.
It doesn't happen because monitor scripts check processes and open sockets. In case of full root file system sockets are still open and processes are still running. But rabbitmq doesn't provide the service and returns
=WARNING REPORT====
disk resource limit alarm set on node 'rabbit@node-2'.
**********************************************************
*** Publishers will be blocked until this alarm clears ***
**********************************************************

It could be fixed in monitor scripts: try to send message and stop resource in case of few fails
Or in osco.messaging: choose another rabbitmq in case of this warning

Actually we have separate logic volumes for frequently variable files. And we can provide this improvements as a feature in 8.0

Fuel Devops McRobotson (fuel-devops-robot) on 2015-09-16

no longer affects:	fuel/5.1.x
no longer affects:	fuel/6.0.x

Roman Rufanov (rrufanov) on 2015-09-17

tags:

added: support

Revision history for this message

Matthew Mosesohn (raytrac3r) wrote on 2015-09-19:

#3

Egor reproduced this today. Controllers recovered okay, but computes couldn't re-establish connectivity to RabbitMQ. Restarting nova-compute on compute nodes served as a workaround.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-09-21: Fix proposed to fuel-library (master)

#4

Fix proposed to branch: master
Review: https://review.openstack.org/226062

Changed in fuel:
assignee:	Alexey Shtokolov (ashtokolov) → Alex Schultz (alex-schultz)
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-09-28: Fix merged to fuel-library (master)

#5

Reviewed: https://review.openstack.org/226062
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=03e7683381d14c4a9d5da93481b2d5140e7896f0
Submitter: Jenkins
Branch: master

commit 03e7683381d14c4a9d5da93481b2d5140e7896f0
Author: Alex Schultz <email address hidden>
Date: Mon Sep 21 16:29:56 2015 -0500

Add cluster health task with disk monitor

This change adds a monitor into corosync/pacemaker to migrate services
if the monitored disks drop below 100M free.

Once the operator has resolved the full disk, they must clear the
alarm by running:

crm node status-attr <hostname> delete "#health_disk"

After the alarm has been cleared, the services should be automatically
restarted.

    This change is not a replacement for proper monitoring, but it will
    properly shut down and migrate services if a controller runs out of disk
    space.

DocImpact
Closes-Bug: 1493520

Change-Id: I8a2cb4bd8d0b6070400d13e25d2310f4777b9faf

Changed in fuel:
status:	In Progress → Fix Committed

Artem Hrechanychenko (agrechanichenko) on 2015-10-01

tags:	added: feature-qa
tags:	removed: feature-qa

Revision history for this message

Craig Peters (craig-l-peters) wrote on 2015-10-01:

#6

Could this fix be included in a maintenance update for 6.1 and 7.0?

Artem Hrechanychenko (agrechanichenko) on 2015-10-02

tags:

added: on-verification

Revision history for this message

Alexey Shtokolov (ashtokolov) wrote on 2015-10-02:

#7

Craig, I've assigned it to MOS Maintenance and nominated to 7.0-updates.
Vitaly Sedelnik can move it to 7.0-mu-1

Revision history for this message

Vitaly Sedelnik (vsedelnik) wrote on 2015-10-02:

#8

This issue is actually new feature, that's why it was set to Won't Fix for all prior releases. The fix is pretty big (187 LOC) and looks not appropriate for backporting. I think the right way to deal with it in prior releases is to setup monitoring to make sure there is always some disk space available.

Revision history for this message

Artem Hrechanychenko (agrechanichenko) wrote on 2015-10-02:

#9

{"build_id": "107", "openstack_version": "2015.1.0-8.0", "build_number": "107", "release_versions": {"2015.1.0-8.0": {"VERSION": {"build_id": "107", "openstack_version": "2015.1.0-8.0", "build_number": "107", "api": "1.0", "fuel-library_sha": "acfcfd289ca454585687b6ff9651b53e4ffaf0cd", "feature_groups": ["mirantis"], "fuel-nailgun-agent_sha": "d66f188a1832a9c23b04884a14ef00fc5605ec6d", "fuel-nailgun_sha": "a95a0bc965c11b0d412a00c4cb088888b919e054", "fuel-agent_sha": "e881f0dabd09af4be4f3e22768b02fe76278e20e", "production": "docker", "python-fuelclient_sha": "286939d3be220828f52e73b65928ed39662e1853", "astute_sha": "0f753467a3f16e4d46e7e9f1979905fb178e4d5b", "fuel-ostf_sha": "37c5d6113408a29cabe0f416fe99cf20e2bca318", "release": "8.0", "fuelmain_sha": "8e5e75302b2534fd38e4b41b795957111ac75543"}}}, "auth_required": true, "api": "1.0", "fuel-library_sha": "acfcfd289ca454585687b6ff9651b53e4ffaf0cd", "feature_groups": ["mirantis"], "fuel-nailgun-agent_sha": "d66f188a1832a9c23b04884a14ef00fc5605ec6d", "fuel-nailgun_sha": "a95a0bc965c11b0d412a00c4cb088888b919e054", "fuel-agent_sha": "e881f0dabd09af4be4f3e22768b02fe76278e20e", "production": "docker", "python-fuelclient_sha": "286939d3be220828f52e73b65928ed39662e1853", "astute_sha": "0f753467a3f16e4d46e7e9f1979905fb178e4d5b", "fuel-ostf_sha": "37c5d6113408a29cabe0f416fe99cf20e2bca318", "release": "8.0", "fuelmain_sha": "8e5e75302b2534fd38e4b41b795957111ac75543"}

Steps to reproduce:
1) Deploy cluster with 3 controllers, 1 compute, 1 compute+cinder
2) Fill root partition on primary controller
3)Wait for 5-10 minutes while pacemaker stopped all resourses
4)Verify that all resourses was really stopped
5) Run OSTF tests

Actual results:
Lot of OSTF test was failed, and also all Platform services functional tests and part of Functional tests was even not started.

1)pcs status on p-controller
http://paste.openstack.org/show/475200/

I can provide credential to environment if you need.

Revision history for this message

Artem Hrechanychenko (agrechanichenko) wrote on 2015-10-02:

#10

Snapshot
https://drive.google.com/a/mirantis.com/file/d/0B6hV1ZImNDRVbzNqRjZCZm9HZjQ/view?usp=sharing

Revision history for this message

Artem Hrechanychenko (agrechanichenko) wrote on 2015-10-03:

#11

{"build_id": "107", "openstack_version": "2015.1.0-8.0", "build_number": "107", "release_versions": {"2015.1.0-8.0": {"VERSION": {"build_id": "107", "openstack_version": "2015.1.0-8.0", "build_number": "107", "api": "1.0", "fuel-library_sha": "acfcfd289ca454585687b6ff9651b53e4ffaf0cd", "feature_groups": ["mirantis"], "fuel-nailgun-agent_sha": "d66f188a1832a9c23b04884a14ef00fc5605ec6d", "fuel-nailgun_sha": "a95a0bc965c11b0d412a00c4cb088888b919e054", "fuel-agent_sha": "e881f0dabd09af4be4f3e22768b02fe76278e20e", "production": "docker", "python-fuelclient_sha": "286939d3be220828f52e73b65928ed39662e1853", "astute_sha": "0f753467a3f16e4d46e7e9f1979905fb178e4d5b", "fuel-ostf_sha": "37c5d6113408a29cabe0f416fe99cf20e2bca318", "release": "8.0", "fuelmain_sha": "8e5e75302b2534fd38e4b41b795957111ac75543"}}}, "auth_required": true, "api": "1.0", "fuel-library_sha": "acfcfd289ca454585687b6ff9651b53e4ffaf0cd", "feature_groups": ["mirantis"], "fuel-nailgun-agent_sha": "d66f188a1832a9c23b04884a14ef00fc5605ec6d", "fuel-nailgun_sha": "a95a0bc965c11b0d412a00c4cb088888b919e054", "fuel-agent_sha": "e881f0dabd09af4be4f3e22768b02fe76278e20e", "production": "docker", "python-fuelclient_sha": "286939d3be220828f52e73b65928ed39662e1853", "astute_sha": "0f753467a3f16e4d46e7e9f1979905fb178e4d5b", "fuel-ostf_sha": "37c5d6113408a29cabe0f416fe99cf20e2bca318", "release": "8.0", "fuelmain_sha": "8e5e75302b2534fd38e4b41b795957111ac75543"}

Steps to reproduce:
1) Deploy cluster with 3 controllers, 1 compute, 1 compute+cinder
2) Fill root partition on primary controller
3)Wait for 5-10 minutes while pacemaker stopped all resourses
4)Verify that all resourses was really stopped
5) execute "crm node status-attr <hostname> delete "#health_disk"
6) Wait for 5-10 minutes
7) Verify that all services automatically restarted

Actual result :
crm node status-attr node-1.test.domain.local delete /

http://paste.openstack.org/show/475236/

Revision history for this message

Alex Schultz (alex-schultz) wrote on 2015-10-05:

#12

To comment on #11, the actual command to restore services is:

crm node status-attr `hostname` delete "#health_disk"

"#health_disk" is not a variable, it's the string you need to provide to the command. Services will start up in ~5-10 minutes.

Nastya Urlapova (aurlapova) on 2015-10-06

tags:	added: long-haul-testing
tags:	removed: on-verification

Dmitry Pyzhov (dpyzhov) on 2015-10-12

no longer affects:

fuel/8.0.x

Alexey Shtokolov (ashtokolov) on 2015-10-14

Changed in fuel:
status:	Triaged → Fix Committed

Dmitry Pyzhov (dpyzhov) on 2015-10-22

tags:

added: area-library

Artem Hrechanychenko (agrechanichenko) on 2015-10-27

Changed in fuel:
status:	Fix Committed → Confirmed

Revision history for this message

Artem Hrechanychenko (agrechanichenko) wrote on 2015-10-27:

#13

Verified on Kilo 154 ISO and Liberty 55 ISO

Steps to reproduce:
1)Deploy 3 controllers, 2 compute, 1 cinder:
2) ssh to primary controller
3) fill root filesystem with fallocate -l 12G /root/bigfile(after that on root_free==0)
4) verify that crm_mon -1 --show-node-attribures print out #health_disk = red for Primary controller
5) verify that pcs status printed that all resources stopped on primary controller
6) Run OSTF tests Sanity and Functional

Actual result:

failed OSTF tests:

Sanity - Check that required services are running
Functional - Create volume and boot instance from it
                          Create volume and attach it to instance
                         Check network connectivity from instance via floating IP
                         Create security group
                         Launch instance
                         Launch instance with file injection
                         Launch instance, create snapshot, launch instance from snapshot

After resolve disk space and crm node status-attr node_hostname delete "#health_disk" resources started again, but the same OSTF test failed(instead of HA, HA don't failed) with the same errors

Revision history for this message

Artem Hrechanychenko (agrechanichenko) wrote on 2015-10-27:

#14

fuel-snapshot-2015-10-27_12-21-13.tar.xz Edit (26.6 MiB, application/octet-stream)

Revision history for this message

Artem Hrechanychenko (agrechanichenko) wrote on 2015-10-27:

#15

Important notice

If fill not all root space, and left 80-50 MB, all work's fine,
1) Resources stopped
2) #health_disk =red
3) OSTF Sanity and Functional - pass, HA - failed
4) after resolving space and delete #health_disk alarm, all OSTF test passed

Alexey Shtokolov (ashtokolov) on 2015-10-28

Changed in fuel:
status:	Confirmed → Triaged

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-11-02: Fix proposed to fuel-library (master)

#16

Fix proposed to branch: master
Review: https://review.openstack.org/240951

Changed in fuel:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-11-03: Fix merged to fuel-library (master)

#17

Reviewed: https://review.openstack.org/240951
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=19b63d48c779a4f48aa0b6d29464f4821669a9ac
Submitter: Jenkins
Branch: master

commit 19b63d48c779a4f48aa0b6d29464f4821669a9ac
Author: Alex Schultz <email address hidden>
Date: Mon Nov 2 11:36:08 2015 -0600

Adjust disk monitoring limits

    This change increases the threshold limit for the corosync disk monitor
    from 100M to 512M and decreases the disk monitoring interval from 30s to
    15s. These changes are to try and provide additional time for pacemaker
    to realize a full disk and trigger a failover prior to the disks
    actually being filled.

    With this change we are also reducing the disk_free_limit configuration
    for rabbitmq from the default of 50M to 5M. When rabbitmq hits the
    disk_free_limit, it will trigger flow control which will block the
    clients. The reduction of the limit from 50M to 5M is to try and still
    have the disk limit in place for an emergency situation but provide
    additional time for corosync to try and shutdown rabbitmq before it
    blocks clients.

    DocImpact: Minimum space free set to 512M for a controller. If any
    parition on the controller (besides /boot) drops below 512M, the
    controller will migrate services off of the affected node. If space
    drops below 5M prior to services being migrated, rabbitmq consumers may
    be blocked and might have to be restarted before services function
    correctly.
    Closes-Bug: #1493520

Change-Id: I537543ce04c4f229939fd7836daf1d62c920974b

Changed in fuel:
status:	In Progress → Fix Committed

Revision history for this message

Artem Hrechanychenko (agrechanichenko) wrote on 2015-11-13:

#18

http://jenkins-product.srt.mirantis.net:8080/view/custom_iso/job/8.0.custom_system_test/513/console

Changed in fuel:
status:	Fix Committed → Fix Released

Vitaly Sedelnik (vsedelnik) on 2016-02-12

tags:

added: wontfix-feature

	Status	Importance	Assigned to	Milestone
Fuel for OpenStack	Fix Released	High	Alex Schultz	Fuel for OpenStack 8.0
5.1.x	Won't Fix	Undecided	MOS Maintenance	Fuel for OpenStack 5.1.1-updates
6.0.x	Won't Fix	Undecided	MOS Maintenance	Fuel for OpenStack 6.0-updates
6.1.x	Won't Fix	High	MOS Maintenance	Fuel for OpenStack 6.1-updates
7.0.x	Won't Fix	High	MOS Maintenance	Fuel for OpenStack 7.0-updates

Fuel for OpenStack

OpenStack is down after filling the root filesystem

Bug Description

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches