OpenStack is down after filling the root filesystem

Bug #1493520 reported by Vladimir Khlyunev
28
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
Alex Schultz
5.1.x
Won't Fix
Undecided
MOS Maintenance
6.0.x
Won't Fix
Undecided
MOS Maintenance
6.1.x
Won't Fix
High
MOS Maintenance
7.0.x
Won't Fix
High
MOS Maintenance

Bug Description

ISO #287

Steps:
1) Deploy simple HA cluster (ubuntu, neutron vlan, cinder, no additional components - 3 controllers, 2 compute, 2 cinder)
2) Fill the root file system on primary controller
# dd if=/dev/zero of=/root/file bs=32KB
3) Wait until dd ends with "out of space" message
4) Wait 1-2 minutes then run ostf
5) Enjoy timeouts and crashed openstack

Snapshot https://drive.google.com/a/mirantis.com/file/d/0B_bjwT_xlxy1UldjTHVyVExlRGc/view?usp=sharing

tags: added: customer-found
Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

What is expected behavior for these steps to reproduce?

I mean that it is expected behavior that everything will crashed in this case (any service which will use disk / data base). What we expect from OpenStack services?

tags: added: feature
Revision history for this message
Alexey Shtokolov (ashtokolov) wrote :

Expected behaviour: pacemaker should mark this node offline and migrate resources.
It doesn't happen because monitor scripts check processes and open sockets. In case of full root file system sockets are still open and processes are still running. But rabbitmq doesn't provide the service and returns
 =WARNING REPORT====
disk resource limit alarm set on node 'rabbit@node-2'.
**********************************************************
*** Publishers will be blocked until this alarm clears ***
**********************************************************

It could be fixed in monitor scripts: try to send message and stop resource in case of few fails
Or in osco.messaging: choose another rabbitmq in case of this warning

Actually we have separate logic volumes for frequently variable files. And we can provide this improvements as a feature in 8.0

no longer affects: fuel/5.1.x
no longer affects: fuel/6.0.x
Roman Rufanov (rrufanov)
tags: added: support
Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

Egor reproduced this today. Controllers recovered okay, but computes couldn't re-establish connectivity to RabbitMQ. Restarting nova-compute on compute nodes served as a workaround.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/226062

Changed in fuel:
assignee: Alexey Shtokolov (ashtokolov) → Alex Schultz (alex-schultz)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/226062
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=03e7683381d14c4a9d5da93481b2d5140e7896f0
Submitter: Jenkins
Branch: master

commit 03e7683381d14c4a9d5da93481b2d5140e7896f0
Author: Alex Schultz <email address hidden>
Date: Mon Sep 21 16:29:56 2015 -0500

    Add cluster health task with disk monitor

    This change adds a monitor into corosync/pacemaker to migrate services
    if the monitored disks drop below 100M free.

    Once the operator has resolved the full disk, they must clear the
    alarm by running:

     crm node status-attr <hostname> delete "#health_disk"

    After the alarm has been cleared, the services should be automatically
    restarted.

    This change is not a replacement for proper monitoring, but it will
    properly shut down and migrate services if a controller runs out of disk
    space.

    DocImpact
    Closes-Bug: 1493520

    Change-Id: I8a2cb4bd8d0b6070400d13e25d2310f4777b9faf

Changed in fuel:
status: In Progress → Fix Committed
tags: added: feature-qa
tags: removed: feature-qa
Revision history for this message
Craig Peters (craig-l-peters) wrote :

Could this fix be included in a maintenance update for 6.1 and 7.0?

tags: added: on-verification
Revision history for this message
Alexey Shtokolov (ashtokolov) wrote :

Craig, I've assigned it to MOS Maintenance and nominated to 7.0-updates.
Vitaly Sedelnik can move it to 7.0-mu-1

Revision history for this message
Vitaly Sedelnik (vsedelnik) wrote :

This issue is actually new feature, that's why it was set to Won't Fix for all prior releases. The fix is pretty big (187 LOC) and looks not appropriate for backporting. I think the right way to deal with it in prior releases is to setup monitoring to make sure there is always some disk space available.

Revision history for this message
Artem Hrechanychenko (agrechanichenko) wrote :

{"build_id": "107", "openstack_version": "2015.1.0-8.0", "build_number": "107", "release_versions": {"2015.1.0-8.0": {"VERSION": {"build_id": "107", "openstack_version": "2015.1.0-8.0", "build_number": "107", "api": "1.0", "fuel-library_sha": "acfcfd289ca454585687b6ff9651b53e4ffaf0cd", "feature_groups": ["mirantis"], "fuel-nailgun-agent_sha": "d66f188a1832a9c23b04884a14ef00fc5605ec6d", "fuel-nailgun_sha": "a95a0bc965c11b0d412a00c4cb088888b919e054", "fuel-agent_sha": "e881f0dabd09af4be4f3e22768b02fe76278e20e", "production": "docker", "python-fuelclient_sha": "286939d3be220828f52e73b65928ed39662e1853", "astute_sha": "0f753467a3f16e4d46e7e9f1979905fb178e4d5b", "fuel-ostf_sha": "37c5d6113408a29cabe0f416fe99cf20e2bca318", "release": "8.0", "fuelmain_sha": "8e5e75302b2534fd38e4b41b795957111ac75543"}}}, "auth_required": true, "api": "1.0", "fuel-library_sha": "acfcfd289ca454585687b6ff9651b53e4ffaf0cd", "feature_groups": ["mirantis"], "fuel-nailgun-agent_sha": "d66f188a1832a9c23b04884a14ef00fc5605ec6d", "fuel-nailgun_sha": "a95a0bc965c11b0d412a00c4cb088888b919e054", "fuel-agent_sha": "e881f0dabd09af4be4f3e22768b02fe76278e20e", "production": "docker", "python-fuelclient_sha": "286939d3be220828f52e73b65928ed39662e1853", "astute_sha": "0f753467a3f16e4d46e7e9f1979905fb178e4d5b", "fuel-ostf_sha": "37c5d6113408a29cabe0f416fe99cf20e2bca318", "release": "8.0", "fuelmain_sha": "8e5e75302b2534fd38e4b41b795957111ac75543"}

Steps to reproduce:
1) Deploy cluster with 3 controllers, 1 compute, 1 compute+cinder
2) Fill root partition on primary controller
3)Wait for 5-10 minutes while pacemaker stopped all resourses
4)Verify that all resourses was really stopped
5) Run OSTF tests

Actual results:
Lot of OSTF test was failed, and also all Platform services functional tests and part of Functional tests was even not started.

1)pcs status on p-controller
http://paste.openstack.org/show/475200/

I can provide credential to environment if you need.

Revision history for this message
Artem Hrechanychenko (agrechanichenko) wrote :
Revision history for this message
Artem Hrechanychenko (agrechanichenko) wrote :

{"build_id": "107", "openstack_version": "2015.1.0-8.0", "build_number": "107", "release_versions": {"2015.1.0-8.0": {"VERSION": {"build_id": "107", "openstack_version": "2015.1.0-8.0", "build_number": "107", "api": "1.0", "fuel-library_sha": "acfcfd289ca454585687b6ff9651b53e4ffaf0cd", "feature_groups": ["mirantis"], "fuel-nailgun-agent_sha": "d66f188a1832a9c23b04884a14ef00fc5605ec6d", "fuel-nailgun_sha": "a95a0bc965c11b0d412a00c4cb088888b919e054", "fuel-agent_sha": "e881f0dabd09af4be4f3e22768b02fe76278e20e", "production": "docker", "python-fuelclient_sha": "286939d3be220828f52e73b65928ed39662e1853", "astute_sha": "0f753467a3f16e4d46e7e9f1979905fb178e4d5b", "fuel-ostf_sha": "37c5d6113408a29cabe0f416fe99cf20e2bca318", "release": "8.0", "fuelmain_sha": "8e5e75302b2534fd38e4b41b795957111ac75543"}}}, "auth_required": true, "api": "1.0", "fuel-library_sha": "acfcfd289ca454585687b6ff9651b53e4ffaf0cd", "feature_groups": ["mirantis"], "fuel-nailgun-agent_sha": "d66f188a1832a9c23b04884a14ef00fc5605ec6d", "fuel-nailgun_sha": "a95a0bc965c11b0d412a00c4cb088888b919e054", "fuel-agent_sha": "e881f0dabd09af4be4f3e22768b02fe76278e20e", "production": "docker", "python-fuelclient_sha": "286939d3be220828f52e73b65928ed39662e1853", "astute_sha": "0f753467a3f16e4d46e7e9f1979905fb178e4d5b", "fuel-ostf_sha": "37c5d6113408a29cabe0f416fe99cf20e2bca318", "release": "8.0", "fuelmain_sha": "8e5e75302b2534fd38e4b41b795957111ac75543"}

Steps to reproduce:
1) Deploy cluster with 3 controllers, 1 compute, 1 compute+cinder
2) Fill root partition on primary controller
3)Wait for 5-10 minutes while pacemaker stopped all resourses
4)Verify that all resourses was really stopped
5) execute "crm node status-attr <hostname> delete "#health_disk"
6) Wait for 5-10 minutes
7) Verify that all services automatically restarted

Actual result :
crm node status-attr node-1.test.domain.local delete /

http://paste.openstack.org/show/475236/

Revision history for this message
Alex Schultz (alex-schultz) wrote :

To comment on #11, the actual command to restore services is:

  crm node status-attr `hostname` delete "#health_disk"

"#health_disk" is not a variable, it's the string you need to provide to the command. Services will start up in ~5-10 minutes.

tags: added: long-haul-testing
tags: removed: on-verification
Dmitry Pyzhov (dpyzhov)
no longer affects: fuel/8.0.x
Changed in fuel:
status: Triaged → Fix Committed
Dmitry Pyzhov (dpyzhov)
tags: added: area-library
Changed in fuel:
status: Fix Committed → Confirmed
Revision history for this message
Artem Hrechanychenko (agrechanichenko) wrote :

Verified on Kilo 154 ISO and Liberty 55 ISO

Steps to reproduce:
1)Deploy 3 controllers, 2 compute, 1 cinder:
2) ssh to primary controller
3) fill root filesystem with fallocate -l 12G /root/bigfile(after that on root_free==0)
4) verify that crm_mon -1 --show-node-attribures print out #health_disk = red for Primary controller
5) verify that pcs status printed that all resources stopped on primary controller
6) Run OSTF tests Sanity and Functional

Actual result:

failed OSTF tests:

Sanity - Check that required services are running
Functional - Create volume and boot instance from it
                          Create volume and attach it to instance
                         Check network connectivity from instance via floating IP
                         Create security group
                         Launch instance
                         Launch instance with file injection
                         Launch instance, create snapshot, launch instance from snapshot

After resolve disk space and crm node status-attr node_hostname delete "#health_disk" resources started again, but the same OSTF test failed(instead of HA, HA don't failed) with the same errors

Revision history for this message
Artem Hrechanychenko (agrechanichenko) wrote :
Revision history for this message
Artem Hrechanychenko (agrechanichenko) wrote :

Important notice

If fill not all root space, and left 80-50 MB, all work's fine,
1) Resources stopped
2) #health_disk =red
3) OSTF Sanity and Functional - pass, HA - failed
4) after resolving space and delete #health_disk alarm, all OSTF test passed

Changed in fuel:
status: Confirmed → Triaged
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/240951

Changed in fuel:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/240951
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=19b63d48c779a4f48aa0b6d29464f4821669a9ac
Submitter: Jenkins
Branch: master

commit 19b63d48c779a4f48aa0b6d29464f4821669a9ac
Author: Alex Schultz <email address hidden>
Date: Mon Nov 2 11:36:08 2015 -0600

    Adjust disk monitoring limits

    This change increases the threshold limit for the corosync disk monitor
    from 100M to 512M and decreases the disk monitoring interval from 30s to
    15s. These changes are to try and provide additional time for pacemaker
    to realize a full disk and trigger a failover prior to the disks
    actually being filled.

    With this change we are also reducing the disk_free_limit configuration
    for rabbitmq from the default of 50M to 5M. When rabbitmq hits the
    disk_free_limit, it will trigger flow control which will block the
    clients. The reduction of the limit from 50M to 5M is to try and still
    have the disk limit in place for an emergency situation but provide
    additional time for corosync to try and shutdown rabbitmq before it
    blocks clients.

    DocImpact: Minimum space free set to 512M for a controller. If any
    parition on the controller (besides /boot) drops below 512M, the
    controller will migrate services off of the affected node. If space
    drops below 5M prior to services being migrated, rabbitmq consumers may
    be blocked and might have to be restarted before services function
    correctly.
    Closes-Bug: #1493520

    Change-Id: I537543ce04c4f229939fd7836daf1d62c920974b

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
Artem Hrechanychenko (agrechanichenko) wrote :
Changed in fuel:
status: Fix Committed → Fix Released
tags: added: wontfix-feature
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.