File system corruption in containers on master node

Bug #1383741 reported by Evgeny Kozhemyakin
74
This bug affects 15 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
Matthew Mosesohn
5.1.x
Fix Released
High
Matthew Mosesohn
6.0.x
Fix Released
High
Matthew Mosesohn
6.1.x
Fix Released
High
Matthew Mosesohn

Bug Description

Several customers reported file system corruption in docker containers.
It could be related https://github.com/docker/docker/issues/7229.

I have managed to reproduce it by filling free space on master node
  "fallocate -l <big number> /var/BIG_FILE"
and then
  "docketctl backup".

log:
 EXT4-fs error (device dm-10): ext4_put_super: Couldn't clean up the journal
...
Buffer I/O error on device dm-10, logical block 1678800
lost page write due to I/O error on dm-10
Buffer I/O error on device dm-10, logical block 1678801
lost page write due to I/O error on dm-10
Buffer I/O error on device dm-10, logical block 1678802
lost page write due to I/O error on dm-10
Buffer I/O error on device dm-10, logical block 1678803
lost page write due to I/O error on dm-10
Buffer I/O error on device dm-10, logical block 1647360
lost page write due to I/O error on dm-10
Buffer I/O error on device dm-10, logical block 1647361
lost page write due to I/O error on dm-10
Buffer I/O error on device dm-10, logical block 1647362
lost page write due to I/O error on dm-10
Buffer I/O error on device dm-10, logical block 1647363
lost page write due to I/O error on dm-10
Buffer I/O error on device dm-10, logical block 1647364
lost page write due to I/O error on dm-10
Buffer I/O error on device dm-10, logical block 1647365
lost page write due to I/O error on dm-10
JBD2: Detected IO errors while flushing file data on dm-10-8
__ratelimit: 512 callbacks suppressed

The other report is here:
http://irclog.perlgeek.de/fuel/2014-10-17#i_9524646

Changed in fuel:
assignee: nobody → Matthew Mosesohn (raytrac3r)
importance: Undecided → Medium
status: New → Confirmed
milestone: none → 6.0
Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

Evgeniy, did you try these steps:
1 - clear space on the disk
2 - reboot
3 - dockerctl check all

I can't seem to get it to fail after that. I was hoping you had a scenario where a container FS was broken or a single container broke so much it couldn't be found.

Revision history for this message
Alexander Bozhenko (alexbozhenko) wrote :

This issue can be reproduced as Evgeny reported.
 "fallocate -l <big number> /var/BIG_FILE" - to fill all the space on /var
After running the command try different actions in fuel: provision new nodes, change configs, etc...
After that reboot, clear the space, reboot again.

I have seen several options after these actions:

1)Corrupted index on postgresql database, can be fixed with something like this
PGPASSWORD=nailgun dockerctl shell postgres su - postgres -c "psql keystone -c 'reindex DATABASE keystone;'"

2)Container that can't start. In case it is stateless container, can be fixed with
dockerctl destroy keystone
dockerctl start keystone

3)Stateful container that can't start. Don't know how to fix this.

In fuel 5.1 (and 6.0 preview) version of docker is:
docker-io-0.10.0-2.mira2.x86_64
but in epel repo there is newer available:
docker-io-1.2.0-3.el6.x86_64

Having this comments from docker developer
https://github.com/docker/docker/issues/6368#issuecomment-46201330
https://github.com/docker/docker/issues/7229#issuecomment-60939980
don't we have a plan to update kernel to recomended 2.6.32-504(shipped with centos 6.6), and update docker package?

Usually /var quickly filled with logs and fuel snapshots.
As workaround, can we put /var/lib/docker/ to separate partition? This will prevent docker images from corruption.

Changed in fuel:
importance: Medium → High
tags: added: customer-found
Revision history for this message
Alexander Bozhenko (alexbozhenko) wrote :

Despite most frequent reason for filling /var is wrong logrotate config, that already have commited fixes https://bugs.launchpad.net/fuel/+bug/1378327 and https://bugs.launchpad.net/fuel/+bug/1382658
I think this should be escalated to "Critical", because there are very many other reasons why /var can be filled 100%,
e.g. executing these commands by user can fill partition quite quickly
dockerctl backup
fuel snaphost

Revision history for this message
Evgeny Kozhemyakin (ekozhemyakin) wrote :

Matthew,
unfortunately I didn't manage to get file corraptions only fs errors. So after cleaning space my env worked well.

Alexander,
thanks for another reproduction.

I am not sure kernel updates will cure (there is not any info about fixed bugs in that comments)
but let me try rebuilding the kernel to 2.6.32-504 or say 3.14.xx

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-docs (master)

Fix proposed to branch: master
Review: https://review.openstack.org/135638

Changed in fuel:
status: Confirmed → In Progress
Changed in fuel:
assignee: Matthew Mosesohn (raytrac3r) → Meg McRoberts (dreidellhasa)
Changed in fuel:
assignee: Meg McRoberts (dreidellhasa) → Matthew Mosesohn (raytrac3r)
Changed in fuel:
milestone: 6.0 → 6.1
status: In Progress → Triaged
tags: added: release-notes
no longer affects: fuel/6.0.x
Changed in fuel:
status: Triaged → In Progress
Revision history for this message
Baboune (seyvet) wrote :

Any chances of retrofitting this to version 5.1? This problem exists in 5.x

Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

Baboune, there's nothing code-wise we can do to stop people from running out of space. I did backport several patches related to log rotation and MongoDB logging that should reduce the likelihood of running out of space. If you need specific patches so you can manually apply to an existing 5.1 installation, let me know. Otherwise, just wait a bit for 5.1.1 to be released.

Revision history for this message
Baboune (seyvet) wrote :

So far the only patch I applied is: https://bugs.launchpad.net/fuel/+bug/1378327. Despite this, /var filled up and one of index in keystone got corrupted. I had missed the change in puppet/anacron/files/logrotate-hourly

What would be needed is a fix to use different partitions for logging and containers in 5.1.1. Can that be done?

Changed in fuel:
milestone: 6.0 → 6.1
no longer affects: fuel/6.1.x
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-docs (master)

Reviewed: https://review.openstack.org/135638
Committed: https://git.openstack.org/cgit/stackforge/fuel-docs/commit/?id=c69cb2ce441b74e09a11afe70a4d733837d753f4
Submitter: Jenkins
Branch: master

commit c69cb2ce441b74e09a11afe70a4d733837d753f4
Author: Matthew Mosesohn <email address hidden>
Date: Wed Nov 19 19:23:39 2014 +0400

    Add Docker disk space troubleshooting guide

    Provides diagnosis and troubleshooting steps to assist
    users who encounter Docker container failures following
    the event where /var partition fills up.

    Change-Id: I53bcb6e447568f52fb0a16865a4b7292e7cedb50
    Closes-Bug: #1383741

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-docs (stable/5.1)

Fix proposed to branch: stable/5.1
Review: https://review.openstack.org/138484

tags: added: docs
no longer affects: fuel/6.1.x
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

The documented set of workarounds (assuming it is enough to provide recovery path for all affected users) allows to keep this bug at High instead of Critical priority, but it's not enough to close this bug.

I don't agree that there's nothing more that we can do. If running /var out of space have such drastic consequences, and the biggest contributor to that risk by far is logging, we have to move logs to a separate partition.

Revision history for this message
Alexander Bozhenko (alexbozhenko) wrote :

There is a blueprint
https://blueprints.launchpad.net/fuel/+spec/isolate-var-log-on-master
But can we put /var/lib/docker/ to separate partition instead of /var/log?
There are many other sources, that can fill up /var...

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

Related problem on target nodes is bug #1394864

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-docs (master)

Reviewed: https://review.openstack.org/138679
Committed: https://git.openstack.org/cgit/stackforge/fuel-docs/commit/?id=46d454d292025a8755313d802dcb07f0c566b6c5
Submitter: Jenkins
Branch: master

commit 46d454d292025a8755313d802dcb07f0c566b6c5
Author: Meg McRoberts <email address hidden>
Date: Wed Dec 3 02:18:10 2014 -0800

    Minor fixes for Docker troubleshooting

    Change-Id: I0f6aa43530cadc34e2b48f289dc0d951b2bbb8b0
    Related-Bug: #1383741

Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

Documentation for 6.0 is out. We aren't fixing any partitioning in 6.0 after HCF.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-docs (stable/5.1)

Reviewed: https://review.openstack.org/138484
Committed: https://git.openstack.org/cgit/stackforge/fuel-docs/commit/?id=a62fc5f363043def792da3d75a02a24398c89436
Submitter: Jenkins
Branch: stable/5.1

commit a62fc5f363043def792da3d75a02a24398c89436
Author: Matthew Mosesohn <email address hidden>
Date: Wed Nov 19 19:23:39 2014 +0400

    Add Docker disk space troubleshooting guide

    Provides diagnosis and troubleshooting steps to assist
    users who encounter Docker container failures following
    the event where /var partition fills up.

    Change-Id: I53bcb6e447568f52fb0a16865a4b7292e7cedb50
    Closes-Bug: #1383741
    (cherry picked from commit c69cb2ce441b74e09a11afe70a4d733837d753f4)

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Please elaborate the status of this issue for 6.1. We have a docs updated for master branch by https://review.openstack.org/#/c/135638/ so I update its status to Fix committed as well

Revision history for this message
Alexander Bozhenko (alexbozhenko) wrote :

Bogdan, I don't agree that fix has been committed.
Only documentation for solving consequences of corruption was released. And procedure is usually is quite complicated for customers.
Root cause( using old docker version) is still there, even for 6.1.

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

The bug was closed without explanation, even though in comment #19 it's correctly stated that all that was merged was documentation of a workaround. Reopened.

Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :
Dmitry Pyzhov (dpyzhov)
Changed in fuel:
status: Confirmed → Won't Fix
Revision history for this message
Mike Scherbakov (mihgen) wrote :

Blueprint gives functionality which prevents the system to suffer from this particular bug.
So, let's never close original issues/defects/bugs failed by people. If we implement blueprint, we can assure afterwards that this defect has gone.
If we close this as won't fix, and blueprint for some reason doesn't actually prevent us from this defect, then we are in pretty bad situation: bug which hit so many people will still be present in Fuel. So, I'm moving all to Confirmed, and let's triage it over again.

Changed in fuel:
status: Won't Fix → Confirmed
Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

Mike, what is the next step for this bug? It will sit in Confirmed state until someone finds a regression in the feature tracked by blueprint only?

Revision history for this message
Andrey Nikitin (heos) wrote :

I changed a milestone for debuggin of my script. I'll come back it later.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-main (stable/5.1)

Fix proposed to branch: stable/5.1
Review: https://review.openstack.org/157851

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-main (stable/6.0)

Fix proposed to branch: stable/6.0
Review: https://review.openstack.org/161692

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-main (stable/5.1)

Reviewed: https://review.openstack.org/157851
Committed: https://git.openstack.org/cgit/stackforge/fuel-main/commit/?id=e4cc81614615396c673272b5055d4dda0c5dff0a
Submitter: Jenkins
Branch: stable/5.1

commit e4cc81614615396c673272b5055d4dda0c5dff0a
Author: Matthew Mosesohn <email address hidden>
Date: Fri Feb 20 19:01:52 2015 +0300

    Separate /var/log from /var

    Allocates 60/40 to /var and /var/log, respectively,
    preventing incoming logs from saturating the disk
    and impacting Docker container store.

    Change-Id: Ifbe57b99949742df676cac9f6662d2234287c2ab
    Partial-Bug: 1383741

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-main (master)

Fix proposed to branch: master
Review: https://review.openstack.org/162111

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/162935

Changed in fuel:
assignee: Matthew Mosesohn (raytrac3r) → Bogdan Dobrelya (bogdando)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-main (master)

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: master
Review: https://review.openstack.org/162111
Reason: moved to https://review.openstack.org/#/c/162935/ due to wrong Change-Id for a 5.1 backport which was merged ahead by an accident

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-main (master)

Reviewed: https://review.openstack.org/162935
Committed: https://git.openstack.org/cgit/stackforge/fuel-main/commit/?id=5042ef765afb793cc05f791776d7229f2f3b1068
Submitter: Jenkins
Branch: master

commit 5042ef765afb793cc05f791776d7229f2f3b1068
Author: Matthew Mosesohn <email address hidden>
Date: Fri Feb 20 19:01:52 2015 +0300

    Separate /var/log from /var

    Allocates 60/40 to /var and /var/log, respectively,
    preventing incoming logs from saturating the disk
    and impacting Docker container store.

    Change-Id: Ifbe57b99949742df676cac9f6662d2234287c2ab
    Partial-Bug: 1383741

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-main (stable/6.0)

Reviewed: https://review.openstack.org/161692
Committed: https://git.openstack.org/cgit/stackforge/fuel-main/commit/?id=030ab0b17675454b3278a313e8f9b2c919c0c9a1
Submitter: Jenkins
Branch: stable/6.0

commit 030ab0b17675454b3278a313e8f9b2c919c0c9a1
Author: Matthew Mosesohn <email address hidden>
Date: Fri Feb 20 19:01:52 2015 +0300

    Separate /var/log from /var

    Allocates 60/40 to /var and /var/log, respectively,
    preventing incoming logs from saturating the disk
    and impacting Docker container store.

    Change-Id: Ifbe57b99949742df676cac9f6662d2234287c2ab
    Partial-Bug: 1383741

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-main (stable/5.1)

Fix proposed to branch: stable/5.1
Review: https://review.openstack.org/162962

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-main (stable/6.0)

Fix proposed to branch: stable/6.0
Review: https://review.openstack.org/162968

Changed in fuel:
assignee: Bogdan Dobrelya (bogdando) → Matthew Mosesohn (raytrac3r)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-main (master)

Reviewed: https://review.openstack.org/162111
Committed: https://git.openstack.org/cgit/stackforge/fuel-main/commit/?id=293208dc3a08e7cc120197df7c37bf3e2b00bfd9
Submitter: Jenkins
Branch: master

commit 293208dc3a08e7cc120197df7c37bf3e2b00bfd9
Author: Matthew Mosesohn <email address hidden>
Date: Fri Mar 6 14:30:45 2015 +0300

    Separate /var/log from /var

    Allocates 60/40 to /var and /var/log, respectively,
    preventing incoming logs from saturating the disk
    and impacting Docker container store.

    Change-Id: Ia2f71d657f620ea543889d95ea2786006c423aa2
    Partial-Bug: 1383741

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-main (stable/6.0)

Reviewed: https://review.openstack.org/162968
Committed: https://git.openstack.org/cgit/stackforge/fuel-main/commit/?id=9193d92da6398dd63c1229a954845715cff0254a
Submitter: Jenkins
Branch: stable/6.0

commit 9193d92da6398dd63c1229a954845715cff0254a
Author: Matthew Mosesohn <email address hidden>
Date: Fri Mar 6 14:30:45 2015 +0300

    Separate /var/log from /var

    Allocates 60/40 to /var and /var/log, respectively,
    preventing incoming logs from saturating the disk
    and impacting Docker container store.

    Change-Id: Ia2f71d657f620ea543889d95ea2786006c423aa2
    Partial-Bug: 1383741

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-main (stable/5.1)

Reviewed: https://review.openstack.org/162962
Committed: https://git.openstack.org/cgit/stackforge/fuel-main/commit/?id=4b0753ad64816301139747c2fb584bab021ad43f
Submitter: Jenkins
Branch: stable/5.1

commit 4b0753ad64816301139747c2fb584bab021ad43f
Author: Matthew Mosesohn <email address hidden>
Date: Fri Mar 6 14:30:45 2015 +0300

    Separate /var/log from /var

    Allocates 60/40 to /var and /var/log, respectively,
    preventing incoming logs from saturating the disk
    and impacting Docker container store.

    Change-Id: Ia2f71d657f620ea543889d95ea2786006c423aa2
    Partial-Bug: 1383741

Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

Splitting /var/log from /var is merged in all 3. Further advancements are tracked here: https://blueprints.launchpad.net/fuel/+spec/fuel-master-separate-logs which likely will not make 6.1.

Revision history for this message
Alexander Bozhenko (alexbozhenko) wrote :

@Matthew.
I think this solution won't work in the following situation:
1) /var/log/ occupies 100G
2) /var have 50G free space avaialble
3) Patient start snapshot creation
4) snapshot copy all /var/log/ to /var/www/nailgun/dump
5) /var full, filesystem corrupted

Isn't it better to create separate partition for /var/lib/docker/ to fence docker itself?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-docs (master)

Fix proposed to branch: master
Review: https://review.openstack.org/177088

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-docs (master)

Reviewed: https://review.openstack.org/177088
Committed: https://git.openstack.org/cgit/stackforge/fuel-docs/commit/?id=6e53191cf5b89341011fcb4cc571aa03951fdbfc
Submitter: Jenkins
Branch: master

commit 6e53191cf5b89341011fcb4cc571aa03951fdbfc
Author: Alexander Bozhenko <email address hidden>
Date: Thu Apr 23 23:10:14 2015 -0700

    Fixed docs on how to restore docker containers.

    Change-Id: I71bc6f442f6a853b26ffb12769b003ea85cdb118
    Partial-Bug: #1383741

tags: added: release-notes-done
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-docs (stable/6.1)

Related fix proposed to branch: stable/6.1
Review: https://review.openstack.org/194961

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-docs (stable/6.1)
Download full text (45.4 KiB)

Reviewed: https://review.openstack.org/194961
Committed: https://git.openstack.org/cgit/stackforge/fuel-docs/commit/?id=0e26e7d7cc153d179ec34985645dd23cdd239ddb
Submitter: Jenkins
Branch: stable/6.1

commit 5cc5f0c643aebecaf3bf4580535a3ea7c3334a6c
Author: Mike Scherbakov <email address hidden>
Date: Tue Jun 23 13:43:35 2015 -0700

    Removed streamlined patching backend pieces

    Change-Id: I955e76ccdbd12a9145f4e9b689f80bdf9fcaf929

commit 563c4b5c78ebfcb1f4f91047c2919f6270f9a1d4
Author: Mike Scherbakov <email address hidden>
Date: Tue Jun 23 13:30:30 2015 -0700

    Removed outdated patching guide

    Change-Id: I76180c277789ade9c5ebedd19fe2092847c0b7d9

commit 8d120c14bec1ab41d448683ad146a3053a57c4ee
Author: Irina Povolotskaya <email address hidden>
Date: Tue Jun 23 19:59:11 2015 +0300

    Add dual hypervisor ref arch into 6.1 docs

    Change-Id: I900c24c9de878eafadbfc995aa879b7f55737fac

commit feebd1592d3305b64bbdfd0bc5fe108190aef120
Author: OlgaGusarenko <email address hidden>
Date: Tue Jun 23 18:38:17 2015 +0300

    [OPs guide] Running Ceilometer section edits

    1. conf file extract is updated
    2. note is updated

    Closes-bug: 1467817
    Change-Id: I0217e164108e0ba6c1397045a5e57d13ff429223

commit 44a93f9dead7511a3461ec35248dbb689c81eafd
Author: OlgaGusarenko <email address hidden>
Date: Tue Jun 23 18:04:40 2015 +0300

    [RN6_1] Final changes

    1. capitalization
    2. 2014.2 to 2014.2.2
    3. general improvements

    Change-Id: I45057e90c90550559f66bc67ccdf97a559fd9000

commit bb41389cae58084285688853281516b659686422
Author: evkonstantinov <email address hidden>
Date: Tue Jun 23 16:45:35 2015 +0300

    Update patching decription

    Update patching description with
    the standard Linux commands.

    Change-Id: Ia1a8346639c468fdfce15a11d2430bf3a4731244

commit bf3018fae3f2e564413d33aba6cdebf8868f0b4e
Author: OlgaGusarenko <email address hidden>
Date: Tue Jun 23 15:55:49 2015 +0300

    [RN6_1] Clean up

    1. Rearranges sections
    2. Improves RST
    3. Changes titles order

    Change-Id: I6110bf515667d3d6ba08ad35ff5d593dbc96641e

commit 1c7e4457808e8f2d6c56fdf31252170972e444b9
Author: Maria Zlatkova <email address hidden>
Date: Tue Jun 23 15:26:28 2015 +0300

    Replaces VBOX screenshots

    This patch:
    - replaces VBOX screenshots
    - changes the link for Download Mirantis VirtualBox scripts
     to https://docs.mirantis.com/openstack/fuel/fuel-master/#downloads

    Change-Id: I58dede960c5c3355d39b07ff44b757403f6af02c
    Closes-Bug: #1467872

commit 0a568bf53fc0e25d1d692d5d74b4a7b4d983bbcc
Author: evkonstantinov <email address hidden>
Date: Tue Jun 23 14:01:55 2015 +0300

    6.1 --separate repos

    change wording and add links to the
    separate repos feature.

    Change-Id: Ib5d0778a0d8f1534f79ed2f553574cb69a3150b0

commit 95a188b21cbdd064d92696b7920e6a0105fe0c56
Author: Maria Zlatkova <email address hidden>
Date: Tue Jun 23 12:07:28 2015 +0300

    Corrects the output 'pcs status'

    Changes the example outputs to appropriate ones.

    Change-Id: Ib6d83...

Roman Rufanov (rrufanov)
tags: added: support
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.