Bug #1247530 “Critical: fix non-convergent scenario” : Bugs : Dmedia

Revision history for this message

Jason Gerard DeRose (jderose) wrote on 2013-11-04:

#1

So after having a day to think about this, I do feel a bit silly for not catching this sooner, but that's how it goes.

Although surprising at first glance, this scenario is more strait forward than I first thought. However, it does seem like fixing this really needs a 5th behavior. Dmedia has had 4 automatic background behaviors for a long time:

1) Verification - Dmedia makes sure the metadata matches reality, and that the files have perfect file integrity; this includes MetaStore.scan(), MetaStore.relink(), MetaStore.verify_by_downgraded(), MetaStore.verify_by_mtime(), and MetaStore.verify_by_verified()

2) Downgrading - Dmedia automatically lowers its confidence in the aspects of reality it hasn't been able to verify for as certain amount of time; this includes MetaStore.purge_or_downgrade_by_store_atime(), MetaStore.downgrade_by_mtime(), and MetaStore.downgrade_by_verified()

3) Copy increasing - when there are user files with less than 3 copies of durability, Dmedia will create new copies on any FileStore (drives) such that at least MIN_FREE_SPACE will still remain after creating the new copy, by either copying the file from one locally connected drive to one or more other locally connected drives, or by downloading the copy from a peer on the local network; this is performed by the vigilance worker, driven by MetaStore.iter_actionable_fragile()

4) Copy decreasing - when there is a locally connected drive with less than RECLAIM_BYTES free space available, and when there are copies on that drive such that after deleting that copy, the file will still have a durability of 3, those copies are automatically deleted (reclaimed) on that drive, starting with the least recently used file (base on doc.atime)

I'd describe the 5th behavior that I think is probably the best fix to this problem something like this:

5) Shuffling - when a drive in the library has less than RECLAIM_BYTES available and contains files with a durability of 3, and when there is a locally connected drive upon which at least MIN_FREE_SPACE would remain after creating a 4th copy of a file, Dmedia will create new copies of these files by copying them from locally connected drives or downloading them from peers on the local network; this behavior creates the needed scenario under which behavior (4) can reclaim space on the first drive

David Jordan brought up the very good point that we need to consider how file "pinning" effects this. Currently, files are never reclaimed when they're pinned. David also suggested that pinning be a convergence behavior... that pinning doesn't necessarily reflect the current state Dmedia is in, it reflects the state Dmedia should be moving toward, and if needed for data safety reasons, Dmedia might temporarily ignore the user's pinning requests.

Still more thought/experimentation/testing needed on this, but I think we're making progress.

So after having a day to think about this, I do feel a bit silly for not catching this sooner, but that's how it goes.
 
Although surprising at first glance, this scenario is more strait forward than I first thought. However, it does seem like fixing this really needs a 5th behavior. Dmedia has had 4 automatic background behaviors for a long time: