Dmedia validation harness

Bug #1094544 reported by Jason Gerard DeRose on 2012-12-29
This bug affects 1 person
Affects Status Importance Assigned to Milestone

Bug Description

Here's a brain dump on my current thoughts of how to validate data safety of Dmedia. For any interested folks with a solid background in statistics (and especially reliability engineering), we could really use help developing better statistical models both for validating Dmedia and for use in Dmedia itself (to drive its behaviors).

Okay, brain dump:

After a lot of musing, I think rather than the validation harness testing that Dmedia "works" (prevents data loss), the harness should test that Dmedia fails at a reasonable threshold. So the harness will always push Dmedia to the point of failure, and then the test pass/fail is based on whether that failure occurred under conditions that we consider acceptable.

One good reason to always push Dmedia to failure is that makes sure the validation harness actually detects failure in the first place. If we were gentler with Dmedia and never detected failures, is this because Dmedia is working perfectly, or because the harness is broken and isn't detecting failures that in fact are occurring?

Dmedia isn't magical or omnipresent, and it can't always prevent data loss. For example, say you import some files onto a single removable hard drive, disconnect that drive, and then immediately chuck it under a bus where the drive and all the files it contains are destroyed, never to be recovered. Dmedia can't fix that problem, nor can any other software.

Dmedia tracks the number of distinct physically copies (on different physical storage devices) that exist for the user's files. When there are fewer than 3, Dmedia will create additional copies to try and maintain it at 3. One of the trickiest problems Dmedia tackles is removable drives, because there is no way for Dmedia to know what happens when that drive isn't connected. So Dmedia automatically downgrades its confidence in those copies if the drive hasn't been connected for a certain amount of time (currently 1 week).

There are two aspects here: first, does Dmedia correctly detect low-durability events. In other words, does Dmedia do its internal accounting correctly, does it correctly downgrade its confidence in a copy when appropriate. This accounting is the trickier problem, because that's what deals with variables introduced by human behavior, especially around removable drives. All this accounting has been turned on as of Dmedia 12.12 and is basically feature complete, although it certainly needs further turning and testing.

The reactive behaviors are fairly simple code-wise. They rely on the accounting being correct. We plan to turn on the copy increasing behaviors in 13.01, and the copy decreasing behaviors in 13.02.

The copy increasing behaviors are safe to turn on as far as data-safety, although they can certainly annoy the user if they aren't working properly (say, they could fill up a hard drive). The copy reducing behaviors are the scariest thing Dmedia does, but they are also critical for preventing data loss because if the user must be the one to delete copies to free space, it will be far, far more error prone than what Dmedia can manage. Think of Dmedia like a really smart local file cache that constantly adjust the files it contains based on what files are being used. The tricky part is local files are used as more than just caching, they are also figured into the total file durability.

Increase copies, or decrease copies, that's basically all Dmedia does. When more copies are needed because of low durability (say, a drive failed), the limiting factor is IO bandwidth. Dmedia will fail at the point when it can't create additional copies fast enough. So when turning our model, drive to drive IO bandwidth and computer to computer network bandwidth need to be considered.

In terms of decreasing copies, Dmedia will fail when Dmedia hasn't sampled reality recently enough and is making decisions on data that is too stale. Part of the complexity here is that multiple computers are involved, and then removable drives are very tricky because when they aren't connected to one of your computers, we have no idea what's happening with the drive.

The validation harness will involved at least 2 Dmedia peers (nodes). The harness will randomly introduce file corruption, delete files, etc, at an increasing rate up to the point where Dmedia can't keep up and reaches that failure condition we're looking for. We can certainly run this validation in the cloud, but I also feel it's important to run it on physical hardware so that you're accounting for how physical hardware behaves. Currently Dmedia can only run in a desktop session, so that makes testing in the cloud a bit tricker, so a small physical test harness will probably my first move (others are welcome to do the same, that would be a great help).

I also think this validation needs to be fairly long running, because there is complexity introduced in a long running process that we can't account for otherwise. Part of the reason I setup the week long-pre release window is for long running integration testing. I'd like every stable Dmedia release to run for about a week in the validation harness before we release it. We can also do as much validation as possible on the daily builds, but I think we really need to put the exact stable release through a deep validation before it goes out to the public.

Changed in dmedia:
milestone: 13.01 → 13.04
Changed in dmedia:
milestone: 13.04 → 13.03
Changed in dmedia:
milestone: 13.03 → 13.04
Changed in dmedia:
milestone: 13.04 → 13.05
Changed in dmedia:
milestone: 13.05 → 13.06
Changed in dmedia:
milestone: 13.06 → 13.07
Changed in dmedia:
milestone: 13.07 → 13.08
Changed in dmedia:
milestone: 13.08 → 13.10
Changed in dmedia:
milestone: 13.10 → 13.12
Changed in dmedia:
milestone: 13.12 → 14.02
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers