Encryption doesn't play well with processes that copy cleartext data while preserving timestamps
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Object Storage (swift) |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
There are at least two processes that use internal-clients to pull user data out of a cluster and re-upload it with the same (or very nearly the same) timestamp:
Container-sync has a special carve-out to allow client-settable X-Timestamp values so that the modification times between different clusters will match. It also needs to decrypt user data as the remote may not have encryption enabled and, even if it does, there's no guarantee that the two clusters share knowledge of root secrets.
The reconciler deterministically adds an offset to the timestamp when moving data between policies. It may not need to decrypt client data, but it often is configured to use a shared internal-client config that may need to have encryption enabled for other use-cases.
When both of those were originally written, there was an assumption that preserving timestamps like that would be safe, because the write should be idempotent -- if two processes tried to do the same work, the data going out to disk should be the same.
With encryption enabled, however, writes become non-deterministic -- we have to choose random values for the body key, body key iv, body iv, and various metadata ivs. As a result, concurrent writers almost certainly *do not* try to write the same data to disk.
When writing to a replicated policy, this isn't too much of a problem. Each replica is self-contained, and any one of them can service reads. It'll likely cause some confusion at some point when manually comparing replicas, but it shouldn't lead to data loss.
Erasure-coded policies are a problem, though: we've observed objects moved by the reconciler with three distinct fragment sets in an 8+4 policy:
* seven fragments were encrypted with some set-of-crypto-meta A,
* three fragments were encrypted with set B, and
* one fragment was encrypted with set C.
As a result, we don't have enough fragments to decode. *Maybe* we once could, but since we can only find 11 frags now... looks like we're out of luck.
Wow, sounds like a horrible bug.
Doesn't the container reconciler work on a object at a time though. So even if it was distributed a reconciler should go and make a whole new set of EV fragments throughout the cluster. Or maybe the problem is 2 different "reconcilers" are trying to do the reconcile at the same time and because the "new" timestamp is srouce TS + 2 offset they always come out the same. So could it be a split brain thing.
I guess when I have a bit more brain power I'll go take a look at the reconciler code, and also I guess its time to build up an enc SAIO to test things out.