EC: Swift can return corrupted Data and be able to go data lost at isa_l_rs_vand policy with >=5 parities
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Object Storage (swift) |
Fix Released
|
Medium
|
Unassigned | ||
OpenStack Security Advisory |
Invalid
|
Undecided
|
Unassigned | ||
PyECLib |
Fix Released
|
Undecided
|
Unassigned | ||
liberasurecode |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
This is what I have been fighting in the last week and here I'd like to summarize/describe for the relations and impact to Swift/PyECLib/
I DO NOT think this is a security risk because this doesn't cause an information leak or significant resource attack from outside, but this includes significant impact for users because of the risk of data lost. This is why I set this as private-security temporary. We can make this as open after notmyname confirm the impact and according to his decision.
To Swift API:
Using isa_l_rs_vand EC policy, pyeclib <= 1.3.1, liberasurecode <=1.2.1 with >= 5 ec_n_parity setting, you can see corrupted data *RARELY* *WITHOUT ANY ERRORS* at downloading. If you are using 'swift' command to download, the error appears most likely,
ubuntu@saio:~$ swift download ectest vagrant/
Error downloading object 'ectest/
And one more significant point is NOTHING QUARANTINED from the fact so swift is now regarding the corrupted one as a correct object.
The reason, I said "RARELY" above is that occurs due to the fragments combinations. You can see such buggy combinations with following script, that asserts isa_l_rs_vand ec_ndata=10, ec_n_parity=5 case:
from pyeclib.ec_iface import ECDriver
from itertools import combinations
if __name__ == '__main__':
driver = ECDriver(
orig_data = 'a' * 1024 * 1024
encoded = driver.
# make all fragment like (index, frag_data) format to feed to combinations
frags = [(i, frag) for i, frag in enumerate(encoded)]
bad_
check_
for check_frags in check_frags_
decoded = driver.
# No No don't kidding me
try:
assert decoded == orig_data
except AssertionError:
if bad_combinations:
ratio = float(len(
print 'Bad combinations found: %s/%s (ratio: %s)' % (
print bad_combinations
The result of the script should be:
Bad combinations found: 10/3003 (ratio: 0.00333000333)
[[0, 1, 2, 3, 5, 6, 8, 10, 11, 14], [0, 1, 2, 3, 5, 7, 8, 10, 13, 14], [0, 1, 2, 4, 5, 7, 9, 10, 11, 14], [0, 1, 2, 4, 6, 7, 9, 10, 13, 14], [0, 1, 3, 4, 6, 8, 9, 10, 11, 14], [0, 1, 3, 5, 6, 8, 9, 10, 13, 14], [0, 2, 3, 5, 7, 8, 9, 10, 11, 14], [0, 2, 4, 5, 7, 8, 9, 10, 13, 14], [1, 2, 4, 6, 7, 8, 9, 10, 11, 14], [1, 3, 4, 6, 7, 8, 9, 10, 13, 14]]
So the bad combinations appears approximately in 1/300 of the time.
The bug reason:
The bug exists in liberasurecode isa-l backend implementation. I've already asked to the isa-l maintainer about the possiblity that isa-l can return such a corrupted data without errors[1]. He honestly point out the issue that gf_gen_rs_matrix (which is used in isa_l_rs_vand encode/
I made a patch[2] that fixes to enable liberasurecode to be able to return Error instead of corrupted data. With [2], Swift will be able to return Error (sort of 5xx) in the bad pattern case.
Still impact to existing clusters:
If you have existing cluster with >=5 parities isa-l policy, it is still problematic even if you patch the fix [2]. That is probably because you are running the *object-
from pyeclib.ec_iface import ECDriver
from itertools import combinations
def test_ec_driver(k, m):
driver = ECDriver(
orig_data = 'a' * 1000
encoded = driver.
# make all fragment like (index, frag_data) format to feed to combinations
frags = [(i, frag) for i, frag in enumerate(encoded)]
an_ok_pattern = None
for check_frags in combinations(frags, k):
decoded = driver.
# No No don't kidding me
try:
assert decoded == orig_data
if not an_ok_pattern:
# save this ok pattern
except AssertionError:
# find missing index
for x in range(k + m):
if x not in check_frags_dict:
else:
print 'try reconstruct index %s with %s' % (
x, check_frags_
# try to reconstruct
reconed = driver.
try:
except AssertionError:
if x not in an_ok_pattern:
# sanity check
# try to decode again including the garbage reconstructed one
if __name__ == '__main__':
params = [(10, 5)]
for k, m in params:
As you can see, the bad pattern can infect the corrupted frags to a good pattern. And then, right now, we have no way to find a good pattern, including the infected garbage as a bad pattern unless we calculate the original object etag with decoding whole fragments in the fragment archive. Even if once we can find the bad pattern, I don't think the way to detect which one or both is (are) bad one(s) exist so that it means, once infected, we cannot stop the pandemic anymore and the infected object will go to the garbage in all fragment archives step by step. This is the worst problem.
I think, to the new incoming objects and brand new cluster deployment, the fix [2] is enough if you can accept to see the errors in 1/300 of the time.
What we can do for this pandemic:
For the possible patches perspective, we should land [2] which can prevent *future pandemic* as soon as possible.
For existing cluster users, sorry, what I can do is just saying "Please plan to migrate to a safer policy as soon as possible before the pandemic goes to data lost". And I'm trying to add the greedy testing [5] to ensure the result consistency in decode/reconstruct that we can know which parameters can be safe. Otherwise, we may be able to leave same pandemic possiblity to another policy.
For isa-l perspective, we know isa-l is the most high-performance open sourced engine in the available backends so the constraint will be a problem to choose. To solve the bad combination, the isa-l maintainer suggested to use another matrix (gf_gen_
PyECLib/
In my opinion, we should fix [2] as soon as possible and set it as the Swift requirement to prevent the pandemic in the future. However, right now we don't have the way to set the c lang binary version constraint. To enable the requirement, my idea is as following steps:
1. [2] land at liberasurecode master
2: bump liberasurecode version
3: make hard-coded constraint for PyECLib to require the liberasurecode >= (step2) (this can be painful becuase xenial or yakkety is =< 1.2.1)
4: bump pyeclib version
5: set the pyeclib requirement in Swift requirements.txt (<- this can have a tag #closes-bug for this bug report in the commit message)
Or any better idea is always welcome to me.
1: https:/
2: https:/
3: https:/
4: https:/
5: https:/
Changed in liberasurecode: | |
status: | New → Fix Released |
Changed in pyeclib: | |
status: | New → In Progress |
Changed in ossa: | |
status: | Incomplete → Won't Fix |
status: | Won't Fix → Invalid |
Set this as private because it seems sensitive but i don't think this is security issue. Just to leave the decision how we can handle this to notmyname.