OpenStack Object Storage (swift)

Bug #1639691
Activity log

Activity log for bug #1639691

Date	Who	What changed	Old value	New value	Message
2016-11-07 04:35:01	Kota Tsuyuzaki	bug			added bug
2016-11-07 04:38:01	Kota Tsuyuzaki	bug task added		pyeclib
2016-11-07 04:38:28	Kota Tsuyuzaki	bug task added		liberasurecode
2016-11-07 04:38:44	Kota Tsuyuzaki	swift: importance	Undecided	Critical
2016-11-07 04:38:52	Kota Tsuyuzaki	swift: status	New	Confirmed
2016-11-07 05:10:17	Tristan Cacqueray	bug task added		ossa
2016-11-07 05:10:36	Tristan Cacqueray	ossa: status	New	Incomplete
2016-11-07 05:10:49	Tristan Cacqueray	description	This is what I have been fighting in the last week and here I'd like to summarize/describe for the relations and impact to Swift/PyECLib/liberasurecode. I DO NOT think this is a security risk because this doesn't cause an information leak or significant resource attack from outside, but this includes significant impact for users because of the risk of data lost. This is why I set this as private-security temporary. We can make this as open after notmyname confirm the impact and according to his decision. To Swift API: Using isa_l_rs_vand EC policy, pyeclib <= 1.3.1, liberasurecode <=1.2.1 with >= 5 ec_n_parity setting, you can see corrupted data RARELY WITHOUT ANY ERRORS at downloading. If you are using 'swift' command to download, the error appears most likely, ubuntu@saio:~$ swift download ectest vagrant/.scratch/swift.tar.gz -o delete.me Error downloading object 'ectest/vagrant/.scratch/swift.tar.gz': u'Error downloading vagrant/.scratch/swift.tar.gz: md5sum != etag, 4f327eed4b9b9210d13a30c9441fefa1 != 5b60a22fcf65a356391cbdb08b9140d8' And one more significant point is NOTHING QUARANTINED from the fact so swift is now regarding the corrupted one as a correct object. The reason, I said "RARELY" above is that occurs due to the fragments combinations. You can see such buggy combinations with following script, that asserts isa_l_rs_vand ec_ndata=10, ec_n_parity=5 case: from pyeclib.ec_iface import ECDriver from itertools import combinations if __name__ == '__main__': driver = ECDriver(ec_type='isa_l_rs_vand', k=10, m=5) orig_data = 'a' * 1024 * 1024 encoded = driver.encode(orig_data) # make all fragment like (index, frag_data) format to feed to combinations frags = [(i, frag) for i, frag in enumerate(encoded)] bad_combinations = [] check_frags_combinations = list(combinations(frags, 10)) for check_frags in check_frags_combinations: check_frags_dict = dict(check_frags) decoded = driver.decode(check_frags_dict.values()) # No No don't kidding me try: assert decoded == orig_data except AssertionError: bad_combinations.append(list(check_frags_dict)) if bad_combinations: ratio = float(len(bad_combinations)) / len(check_frags_combinations) print 'Bad combinations found: %s/%s (ratio: %s)' % ( len(bad_combinations), len(check_frags_combinations), ratio) print bad_combinations The result of the script should be: Bad combinations found: 10/3003 (ratio: 0.00333000333) [[0, 1, 2, 3, 5, 6, 8, 10, 11, 14], [0, 1, 2, 3, 5, 7, 8, 10, 13, 14], [0, 1, 2, 4, 5, 7, 9, 10, 11, 14], [0, 1, 2, 4, 6, 7, 9, 10, 13, 14], [0, 1, 3, 4, 6, 8, 9, 10, 11, 14], [0, 1, 3, 5, 6, 8, 9, 10, 13, 14], [0, 2, 3, 5, 7, 8, 9, 10, 11, 14], [0, 2, 4, 5, 7, 8, 9, 10, 13, 14], [1, 2, 4, 6, 7, 8, 9, 10, 11, 14], [1, 3, 4, 6, 7, 8, 9, 10, 13, 14]] So the bad combinations appears approximately in 1/300 of the time. The bug reason: The bug exists in liberasurecode isa-l backend implementation. I've already asked to the isa-l maintainer about the possiblity that isa-l can return such a corrupted data without errors[1]. He honestly point out the issue that gf_gen_rs_matrix (which is used in isa_l_rs_vand encode/decode/reconstruct process) could make non-invertable matrix that means "the matrix can have the combination to be unrecoverable". So the bad pattern should exist for the isa-l constraint. However, he points out another issue that "we can find the case if you asserts the return value of gf_invert_matrix" and liberasurecode doesn't. That is the main reason, Swift/PyECLib/liberasurecode can make such a garbage decoding result. I made a patch[2] that fixes to enable liberasurecode to be able to return Error instead of corrupted data. With [2], Swift will be able to return Error (sort of 5xx) in the bad pattern case. Still impact to existing clusters: If you have existing cluster with >=5 parities isa-l policy, it is still problematic even if you patch the fix [2]. That is probably because you are running the object-reconstructor in the cluster. As well as Swift API impact, object-reconstructor is also impacted the issue. The following script is an extent from the previous script that asserts what happen if a fragment was replaced in a good pattern with the fragment reconstructed from a bad pattern. from pyeclib.ec_iface import ECDriver from itertools import combinations def test_ec_driver(k, m): driver = ECDriver(ec_type='isa_l_rs_vand', k=k, m=m) orig_data = 'a' * 1000 encoded = driver.encode(orig_data) # make all fragment like (index, frag_data) format to feed to combinations frags = [(i, frag) for i, frag in enumerate(encoded)] an_ok_pattern = None for check_frags in combinations(frags, k): check_frags_dict = dict(check_frags) decoded = driver.decode(check_frags_dict.values()) # No No don't kidding me try: assert decoded == orig_data if not an_ok_pattern: # save this ok pattern an_ok_pattern = dict(check_frags_dict) except AssertionError: # find missing index for x in range(k + m): if x not in check_frags_dict: break else: raise Exception('something wrong') print 'try reconstruct index %s with %s' % ( x, check_frags_dict.keys()) # try to reconstruct reconed = driver.reconstruct(check_frags_dict.values(), [x])[0] try: assert reconed == frags[x] except AssertionError: if x not in an_ok_pattern: raise Exception('need to make this script better') # sanity check decoded = driver.decode(an_ok_pattern.values()) assert decoded == orig_data print 'this is ok' # try to decode again including the garbage reconstructed one an_ok_pattern[x] = reconed decoded = driver.decode(an_ok_pattern.values()) assert decoded == orig_data print 'this is ok if the assertion above passed' if __name__ == '__main__': params = [(10, 5)] for k, m in params: test_ec_driver(k, m) As you can see, the bad pattern can infect the corrupted frags to a good pattern. And then, right now, we have no way to find a good pattern, including the infected garbage as a bad pattern unless we calculate the original object etag with decoding whole fragments in the fragment archive. Even if once we can find the bad pattern, I don't think the way to detect which one or both is (are) bad one(s) exist so that it means, once infected, we cannot stop the pandemic anymore and the infected object will go to the garbage in all fragment archives step by step. This is the worst problem. I think, to the new incoming objects and brand new cluster deployment, the fix [2] is enough if you can accept to see the errors in 1/300 of the time. What we can do for this pandemic: For the possible patches perspective, we should land [2] which can prevent future pandemic as soon as possible. For existing cluster users, sorry, what I can do is just saying "Please plan to migrate to a safer policy as soon as possible before the pandemic goes to data lost". And I'm trying to add the greedy testing [5] to ensure the result consistency in decode/reconstruct that we can know which parameters can be safe. Otherwise, we may be able to leave same pandemic possiblity to another policy. For isa-l perspective, we know isa-l is the most high-performance open sourced engine in the available backends so the constraint will be a problem to choose. To solve the bad combination, the isa-l maintainer suggested to use another matrix (gf_gen_caucy1_matrix) for the backend which can be invertable at 5>=parities case. I had worked at [3][4] as the draft. And in my experiment, it shows as similar high-perforemance as isa_r_rs_vand without the constraint we hit. PyECLib/liberasurecode Version Consideration: In my opinion, we should fix [2] as soon as possible and set it as the Swift requirement to prevent the pandemic in the future. However, right now we don't have the way to set the c lang binary version constraint. To enable the requirement, my idea is as following steps: 1. [2] land at liberasurecode master 2: bump liberasurecode version 3: make hard-coded constraint for PyECLib to require the liberasurecode >= (step2) (this can be painful becuase xenial or yakkety is =< 1.2.1) 4: bump pyeclib version 5: set the pyeclib requirement in Swift requirements.txt (<- this can have a tag #closes-bug for this bug report in the commit message) Or any better idea is always welcome to me. 1: https://github.com/01org/isa-l/issues/10 2: https://review.openstack.org/#/c/393595/ 3: https://review.openstack.org/#/c/393263/ 4: https://review.openstack.org/#/c/393276/ 5: https://review.openstack.org/#/c/393656/	This issue is being treated as a potential security risk under embargo. Please do not make any public mention of embargoed (private) security vulnerabilities before their coordinated publication by the OpenStack Vulnerability Management Team in the form of an official OpenStack Security Advisory. This includes discussion of the bug or associated fixes in public forums such as mailing lists, code review systems and bug trackers. Please also avoid private disclosure to other individuals not already approved for access to this information, and provide this same reminder to those who are made aware of the issue prior to publication. All discussion should remain confined to this private bug report, and any proposed fixes should be added to the bug as attachments. -- This is what I have been fighting in the last week and here I'd like to summarize/describe for the relations and impact to Swift/PyECLib/liberasurecode. I DO NOT think this is a security risk because this doesn't cause an information leak or significant resource attack from outside, but this includes significant impact for users because of the risk of data lost. This is why I set this as private-security temporary. We can make this as open after notmyname confirm the impact and according to his decision. To Swift API: Using isa_l_rs_vand EC policy, pyeclib <= 1.3.1, liberasurecode <=1.2.1 with >= 5 ec_n_parity setting, you can see corrupted data RARELY WITHOUT ANY ERRORS at downloading. If you are using 'swift' command to download, the error appears most likely, ubuntu@saio:~$ swift download ectest vagrant/.scratch/swift.tar.gz -o delete.me Error downloading object 'ectest/vagrant/.scratch/swift.tar.gz': u'Error downloading vagrant/.scratch/swift.tar.gz: md5sum != etag, 4f327eed4b9b9210d13a30c9441fefa1 != 5b60a22fcf65a356391cbdb08b9140d8' And one more significant point is NOTHING QUARANTINED from the fact so swift is now regarding the corrupted one as a correct object. The reason, I said "RARELY" above is that occurs due to the fragments combinations. You can see such buggy combinations with following script, that asserts isa_l_rs_vand ec_ndata=10, ec_n_parity=5 case: from pyeclib.ec_iface import ECDriver from itertools import combinations if __name__ == '__main__': driver = ECDriver(ec_type='isa_l_rs_vand', k=10, m=5) orig_data = 'a' * 1024 * 1024 encoded = driver.encode(orig_data) # make all fragment like (index, frag_data) format to feed to combinations frags = [(i, frag) for i, frag in enumerate(encoded)] bad_combinations = [] check_frags_combinations = list(combinations(frags, 10)) for check_frags in check_frags_combinations: check_frags_dict = dict(check_frags) decoded = driver.decode(check_frags_dict.values()) # No No don't kidding me try: assert decoded == orig_data except AssertionError: bad_combinations.append(list(check_frags_dict)) if bad_combinations: ratio = float(len(bad_combinations)) / len(check_frags_combinations) print 'Bad combinations found: %s/%s (ratio: %s)' % ( len(bad_combinations), len(check_frags_combinations), ratio) print bad_combinations The result of the script should be: Bad combinations found: 10/3003 (ratio: 0.00333000333) [[0, 1, 2, 3, 5, 6, 8, 10, 11, 14], [0, 1, 2, 3, 5, 7, 8, 10, 13, 14], [0, 1, 2, 4, 5, 7, 9, 10, 11, 14], [0, 1, 2, 4, 6, 7, 9, 10, 13, 14], [0, 1, 3, 4, 6, 8, 9, 10, 11, 14], [0, 1, 3, 5, 6, 8, 9, 10, 13, 14], [0, 2, 3, 5, 7, 8, 9, 10, 11, 14], [0, 2, 4, 5, 7, 8, 9, 10, 13, 14], [1, 2, 4, 6, 7, 8, 9, 10, 11, 14], [1, 3, 4, 6, 7, 8, 9, 10, 13, 14]] So the bad combinations appears approximately in 1/300 of the time. The bug reason: The bug exists in liberasurecode isa-l backend implementation. I've already asked to the isa-l maintainer about the possiblity that isa-l can return such a corrupted data without errors[1]. He honestly point out the issue that gf_gen_rs_matrix (which is used in isa_l_rs_vand encode/decode/reconstruct process) could make non-invertable matrix that means "the matrix can have the combination to be unrecoverable". So the bad pattern should exist for the isa-l constraint. However, he points out another issue that "we can find the case if you asserts the return value of gf_invert_matrix" and liberasurecode doesn't. That is the main reason, Swift/PyECLib/liberasurecode can make such a garbage decoding result. I made a patch[2] that fixes to enable liberasurecode to be able to return Error instead of corrupted data. With [2], Swift will be able to return Error (sort of 5xx) in the bad pattern case. Still impact to existing clusters: If you have existing cluster with >=5 parities isa-l policy, it is still problematic even if you patch the fix [2]. That is probably because you are running the object-reconstructor in the cluster. As well as Swift API impact, object-reconstructor is also impacted the issue. The following script is an extent from the previous script that asserts what happen if a fragment was replaced in a good pattern with the fragment reconstructed from a bad pattern. from pyeclib.ec_iface import ECDriver from itertools import combinations def test_ec_driver(k, m): driver = ECDriver(ec_type='isa_l_rs_vand', k=k, m=m) orig_data = 'a' * 1000 encoded = driver.encode(orig_data) # make all fragment like (index, frag_data) format to feed to combinations frags = [(i, frag) for i, frag in enumerate(encoded)] an_ok_pattern = None for check_frags in combinations(frags, k): check_frags_dict = dict(check_frags) decoded = driver.decode(check_frags_dict.values()) # No No don't kidding me try: assert decoded == orig_data if not an_ok_pattern: # save this ok pattern an_ok_pattern = dict(check_frags_dict) except AssertionError: # find missing index for x in range(k + m): if x not in check_frags_dict: break else: raise Exception('something wrong') print 'try reconstruct index %s with %s' % ( x, check_frags_dict.keys()) # try to reconstruct reconed = driver.reconstruct(check_frags_dict.values(), [x])[0] try: assert reconed == frags[x] except AssertionError: if x not in an_ok_pattern: raise Exception('need to make this script better') # sanity check decoded = driver.decode(an_ok_pattern.values()) assert decoded == orig_data print 'this is ok' # try to decode again including the garbage reconstructed one an_ok_pattern[x] = reconed decoded = driver.decode(an_ok_pattern.values()) assert decoded == orig_data print 'this is ok if the assertion above passed' if __name__ == '__main__': params = [(10, 5)] for k, m in params: test_ec_driver(k, m) As you can see, the bad pattern can infect the corrupted frags to a good pattern. And then, right now, we have no way to find a good pattern, including the infected garbage as a bad pattern unless we calculate the original object etag with decoding whole fragments in the fragment archive. Even if once we can find the bad pattern, I don't think the way to detect which one or both is (are) bad one(s) exist so that it means, once infected, we cannot stop the pandemic anymore and the infected object will go to the garbage in all fragment archives step by step. This is the worst problem. I think, to the new incoming objects and brand new cluster deployment, the fix [2] is enough if you can accept to see the errors in 1/300 of the time. What we can do for this pandemic: For the possible patches perspective, we should land [2] which can prevent future pandemic as soon as possible. For existing cluster users, sorry, what I can do is just saying "Please plan to migrate to a safer policy as soon as possible before the pandemic goes to data lost". And I'm trying to add the greedy testing [5] to ensure the result consistency in decode/reconstruct that we can know which parameters can be safe. Otherwise, we may be able to leave same pandemic possiblity to another policy. For isa-l perspective, we know isa-l is the most high-performance open sourced engine in the available backends so the constraint will be a problem to choose. To solve the bad combination, the isa-l maintainer suggested to use another matrix (gf_gen_caucy1_matrix) for the backend which can be invertable at 5>=parities case. I had worked at [3][4] as the draft. And in my experiment, it shows as similar high-perforemance as isa_r_rs_vand without the constraint we hit. PyECLib/liberasurecode Version Consideration: In my opinion, we should fix [2] as soon as possible and set it as the Swift requirement to prevent the pandemic in the future. However, right now we don't have the way to set the c lang binary version constraint. To enable the requirement, my idea is as following steps: 1. [2] land at liberasurecode master 2: bump liberasurecode version 3: make hard-coded constraint for PyECLib to require the liberasurecode >= (step2) (this can be painful becuase xenial or yakkety is =< 1.2.1) 4: bump pyeclib version 5: set the pyeclib requirement in Swift requirements.txt (<- this can have a tag #closes-bug for this bug report in the commit message) Or any better idea is always welcome to me. 1: https://github.com/01org/isa-l/issues/10 2: https://review.openstack.org/#/c/393595/ 3: https://review.openstack.org/#/c/393263/ 4: https://review.openstack.org/#/c/393276/ 5: https://review.openstack.org/#/c/393656/
2016-11-07 15:19:09	Jeremy Stanley	bug			added subscriber Swift Core security contacts
2016-11-09 11:48:27	Kota Tsuyuzaki	attachment added		reputter script https://bugs.launchpad.net/swift/+bug/1639691/+attachment/4774846/+files/reputter.py
2016-11-09 22:01:49	clayg	bug			added subscriber Tushar Gohad
2016-11-09 22:04:00	clayg	bug			added subscriber Kevin Greenan
2016-11-16 23:01:55	John Dickinson	information type	Private Security	Public
2016-11-16 23:03:42	John Dickinson	description	This issue is being treated as a potential security risk under embargo. Please do not make any public mention of embargoed (private) security vulnerabilities before their coordinated publication by the OpenStack Vulnerability Management Team in the form of an official OpenStack Security Advisory. This includes discussion of the bug or associated fixes in public forums such as mailing lists, code review systems and bug trackers. Please also avoid private disclosure to other individuals not already approved for access to this information, and provide this same reminder to those who are made aware of the issue prior to publication. All discussion should remain confined to this private bug report, and any proposed fixes should be added to the bug as attachments. -- This is what I have been fighting in the last week and here I'd like to summarize/describe for the relations and impact to Swift/PyECLib/liberasurecode. I DO NOT think this is a security risk because this doesn't cause an information leak or significant resource attack from outside, but this includes significant impact for users because of the risk of data lost. This is why I set this as private-security temporary. We can make this as open after notmyname confirm the impact and according to his decision. To Swift API: Using isa_l_rs_vand EC policy, pyeclib <= 1.3.1, liberasurecode <=1.2.1 with >= 5 ec_n_parity setting, you can see corrupted data RARELY WITHOUT ANY ERRORS at downloading. If you are using 'swift' command to download, the error appears most likely, ubuntu@saio:~$ swift download ectest vagrant/.scratch/swift.tar.gz -o delete.me Error downloading object 'ectest/vagrant/.scratch/swift.tar.gz': u'Error downloading vagrant/.scratch/swift.tar.gz: md5sum != etag, 4f327eed4b9b9210d13a30c9441fefa1 != 5b60a22fcf65a356391cbdb08b9140d8' And one more significant point is NOTHING QUARANTINED from the fact so swift is now regarding the corrupted one as a correct object. The reason, I said "RARELY" above is that occurs due to the fragments combinations. You can see such buggy combinations with following script, that asserts isa_l_rs_vand ec_ndata=10, ec_n_parity=5 case: from pyeclib.ec_iface import ECDriver from itertools import combinations if __name__ == '__main__': driver = ECDriver(ec_type='isa_l_rs_vand', k=10, m=5) orig_data = 'a' * 1024 * 1024 encoded = driver.encode(orig_data) # make all fragment like (index, frag_data) format to feed to combinations frags = [(i, frag) for i, frag in enumerate(encoded)] bad_combinations = [] check_frags_combinations = list(combinations(frags, 10)) for check_frags in check_frags_combinations: check_frags_dict = dict(check_frags) decoded = driver.decode(check_frags_dict.values()) # No No don't kidding me try: assert decoded == orig_data except AssertionError: bad_combinations.append(list(check_frags_dict)) if bad_combinations: ratio = float(len(bad_combinations)) / len(check_frags_combinations) print 'Bad combinations found: %s/%s (ratio: %s)' % ( len(bad_combinations), len(check_frags_combinations), ratio) print bad_combinations The result of the script should be: Bad combinations found: 10/3003 (ratio: 0.00333000333) [[0, 1, 2, 3, 5, 6, 8, 10, 11, 14], [0, 1, 2, 3, 5, 7, 8, 10, 13, 14], [0, 1, 2, 4, 5, 7, 9, 10, 11, 14], [0, 1, 2, 4, 6, 7, 9, 10, 13, 14], [0, 1, 3, 4, 6, 8, 9, 10, 11, 14], [0, 1, 3, 5, 6, 8, 9, 10, 13, 14], [0, 2, 3, 5, 7, 8, 9, 10, 11, 14], [0, 2, 4, 5, 7, 8, 9, 10, 13, 14], [1, 2, 4, 6, 7, 8, 9, 10, 11, 14], [1, 3, 4, 6, 7, 8, 9, 10, 13, 14]] So the bad combinations appears approximately in 1/300 of the time. The bug reason: The bug exists in liberasurecode isa-l backend implementation. I've already asked to the isa-l maintainer about the possiblity that isa-l can return such a corrupted data without errors[1]. He honestly point out the issue that gf_gen_rs_matrix (which is used in isa_l_rs_vand encode/decode/reconstruct process) could make non-invertable matrix that means "the matrix can have the combination to be unrecoverable". So the bad pattern should exist for the isa-l constraint. However, he points out another issue that "we can find the case if you asserts the return value of gf_invert_matrix" and liberasurecode doesn't. That is the main reason, Swift/PyECLib/liberasurecode can make such a garbage decoding result. I made a patch[2] that fixes to enable liberasurecode to be able to return Error instead of corrupted data. With [2], Swift will be able to return Error (sort of 5xx) in the bad pattern case. Still impact to existing clusters: If you have existing cluster with >=5 parities isa-l policy, it is still problematic even if you patch the fix [2]. That is probably because you are running the object-reconstructor in the cluster. As well as Swift API impact, object-reconstructor is also impacted the issue. The following script is an extent from the previous script that asserts what happen if a fragment was replaced in a good pattern with the fragment reconstructed from a bad pattern. from pyeclib.ec_iface import ECDriver from itertools import combinations def test_ec_driver(k, m): driver = ECDriver(ec_type='isa_l_rs_vand', k=k, m=m) orig_data = 'a' * 1000 encoded = driver.encode(orig_data) # make all fragment like (index, frag_data) format to feed to combinations frags = [(i, frag) for i, frag in enumerate(encoded)] an_ok_pattern = None for check_frags in combinations(frags, k): check_frags_dict = dict(check_frags) decoded = driver.decode(check_frags_dict.values()) # No No don't kidding me try: assert decoded == orig_data if not an_ok_pattern: # save this ok pattern an_ok_pattern = dict(check_frags_dict) except AssertionError: # find missing index for x in range(k + m): if x not in check_frags_dict: break else: raise Exception('something wrong') print 'try reconstruct index %s with %s' % ( x, check_frags_dict.keys()) # try to reconstruct reconed = driver.reconstruct(check_frags_dict.values(), [x])[0] try: assert reconed == frags[x] except AssertionError: if x not in an_ok_pattern: raise Exception('need to make this script better') # sanity check decoded = driver.decode(an_ok_pattern.values()) assert decoded == orig_data print 'this is ok' # try to decode again including the garbage reconstructed one an_ok_pattern[x] = reconed decoded = driver.decode(an_ok_pattern.values()) assert decoded == orig_data print 'this is ok if the assertion above passed' if __name__ == '__main__': params = [(10, 5)] for k, m in params: test_ec_driver(k, m) As you can see, the bad pattern can infect the corrupted frags to a good pattern. And then, right now, we have no way to find a good pattern, including the infected garbage as a bad pattern unless we calculate the original object etag with decoding whole fragments in the fragment archive. Even if once we can find the bad pattern, I don't think the way to detect which one or both is (are) bad one(s) exist so that it means, once infected, we cannot stop the pandemic anymore and the infected object will go to the garbage in all fragment archives step by step. This is the worst problem. I think, to the new incoming objects and brand new cluster deployment, the fix [2] is enough if you can accept to see the errors in 1/300 of the time. What we can do for this pandemic: For the possible patches perspective, we should land [2] which can prevent future pandemic as soon as possible. For existing cluster users, sorry, what I can do is just saying "Please plan to migrate to a safer policy as soon as possible before the pandemic goes to data lost". And I'm trying to add the greedy testing [5] to ensure the result consistency in decode/reconstruct that we can know which parameters can be safe. Otherwise, we may be able to leave same pandemic possiblity to another policy. For isa-l perspective, we know isa-l is the most high-performance open sourced engine in the available backends so the constraint will be a problem to choose. To solve the bad combination, the isa-l maintainer suggested to use another matrix (gf_gen_caucy1_matrix) for the backend which can be invertable at 5>=parities case. I had worked at [3][4] as the draft. And in my experiment, it shows as similar high-perforemance as isa_r_rs_vand without the constraint we hit. PyECLib/liberasurecode Version Consideration: In my opinion, we should fix [2] as soon as possible and set it as the Swift requirement to prevent the pandemic in the future. However, right now we don't have the way to set the c lang binary version constraint. To enable the requirement, my idea is as following steps: 1. [2] land at liberasurecode master 2: bump liberasurecode version 3: make hard-coded constraint for PyECLib to require the liberasurecode >= (step2) (this can be painful becuase xenial or yakkety is =< 1.2.1) 4: bump pyeclib version 5: set the pyeclib requirement in Swift requirements.txt (<- this can have a tag #closes-bug for this bug report in the commit message) Or any better idea is always welcome to me. 1: https://github.com/01org/isa-l/issues/10 2: https://review.openstack.org/#/c/393595/ 3: https://review.openstack.org/#/c/393263/ 4: https://review.openstack.org/#/c/393276/ 5: https://review.openstack.org/#/c/393656/	This is what I have been fighting in the last week and here I'd like to summarize/describe for the relations and impact to Swift/PyECLib/liberasurecode. I DO NOT think this is a security risk because this doesn't cause an information leak or significant resource attack from outside, but this includes significant impact for users because of the risk of data lost. This is why I set this as private-security temporary. We can make this as open after notmyname confirm the impact and according to his decision. To Swift API: Using isa_l_rs_vand EC policy, pyeclib <= 1.3.1, liberasurecode <=1.2.1 with >= 5 ec_n_parity setting, you can see corrupted data RARELY WITHOUT ANY ERRORS at downloading. If you are using 'swift' command to download, the error appears most likely, ubuntu@saio:~$ swift download ectest vagrant/.scratch/swift.tar.gz -o delete.me Error downloading object 'ectest/vagrant/.scratch/swift.tar.gz': u'Error downloading vagrant/.scratch/swift.tar.gz: md5sum != etag, 4f327eed4b9b9210d13a30c9441fefa1 != 5b60a22fcf65a356391cbdb08b9140d8' And one more significant point is NOTHING QUARANTINED from the fact so swift is now regarding the corrupted one as a correct object. The reason, I said "RARELY" above is that occurs due to the fragments combinations. You can see such buggy combinations with following script, that asserts isa_l_rs_vand ec_ndata=10, ec_n_parity=5 case: from pyeclib.ec_iface import ECDriver from itertools import combinations if __name__ == '__main__': driver = ECDriver(ec_type='isa_l_rs_vand', k=10, m=5) orig_data = 'a' * 1024 * 1024 encoded = driver.encode(orig_data) # make all fragment like (index, frag_data) format to feed to combinations frags = [(i, frag) for i, frag in enumerate(encoded)] bad_combinations = [] check_frags_combinations = list(combinations(frags, 10)) for check_frags in check_frags_combinations: check_frags_dict = dict(check_frags) decoded = driver.decode(check_frags_dict.values()) # No No don't kidding me try: assert decoded == orig_data except AssertionError: bad_combinations.append(list(check_frags_dict)) if bad_combinations: ratio = float(len(bad_combinations)) / len(check_frags_combinations) print 'Bad combinations found: %s/%s (ratio: %s)' % ( len(bad_combinations), len(check_frags_combinations), ratio) print bad_combinations The result of the script should be: Bad combinations found: 10/3003 (ratio: 0.00333000333) [[0, 1, 2, 3, 5, 6, 8, 10, 11, 14], [0, 1, 2, 3, 5, 7, 8, 10, 13, 14], [0, 1, 2, 4, 5, 7, 9, 10, 11, 14], [0, 1, 2, 4, 6, 7, 9, 10, 13, 14], [0, 1, 3, 4, 6, 8, 9, 10, 11, 14], [0, 1, 3, 5, 6, 8, 9, 10, 13, 14], [0, 2, 3, 5, 7, 8, 9, 10, 11, 14], [0, 2, 4, 5, 7, 8, 9, 10, 13, 14], [1, 2, 4, 6, 7, 8, 9, 10, 11, 14], [1, 3, 4, 6, 7, 8, 9, 10, 13, 14]] So the bad combinations appears approximately in 1/300 of the time. The bug reason: The bug exists in liberasurecode isa-l backend implementation. I've already asked to the isa-l maintainer about the possiblity that isa-l can return such a corrupted data without errors[1]. He honestly point out the issue that gf_gen_rs_matrix (which is used in isa_l_rs_vand encode/decode/reconstruct process) could make non-invertable matrix that means "the matrix can have the combination to be unrecoverable". So the bad pattern should exist for the isa-l constraint. However, he points out another issue that "we can find the case if you asserts the return value of gf_invert_matrix" and liberasurecode doesn't. That is the main reason, Swift/PyECLib/liberasurecode can make such a garbage decoding result. I made a patch[2] that fixes to enable liberasurecode to be able to return Error instead of corrupted data. With [2], Swift will be able to return Error (sort of 5xx) in the bad pattern case. Still impact to existing clusters: If you have existing cluster with >=5 parities isa-l policy, it is still problematic even if you patch the fix [2]. That is probably because you are running the object-reconstructor in the cluster. As well as Swift API impact, object-reconstructor is also impacted the issue. The following script is an extent from the previous script that asserts what happen if a fragment was replaced in a good pattern with the fragment reconstructed from a bad pattern. from pyeclib.ec_iface import ECDriver from itertools import combinations def test_ec_driver(k, m): driver = ECDriver(ec_type='isa_l_rs_vand', k=k, m=m) orig_data = 'a' * 1000 encoded = driver.encode(orig_data) # make all fragment like (index, frag_data) format to feed to combinations frags = [(i, frag) for i, frag in enumerate(encoded)] an_ok_pattern = None for check_frags in combinations(frags, k): check_frags_dict = dict(check_frags) decoded = driver.decode(check_frags_dict.values()) # No No don't kidding me try: assert decoded == orig_data if not an_ok_pattern: # save this ok pattern an_ok_pattern = dict(check_frags_dict) except AssertionError: # find missing index for x in range(k + m): if x not in check_frags_dict: break else: raise Exception('something wrong') print 'try reconstruct index %s with %s' % ( x, check_frags_dict.keys()) # try to reconstruct reconed = driver.reconstruct(check_frags_dict.values(), [x])[0] try: assert reconed == frags[x] except AssertionError: if x not in an_ok_pattern: raise Exception('need to make this script better') # sanity check decoded = driver.decode(an_ok_pattern.values()) assert decoded == orig_data print 'this is ok' # try to decode again including the garbage reconstructed one an_ok_pattern[x] = reconed decoded = driver.decode(an_ok_pattern.values()) assert decoded == orig_data print 'this is ok if the assertion above passed' if __name__ == '__main__': params = [(10, 5)] for k, m in params: test_ec_driver(k, m) As you can see, the bad pattern can infect the corrupted frags to a good pattern. And then, right now, we have no way to find a good pattern, including the infected garbage as a bad pattern unless we calculate the original object etag with decoding whole fragments in the fragment archive. Even if once we can find the bad pattern, I don't think the way to detect which one or both is (are) bad one(s) exist so that it means, once infected, we cannot stop the pandemic anymore and the infected object will go to the garbage in all fragment archives step by step. This is the worst problem. I think, to the new incoming objects and brand new cluster deployment, the fix [2] is enough if you can accept to see the errors in 1/300 of the time. What we can do for this pandemic: For the possible patches perspective, we should land [2] which can prevent future pandemic as soon as possible. For existing cluster users, sorry, what I can do is just saying "Please plan to migrate to a safer policy as soon as possible before the pandemic goes to data lost". And I'm trying to add the greedy testing [5] to ensure the result consistency in decode/reconstruct that we can know which parameters can be safe. Otherwise, we may be able to leave same pandemic possiblity to another policy. For isa-l perspective, we know isa-l is the most high-performance open sourced engine in the available backends so the constraint will be a problem to choose. To solve the bad combination, the isa-l maintainer suggested to use another matrix (gf_gen_caucy1_matrix) for the backend which can be invertable at 5>=parities case. I had worked at [3][4] as the draft. And in my experiment, it shows as similar high-perforemance as isa_r_rs_vand without the constraint we hit. PyECLib/liberasurecode Version Consideration: In my opinion, we should fix [2] as soon as possible and set it as the Swift requirement to prevent the pandemic in the future. However, right now we don't have the way to set the c lang binary version constraint. To enable the requirement, my idea is as following steps: 1. [2] land at liberasurecode master 2: bump liberasurecode version 3: make hard-coded constraint for PyECLib to require the liberasurecode >= (step2) (this can be painful becuase xenial or yakkety is =< 1.2.1) 4: bump pyeclib version 5: set the pyeclib requirement in Swift requirements.txt (<- this can have a tag #closes-bug for this bug report in the commit message) Or any better idea is always welcome to me. 1: https://github.com/01org/isa-l/issues/10 2: https://review.openstack.org/#/c/393595/ 3: https://review.openstack.org/#/c/393263/ 4: https://review.openstack.org/#/c/393276/ 5: https://review.openstack.org/#/c/393656/
2016-11-16 23:08:00	John Dickinson	liberasurecode: status	New	Fix Released
2016-11-16 23:08:05	John Dickinson	pyeclib: status	New	In Progress
2016-11-17 14:21:51	Jeremy Stanley	ossa: status	Incomplete	Won't Fix
2016-11-17 14:22:12	Jeremy Stanley	ossa: status	Won't Fix	Invalid
2016-12-03 00:07:42	clayg	swift: importance	Critical	High
2017-01-11 20:28:32	clayg	swift: importance	High	Critical
2017-02-01 19:30:47	clayg	swift: importance	Critical	High
2017-02-01 19:30:52	clayg	swift: importance	High	Medium
2017-02-24 01:01:41	OpenStack Infra	tags	ec	ec in-stable-newton
2017-06-22 01:11:07	OpenStack Infra	swift: status	Confirmed	Fix Released
2024-04-20 00:36:42	OpenStack Infra	pyeclib: status	In Progress	Fix Released