Metadata corrupted

Bug #1718955 reported by DilipSimha
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
Invalid
Undecided
Unassigned

Bug Description

On one of our openstack-swift deployments, we observed several metadata errors from swift error log:

Type1: TypeError: 'NoneType' object has no attribute '_getitem_
2017-09-19T00:18:52.696180+00:00 ealecb2 object-server: ERROR __call__ error with GET /r1/27/AUTH_ktvzcv/mag-1453071101-yverxe/7906a582-168452472 : #012Traceback (most recent call last):#012 File "/usr/lib/python2.7/dist-packages/swift/obj/server.py", line 722, in __call__#012 res = method(req)#012 File "/usr/lib/python2.7/dist-packages/swift/common/utils.py", line 2535, in wrapped#012 return func(*a, **kw)#012 File "/usr/lib/python2.7/dist-packages/swift/common/utils.py", line 1108, in _timing_stats#012 resp = func(ctrl, *args, **kwargs)#012 File "/usr/lib/python2.7/dist-packages /swift/obj/server.py", line 523, in GET#012 with disk_file.open():#012 File "/usr/lib/python2.7/dist-packages/swift/obj/diskfile.py", line 1322, in open#012 data_file, meta_file)#012 File "/us r/lib/python2.7/dist-packages/swift/obj/diskfile.py", line 1557, in _construct_from_data_file#012 self._verify_data_file(data_file, fp)#012 File "/usr/lib/python2.7/dist-packages/swift/obj/diskfile.py", line 1467, in _verify_data_file#012 mname = self._metadata['name']#012TypeError: 'NoneType' object has no attribute '__getitem__' (txn: tx9612d580becb445f93869-0059c0626c)

Type2: TypeError: tuple indices must be integers, not str
2017-09-19T00:18:52.707319+00:00 ealecb2 swift-proxy-server: ERROR 500 Traceback (most recent call last):#012 File "/usr/lib/python2.7/dist-packages/swift/obj/server.py", line 722, in __call__#012 res = method(req)#012 File "/usr/lib/python2.7/dist-packages/swift/common/utils.py", line 2535, in wrapped#012 return func(*a, **kw)#012 File "/usr/lib/python2.7/dist-packages/swift/common/utils.py", line 1108, in _timing_stats#012 resp = func(ctrl, *args, **kwargs)#012 File "/usr/lib/python2.7/dist-packages/swift/obj/server.py", line 523, in GET#012 with disk_file.open():#012 File " /usr/lib/python2.7/dist-packages/swift/obj/diskfile.py", line 1322, in open#012 data_file, meta_file)#012 File "/usr/lib/python2.7/dist-packages/swift/obj/diskfile.py", line 1557, in _construct_from_data_file#012 self._verify_data_file(data_file, fp)#012 File "/usr/lib/python2.7/dist-packages/swift/obj/diskfile.py", line 1467, in _verify_data_file#012 mname = self._metadata['name']#012TypeError: tuple indices must be integers, not str#012 From Object Server 127.0.0.1:6000/r1 (txn: txe732f3edec224588bfd96-0059c0626c) (client_ip: 10.85.3.89)

Type3: TypeError: 'bool' object has no attribute '_getitem_
2017-09-19T00:18:52.712802+00:00 ealecb2 swift-proxy-server: ERROR 500 Traceback (most recent call last):#012 File "/usr/lib/python2.7/dist-packages/swift/obj/server.py", line 722, in __call__#012 res = method(req)#012 File "/usr/lib/python2.7/dist-packages/s wift/common/utils.py", line 2535, in wrapped#012 return func(*a, **kw)#012 File "/usr/lib/python2.7/dist-packages/swift/common/utils.py", line 1108, in _timing_stats#012 resp = func(ctrl, *args, **kwargs)#012 File "/usr/lib/python2.7/dist-packages/swift/obj/server.py", line 523, in GET#012 with disk_file.open():#012 File "/usr/lib/python2.7/dist-packages/swift/obj/diskfile.py", line 1322, in open#012 data_file, meta_file)#012 File "/usr/lib/python2.7/dist-packages/swift/obj/diskfile.py", line 1557, in _construct_from_data_file#012 self._verify_data_file(data_file, fp)#012 File "/usr/lib/python2.7/dist-packages/swift/obj/diskfile.py", line 1467, in _verify_data_file#012 mname = self._metadata['name']#012TypeError: 'bool' object has no attribute '__getitem__'#012 From Object Server 127.0.0.1:6000/r1 (txn: tx3719f3229c0943afab817-0059c0626c) (client_ip: 10.85.3.89)

Type4: TypeError: string indices must be integers, not str
 18 2017-09-19T00:18:52.714421+00:00 ealecb2 object-server: ERROR __call__ error with GET /r1/30/AUTH_ktvzcv/mag-1453071101-yverxe/c2624aee-168395212 : #012Traceback (most recent call last):#012 File "/usr/lib/python2.7/dist-packages/swift/obj/server.py", line 722, in __call__#012 res = method(req)#012 File "/usr/lib/python2.7/dist-packages/swift/common/utils.py", line 2535, in wrapped#012 return func(*a, **kw)#012 File "/usr/lib/python2.7/dist-packages/swift/common/utils.py", line 1108, in _timing_stats#012 resp = func(ctrl, *args, **kwargs)#012 File "/usr/lib/python2.7/dist-packages/swift/obj/server.py", line 523, in GET#012 with disk_file.open():#012 File "/usr/lib/python2.7/dist-packages/swift/obj/diskfile.py", line 1322, in open#012 data_file, meta_file)#012 File "/usr/lib/python2.7/dist-packages/swift/obj/diskfile.py", line 1557, in _construct_from_data_file#012 self._verify_data_file(data_file, fp)#012 File "/usr/lib/python2.7/dist-packages/swift/obj/diskfile.py", line 1467, in _verify_data_file#012 mname = self._metadata['name']#012TypeError: string indices must be integers, not str (txn: txeab2fe7cd4664f748f66e-0059c0626c)

Upon reporting this error, data is quarantined. Unfortunately, we do not have any replicas (just 1 disk)

As can be observed, all these errors are stemming from file: /usr/lib/python2.7/dist-packages/swift/obj/diskfile.py at line 1467.
Looking at source code in our repo (which has slight unrelated(in this context) modifications from opensource version):

1528 def _construct_from_data_file(self, data_file, meta_file):
1529 """
1530 Open the `.data` file to fetch its metadata, and fetch the metadata
1531 from the fast-POST `.meta` file as well if it exists, merging them
1532 properly.
1533
1534 :param data_file: on-disk `.data` file being considered
1535 :param meta_file: on-disk fast-POST `.meta` file being considered
1536 :returns: an opened data file pointer
1537 :raises DiskFileError: various exceptions from
1538 :func:`swift.obj.diskfile.DiskFile._verify_data_file`
1539 """
1540 fp = open(data_file, 'rb')
1541 datafile_metadata = self._failsafe_read_metadata(fp, data_file)
1542 if meta_file:
1543 self._metadata = self._failsafe_read_metadata(meta_file, meta_file)
1544 sys_metadata = dict(
1545 [(key, val) for key, val in datafile_metadata.iteritems()
1546 if key.lower() in DATAFILE_SYSTEM_META
1547 or is_sys_meta('object', key)])
1548 self._metadata.update(sys_metadata)
1549 else:
1550 self._metadata = datafile_metadata
1551 if self._name is None:
1552 # If we don't know our name, we were just given a hash dir at
1553 # instantiation, so we'd better validate that the name hashes back
1554 # to us
1555 self._name = self._metadata['name']
1556 self._verify_name_matches_hash(data_file)
1557 self._verify_data_file(data_file, fp)
1558 return fp

....
....
....

1450 def _verify_data_file(self, data_file, fp):
1451 """
1452 Verify the metadata's name value matches what we think the object is
1453 named.
1454
1455 :param data_file: data file name being consider, used when quarantines
1456 occur
1457 :param fp: open file pointer so that we can `fstat()` the file to
1458 verify the on-disk size with Content-Length metadata value
1459 :raises DiskFileCollision: if the metadata stored name does not match
1460 the referenced name of the file
1461 :raises DiskFileExpired: if the object has expired
1462 :raises DiskFileQuarantined: if data inconsistencies were detected
1463 between the metadata and the file-system
1464 metadata
1465 """
1466 try:
1467 mname = self._metadata['name']
1468 except KeyError:
1469 raise self._quarantine(data_file, "missing name metadata")
1470 else:
1471 if mname != self._name:
1472 self._logger.error(
1473 _('Client path %(client)s does not match '
1474 'path stored in object metadata %(meta)s'),
1475 {'client': self._name, 'meta': mname})
1476 raise DiskFileCollision('Client path does not match path '
1477 'stored in object metadata')
1478 try:
1479 x_delete_at = int(self._metadata['X-Delete-At'])
1480 except KeyError:
1481 pass
1482 except ValueError:
1483 # Quarantine, the x-delete-at key is present but not an
1484 # integer.
1485 raise self._quarantine(
1486 data_file, "bad metadata x-delete-at value %s" % (
1487 self._metadata['X-Delete-At']))
1488 else:
1489 if x_delete_at <= time.time():
1490 raise DiskFileExpired(metadata=self._metadata)
1491 try:
1492 metadata_size = int(self._metadata['Content-Length'])
1493 except KeyError:
1494 raise self._quarantine(
1495 data_file, "missing content-length in metadata")
1496 except ValueError:
1497 # Quarantine, the content-length key is present but not an
1498 # integer.
1499 raise self._quarantine(
1500 data_file, "bad metadata content-length value %s" % (
1501 self._metadata['Content-Length']))
1502 fd = fp.fileno()
1503 try:
1504 statbuf = os.fstat(fd)
1505 except OSError as err:
1506 # Quarantine, we can't successfully stat the file.
1507 raise self._quarantine(data_file, "not stat-able: %s" % err)
1508 else:
1509 obj_size = statbuf.st_size - METADATA_FOOTER_LEN
1510 if obj_size != metadata_size:
1511 raise self._quarantine(
1512 data_file, "metadata content-length %s does"
1513 " not match actual object size %s" % (
1514 metadata_size, statbuf.st_size))
1515 self._content_length = obj_size
1516 return obj_size

Definitely looks like metadata area is corrupted. I don't see any IO errors from local disk.
Can someone please help in understanding what exactly went wrong.
Is it a software or hardware bug ?
Is it possible to recover data in quarantined region.

Changed in swift:
status: New → Incomplete
Revision history for this message
Alistair Coles (alistair-coles) wrote :

Based on the diskile code posted above, this looks like a relatively old version of Swift. Can you update the bug report with the swift version?

You say that "source code in our repo (which has slight unrelated(in this context) modifications from opensource version)" - maybe you can post the _failsafe_read_metadata and read_metadata functions from diskfile.py. In fact, post a diff of your diskfile.py against the opensource version you have forked.

Revision history for this message
DilipSimha (nmdilipsimha) wrote :

Yes the deployed swift-version is not the latest.
Swift-version: 2.2.2

Have also attached the diff.

Revision history for this message
Alistair Coles (alistair-coles) wrote :

From the diff it appears you have modified the read/write_metadata functions. From the traceback it appears there is a problem with reading metadata. I suspect the two might be related.

I'm leaving this bug as incomplete because I don't think there is sufficient information to suggest that this is a bug in upstream Swift code.

Revision history for this message
Alistair Coles (alistair-coles) wrote :

Changing to Invalid, does not appear to be a bug in upstream swift. Please restore the bug report with more information if you disagree.

Changed in swift:
status: Incomplete → Invalid
Revision history for this message
DilipSimha (nmdilipsimha) wrote :

I understand that metadata path has deviated from upstream version, but its a working piece of code that has been tested for a long time now on several active deployments.

Can you please help debug the issue ? As in, are there any swift tools that can be used to:
1: check metadata portions of the .data file ?
2: remove data from quarantined region ?
3: Whether this bug (metadata read failures) was seen before and if latest swift code has fix for them ?

Any help is greatly appreciated.

Revision history for this message
clayg (clay-gerrard) wrote :

It looks like the code to deal with metadata overflow was removed - so that's probably what you're looking at. Probably there's more than 256 bytes of serialized metadata at the end of the file.

Just open it up in hexdump and try to spot the beginning of the pickled data. Can you share the last ~1MiB of the datafile?

You are correct that the newest Swift code doesn't have this bug - but it's also going to be incompatible with all data in your cluster because Swift doesn't store metadata they way that your code does. You'd have to do a migration to get back on a community supported version of Swift - or seek commercial services/support. GL!

Revision history for this message
Samuel Merritt (torgomatic) wrote :

There's no Swift tools to check your metadata storage because Swift doesn't store metadata like your cluster does.

Swift stores metadata in one or more extended attributes (xattrs) on the filesystem. Your cluster stores metadata in the file data in the last N bytes.

As for debugging help: I took a look at the provided diff. Your write_metadata() function takes the metadata, pickles it, and pads the pickled bytes with zeros up to a size of 512 [1]. However, when the metadata is large enough that pickle.dumps() returns more than 512 bytes, write_metadata() does not detect that and writes it anyway. This is probably why you've got broken objects.

[1] the comments say 256, but the constant METADATA_FOOTER_LEN is 512

Revision history for this message
DilipSimha (nmdilipsimha) wrote :

Thank you Samuel and Clay for helping locate the buggy code. The bug really looks nasty and will work on patching it.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.