relinker errors due to existing similar tombstones

Bug #1934142 reported by Alistair Coles
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
Fix Released
Undecided
Unassigned

Bug Description

We've seen many relink errors due to reconciler concurrently DELETEing objects during relink, which leaves tombstones with different inodes in the old and new part power locations.

- The reconciler daemons can make concurrent DELETEs of the same object with the same timestamp.

- concurrent DELETEs (or PUTs) with same X-Timestamp during relink step *can* result in the object server creating different inode files in the old and new part power locations. This is likely to result in tracebacks similar to:

object-server: err Relinking X to Y failed: [Errno 17] File exists: #012Traceback (most recent call last):#012 File "/opt/ss/lib/python2.7/site-packages/swift/obj/diskfile.py", line 1877, in _finalize_put#012 relink_paths(target_path, new_target_path)#012 File "/opt/ss/lib/python2.7/site-packages/swift/obj/diskfile.py", line 502, in relink_paths#012 raise err#012OSError: [Errno 17] File exists (txn: tx097009b1a17c49409f965-0060dc22f0)

but despite the traceback there is a diskfile in both old and new part power locations.

- The relinker will log an error if it fails to relink a file because there is an existing file in the next part power location with a different inode.

e.g.
Error relinking: failed to relink X to Y: [Errno 17] File exists

So when the relinker visits tombstones created by reconcilers it can result in errors being logged and apparent failure to relink.

Note: the symptoms are similar to https://bugs.launchpad.net/swift/+bug/1921718 but the root cause is different. The fix for https://bugs.launchpad.net/swift/+bug/1921718 was to tolerate conflicting inode tombstones when the new part power tombstone was linked to an older part power tombstone (older than the current part power). The fix for this debug is likely to tolerate conflicting inode tombstones always (assuming timestamps match).

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to swift (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/swift/+/798849

Changed in swift:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to swift (master)

Reviewed: https://review.opendev.org/c/openstack/swift/+/798849
Committed: https://opendev.org/openstack/swift/commit/574897ae275ce94257096c56a9bdc494bc0a39ba
Submitter: "Zuul (22348)"
Branch: master

commit 574897ae275ce94257096c56a9bdc494bc0a39ba
Author: Alistair Coles <email address hidden>
Date: Wed Jun 30 14:05:23 2021 +0100

    relinker: tolerate existing tombstone with same timestamp

    It is possible for the current and next part power locations to
    both have existing tombstones with different inodes when the
    relinker tries to relink. This can be caused, for example, by
    concurrent reconciler DELETEs that specify the same timestamp.

    The relinker previously failed to relink and reported an error when
    encountering this situation. With this patch the relinker will
    tolerate an existing tombstone with the same filename but different
    inode in the next part power location.

    Since [1] the relinker had special case handling for EEXIST errors
    caused by a different inode tombstone already existing in the next
    partition power location: the relinker would check to see if the
    existing next part power tombstone linked to a tombstone in a previous
    part power (i.e. < current part power) location, and if so tolerate
    the EEXIST.

    This special case handling is no longer necessary because the relinker
    will now tolerate an EEXIST when linking a tombstone provided the two
    files have the same timestamp. There is therefore no need to search
    previous part power locations for a tombstone that does link with the
    next part power location.

    The link_check_limit is no longer used but the --link-check-limit
    command line option is still allowed (although ignored) for backwards
    compatibility.

    [1] Related-Change-Id: If9beb9efabdad64e81d92708f862146d5fafb16c

    Change-Id: I07ffee3b4ba6c7ff6c206beaf6b8f746fe365c2b
    Closes-Bug: #1934142

Changed in swift:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/swift 2.28.0

This issue was fixed in the openstack/swift 2.28.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.