SSYNC: Race condition in replication/reconstruction can lead to loss of datafile

Bug #1897177 reported by Romain LE DISEZ
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
Fix Released
High
Unassigned

Bug Description

We discovered that, after a rebalance, some replica or fragments were missing in the cluster.
We enabled debug and added some extra logging around all calls of rmdir/rmtree.

Here is the example of a fragment that disappeared during a rebalance.

partition: 222495
from: 172.16.0.38/disk-00-019 (OLD primary)
to: 172.16.0.126/disk-00-019 (NEW primary)
object: 222495/30e/d947def17788ea288999ec0ae76b330e/1574430074.25434#4#d.data
            AUTH_xxx/container/object.ext (obfuscated)

OLD 2020-09-24T15:18:16 obj-reconstructor Ring change detected. Aborting current reconstruction pass.
NEW 2020-09-24T15:18:37 obj-server 172.16.0.38 - - [24/Sep/2020:15:18:37 +0000] "SSYNC /disk-00-019/222495" 200 - "-" "-" "-" 0.0003 "-" 10058 0
NEW 2020-09-24T15:20:32 obj-server - - - [24/Sep/2020:15:20:32 +0000] "PUT /disk-00-019/222495/AUTH_xxx/container/object.ext" 201 - "-" "-" "-" 1.4060 "-" 10058 0

OLD 2020-09-24T15:22:09 obj-server 172.16.0.126 - - [24/Sep/2020:15:22:09 +0000] "SSYNC /disk-00-019/222495" 200 - "-" "-" "-" 0.0003 "-" 15433 0
OLD 2020-09-24T15:22:09 obj-server 172.16.0.126 - - [24/Sep/2020:15:22:09 +0000] "REPLICATE /disk-00-019/222495/028-[...]-30e-[...]-fff" 200 10538 "-" "-" "obj-reconstructor 10197" 0.0927 "-" 15433 0
NEW 2020-09-24T15:22:10 obj-reconstructor rmdir(/srv/node/disk-00-019/objects/222495/30e/d947def17788ea288999ec0ae76b330e)
NEW 2020-09-24T15:24:13 obj-reconstructor Ring change detected. Aborting current reconstruction pass.

OLD 2020-09-24T15:24:45 obj-reconstructor rmdir(/srv/node/disk-00-019/objects/222495/30e/d947def17788ea288999ec0ae76b330e)
NEW 2020-09-24T15:24:45 obj-server 172.16.0.38 - - [24/Sep/2020:15:24:45 +0000] "REPLICATE /disk-00-019/222495/028-[...]-30e-[...]-fff" 200 4547 "-" "-" "obj-reconstructor 15498" 0.7376 "-" 10058 0

In this cluster, the distribution of a new ring can take up to 30 minutes.

The extract from the log shows that, while the old primary is reverting the partition to the new primary, the new primary still has the old ring so it also tries to revert it (to the old primary).

It leads:
- the new primary to delete the fragments of the partition he already got from the old primary because old primary is in sync
- the old primary to delete all fragments as the new primary confirmed the PUT succeeded

Revision history for this message
Romain LE DISEZ (rledisez) wrote :

I'm thinking that the replicator/reconstructor should lock the partition so that an incoming SSYNC would fail. It would avoid such race condition

Revision history for this message
clayg (clay-gerrard) wrote :

yeah a lock, or even just something with their ring version

I don't think old primaries need to be doing a suffix rehash of a part their about to ship off regardless!

Changed in swift:
importance: Undecided → High
summary: - Race condition in replication/reconstruction can lead to loss of
+ SSYNC: Race condition in replication/reconstruction can lead to loss of
datafile
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to swift (master)

Reviewed: https://review.opendev.org/754242
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=8c0a1abf744a11b5c289239e3ac830786a9de4e9
Submitter: Zuul
Branch: master

commit 8c0a1abf744a11b5c289239e3ac830786a9de4e9
Author: Romain LE DISEZ <email address hidden>
Date: Thu Sep 24 20:36:36 2020 -0400

    Fix a race condition in case of cross-replication

    In a situation where two nodes does not have the same version of a ring
    and they both think the other node is the primary node of a partition,
    a race condition can lead to the loss of some of the objects of the
    partition.

    The following sequence leads to the loss of some of the objects:

      1. A gets and reloads the new ring
      2. A starts to replicate/revert the partition P to node B
      3. B (with the old ring) starts to replicate/revert the (partial)
         partition P to node A
         => replication should be fast as all objects are already on node A
      4. B finished replication of (partial) partition P to node A
      5. B remove the (partial) partition P after replication succeeded
      6. A finishes replication of partition P to node B
      7. A removes the partition P
      8. B gets and reloads the new ring

    All data transfered between steps 2 and 5 will be lost as they are not
    anymore on node B and they are also removed from node A.

    This commit make the replicator/reconstructor to hold a replication_lock
    on partition P so that remote node cannot start an opposite replication.

    Change-Id: I29acc1302a75ed52c935f42485f775cd41648e4d
    Closes-Bug: #1897177

Changed in swift:
status: In Progress → Fix Released
Revision history for this message
Tim Burke (1-tim-z) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/swift 2.27.0

This issue was fixed in the openstack/swift 2.27.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.