OpenStack Object Storage (swift)

SSYNC: Race condition in replication/reconstruction can lead to loss of datafile

Bug #1897177 reported by Romain LE DISEZ on 2020-09-24

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Object Storage (swift)	Fix Released	High	Unassigned

Bug Description

We discovered that, after a rebalance, some replica or fragments were missing in the cluster.
We enabled debug and added some extra logging around all calls of rmdir/rmtree.

Here is the example of a fragment that disappeared during a rebalance.

partition: 222495
from: 172.16.0.38/disk-00-019 (OLD primary)
to: 172.16.0.126/disk-00-019 (NEW primary)
object: 222495/30e/d947def17788ea288999ec0ae76b330e/1574430074.25434#4#d.data
AUTH_xxx/container/object.ext (obfuscated)

OLD 2020-09-24T15:18:16 obj-reconstructor Ring change detected. Aborting current reconstruction pass.
NEW 2020-09-24T15:18:37 obj-server 172.16.0.38 - - [24/Sep/2020:15:18:37 +0000] "SSYNC /disk-00-019/222495" 200 - "-" "-" "-" 0.0003 "-" 10058 0
NEW 2020-09-24T15:20:32 obj-server - - - [24/Sep/2020:15:20:32 +0000] "PUT /disk-00-019/222495/AUTH_xxx/container/object.ext" 201 - "-" "-" "-" 1.4060 "-" 10058 0

OLD 2020-09-24T15:22:09 obj-server 172.16.0.126 - - [24/Sep/2020:15:22:09 +0000] "SSYNC /disk-00-019/222495" 200 - "-" "-" "-" 0.0003 "-" 15433 0
OLD 2020-09-24T15:22:09 obj-server 172.16.0.126 - - [24/Sep/2020:15:22:09 +0000] "REPLICATE /disk-00-019/222495/028-[...]-30e-[...]-fff" 200 10538 "-" "-" "obj-reconstructor 10197" 0.0927 "-" 15433 0
NEW 2020-09-24T15:22:10 obj-reconstructor rmdir(/srv/node/disk-00-019/objects/222495/30e/d947def17788ea288999ec0ae76b330e)
NEW 2020-09-24T15:24:13 obj-reconstructor Ring change detected. Aborting current reconstruction pass.

OLD 2020-09-24T15:24:45 obj-reconstructor rmdir(/srv/node/disk-00-019/objects/222495/30e/d947def17788ea288999ec0ae76b330e)
NEW 2020-09-24T15:24:45 obj-server 172.16.0.38 - - [24/Sep/2020:15:24:45 +0000] "REPLICATE /disk-00-019/222495/028-[...]-30e-[...]-fff" 200 4547 "-" "-" "obj-reconstructor 15498" 0.7376 "-" 10058 0

In this cluster, the distribution of a new ring can take up to 30 minutes.

The extract from the log shows that, while the old primary is reverting the partition to the new primary, the new primary still has the old ring so it also tries to revert it (to the old primary).

It leads:
- the new primary to delete the fragments of the partition he already got from the old primary because old primary is in sync
- the old primary to delete all fragments as the new primary confirmed the PUT succeeded

Revision history for this message

Romain LE DISEZ (rledisez) wrote on 2020-09-24:

I'm thinking that the replicator/reconstructor should lock the partition so that an incoming SSYNC would fail. It would avoid such race condition

Revision history for this message

clayg (clay-gerrard) wrote on 2020-09-25:

yeah a lock, or even just something with their ring version

I don't think old primaries need to be doing a suffix rehash of a part their about to ship off regardless!

Changed in swift:
importance:	Undecided → High

Romain LE DISEZ (rledisez) on 2020-11-11

summary:

- Race condition in replication/reconstruction can lead to loss of
+ SSYNC: Race condition in replication/reconstruction can lead to loss of
datafile

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-11-11: Fix merged to swift (master)

Reviewed: https://review.opendev.org/754242
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=8c0a1abf744a11b5c289239e3ac830786a9de4e9
Submitter: Zuul
Branch: master

commit 8c0a1abf744a11b5c289239e3ac830786a9de4e9
Author: Romain LE DISEZ <email address hidden>
Date: Thu Sep 24 20:36:36 2020 -0400

Fix a race condition in case of cross-replication

    In a situation where two nodes does not have the same version of a ring
    and they both think the other node is the primary node of a partition,
    a race condition can lead to the loss of some of the objects of the
    partition.

The following sequence leads to the loss of some of the objects:

      1. A gets and reloads the new ring
      2. A starts to replicate/revert the partition P to node B
      3. B (with the old ring) starts to replicate/revert the (partial)
         partition P to node A
         => replication should be fast as all objects are already on node A
      4. B finished replication of (partial) partition P to node A
      5. B remove the (partial) partition P after replication succeeded
      6. A finishes replication of partition P to node B
      7. A removes the partition P
      8. B gets and reloads the new ring

All data transfered between steps 2 and 5 will be lost as they are not
anymore on node B and they are also removed from node A.

This commit make the replicator/reconstructor to hold a replication_lock
on partition P so that remote node cannot start an opposite replication.

Change-Id: I29acc1302a75ed52c935f42485f775cd41648e4d
Closes-Bug: #1897177

Changed in swift:
status:	In Progress → Fix Released

Revision history for this message

Tim Burke (1-tim-z) wrote on 2021-02-25:

Related bug for rsync: https://bugs.launchpad.net/swift/+bug/1903917

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-03-18: Fix included in openstack/swift 2.27.0

This issue was fixed in the openstack/swift 2.27.0 release.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.