container-sync needs to be more parallel

Bug #1068426 reported by Faidon Liambotis on 2012-10-19
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
Eran Rom

Bug Description

The way container sync works right now, it syncs one object at a time, which is hugely inefficient in large containers. Moreover, running multiple sync processes from multiple container servers makes it actually *worse*, since they try to sync the very same files that their sibling just synced (since the state remains in the yet unrsynced sqlites).

It'd be nice if the sync process had some way to parallelize the problem and sync multiple files at the same time for efficiency reasons.

Samuel Merritt (torgomatic) wrote :

I'm confused about which container-sync processes are doing duplicate work. Are you saying

(a) when running multiple swift-container-sync processes on a single node, they operate on the containers in the same order and therefore do duplicate work,

(b) when running two swift-container-sync processes on two separate nodes, they duplicate each others' work, or

(c) all of the above?

Faidon Liambotis (paravoid) wrote :

It's (b), and I observed it with running sync processes short after each other. My understanding is that the synchronization state is all within the container databases, and these are synced via the normal replication processes, so the timing is probably affecting this.

Samuel Merritt (torgomatic) wrote :
Download full text (3.7 KiB)

This bug confuses me. (This is not a difficult task.)

I've spent some time reading the container-sync code, and here's how I think it works:

Each container DB has two sync-points SP1 and SP2. They're stored as "x_container_sync_point1" and "x_container_sync_point2", but I'm going to call them SP1 and SP2 to save keystrokes.

Initially, SP1 = SP2 = -1.

When container sync runs, the first thing it does is figure out which replica of the container it has, and names that "ordinal". If there are 3 replicas, then ordinal is either 0, 1, or 2. Each container-sync process will run on a different machine, so no two container-sync processes will ever have the same container *and* the same ordinal together (i.e. the same (container, ordinal)) pair.

Then, for every row where SP2 < row-id < SP1 and hash(container) % 3 != ordinal, it syncs that row to the remote server. That not-equals is kind of weird at first glance, but the effect is that it syncs up the rows that it didn't do last time. After it's done here, SP2 = SP1.

The third and final thing the container-server does is sync rows where SP1 < row-id and hash(object) % 3 == ordinal. After it's done, SP1 is advanced to be the newest row-id.

So essentially there are 3 categories of row: new rows, old rows, and ancient rows.

                 SP2 SP1
                  | |
  0 --------------+------------+--------- newest-row
       ancient old new

The first thing that happens is to sync old rows that this process is *not* responsible for (hash(object) != ordinal):

  0 ---------------------------+--------- newest-row
            ancient new

Then we sync new rows that this process *is* responsible for (hash(object) == ordinal):

                              SP2 SP1
                               | |
  0 ---------------------------+--------+ newest-row
            ancient old

And of course, as container contents change, new rows get appended on the right:

                              SP2 SP1
                               | |
  0 ---------------------------+--------+----- newest-row
            ancient old new

I wrote all that mainly so you or anyone else can tell me where I've misunderstood things.

Without knowing much about your setup, I suspect you've got a small number of containers with sync enabled. If there were a large number of sync-enabled containers, then different sync processes would tend to end up working on different containers at a given time by virtue of how the containers are distributed across machines.

When you get collisions, I'm guessing that the duplicate work you see is the old-row syncing. If the new-row syncing takes a long time for sync-process 1 but not for sync-process 2, then sync-process 2 will run through its new work quickly, and then on the next cycle will start syncing old rows that processes 1 and 3 should have done. It'll race through the already-synced stuff pretty quickly and then reach the rows that sync-process 1 is working on, and that's...


Samuel Merritt (torgomatic) wrote :

Ugh, formatting fail. The diagrams are actually readable here:

Faidon Liambotis (paravoid) wrote :

Thanks a lot for the very detailed analysis! It is confusing indeed, and your explanations help a lot.

This was on my trials with container sync, where I only had a single container set up for syncing... So, your guess was spot on, apologies for not mentioning earlier what turned out to be an important fact.

Chuck Thier (cthier) on 2012-12-04
Changed in swift:
importance: Undecided → Wishlist
Chuck Thier (cthier) on 2012-12-04
Changed in swift:
status: New → Triaged
David Hadas (david-hadas) wrote :

Unless I missed something here, I suggest to remove this from the wish list.

Seems like parallelism is reasonably supported for multiple containers.
Faidon, is there justification for achieving a single container parallelism ?

Filippo Giunchedi (filippo) wrote :

(I'm currently working with Faidon, specifically on container-sync for our swift deployment)

I think more parallelism would help regardless when processing many containers, something like this (not an actual patch, just a sketch to explain the idea)

--- a/swift/container/
+++ b/swift/container/
@@ -21,6 +21,7 @@ from random import choice, random, shuffle
 from struct import unpack_from

 from eventlet import sleep, Timeout
+from eventlet.greenpool import GreenPool

 import swift.common.db
 from swift.container.backend import ContainerBroker, DATADIR
@@ -157,6 +158,7 @@ class ContainerSync(Daemon):
         self._myport = int(conf.get('bind_port', 6001))
         swift.common.db.DB_PREALLOCATION = \
             config_true_value(conf.get('db_preallocation', 'f'))
+ self.pool = GreenPool(int(conf.get('concurrency', 1)))

     def get_object_ring(self, policy_idx):
@@ -178,9 +180,10 @@ class ContainerSync(Daemon):
             for path, device, partition in all_locs:
- self.container_sync(path)
+ self.pool.spawn_n(self.container_sync, path)
                 if time() - self.reported >= 3600: # once an hour
+ self.pool.waitall()
             elapsed = time() - begin
             if elapsed < self.interval:
                 sleep(self.interval - elapsed)
@@ -195,9 +198,10 @@ class ContainerSync(Daemon):
         for path, device, partition in all_locs:
- self.container_sync(path)
+ self.pool.spawn_n(self.container_sync, path)
             if time() - self.reported >= 3600: # once an hour
+ self.pool.waitall()
         elapsed = time() - begin

John Dickinson (notmyname) wrote :

Basically, the exact same comment as on

marking this as invalid because of age. It's true that container sync should be "better", but there's a lot that's happened in swift over the pas 2 years since this was filed that we can learn from (like the container reconciler).

Changed in swift:
status: Triaged → Invalid
Eran Rom (eranr) on 2015-07-08
Changed in swift:
assignee: nobody → Eran Rom (eranr)
Eran Rom (eranr) wrote :

Moving to in progress following previous IRC meeting.

Changed in swift:
status: Invalid → In Progress

Fix proposed to branch: master

Change abandoned by Alistair Coles (<email address hidden>) on branch: feature/crypto-review
Reason: ok, I will abandon this one in favour of the patches on master

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers