Bug #1068426 “container-sync needs to be more parallel” : Bugs : OpenStack Object Storage (swift)

Revision history for this message

Samuel Merritt (torgomatic) wrote on 2012-11-20:

#1

I'm confused about which container-sync processes are doing duplicate work. Are you saying

(a) when running multiple swift-container-sync processes on a single node, they operate on the containers in the same order and therefore do duplicate work,

(b) when running two swift-container-sync processes on two separate nodes, they duplicate each others' work, or

(c) all of the above?

Revision history for this message

Faidon Liambotis (paravoid) wrote on 2012-11-20:

#2

It's (b), and I observed it with running sync processes short after each other. My understanding is that the synchronization state is all within the container databases, and these are synced via the normal replication processes, so the timing is probably affecting this.

Revision history for this message

Samuel Merritt (torgomatic) wrote on 2012-11-21:

#3

Download full text (3.7 KiB)

This bug confuses me. (This is not a difficult task.)

I've spent some time reading the container-sync code, and here's how I think it works:

Each container DB has two sync-points SP1 and SP2. They're stored as "x_container_sync_point1" and "x_container_sync_point2", but I'm going to call them SP1 and SP2 to save keystrokes.

Initially, SP1 = SP2 = -1.

When container sync runs, the first thing it does is figure out which replica of the container it has, and names that "ordinal". If there are 3 replicas, then ordinal is either 0, 1, or 2. Each container-sync process will run on a different machine, so no two container-sync processes will ever have the same container *and* the same ordinal together (i.e. the same (container, ordinal)) pair.

Then, for every row where SP2 < row-id < SP1 and hash(container) % 3 != ordinal, it syncs that row to the remote server. That not-equals is kind of weird at first glance, but the effect is that it syncs up the rows that it didn't do last time. After it's done here, SP2 = SP1.

The third and final thing the container-server does is sync rows where SP1 < row-id and hash(object) % 3 == ordinal. After it's done, SP1 is advanced to be the newest row-id.

So essentially there are 3 categories of row: new rows, old rows, and ancient rows.

                 SP2 SP1
                  | |
  0 --------------+------------+--------- newest-row
       ancient old new

The first thing that happens is to sync old rows that this process is *not* responsible for (hash(object) != ordinal):

                              SP2
                              SP1
                               |
  0 ---------------------------+--------- newest-row
            ancient new

Then we sync new rows that this process *is* responsible for (hash(object) == ordinal):

                              SP2 SP1
                               | |
  0 ---------------------------+--------+ newest-row
            ancient old

And of course, as container contents change, new rows get appended on the right:

                              SP2 SP1
                               | |
  0 ---------------------------+--------+----- newest-row
            ancient old new

I wrote all that mainly so you or anyone else can tell me where I've misunderstood things.

Without knowing much about your setup, I suspect you've got a small number of containers with sync enabled. If there were a large number of sync-enabled containers, then different sync processes would tend to end up working on different containers at a given time by virtue of how the containers are distributed across machines.

When you get collisions, I'm guessing that the duplicate work you see is the old-row syncing. If the new-row syncing takes a long time for sync-process 1 but not for sync-process 2, then sync-process 2 will run through its new work quickly, and then on the next cycle will start syncing old rows that processes 1 and 3 should have done. It'll race through the already-synced stuff pretty quickly and then reach the rows that sync-process 1 is working on, and that's...

This bug confuses me. (This is not a difficult task.)

I've spent some time reading the container-sync code, and here's how I think it works:

Each container DB has two sync-points SP1 and SP2. They're stored as "x_container_sync_point1" and "x_container_sync_point2", but I'm going to call them SP1 and SP2 to save keystrokes.

Initially, SP1 = SP2 = -1.

When container sync runs, the first thing it does is figure out which replica of the container it has, and names that "ordinal". If there are 3 replicas, then ordinal is either 0, 1, or 2. Each container-sync process will run on a different machine, so no two container-sync processes will ever have the same container *and* the same ordinal together (i.e. the same (container, ordinal)) pair.

Then, for every row where SP2 < row-id < SP1 and hash(container) % 3 != ordinal, it syncs that row to the remote server. That not-equals is kind of weird at first glance, but the effect is that it syncs up the rows that it didn't do last time. After it's done here, SP2 = SP1.

The third and final thing the container-server does is sync rows where SP1 < row-id and hash(object) % 3 == ordinal. After it's done, SP1 is advanced to be the newest row-id.

So essentially there are 3 categories of row: new rows, old rows, and ancient rows.

SP2          SP1
                  |            |
  0 --------------+------------+--------- newest-row
       ancient         old          new

The first thing that happens is to sync old rows that this process is *not* responsible for (hash(object) != ordinal):

SP2
                              SP1
                               |
  0 ---------------------------+--------- newest-row
            ancient                new

Then we sync new rows that this process *is* responsible for (hash(object) == ordinal):

SP2      SP1
                               |        |
  0 ---------------------------+--------+ newest-row
            ancient                old

And of course, as container contents change, new rows get appended on the right:

SP2      SP1
                               |        |
  0 ---------------------------+--------+----- newest-row
            ancient                old    new

I wrote all that mainly so you or anyone else can tell me where I've misunderstood things.

Without knowing much about your setup, I suspect you've got a small number of containers with sync enabled. If there were a large number of sync-enabled containers, then different sync processes would tend to end up working on different containers at a given time by virtue of how the containers are distributed across machines.

When you get collisions, I'm guessing that the duplicate work you see is the old-row syncing. If the new-row syncing takes a long time for sync-process 1 but not for sync-process 2, then sync-process 2 will run through its new work quickly, and then on the next cycle will start syncing old rows that processes 1 and 3 should have done. It'll race through the already-synced stuff pretty quickly and then reach the rows that sync-process 1 is working on, and that's when you get duplicate work.

At large scale, the wide distribution of containers and relatively large numbers of synced containers will reduce the amount of duplicate work to a very small fraction of the overall work. At small scale, there'll be more collisions. In the limit as scale gets really small, if there's only one sync-enabled container in the whole cluster, then you'll end up with duplicate work all the time.

I don't have any good ideas on how to fix this right now. Maybe someone who understands the problem better can chime in.

Revision history for this message

Samuel Merritt (torgomatic) wrote on 2012-11-21:

#4

Ugh, formatting fail. The diagrams are actually readable here: http://paste.openstack.org/raw/26244/

Revision history for this message

Faidon Liambotis (paravoid) wrote on 2012-11-21:

#5

Thanks a lot for the very detailed analysis! It is confusing indeed, and your explanations help a lot.

This was on my trials with container sync, where I only had a single container set up for syncing... So, your guess was spot on, apologies for not mentioning earlier what turned out to be an important fact.

Chuck Thier (cthier) on 2012-12-04

Changed in swift:
importance:	Undecided → Wishlist

Chuck Thier (cthier) on 2012-12-04

Changed in swift:
status:	New → Triaged

Revision history for this message

David Hadas (david-hadas) wrote on 2013-02-26:

#6

Unless I missed something here, I suggest to remove this from the wish list.

Seems like parallelism is reasonably supported for multiple containers.
Faidon, is there justification for achieving a single container parallelism ?

Revision history for this message

Filippo Giunchedi (filippo) wrote on 2014-10-24:

#7

(I'm currently working with Faidon, specifically on container-sync for our swift deployment)

I think more parallelism would help regardless when processing many containers, something like this (not an actual patch, just a sketch to explain the idea)

--- a/swift/container/sync.py
+++ b/swift/container/sync.py
@@ -21,6 +21,7 @@ from random import choice, random, shuffle
from struct import unpack_from

from eventlet import sleep, Timeout
+from eventlet.greenpool import GreenPool

import swift.common.db
from swift.container.backend import ContainerBroker, DATADIR
@@ -157,6 +158,7 @@ class ContainerSync(Daemon):
         self._myport = int(conf.get('bind_port', 6001))
         swift.common.db.DB_PREALLOCATION = \
             config_true_value(conf.get('db_preallocation', 'f'))
+ self.pool = GreenPool(int(conf.get('concurrency', 1)))

     def get_object_ring(self, policy_idx):
         """
@@ -178,9 +180,10 @@ class ContainerSync(Daemon):
                                                 mount_check=self.mount_check,
                                                 logger=self.logger)
             for path, device, partition in all_locs:
- self.container_sync(path)
+ self.pool.spawn_n(self.container_sync, path)
                 if time() - self.reported >= 3600: # once an hour
                     self.report()
+ self.pool.waitall()
             elapsed = time() - begin
             if elapsed < self.interval:
                 sleep(self.interval - elapsed)
@@ -195,9 +198,10 @@ class ContainerSync(Daemon):
                                             mount_check=self.mount_check,
                                             logger=self.logger)
         for path, device, partition in all_locs:
- self.container_sync(path)
+ self.pool.spawn_n(self.container_sync, path)
             if time() - self.reported >= 3600: # once an hour
                 self.report()
+ self.pool.waitall()
         self.report()
         elapsed = time() - begin
         self.logger.info(
--
2.1.1

(I'm currently working with Faidon, specifically on container-sync for our swift deployment)

I think more parallelism would help regardless when processing many containers, something like this (not an actual patch, just a sketch to explain the idea)

--- a/swift/container/sync.py
+++ b/swift/container/sync.py
@@ -21,6 +21,7 @@ from random import choice, random, shuffle
 from struct import unpack_from
 
 from eventlet import sleep, Timeout
+from eventlet.greenpool import GreenPool
 
 import swift.common.db
 from swift.container.backend import ContainerBroker, DATADIR
@@ -157,6 +158,7 @@ class ContainerSync(Daemon):
         self._myport = int(conf.get('bind_port', 6001))
         swift.common.db.DB_PREALLOCATION = \
             config_true_value(conf.get('db_preallocation', 'f'))
+        self.pool = GreenPool(int(conf.get('concurrency', 1)))
 
     def get_object_ring(self, policy_idx):
         """
@@ -178,9 +180,10 @@ class ContainerSync(Daemon):
                                                 mount_check=self.mount_check,
                                                 logger=self.logger)
             for path, device, partition in all_locs:
-                self.container_sync(path)
+                self.pool.spawn_n(self.container_sync, path)
                 if time() - self.reported >= 3600:  # once an hour
                     self.report()
+            self.pool.waitall()
             elapsed = time() - begin
             if elapsed < self.interval:
                 sleep(self.interval - elapsed)
@@ -195,9 +198,10 @@ class ContainerSync(Daemon):
                                             mount_check=self.mount_check,
                                             logger=self.logger)
         for path, device, partition in all_locs:
-            self.container_sync(path)
+            self.pool.spawn_n(self.container_sync, path)
             if time() - self.reported >= 3600:  # once an hour
                 self.report()
+        self.pool.waitall()
         self.report()
         elapsed = time() - begin
         self.logger.info(
-- 
2.1.1

Revision history for this message

John Dickinson (notmyname) wrote on 2015-06-03:

#8

Basically, the exact same comment as on https://bugs.launchpad.net/swift/+bug/1068425

marking this as invalid because of age. It's true that container sync should be "better", but there's a lot that's happened in swift over the pas 2 years since this was filed that we can learn from (like the container reconciler).

Changed in swift:
status:	Triaged → Invalid

Eran Rom (eranr) on 2015-07-08

Changed in swift:
assignee:	nobody → Eran Rom (eranr)

Revision history for this message

Eran Rom (eranr) wrote on 2015-07-21:

#9

Moving to in progress following previous IRC meeting.

Changed in swift:
status:	Invalid → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-08-06: Fix proposed to swift (master)

#10

Fix proposed to branch: master
Review: https://review.openstack.org/210099

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-09-18:

#11

Fix proposed to branch: master
Review: https://review.openstack.org/225338

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-06-22: Fix proposed to swift (feature/crypto-review)

#12

Fix proposed to branch: feature/crypto-review
Review: https://review.openstack.org/332985

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-06-28: Change abandoned on swift (feature/crypto-review)

#13

Change abandoned by Alistair Coles (<email address hidden>) on branch: feature/crypto-review
Review: https://review.openstack.org/332985
Reason: ok, I will abandon this one in favour of the patches on master

OpenStack Object Storage (swift)

container-sync needs to be more parallel

Bug Description

Other bug subscribers

Remote bug watches