OpenStack Object Storage (swift)

Container sync does not replicate container metadata

Bug #1464022 reported by Eran Rom on 2015-06-10

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Object Storage (swift)	Confirmed	Wishlist	Eran Rom

Bug Description

Feature Request:
The container sync functionality does not include syncing the container's metadata.
Proposed solution below:

Configuration
---------------------
Add the following to under [container-sync] in container_server.conf
# sync all x-container-meta-* items
sync_metadata = true / false
# A comma separated list of other metadata items to sync, e.g.
sync_system_metadata_items = [ ‘x-container-read’, ‘x-container-write’]

Sync Process
-------------------
1. In ContainerSync.container_sync() attempt to sync the metadata just before the loop doing the container_sync_row()
2. Given the metadata json kept in the container info table proceed as follows:
  a. Group the metadata items according to their timestamp (items are: <key,(timestamp, value)>)
  b. For each group issue a POST to the remote cluster carrying the group’s timestamp as the x-timestamp header (not to forget the metadata items in the group).
  c. As an optimization we can keep a ‘last_metadata_sync_timestamp’ variable in the container info table and use it as follows:
     i. Sort the groups timestamps and send the POSTs accordingly
     ii. After each post update the ‘last_metadata_sync_timestamp’ to reflect the sent POST

suggested steps / patches
---------------------------------------
1. Adding last_metadata_sync_timestamp to container info table
swift/container/backend.py
test/unit/container/test_backend.py

2. Resetting last_metadata_sync_timestamp together with the x_container_sync_points
swift/container/server.py
test/unit/container/test_server.py

3. Add a post_container wrapper to SimpleClient in InternalClient.py

4. Adding the actual sync functionality
   swift/container/sync.py
   test/unit/container/test_sync.py
   functional / probe tests

Editor's note: 1,2 are by far most of the code (container info table changes/ schema migration & tests) perhaps not worth the optimization effort.

Revision history for this message

Eran Rom (eranr) wrote on 2015-06-17:

Two Comments:

1. Missing from the optimization above: If we do maintain ‘last_metadata_sync_timestamp’ , then we should check against it all metadata items timestamps before issuing any POST to the remote container. Thus, the sync process proceed as follows:
  1. Get the 'last_metadata_sync_timestamp' from the info table
  2. Given the metadata json kept in the container info table proceed as follows:
    a. filter out all metadata items whose timestamp > last_metadata_sync_timestamp
    b. Group the remaining metadata items according to their timestamp (items are: <key,(timestamp, value)>), and sort the groups in increasing order of timestamps
    c. For each group issue a POST to the remote cluster carrying the group’s timestamp as the x-timestamp header (not to forget the metadata items in the group).
    d. After each successful post update the ‘last_metadata_sync_timestamp’ to reflect the sent POST

2. The above suggest to control the replicated metadata in the config file. Seems that this should be done as a container metadata, e.g.:
x-container-sync-meta: true/false
x-container-sync-sysmeta: comma separated list of other metadata items to sync

Revision history for this message

clayg (clay-gerrard) wrote on 2015-06-17:

Eran,

Acoles & I talked about how we might sync persistent object metadata via ssync and came up with a header scheme like

X-Object-Metadata-[Key]: [Value]
X-Timestamp-Object-Metadata-[Key]: [Timestamp]

So you can just send everything in one post:

    for key, (value, timestamp) in metadata.items():
        headers.add(object_metadata(key, value))
        headers.add(object_metadata_timestamp(key, timestamp))

... you might think about doing that instead of a post for every x-timestamp (which could add up over time)

Revision history for this message

Eran Rom (eranr) wrote on 2015-06-18:

Clay,
That makes sense. Thanks! will use that scheme

Revision history for this message

Eran Rom (eranr) wrote on 2015-06-18:

Use case for replicating container metadata
------------------------------------------------------------------
Consider the case where container sync is used in a master/master mode only.
In other words: containers are being mirrored
In this case it clearly makes sense to replicateuser metadata as well as ACLs and Quotas between mirrored containers.
It is also clear that system metadata such as policy index, version control and encryption stuff might not need to get replicated.

Revision history for this message

Eran Rom (eranr) wrote on 2015-06-18:

After going back and forth with myself and IRC I suggest the following:
X-Container-Sync-Metadata: A comma separated list of headers to replicate, where the values can be:
- x-container-meta-*
- any header from the container's pass_through_headers as defined in the proxy's ContainerController

Revision history for this message

clayg (clay-gerrard) wrote on 2015-06-18:

I think container mirroring is an interesting use-case

Currently one option that jumps out is global clusters - if you want all of the objects of a container in region A to also be accessible with all the same metadata and behaviors in region B - you create that container with a storage policy that will house multiple replicas in each region. There might be some deficiencies with how container requests are currently propagated, maybe some inefficiencies with how data uploaded in region A is synced to region B, but I think I'd rather enumerate those and fix them than try to do "container mirroring" via an extension to container-sync that also requires the user to enumerate all the things they want mirrored?

Idk, i suppose I'm currently skeptical mainly because container-sync's ability to scale is tied to the database replicas and container servers. I have a notion that the queue based approach used by the object expirer and reconciler - despite the i/o overhead of duplicating the rows into another container for staging/sequencing - actually has a better chance to scale because the heavy lifting of moving the data can be spread across any N nodes that can reach the queue. But that's somewhat orthogonal to container-metadata syncing.

Do you have another use case for this feature in mind besides mirroring a container between two sites? Because that use-case sounds like "I really want a multi-site storage policy" - but I'm guessing you have some operator requirement that makes the existing implementation there less suitable somehow?

Revision history for this message

Eran Rom (eranr) wrote on 2015-06-21:

Indeed the intention of container mirroring is mirroring between regions.

A multi-site storage policy has been considered, and was overruled due to isolation requirements coming from a
concern that problems happening in one region affect other regions when there is a shared ring between them.

Two other concerns have to do with policy management and resource utilization:
Suppose that to reduce the isolation problem mentioned above, there is a policy for each two sites (or for each 3 sites if you want 3X 'multi-region replication'). This leads to O(N^2) or O(M^m) policies for m replicas over N sites. Having so many policies, or groups of policies also brings the question of resource utilization. With one ring resources usage is pretty much evenly distributed, the more rings we have to work harder to globally share the resources.

Other then that, I do believe that replicating the metadata is a natural extension to the existing container sync generic feature.

The scale problem is a concern we also share. While we try to measure the 'limits' of container sync, we can think of short term and long term solutions for this:
1. Short term solution - Add more md servers to be able to deal with larger sync BW.
2. Long term solution, which you pretty much suggest above - Rely on the notification mechanism to 'decouple' the container db from the container 'sync' node so as to distribute the work amongst more nodes. BTW From a more theoretical P.O.V. I wonder if this can be done today using a 'remote' broker.

John Dickinson (notmyname) on 2015-07-06

Changed in swift:
importance:	Undecided → Wishlist
status:	New → Confirmed

Eran Rom (eranr) on 2015-07-06

Changed in swift:
assignee:	nobody → Eran Rom (eranr)

Revision history for this message

Eran Rom (eranr) wrote on 2016-09-20:

I believe that the metadata updates taken cared of by Fast-POST are for objects' metadata. The bug in question is about the container's metadata.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.