Container replication does not scale for very large container databases

Bug #1260460 reported by Alexander Saint Croix on 2013-12-12
This bug affects 5 people
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)

Bug Description

Containers with very large numbers of files result in very large container databases, and the current container replication mechanism does not scale well for these use cases.

For example, one of our containers has nearly ten million objects in it. The resulting container database is nearly 2.5GB. Container replication in our environment happens very frequently, in times of heavy usage it can happen every 30 seconds or so. Since the container database is very large, these constant read/writes end up busying out the devices upon which the container database is stored, which results in a large number of "Lock timeout" and other PUT errors on that particular container-server process. Second, because so many objects are stored in that container, during periods of heavy load, the contents of the container change constantly, virtually guaranteeing that the entire container database will need to be synced during each attempted container-replication run.

A quick and dirty workaround for this problem is to create hundreds or thousands of small containers. This accomplishes two things: first, each of these small container databases will be spread around the storage devices more evenly, resulting in less impact on disk I/O on average. Second, because the entire object space is spread across more containers, the odds of any particular container being updated from one minute to the next will be smaller, and this should result in fewer container databases having to be copied at any given time, further reducing disk I/O and network load.

However, the above situation is surprising for a system that advertises itself to be as scalable as Swift, and the solution I give here does not feel like the sort of thing that an end user should have to concern themselves with. The abstraction between a "container" in the swift API and how the information about the objects stored in that container on the back end should be opaque to the end user. Knowledge of the replication process, how container databases are stored on disk, how much time they take to replicate, what size they are, whether they contend with regular PUT operations, and so on and so forth should not be end-user concerns.

We are surprised that our intuitive use case (put lots of objects in one container) becomes pathological to the extent that the user must create thousands of containers in order to continue to scale the storage solution to "large" numbers of files. Copying a 2.5GB sqlite database over the network just because one object was written into a container seems, at least to me, to be an astonishing misuse of disk and network resources.

Although my team is capable of adapting our use of the tool as described above, we consider this to be a major flaw in the design of Swift and an impediment to its scalability for intuitive use cases. At the very least, putting large numbers of files in a container should be warned against in the official documentation. In the long term, we'd very much like to see the container replication process be overhauled. For example, perhaps deltas containing changes to the container database could be shipped around, instead of the entire container database itself.

John Dickinson (notmyname) wrote :

You've run in to an issue that does exist in Swift: containers with high cardinality can increase object write times and may be difficult to keep in sync with its replicas.

There are a few different ways to mitigate this:

1) Use many containers. This is the normal recommended path of "best practices" for using Swift clusters. I'm sorry if you were not able to come across this recommendation earlier, and I'd welcome your help in updating the docs to be clear that this is something to take into consideration when writing an application to store data in Swift.

2) Use faster hardware. As you noticed, the main limitation on container servers is the disk IO. Deploying account and container serves provisioned with SSDs goes a long way to removing bottlenecks in container cardinality. I know this is simply a "throw more hardware at it" solution, but it is a relatively cheap and easy way to massively improve performance, even for containers with 10s of millions of objects in them.

3) Update Swift's code. This is the best long-term solution, but it is by far the most expensive and complicated option. Many attempts to do this have been made and abandoned. Basically, what needs to happen is that a container is sharded across the cluster as it grows. The tricky parts involve preserving ordered listings, knowing when to split and join container segments (while keeping aggregated metadata), and doing all this while keeping the eventual consistency window low enough to keep the system usable.

There's been a placeholder blueprint for this feature for a long time.

If you are interested in tackling this problem, we'd all welcome your help. I believe it's something that we in the community must solve eventually. Please join us on freenode IRC in #openstack-swift if you'd like to talk about contributing container sharding into the project.

Finally, I'd like to correct one slight mistake in your report. if possible, Swift does sync containers by sending deltas. Entire container databases are only sent over the network when a replica is completely missing.

Changed in swift:
importance: Undecided → Wishlist
Changed in swift:
status: New → Confirmed
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers