tmp directory causing Swift slowdown

Bug #1450656 reported by Shri Javadekar on 2015-04-30
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
Undecided
Shri Javadekar

Bug Description

Swift's object server creates a temp directory under the mount point /srv/node/rN (where N is some int). It create a temp file under this directory first (say /srv/node/r0/tmp/tmpASDF) and eventually renames this file to its final destination.

rename /srv/node/r0/tmp/tmpASDF ->
/srv/node/r0/objects/312/eef/deadbeef/33453453454323424.data.

XFS creates an inode in the same allocation group as it parent. So, when the temp file tmpASDF is created, it goes in the same allocation group of "tmp". When the rename happens, only the filesystem metadata gets modified. The allocation groups of the inodes don't change.

Since all object PUTs start off in the tmp directory, all inodes get created in the same allocation group. The B-tree used for keeping
track of these inodes in the allocation group grows bigger and bigger as more files are written and parsing this tree for existence checks or for creating new inodes becomes more and more expensive.

See this discussion [1] I had on the XFS mailing list where this issue was brought to light. And this other slightly old thread where the
problem was identical [2].

I validated this theory by periodically deleting the temp directory. Earlier my tests would start at ~375 obj/s and by the time I had 600GB of data, it would be crawling at ~100 obj/s. When I consistently deleted the tmp directory, I observed that the objects per second was not reducing at the same rate as earlier. Staring at ~375 obj/s, after 600GB data in Swift, I was getting ~340 obj/s.

[1] http://www.spinics.net/lists/xfs/msg32868.html
[2] http://xfs.9218.n7.nabble.com/Performance-degradation-over-time-td28514.html

Shri Javadekar (shrinand) wrote :

One option would be to make the temp directory somewhere deeper in the filesystem rather than immediately under the mount point. E.g. create one temp directory under each of the 3-byte hash directories. And use the temp directory corresponding to the object's hash.

But, it's unclear what other repercussions this will have? Will the replicator start replicating this temp directory?

Another option is to actually delete the tmp directory periodically. Problem is that we don't know when. And whenever we decide to do it, the temp directory may have some file in it making it impossible to delete the directory.

Alan Jiang (ajiang) wrote :

Shri

This brought some light to the high load issue I am facing when replicator update_delete is running.
When there are a lot of partition needs to be removed and each partition has a lot of suffix subdirectories...., the partition
delete operation will cause load to spike on the data node. This will cause object server request from
proxy to timeout(I have 20s node timeout).
Does this sound something related to the temp file issue you described?

Thanks.

Shri Javadekar (shrinand) wrote :

The pattern I mentioned in the original bug happens when a PUT request is processed. AFAIK, the replicator process uses rsync directly on the partition directories, which wouldn't use the tmp directory directly under the mount-point. Therefore, I think, this shouldn't be the root-cause of your problem.

paul luse (paul-e-luse) wrote :

thought you mentioned on the ML that you had a patch for this that you had tested are you going to post/link it?

Shri Javadekar (shrinand) wrote :

I have a patch for moving the tmp directory from directly under the mount point to further down in the filesystem hierarchy (in the 3-byte hash directory). I can send out a patch for it if that's the best option.

Samuel Merritt (torgomatic) wrote :

O_TMPFILE + linkat() is going to be the best answer since this is exactly the use case those were designed for.

However, on older kernels, we can certainly improve things.

Right now, the big thing that we gain by having a single temp dir is easy cleanup of stale temporary files. You can see the object replicator do this on line 489: it calls unlink_older_than() and clears out any old temp files left over by object server crashes or whatever. By pushing named temporary files down further into the directory structure, we lose that easy cleanup.

That's not to say we shouldn't do such a thing. What I'd do, if I wanted to spread out my tempfile creation but couldn't use O_TMPFILE, is create my tempfiles right in the hash dir with some suffix like ".objtmp". I'd add logic to the auditor to clean up any old tempfiles that it finds while it's crawling the disk, and I'd make sure we were passing the right flags to rsync so that .objtmp files don't get replicated. That gets us even lower in the hierarchy than the suffix dir.

Shri Javadekar (shrinand) wrote :

IMHO, overloading the auditor (or even the replicator) to do cleanup of old temp files is not a good choice. First, it is very unintuitive and second those processes might not even be running. If someone is running with replica count of 1, they may not run the replicator. Unfortunately, they have to do that today because replicator also does deletions.

Creating the tempfile in the hash directory itself will be great when the put succeeds. But in case of failures, it requires more work. Whoever does the deletion will have to look into the hash directory, find a .objtmp file and then delete the file as well as the hash directory itself.

As against this, if there is a tmp directory in the 3-byte hash directory, the only thing to cleanup in the case of failures would be files under this tmp directory. The drawback with this approach though is that there will be at least 4096 tmp directories.

Mike Barton (redbo) wrote :

I'm confused - if everything under a directory is in the same allocation group as its parent, wouldn't everything under "objects" be in the same allocation group?

Or if subdirectories get their own allocation group, could we just create 00-FF directories under tmp?

Alan Jiang (ajiang) wrote :

Shri

Thanks for your feedback to my questions. I still think it is related since the replicator does partition delete when
the partition doesn't belong to the node that replicator is running on.

Since all the objects were created in the same allocation group and moved to the current location, won't it be
the same issue that rmdir of the partition and its contents bottlenecked by the same allocation group?

Shri Javadekar (shrinand) wrote :

@Mike: Yes, all partitions under "objects" will go to the same allocation group. But presumably the number of partitions shouldn't be that much (isn't the recommendation that each device should have 100 partitions? 100 dirs inside "objects" should be fine).

I believe the attempt is to keep a file inode and it's parent dir in the same allocation group. We could create subdirs 00-FF (or maybe more) inside tmp and fix it. The benefit of doing it this way is this would be easy and least disruptive.

@Alan, you may be right. Simple way to try it would be to manually, periodically delete the "tmp" directory under the mount point *while ingesting the data*. That way the inodes will get spread across allocation groups.

Shri Javadekar (shrinand) wrote :

One of my co-workers came up with another interesting solution.

1. We create the tmpfile using the .objtmp extenstion in the same directory when the file should eventually reside.
2. But we also create a symlink to this tmpfile from /srv/node/r0/objects/tmp/
3. If the PUT succeeds, we should remove the symlink from the tmp directory.
4. The cleanup routine goes through the tmp directory and deleted aged files. That cleanup routine should be made symlink aware so that it if the symlink has aged, it should remove the symlink as well as the target.

There is a rare possibility if #1 succeeds but #2 fails above, we're left with dangling .objtmp files. These could be deleted similar to the way .ts files are cleaned up (hash_cleanup_listdir -> cleanup_ondisk_files) today.

Changed in swift:
assignee: nobody → Shri Javadekar (shrinand)
status: New → In Progress
Samuel Merritt (torgomatic) wrote :

I want to go on the record here: if your Swift cluster isn't running all the consistency processes, you're doing it wrong. The replicator is responsible for stale tempfile cleanup, data movement in case of rebalance, and (of course) replication.

This statement is not targeted at any particular individual; rather, it is intended to help out anyone who reads this bug and wonders about shutting down replicators or auditors.

Samuel Merritt (torgomatic) wrote :

@redbo: files go in parent directory's AG, directories go all over without regard for parent. See links in http://www.gossamer-threads.com/lists/openstack/dev/46301?do=post_view_threaded#46301

I think I'd rather see higher-in-the-tree tempfiles than this symlink thing. What we have now is basically (open, write*N, fsetxattr, fsync, rename), and that's pretty much the minimal set of syscalls needed.

Adding in all this symlink stuff turns that into (open, symlink, write*N, fsetxattr, fsync, rename, unlink).

Since XFS isn't usually deployed with thousands of allocation groups (e.g. I have 32 AGs on my 8T disks in one cluster here (CentOS 6.6), and 6 AGs on my 6T disks elsewhere (Ubuntu 12.04)), it's probably better to go with partition-level or suffix-level tempdirs to get the necessary splaying across AGs. At 4096 suffixes per partition, suffix-level tempdirs should keep us happy for a good long time. Really, partition-level tempdirs will probably do the same thing; if I have 100 partitions/disk and only 32 AGs, that's good enough too.

By the time some future filesystem supports millions of AG-like things per hyper-disk-o-tron, we'll all be using O_TMPFILE anyway. :)

Mike Barton (redbo) wrote :

If you have 200 partitions on a drive, suffix level tmp dirs could mean there are >800K tmp dirs, which seems excessive.

Partition-level tmp dirs would be fine, but I also think if we just do /tmp/00 through /tmp/FF or something, that'd make sure we get a good distribution over allocation groups.

I guess it's been hinted that cross-directory renames accelerate some aspect of filesystem aging other than allocation group distribution, but if that's the case, I'd like to understand it a bit better.

Samuel Merritt (torgomatic) wrote :

@redbo: Good call on subdirs of tmp; that gives us a nice level of control over things, so we could add more bits if it ever becomes necessary.

(Note to implementer: please don't start out with it configurable; just do 00-FF. Unless we discover a use case that needs more than 255, it's better not to have Yet Another Knob(tm) in the configuration.)

I know nothing about anything other than cross-AG distribution, but if you find out more, please let me know.

Looking back, I guess I just got that impression from:

> That's pretty braindead behaviour. That will screw performance and
> locality on any filesystem you do that on, not to mention age it
> extremely quickly.

I'm not entirely sure what that means, or if the aging is somehow
independent of the performance and locality screwing. Maybe we should find
Dave Chinner and verify that this new scheme is no longer braindead.

Shri Javadekar (shrinand) wrote :

Sharding the tmp directory should solve the problem of files only getting created in a single AG. But parent directory (datadir) and the actual file may not be in the same AG.

Nonetheless, I have written about this proposed scheme in the same thread on the XFS mailing list [1]. Feel free to weigh in there.

[1] http://www.spinics.net/lists/xfs/msg32868.html

Mike Barton (redbo) wrote :

After doing some reading, it seems like the problem with just doing a bunch
of tmp dirs (no matter where they live) is that for most operations we have
to listdir an object dir and then stat/read a file or two inside of it.
Since the directory and files are likely all in different allocation
groups, we'd be incurring long seeks across the drive for most operations.

Right now, we have kind of the same problem (no locality in data layout) in
addition to limiting file concurrency to what one large, bloated AG can do.

I really don't like putting tmp files under the object dirs - it's going to
jam its fingers in all sorts of places. But that may be the best option
until we're living in an O_TMPFILE world. I'll definitely continue
researching and pondering this.

For auditing and replicating, it seems like ideally we might want
everything (directories and files) under a partition or suffix dir to be in
the same allocation group. Then we'd get better locality when walking the
whole tree. But that doesn't seem possible with the file system interface
(or at least it wouldn't be pretty).

Prashanth Pai (ppai) wrote :

This should fix things for newer Linux versions: https://review.openstack.org/162243

Samuel Merritt (torgomatic) wrote :

Fixed in commit 773edb4a5dd22b22f5fcab57820bcbdd4176dfcd (Thu Mar 5 18:18:25 2015 +0530).

Changed in swift:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers