Jamie McCracken wrote: > I dont know about other indexers Someone should see what Beagle's like, I guess. > Trackers indexer is a hash table so words are written at random > locations - its not possible to write more than one word at a time nor > do we know whether certain words are stored sequentially as a result. > > We rely on the kernel to order the writes elevator fashion so they can > be written in a contiguous fashion - sadly that does not happen on ext3 > but does if ~/.cache/tracker is mounted on XFS > > We call fdatasync after every 1000-5000 words written to prevent pdflush > starving the disk from other apps (this starvation appears to be a > recent problem in kernels since 2.6.20) Ew. Those both look like nasty kernel limitations when writing to a file in a scattered fashion. I guess this is also why producing multiple smaller index files, then having a merge-fest to make the large index file when it's all done, is faster than writing everything to one hash index as you go. That would naturally decrease the _size_ of seeks and hence seek time, as smaller files span less of the disk. Coming back to this: > Trackers indexer is a hash table so words are written at random > locations - its not possible to write more than one word at a time nor > do we know whether certain words are stored sequentially as a result. That doesn't seem like a good way to write an index. I'll try to put together a different idea. (I've thought a lot about indexes for databases: I'm thinking of writing yet another database engine). You've found that SQLite doesn't perform too well without index-merging, and neither does the other db you tried. But a _good_ database implementation obviously should perform better writing everything to one big file, instead of to multiple smaller files in sequence then merging them. Think about it: a good database would do the optimisation you're doing automatically, because it's quite a common requirement (and benchmark) in databases to load loads of data with randomly ordered index keys. The only different would be it would hide the "smaller files" as allocated zones inside a single big file that it's apparently using. That bigger file's size would grow in steps. The disk seeking and I/O would still have similar properties. There's quite a few different algorithms,to make the index (in a general database engine) be always accessible while it grows, despite the internal periodic reorganisation, and to keep the reorganisation efficient with disk seeks. One way is to store the index as two tables internally: one B-tree, and one sequential table (because reads can easily check both), and write updates (in your case, each new "word") fairly sequentially up to a certain amount, as write-ahead logging, then periodically merging the log into the main B-tree using a batched method. If two logs are permitted, one being merged and one being written, writing doesn't have to stall during merging. (Aside: => I know for a fact that several databases do this, retain a separate sequential index and B-tree index, when there's a constant stream of updates. You might find PostgreSQL, some MySQL backend, Firebird, or Oracle's version of Berkeley DB (see oracle.com - it's still the BSD license) actually performs better than anything you've tried so far for these reasons. Don't assume a separate SQL server process means slow, as disk seeking is slower :-) Have you tried an SQL DB, or Oracle/non-Oracle Berkeley DB? I can't offer advice, as I don't actually use these things myself much, other than consider trying them, and try Oracle's Berkeley DB first because the API is so similar to what you've already tried. ) Another algorithm is to have concurrent logs in a tree-of-logs-and-B-trees structure, with merge-sorting being interleaved into the index accesses, either lazily on demand, or when the tree-of-trees becomes unbalanced, or when seek time measurements indicate it is worthwhile. That's more complex, but algorithmically adapts better with the size of the data and the approximate seek-time-to-data-size ratio. Your index-merging strategy is similar to tree-of-logs-and-B-trees but without the logs ;-) Because you don't have the logs part, all your disk writes are random, suffering from these kernel issues and needing the fdatasync() calls. To improve this, I can think of a couple of simple things: 1. Instead of writing small hash indexes, to merge them later, write small sequential indexes: i.e. simply append words (index keys) and the attached data until the file is large enough and it's time to move to the next. Then merge the sequential indexes, much as you merge tree ones now. You might want a two-level strategy, with relatively small sequential index files, and merging those to make medium size B-tree indexes, then merging those to make a large hash or B-tree index. The merging would be slow, but you've said that's much less of a problem than the main index building time. 2. Don't write 1000-5000 random words, in the order you found them, to a hash index, followed by fdatasync. Collect the words in memory first into a batch, then _sort_ them, then write then to a B-tree index (not hash) in the sorted order (the B-tree must use exactly the same order, beward of case/locale issues), then do fdatasync(). The reason is that writing a sorted batch of updates to a B-tree does less seeking, (and also less I/O under memory pressure), then writing the same data unsorted, to a B-tree or hash. (Sorting doesn't help with a hash as the hash mixes up the sorting of course :-) The benefit of storing sorted memory batches into a file B-tree reduces as the tree gets larger, because the I/O will eventually be one update per block, with a huge tree and sparsely distributed keys, but it still reduces seek times. The heuristic about how often to call fdatasync() to avoid pdflush problems is a function of tree size, but if you're limiting the tree size, for index merging multiple trees later, it may be irrelevant. Seeing all these performance problems with Tracker, the index writing issues, plus the inotify issues and disk-battering initial inotify registration, they aren't Tracker's fault at all. Tracker should be able to use all those facilities, and there is no _good_ reason why they don't work well for this application. They are weakness in the database engines you've used: A good database engine should do what you're having to rather "manually", but even better heuristics (multi-level batching and sorting in memory and on disk, delayed reorganisation, seek clustering), and better I/O methods (O_DIRECT and application-controlled elevator, for a start). And they are weaknesses in the kernel's incomplete attempt at I/O priorities, especially for writing, and lacking seek awareness to give space to other applications, and lacking a file-change detection mechanism which propagates up directories and persists across reboots. This has really added to my urge to write my vision of a database engine (I was thinking a lot about techniques), now that I see there's so much room for improvement in performance for some applications. But that's not going to happen for quite some while. (Tracker would be a great test case for it :-). I'd be very happy if the ideas I've mentioned might make a big difference to index writing: 1. Try Oracle Berkeley DB, use the latest stable source from their site, just to be sure of the best performance test. 2. Try the SQL separate process servers: Firebird, PostgreSQL, MySQL w/ various backends, or anything else which is free enough and claims speed. 3. Try batching and sorting (by index order) writes in memory, and writing them to a B-tree format file (ensuring the B-tree order is idential to batch sort order, including locale/case issues), not a hash format file. (E.g. I know Berkeley DB offers both options). Don't be afraid to use a large batch in memory: a few megabytes or tens of megabytes, especially to give the test a chance. Still call fdatasync() every 1000-5000 words during the batch write (i.e. after it's sorted), to minimise ext3 elevator screwups: the amount of unwritten index entries in the _kernel_ should always remain limited. 4. Try writing sequential indexes (i.e. just appending each word+data to the small index files), not using a hash/B-tree, and index merging those. Use a two-level mergine scheme, so the sequential indexes are limited in size. Can be combined with suggestion 3, with memory batch sorting to a B-tree being used by the merging step. Some food for thought. Please let me know yours :-) Sorry if it seems like there's always more to do. I do think you might be pleasantly surprised by the writing improvements, but I don't have enough experience to guarantee it (just lots of theory...). -- Jamie