Jamie McCracken wrote:
> I dont know about other indexers

Someone should see what Beagle's like, I guess.

> Trackers indexer is a hash table so words are written at random
> locations - its not possible to write more than one word at a time nor
> do we know whether certain words are stored sequentially as a result.
>
> We rely on the kernel to order the writes elevator fashion so they can
> be written in a contiguous fashion - sadly that does not happen on ext3
> but does if ~/.cache/tracker is mounted on XFS
> 
> We call fdatasync after every 1000-5000 words written to prevent pdflush
> starving the disk from other apps (this starvation appears to be a
> recent problem in kernels since 2.6.20)

Ew.  Those both look like nasty kernel limitations when writing to a
file in a scattered fashion.

I guess this is also why producing multiple smaller index files, then
having a merge-fest to make the large index file when it's all done,
is faster than writing everything to one hash index as you go.  That
would naturally decrease the _size_ of seeks and hence seek time, as
smaller files span less of the disk.

Coming back to this:

> Trackers indexer is a hash table so words are written at random
> locations - its not possible to write more than one word at a time nor
> do we know whether certain words are stored sequentially as a result.

That doesn't seem like a good way to write an index.  I'll try to put
together a different idea.  (I've thought a lot about indexes for
databases: I'm thinking of writing yet another database engine).

You've found that SQLite doesn't perform too well without
index-merging, and neither does the other db you tried.

But a _good_ database implementation obviously should perform better
writing everything to one big file, instead of to multiple smaller
files in sequence then merging them.

Think about it: a good database would do the optimisation you're doing
automatically, because it's quite a common requirement (and benchmark)
in databases to load loads of data with randomly ordered index keys.

The only different would be it would hide the "smaller files" as
allocated zones inside a single big file that it's apparently using.
That bigger file's size would grow in steps.  The disk seeking and I/O
would still have similar properties.

There's quite a few different algorithms,to make the index (in a
general database engine) be always accessible while it grows, despite
the internal periodic reorganisation, and to keep the reorganisation
efficient with disk seeks.

One way is to store the index as two tables internally: one B-tree,
and one sequential table (because reads can easily check both), and
write updates (in your case, each new "word") fairly sequentially up
to a certain amount, as write-ahead logging, then periodically merging
the log into the main B-tree using a batched method.  If two logs are
permitted, one being merged and one being written, writing doesn't
have to stall during merging.

(Aside:
  => I know for a fact that several databases do this, retain a
     separate sequential index and B-tree index, when there's a
     constant stream of updates.

     You might find PostgreSQL, some MySQL backend, Firebird, or
     Oracle's version of Berkeley DB (see oracle.com - it's still the
     BSD license) actually performs better than anything you've tried
     so far for these reasons.  Don't assume a separate SQL server
     process means slow, as disk seeking is slower :-) Have you tried
     an SQL DB, or Oracle/non-Oracle Berkeley DB?  I can't offer
     advice, as I don't actually use these things myself much, other
     than consider trying them, and try Oracle's Berkeley DB first
     because the API is so similar to what you've already tried.
)

Another algorithm is to have concurrent logs in a
tree-of-logs-and-B-trees structure, with merge-sorting being
interleaved into the index accesses, either lazily on demand, or when
the tree-of-trees becomes unbalanced, or when seek time measurements
indicate it is worthwhile.  That's more complex, but algorithmically
adapts better with the size of the data and the approximate
seek-time-to-data-size ratio.

Your index-merging strategy is similar to tree-of-logs-and-B-trees but
without the logs ;-)

Because you don't have the logs part, all your disk writes are random,
suffering from these kernel issues and needing the fdatasync() calls.

To improve this, I can think of a couple of simple things:

   1. Instead of writing small hash indexes, to merge them later,
      write small sequential indexes: i.e. simply append words (index keys)
      and the attached data until the file is large enough and it's
      time to move to the next.

      Then merge the sequential indexes, much as you merge tree ones
      now.

      You might want a two-level strategy, with relatively small
      sequential index files, and merging those to make medium size
      B-tree indexes, then merging those to make a large hash or
      B-tree index.  The merging would be slow, but you've said that's
      much less of a problem than the main index building time.

   2. Don't write 1000-5000 random words, in the order you found them,
      to a hash index, followed by fdatasync.  Collect the words in
      memory first into a batch, then _sort_ them, then write then to
      a B-tree index (not hash) in the sorted order (the B-tree must
      use exactly the same order, beward of case/locale issues), then
      do fdatasync().

      The reason is that writing a sorted batch of updates to a B-tree
      does less seeking, (and also less I/O under memory pressure),
      then writing the same data unsorted, to a B-tree or hash.
      (Sorting doesn't help with a hash as the hash mixes up the
      sorting of course :-)

      The benefit of storing sorted memory batches into a file B-tree
      reduces as the tree gets larger, because the I/O will eventually
      be one update per block, with a huge tree and sparsely
      distributed keys, but it still reduces seek times.  The
      heuristic about how often to call fdatasync() to avoid pdflush
      problems is a function of tree size, but if you're limiting the
      tree size, for index merging multiple trees later, it may be
      irrelevant.

Seeing all these performance problems with Tracker, the index writing
issues, plus the inotify issues and disk-battering initial inotify
registration, they aren't Tracker's fault at all.

Tracker should be able to use all those facilities, and there is no
_good_ reason why they don't work well for this application.

They are weakness in the database engines you've used: A good database
engine should do what you're having to rather "manually", but even
better heuristics (multi-level batching and sorting in memory and on
disk, delayed reorganisation, seek clustering), and better I/O methods
(O_DIRECT and application-controlled elevator, for a start).

And they are weaknesses in the kernel's incomplete attempt at I/O
priorities, especially for writing, and lacking seek awareness to give
space to other applications, and lacking a file-change detection
mechanism which propagates up directories and persists across reboots.

This has really added to my urge to write my vision of a database
engine (I was thinking a lot about techniques), now that I see there's
so much room for improvement in performance for some applications.

But that's not going to happen for quite some while.  (Tracker would
be a great test case for it :-).

I'd be very happy if the ideas I've mentioned might make a big
difference to index writing:

    1. Try Oracle Berkeley DB, use the latest stable source from their
       site, just to be sure of the best performance test.

    2. Try the SQL separate process servers: Firebird, PostgreSQL,
       MySQL w/ various backends, or anything else which is free
       enough and claims speed.

    3. Try batching and sorting (by index order) writes in memory, and
       writing them to a B-tree format file (ensuring the B-tree order
       is idential to batch sort order, including locale/case issues), not a
       hash format file.  (E.g. I know Berkeley DB offers both
       options).  Don't be afraid to use a large batch in memory: a
       few megabytes or tens of megabytes, especially to give the test
       a chance.  Still call fdatasync() every 1000-5000 words during
       the batch write (i.e. after it's sorted), to minimise ext3
       elevator screwups: the amount of unwritten index entries in the
       _kernel_ should always remain limited.

    4. Try writing sequential indexes (i.e. just appending each
       word+data to the small index files), not using a hash/B-tree, and
       index merging those.  Use a two-level mergine scheme, so the
       sequential indexes are limited in size.  Can be combined with
       suggestion 3, with memory batch sorting to a B-tree being used
       by the merging step.

Some food for thought.  Please let me know yours :-)  Sorry if it seems
like there's always more to do.  I do think you might be pleasantly
surprised by the writing improvements, but I don't have enough
experience to guarantee it (just lots of theory...).

-- Jamie