bzr search plugin

Bug #720853
Comment #4

Comment 4 for bug 720853

Revision history for this message

John A Meinel (jameinel) wrote on 2011-11-14:

I think the original fix was approximately ok, we just need to find a better place to put the check.

It would be nice if we could understand why bug #889872 was succeeding in writing content, but not enough to make it think something was written.

1) Could line be an empty string? No, It comes from _flatten_node, so at a minimum there are some \x00 bytes in there, even if the key and the references were all empty.
2) I can imagine that it could be exactly 2 bytes, though. Specifically, bytes_out_len is just what zlib.compressobj.compress() has decided it needs to write so-far. And it usually buffers quite a bit (a few KB)
When I look at it:
  >>> import zlib
  >>> o = zlib.compressobj()
  >>> o.compress('abcdefghijklmnop')
  'x\x9c'
Which is the same bytes you get if you write a different string there. So I'm guessing that is "this is a zlib compressed structure" 2-byte header.

3) During normal commits, etc, we use a soft packing scheme, but during 'bzr pack' we do a "hard" packing scheme. Specific differences are:
    # Tuple of (num_repack_attempts, num_zsync_attempts)
    # num_zsync_attempts only has meaning if num_repack_attempts is 0.
    _repack_opts_for_speed = (0, 8)
    _repack_opts_for_size = (20, 0)
  a) that means during normal commit, we do:
  if these bytes will fit without compression, accept them.
  once we fill the buffer, call zobj.flush(Z_SYNC_FLUSH), and re-estimate how many more bytes we can fit.
  but only use Z_SYNC_FLUSH 8 times, otherwise just consider the page FULL.
  b) However, during 'pack' we are much more aggressive
  Use a similar strategy to just add content that we know will fit, and then Z_SYNC_FLUSH to get a good estimate of how much
  compression we expect to get.
  Once the buffer is then 'almost full', take all the data so far and write it to a new zlib compressobj. Then try to add just the new content and Z_SYNC_FLUSH it. That way, instead of having N Z_SYNC_FLUSH points consuming space in our stream, we only have the one at the end.
At this point, the new content may not fit. So we fall back and write the existing data to a compressed stream, and mark it finalized. And return that we could not add the new content.

So actually 3b) sounds like during 'bzr pack' we'll pretty much always get bytes_out_len == 2. Because we write content to the PAGE until we're absolutely sure we can't fit any more, and then we create a new zlib object to make it as small as possible. However, because we aren't calling Z_SYNC_FLUSH and the zlib buffer is probably bigger than the number of bytes we've written, the only bytes that have been output are the zlib header, and all the other bytes are sitting in the zlib buffer waiting for .finish() to be called and for us to flush() the compressor.

So what would I say would be better...

The code we are changing is in _add_key. It has an earlier check "rows[-1].writer is None". If that is True, then we are creating a new PAGE and the content is empty to start. So just do this:
=== modified file 'bzrlib/btree_index.py'
--- bzrlib/btree_index.py 2011-08-19 22:34:02 +0000
+++ bzrlib/btree_index.py 2011-11-14 09:54:20 +0000
@@ -294,7 +294,9 @@
             flag when writing out. This is used by the _spill_mem_keys_to_disk
             functionality.
         """
+ new_leaf = False
         if rows[-1].writer is None:
+ new_leaf = True
             # opening a new leaf chunk;
             for pos, internal_row in enumerate(rows[:-1]):
                 # flesh out any internal nodes that are needed to
@@ -320,6 +322,10 @@
                 optimize_for_size=self._optimize_for_size)
             rows[-1].writer.write(_LEAF_FLAG)
         if rows[-1].writer.write(line):
+ if new_leaf:
+ # We just created this leaf, and now the line doesn't fit.
+ # Clearly it will never fit, so punt.
+ raise errors.BadIndexKey(...)
             # this key did not fit in the node:
             rows[-1].finish_node()
             key_line = string_key + "\n"

I think the original fix was approximately ok, we just need to find a better place to put the check.

It would be nice if we could understand why bug #889872 was succeeding in writing content, but not enough to make it think something was written.

3) During normal commits, etc, we use a soft packing scheme, but during 'bzr pack' we do a "hard" packing scheme. Specific differences are:
    # Tuple of (num_repack_attempts, num_zsync_attempts)
    # num_zsync_attempts only has meaning if num_repack_attempts is 0.
    _repack_opts_for_speed = (0, 8)
    _repack_opts_for_size = (20, 0)
  a) that means during normal commit, we do:
  if these bytes will fit without compression, accept them.
  once we fill the buffer, call zobj.flush(Z_SYNC_FLUSH), and re-estimate how many more bytes we can fit.
  but only use Z_SYNC_FLUSH 8 times, otherwise just consider the page FULL.
  b) However, during 'pack' we are much more aggressive
  Use a similar strategy to just add content that we know will fit, and then Z_SYNC_FLUSH to get a good estimate of how much 
  compression we expect to get.
  Once the buffer is then 'almost full', take all the data so far and write it to a new zlib compressobj. Then try to add just the new content and Z_SYNC_FLUSH it. That way, instead of having N Z_SYNC_FLUSH points consuming space in our stream, we only have the one at the end.
 At this point, the new content may not fit. So we fall back and write the existing data to a compressed stream, and mark it finalized. And return that we could not add the new content.

So what would I say would be better...

The code we are changing is in _add_key. It has an earlier check "rows[-1].writer is None". If that is True, then we are creating a new PAGE and the content is empty to start. So just do this:
=== modified file 'bzrlib/btree_index.py'
--- bzrlib/btree_index.py       2011-08-19 22:34:02 +0000
+++ bzrlib/btree_index.py       2011-11-14 09:54:20 +0000
@@ -294,7 +294,9 @@
             flag when writing out. This is used by the _spill_mem_keys_to_disk
             functionality.
         """
+        new_leaf = False
         if rows[-1].writer is None:
+            new_leaf = True
             # opening a new leaf chunk;
             for pos, internal_row in enumerate(rows[:-1]):
                 # flesh out any internal nodes that are needed to
@@ -320,6 +322,10 @@
                 optimize_for_size=self._optimize_for_size)
             rows[-1].writer.write(_LEAF_FLAG)
         if rows[-1].writer.write(line):
+            if new_leaf:
+                # We just created this leaf, and now the line doesn't fit.
+                # Clearly it will never fit, so punt.
+                raise errors.BadIndexKey(...)
             # this key did not fit in the node:
             rows[-1].finish_node()
             key_line = string_key + "\n"