Overlapping autopacks can lead to unreachable revisions causing NoSuchRevision

Bug #507566 reported by Gareth White on 2010-01-14
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Bazaar
Critical
John A Meinel
2.0
Critical
John A Meinel

Bug Description

Occasionally when autopacks overlap you can get a branch into a state where subsequent commands will fail on that branch and "bzr check" reports errors in the repository.

I discovered this while trying to reproduce 495000, 507553 and 507557. At first I thought it only happened while I was doing stuff in the debugger but then I started getting it in normal usage too. The attached shell script will eventually reproduce it if you wait a while (you'll start seeing "NoSuchId" errors).

The method to reproduce is very similar to those other bugs:
1. At the very top of _create_pack_from_packs() in bzrlib/repofmt/groupcompress_repo.py, add the following:
        import pdb
        pdb.set_trace()
2. Create a dummy repository:
    bzr init-repo testrepo
    bzr init testrepo/branch1
    bzr branch testrepo/branch1 testrepo/branch2
3. Commit until it breaks into the debugger (should be the 10th time):
    bzr commit -m "test" --unchanged testrepo/branch1
4. In a separate window, commit on the other branch:
    bzr commit -m "test" --unchanged testrepo/branch2
5. Both windows should now be in the debugger.
6. In the first window type "c" to continue. The commit should finish successfully.
7. In the second window type "c". The commit will also succeed, but testrepo/branch2 is now broken.

If you try "bzr log testrepo/branch2" you get:
bzr: ERROR: Revision {<email address hidden>} not present in "CHKInventoryRepository('file:///Users/garethw/Programming/Bazaar/autopack2/testrepo/.bzr/repository/')".

If you do "bzr check testrepo" you get:
Checking working tree at '/Users/garethw/Programming/Bazaar/autopack2/testrepo/branch1'.
Checking branch at 'file:///Users/garethw/Programming/Bazaar/autopack2/testrepo/branch1/'.
Checking working tree at '/Users/garethw/Programming/Bazaar/autopack2/testrepo/branch2'.
Checking branch at 'file:///Users/garethw/Programming/Bazaar/autopack2/testrepo/branch2/'.
Checking repository at 'file:///Users/garethw/Programming/Bazaar/autopack2/testrepo/'.
bzr: ERROR: bzrlib.errors.NoSuchRevision: CHKInventoryRepository('file:///Users/garethw/Programming/Bazaar/autopack2/testrepo/.bzr/repository/') has no revision ('<email address hidden>',)

Also, if you try this procedure on a branch that has files in it and use non-empty commits they will fail with:
aborting commit write group: NoSuchId(The file id "tree_root-20100114171311-f4misndhza0v4tlm-1" is not present in the tree <bzrlib.inventory.CHKInventory object at 0x12d9f10>.)
bzr: ERROR: The file id "tree_root-20100114171311-f4misndhza0v4tlm-1" is not present in the tree <bzrlib.inventory.CHKInventory object at 0x12d9f10>.

I'm guessing that one or more pack files have been "orphaned" somehow.

Tags: 2a Edit Tag help

Related branches

I just realised "Repository corruption" may sound a bit alarmist; feel free to change it to a more appropriate term.

Martin Pool (mbp) on 2010-01-15
Changed in bzr:
status: New → Confirmed
importance: Undecided → Critical
Martin Pool (mbp) on 2010-01-15
summary: - Overlapping autopacks can lead to repository corruption
+ Overlapping autopacks can lead to unreachable revisions causing
+ NoSuchRevision
Martin Pool (mbp) on 2010-01-15
tags: added: 2a
John A Meinel (jameinel) on 2010-01-15
Changed in bzr:
assignee: nobody → John A Meinel (jameinel)
John A Meinel (jameinel) wrote :

I can confirm this behavior with bzr/2.0 tip and with bzr.dev tip.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

John A Meinel wrote:
> I can confirm this behavior with bzr/2.0 tip and with bzr.dev tip.
>

I think I've tracked down the cause, I haven't tracked down an
appropriate fix.

Specifically it has something to do with us having written the pack file
down and gotten it 'into the mix' (that may just be in self._names).

The issue is that if we trigger "reload_pack_names()" as a side-effect
it sets "self._packs_at_load".

Let me try a walk through...

1) branch 1 starts a commit, creates a new pack file on disk renames it
into final position, writes the indices, etc. At this point
self._names[] includes that named file, but .bzr/repository/pack-names
does not.

2) branch 2 does the same
3) branch 1 takes out the disk lock to update the pack-names file, it
then reads the pack-names file to see if anyone has inserted something
since it last checked, and does a 3-way diff comparing:
  BASE = self._packs_at_load
  THIS = self._names
  OTHER = the newly read pack-names file

4) branch 1 then writes out a new pack-names file, (after autopacking,
and deleting some files, etc.)

5) At this point branch 2 goes to autopack, but finds that a file is
missing, so it calls self.reload_pack_names(). As a side effect, this
sets self._packs_at_load to the current 'disk_nodes' but this *includes*
new_nodes.

6) the new diff doesn't think the new nodes are still new, so it doesn't
try to write them to the index file.

...

I'm missing something. The state I see right now is that:

self._names => 7 items
but
self._diff_pack_names returns

disk_nodes => 6 items (missing 75b...)
new_nodes => empty
deleted_nodes => empty

I really don't understand how we can read from disk, and then add
everything in self._names, and come up with less than 7 items...

Anyway more investigation needed, but current signs point to an issue
with
bzrlib.repofmt.RepositoryPackCollection.{_diff_pack_names,_save_pack_names}

I'm also curious if we have a bug based on the new unlocking exception
supression. This is less likely, but maybe the commit isn't supposed to
succeed, but it is going through an unlock phase which suppresses that
exception, and tells the branch that it is ok to use the new revision_id....

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAktRLjgACgkQJdeBCYSNAAOhrACdEnhm+RsAD+kDIgZNSQb/05U1
idgAn2QtU6PjUKiC5KwQLHS+iC25wYHV
=QvFI
-----END PGP SIGNATURE-----

John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

...

> I'm missing something. The state I see right now is that:
>
> self._names => 7 items
> but
> self._diff_pack_names returns
>
> disk_nodes => 6 items (missing 75b...)
> new_nodes => empty
> deleted_nodes => empty
>
> I really don't understand how we can read from disk, and then add
> everything in self._names, and come up with less than 7 items...
>

The issue is in the comparisons in _diff_pack_names

It reads the actual disk index {which does not include the new file}
It creates 'new_nodes' from self._names {which does}
It then computes
        deleted_nodes = self._packs_at_load - current_nodes
        new_nodes = current_nodes - self._packs_at_load

However self._packs_at_load *also* contains the new file.
And thus _packs_at_load - current_nodes == empty() and current - at_load
= empty().

And thus nothing gets added to, or removed from, the disk index.

A potential fix is this, I think:
=== modified file 'bzrlib/repofmt/pack_repo.py'
- --- bzrlib/repofmt/pack_repo.py 2010-01-04 06:56:46 +0000
+++ bzrlib/repofmt/pack_repo.py 2010-01-16 03:20:50 +0000
@@ -1991,8 +1991,8 @@
         if first_read:
             return True
         # out the new value.
- - disk_nodes, _, _ = self._diff_pack_names()
- - self._packs_at_load = disk_nodes
+ disk_nodes, _, new_nodes = self._diff_pack_names()
+ self._packs_at_load = disk_nodes - new_nodes
         (removed, added,
          modified) =
self._syncronize_pack_names_from_disk_nodes(disk_nodes)
         if removed or added or modified:

Namely, when we are reloading the pack names, *only* consider names that
are actually in the on-disk pack-names file to be part of
"_packs_at_load", don't include packs that are on disk but not yet
referenced.

On the plus side, if anyone actually hit this, the pack file is still on
disk, and just needs to be referenced from the pack-names file.

Robert, do you think you can follow this and confirm the logic? I'll try
to get a test written, etc.

Note that this isn't specific to 2a, as this is generic packing logic.
So it is true for --pack-0.92 => --2a. (Older versions of bzr probably
didn't suffer this because they didn't support concurrent autopacking
anyway. :)

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAktRMT0ACgkQJdeBCYSNAAMnlACfSy5DuVcsmT/wFWzczAYnLty+
fR0AoKLyaJExEx8lnV8DQYu27TzTskxG
=arss
-----END PGP SIGNATURE-----

Robert Collins (lifeless) wrote :

Yes, I think that your analysis is fine.

John A Meinel (jameinel) wrote :
Changed in bzr:
milestone: none → 2.1.0rc1
status: Confirmed → In Progress

I can confirm that I can no longer reproduce the bug with John's fix.

Thanks very much for filing such careful bug reports about these issues, Gareth!

John A Meinel (jameinel) on 2010-01-21
Changed in bzr:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Bug attachments