Overlapping autopacks can fail with NoSuchFile when obsoleting packs

Bug #507557 reported by Gareth White
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Bazaar
Fix Released
Low
John A Meinel
2.0
Fix Released
Low
John A Meinel

Bug Description

If two autopacks overlap while obsoleting the pack files, one of them will fail with a NoSuchFile error. This affects both bzr 2.0.4dev and 2.1.0dev5 on both Windows XP and OS X.

The attached shell script reproduces it intermittently, but you may also run into bug 507553 as well.

I can reproduce it every time as follows:
1. In bzrlib/repofmt/groupcompress_repo.py make these changes:
    (a) add "import pdb" at the top of the file
    (b) add "pdb.set_trace()" at the top of _create_pack_from_packs()
    (c) add "pdb.set_trace()" just before the call to self._obsolete_packs() near the end of _execute_pack_operations()
2. Create a dummy repository:
    bzr init-repo testrepo
    bzr init testrepo/branch1
    bzr branch testrepo/branch1 testrepo/branch2
3. Commit until it breaks into the debugger (should be the 10th time):
    bzr commit -m "test" --unchanged testrepo/branch1
4. In a separate window, commit on the other branch:
    bzr commit -m "test" --unchanged testrepo/branch2
5. Both windows should now be in the debugger and stopped at _create_pack_from_packs
6. In the first window type "c" to continue.
7. Repeat in the second window.
8. Both windows should now be at the breakpoint just before "obsolete_packs"
9. In the first window type "c" to continue. The commit should succeed.
10. In the second window type "c". The commit will fail with an error such as:
bzr: ERROR: No such file: u'/Users/garethw/Programming/Bazaar/autopack2/testrepo/.bzr/repository/packs/f53c7f81e94222323d6f7ccbc2506e62.pack': [Errno 2] No such file or directory

Full traceback:
  File "/Users/garethw/Programming/Bazaar/bzr-repo/bzr/bzrlib/commands.py", line 853, in exception_to_return_code
    return the_callable(*args, **kwargs)
  File "/Users/garethw/Programming/Bazaar/bzr-repo/bzr/bzrlib/commands.py", line 1055, in run_bzr
    ret = run(*run_argv)
  File "/Users/garethw/Programming/Bazaar/bzr-repo/bzr/bzrlib/commands.py", line 661, in run_argv_aliases
    return self.run_direct(**all_cmd_args)
  File "/Users/garethw/Programming/Bazaar/bzr-repo/bzr/bzrlib/commands.py", line 665, in run_direct
    return self._operation.run_simple(*args, **kwargs)
  File "/Users/garethw/Programming/Bazaar/bzr-repo/bzr/bzrlib/cleanup.py", line 122, in run_simple
    self.cleanups, self.func, *args, **kwargs)
  File "/Users/garethw/Programming/Bazaar/bzr-repo/bzr/bzrlib/cleanup.py", line 156, in _do_with_cleanups
    result = func(*args, **kwargs)
  File "/Users/garethw/Programming/Bazaar/bzr-repo/bzr/bzrlib/builtins.py", line 3120, in run
    exclude=safe_relpath_files(tree, exclude))
  File "/Users/garethw/Programming/Bazaar/bzr-repo/bzr/bzrlib/decorators.py", line 194, in write_locked
    result = unbound(self, *args, **kwargs)
  File "/Users/garethw/Programming/Bazaar/bzr-repo/bzr/bzrlib/workingtree_4.py", line 197, in commit
    result = WorkingTree3.commit(self, message, revprops, *args, **kwargs)
  File "/Users/garethw/Programming/Bazaar/bzr-repo/bzr/bzrlib/decorators.py", line 194, in write_locked
    result = unbound(self, *args, **kwargs)
  File "/Users/garethw/Programming/Bazaar/bzr-repo/bzr/bzrlib/mutabletree.py", line 225, in commit
    *args, **kwargs)
  File "/Users/garethw/Programming/Bazaar/bzr-repo/bzr/bzrlib/commit.py", line 257, in commit
    possible_master_transports=possible_master_transports)
  File "/Users/garethw/Programming/Bazaar/bzr-repo/bzr/bzrlib/cleanup.py", line 118, in run
    self.cleanups, self.func, self, *args, **kwargs)
  File "/Users/garethw/Programming/Bazaar/bzr-repo/bzr/bzrlib/cleanup.py", line 156, in _do_with_cleanups
    result = func(*args, **kwargs)
  File "/Users/garethw/Programming/Bazaar/bzr-repo/bzr/bzrlib/commit.py", line 402, in _commit
    self.rev_id = self.builder.commit(self.message)
  File "/Users/garethw/Programming/Bazaar/bzr-repo/bzr/bzrlib/repository.py", line 179, in commit
    self.repository.commit_write_group()
  File "/Users/garethw/Programming/Bazaar/bzr-repo/bzr/bzrlib/repository.py", line 1563, in commit_write_group
    result = self._commit_write_group()
  File "/Users/garethw/Programming/Bazaar/bzr-repo/bzr/bzrlib/repofmt/pack_repo.py", line 2266, in _commit_write_group
    hint = self._pack_collection._commit_write_group()
  File "/Users/garethw/Programming/Bazaar/bzr-repo/bzr/bzrlib/repofmt/pack_repo.py", line 2126, in _commit_write_group
    result = self.autopack()
  File "/Users/garethw/Programming/Bazaar/bzr-repo/bzr/bzrlib/repofmt/pack_repo.py", line 1476, in autopack
    return self._do_autopack()
  File "/Users/garethw/Programming/Bazaar/bzr-repo/bzr/bzrlib/repofmt/pack_repo.py", line 1516, in _do_autopack
    reload_func=self._restart_autopack)
  File "/Users/garethw/Programming/Bazaar/bzr-repo/bzr/bzrlib/repofmt/groupcompress_repo.py", line 714, in _execute_pack_operations
    self._obsolete_packs(packs)
  File "/Users/garethw/Programming/Bazaar/bzr-repo/bzr/bzrlib/repofmt/pack_repo.py", line 1792, in _obsolete_packs
    '../obsolete_packs/' + pack.file_name())
  File "/Users/garethw/Programming/Bazaar/bzr-repo/bzr/bzrlib/transport/local.py", line 408, in rename
    self._translate_error(e, path_from)
  File "/Users/garethw/Programming/Bazaar/bzr-repo/bzr/bzrlib/transport/__init__.py", line 308, in _translate_error
    raise errors.NoSuchFile(path, extra=e)
NoSuchFile: No such file: u'/Users/garethw/Programming/Bazaar/autopack2/testrepo/.bzr/repository/packs/f53c7f81e94222323d6f7ccbc2506e62.pack': [Errno 2] No such file or directory

Related branches

Revision history for this message
Gareth White (gwhite-deactivatedaccount) wrote :
Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 507557] [NEW] Overlapping autopacks can fail with NoSuchFile when obsoleting packs

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Gareth White wrote:
> Public bug reported:
>
> If two autopacks overlap while obsoleting the pack files, one of them
> will fail with a NoSuchFile error. This affects both bzr 2.0.4dev and
> 2.1.0dev5 on both Windows XP and OS X.
>
> The attached shell script reproduces it intermittently, but you may also
> run into bug 507553 as well.
>

As mentioned, similar in form to bug #507553, and thus getting the same
basic triaging.

 status: confirmed
 importance: low

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAktPTyAACgkQJdeBCYSNAANtQgCgxbHq1rbxTZ5KXfjCr/Bf449N
K6UAoIMcjVflUxjSxgMd59sQ7GCfT5fs
=IZbC
-----END PGP SIGNATURE-----

Changed in bzr:
importance: Undecided → Low
status: New → Confirmed
Revision history for this message
Gareth White (gwhite-deactivatedaccount) wrote :

I've done some more investigation into this issue. Here's what I've found so far:

1. The first 9 commits each create a single-revision pack (call them p1-p9) and add them to pack-names.
2. The 10th commit creates p10 but does not yet add it to pack-names.
3. The 11th commit (on a different branch) creates p11 but does not yet add it to pack-names.
4. We're now at step 5 in the description above, where both processes are stopped at _create_pack_from_packs(). There are 11 single-revision packs on disk (p1-p11), but only p1-p9 are in pack-names at this stage.
5. After the first process has continued up to _obsolete_packs(), there are 12 packs on disk. These are p1-p11 plus a new 10-revision pack (call it q1) which contains the revisions in p1-p10. At this stage pack-names has been updated to contain just q1.
6a. If the first process now continues, p1-p10 will be moved into obsolete_packs, leaving just q1 and p11. The second process will then try to create a new 10-rev pack, fail because it can't find p1-p9, retry the autopack, determine nothing needs doing and instead just add p11 to pack-names. Both commits succeed and everyone is happy.
6b. If the second process continues *before* the first process calls obsolete_packs() (as in the example at the start), things go a bit differently. A second 10-rev pack (q2) is created on disk containing p1-p9 plus p11, and pack-names now references q1 and q2.
7. At this stage on disk there are p1-p11, q1 and q2.
8. The first process obsoletes p1-p10, leaving just p11, q1 and q2.
9. The second process tries to obsolete p1-p9 plus p11 but fails because p1-p9 aren't there.
10. The second commit fails, leaving p11 "orphaned" on disk. The revisions in p1-p9 are now in two separate pack files (q1 & q2), both of which are referenced in pack-names.

Some possible solutions:
1. If it's OK to have some revisions duplicated in different packs then we could:
    (a) make obsolete_pack() ignore NoSuchFile and just assume someone else had already obsoleted the file, or
    (b) be smarter about which packs we try to obsolete. We may be able to determine this in _diff_pack_names() by calculating the intersection of deleted_nodes and disk_nodes.
2. If it's not OK then we could somehow detect that revisions were being duplicated and back out the second autopack. Perhaps _save_pack_names() could fail if packs it was expecting to remove had already been removed from pack-names.

Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 507557] Re: Overlapping autopacks can fail with NoSuchFile when obsoleting packs

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

...

> Some possible solutions:
> 1. If it's OK to have some revisions duplicated in different packs then we could:
> (a) make obsolete_pack() ignore NoSuchFile and just assume someone else had already obsoleted the file, or
> (b) be smarter about which packs we try to obsolete. We may be able to determine this in _diff_pack_names() by calculating the intersection of deleted_nodes and disk_nodes.

It is ok. So we can just skip the removal. One other alternative...
before we delete the files, I believe we check the 'obsolete_packs'
directory for any files and delete them. I would recommend that if we
have a file that is queued for obsoleting, but is present in
obsolete_packs, then we should

1) remove it from the 'to-obsolete' queue
2) remove it from the 'to-delete' queue

The reason we wait to delete is to make sure the os has time to actually
sync changes to disk. Since we would have marked it obsolete
'right-now', we can presume that someone else also made that decision
'right now', and we should put off deleting the file until later.

We should still handle NoSuchFile, but it should make the window of
things we try to obsolete smaller.

> 2. If it's not OK then we could somehow detect that revisions were being duplicated and back out the second autopack. Perhaps _save_pack_names() could fail if packs it was expecting to remove had already been removed from pack-names.
>

The repository is designed to scale to multiple concurrent writers. (We
only have one small pack-names file whose writes must be serialized,
rather than all pack files.) As such, the design has to be tolerant of
data being present multiple times. (If I push a branch with revs 1 and
2, and you push a branch with revs 1 and 3, 1 can easily get pushed 2
times.)

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAktUqBIACgkQJdeBCYSNAAOTewCfUxcAjafoKmqmuWQx2ErxJe7Q
ehAAni1G1JufpiwTwf0kXlnNoN72KMNx
=WQxY
-----END PGP SIGNATURE-----

Revision history for this message
Robert Collins (lifeless) wrote :

On Mon, 2010-01-18 at 18:27 +0000, John A Meinel wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
>
> ...
>
> > Some possible solutions:
> > 1. If it's OK to have some revisions duplicated in different packs then we could:
> > (a) make obsolete_pack() ignore NoSuchFile and just assume someone else had already obsoleted the file, or
> > (b) be smarter about which packs we try to obsolete. We may be able to determine this in _diff_pack_names() by calculating the intersection of deleted_nodes and disk_nodes.
>
> It is ok. So we can just skip the removal. One other alternative...
> before we delete the files, I believe we check the 'obsolete_packs'
> directory for any files and delete them. I would recommend that if we
> have a file that is queued for obsoleting, but is present in
> obsolete_packs, then we should
>
> 1) remove it from the 'to-obsolete' queue
> 2) remove it from the 'to-delete' queue
>
> The reason we wait to delete is to make sure the os has time to actually
> sync changes to disk. Since we would have marked it obsolete
> 'right-now', we can presume that someone else also made that decision
> 'right now', and we should put off deleting the file until later.
>
> We should still handle NoSuchFile, but it should make the window of
> things we try to obsolete smaller.

I like your logic here.

-Rob

Revision history for this message
Gareth White (gwhite-deactivatedaccount) wrote :

That does sound like a good way to go as it would actually solve two issues (the original bug, plus prematurely deleting obsoleted files).

Would you recommend adding a "packs_to_obsolete" member to this class? That would make it easy for _clear_obsolete_packs() to know which ones to skip deletion of, while at the same time allowing it to remove them from the list so they don't try to be obsoleted twice. An alternative is to try to pass in and return a list via _save_pack_names() but that seems a little messy.

As an aside, I'm a bit unclear why _clear_obsolete_packs() is called from within _save_pack_names(). This does mean that files in "obsolete_packs" are deleted only while pack-names is locked, yet _obsolete_packs() moves files into there while it's not locked. I'm sure there must be a good reason - if not, perhaps we could do the deletions closer to where we obsolete the files. That would make it easier to implement this bugfix.

Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Gareth White wrote:
> That does sound like a good way to go as it would actually solve two
> issues (the original bug, plus prematurely deleting obsoleted files).
>
> Would you recommend adding a "packs_to_obsolete" member to this class?
> That would make it easy for _clear_obsolete_packs() to know which ones
> to skip deletion of, while at the same time allowing it to remove them
> from the list so they don't try to be obsoleted twice. An alternative is
> to try to pass in and return a list via _save_pack_names() but that
> seems a little messy.
>
> As an aside, I'm a bit unclear why _clear_obsolete_packs() is called
> from within _save_pack_names(). This does mean that files in
> "obsolete_packs" are deleted only while pack-names is locked, yet
> _obsolete_packs() moves files into there while it's not locked. I'm sure
> there must be a good reason - if not, perhaps we could do the deletions
> closer to where we obsolete the files. That would make it easier to
> implement this bugfix.
>

Once pack-names has been written, those files are no longer referenced,
so we can rename them from 'packs' to 'obsolete_packs' at our leisure
(we don't need the lock).

We probably don't need the lock when deleting content from
obsolete_packs either, as it doesn't effect the correctness of the
operation.

As for a member variable... you also could just mutate the list that is
passed in.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAktWHkYACgkQJdeBCYSNAAMJDQCfaMC1TA/3ZeXhLC4emZlwcjxL
QCoAn0tdtVFN3CocIb30BmVoOjP0sl8M
=beA9
-----END PGP SIGNATURE-----

Revision history for this message
Robert Collins (lifeless) wrote :

On Tue, 2010-01-19 at 21:04 +0000, John A Meinel wrote:
>
> Once pack-names has been written, those files are no longer
> referenced,
> so we can rename them from 'packs' to 'obsolete_packs' at our leisure
> (we don't need the lock).
>
> We probably don't need the lock when deleting content from
> obsolete_packs either, as it doesn't effect the correctness of the
> operation.

We need it when deleting content so that two processes don't try to
delete the same file at the same time.

-Rob

Revision history for this message
Gareth White (gwhite-deactivatedaccount) wrote :

Robert Collins wrote:
> We need it when deleting content so that two processes don't try to
> delete the same file at the same time.

In that case, I wonder if we should also be doing the renaming within the lock too. Even if we skip renaming files that we find are already in "obsolete_packs", there's still a chance that two processes will try to rename a file at the same time. It will have to either ignore NoSuchFile when renaming or do it within a lock too.

Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Gareth White wrote:
> Robert Collins wrote:
>> We need it when deleting content so that two processes don't try to
>> delete the same file at the same time.
>
> In that case, I wonder if we should also be doing the renaming within
> the lock too. Even if we skip renaming files that we find are already
> in "obsolete_packs", there's still a chance that two processes will try
> to rename a file at the same time. It will have to either ignore
> NoSuchFile when renaming or do it within a lock too.
>

So actually, we could probably do a bit better.

Specifically, I think that at the time we go to obsolete the existing
packs, we would actually know that the packs are no longer referenced by
the repository. If I break down the timing...

1) Both repo A & B start an autopack, affecting files 1,2,3, trying to
   create a new file 4A and 4B respectively.
2) They finish reading all of 1,2,3 and write out 4A and 4B.
3) Repo A grabs the lock, and writes out that the new referenced list is
4A. Repo A succeeds, and starts moving 1, 2, 3 into the obsolete folder.
4) Repo B then grabs the lock and rereads pack-names. Finding that 1,2,3
are no longer referenced. However it still remembers that 1,2,3 became
4B. It writes out that the new pack-names should be 4A and 4B, and then
proceeds to try to rename 1,2,3.

I'm trying to decide between

5a) Repo B sees that 1,2,3 are no longer referenced, and thus removes
them from the 'to-be-rename-into-obsolete' list.

5b) Repo B tries to obsolete them anyway, but does not delete them from
obsolete packs.

The reason I question 5a is what the failure modes are. (If two clients
both try to obsolete certain files, but one of them gets interrupted,
then those files get left there indefinitely.)

Then again, there is a race condition with trusting
"list_dir('obsolete_packs')". Specifically, if Repo A has renamed 1 and
2, but not yet 3, then Repo B will list the directory, and find that it
still thinks 3 should be renamed. But before Repo B gets to it, Repo A
will have renamed it.

Anyway, I'm still sifting through it, but I think I've got some stuff
put together for this.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAktXRRgACgkQJdeBCYSNAANFSgCcC0kKwGOGqHAeSjRIpz7IzXE0
VXgAnA7fUJh6zagxK905nzV0RFSxGcwW
=vrQ+
-----END PGP SIGNATURE-----

Vincent Ladeuil (vila)
Changed in bzr:
assignee: nobody → John A Meinel (jameinel)
status: Confirmed → In Progress
Revision history for this message
John A Meinel (jameinel) wrote :

This has landed on trunk, and can also be landed into a 2.0 series release.

Changed in bzr:
milestone: none → 2.1.0rc1
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.