Autopack fails with NoSuchFile error when committing concurrently

Bug #495000 reported by Gareth White on 2009-12-10
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Bazaar
Medium
John A Meinel
2.0
Medium
John A Meinel

Bug Description

When committing at the same time to different branches in a shared repository, the commit will occasionally fail with an error such as:
aborting commit write group: NoSuchFile(No such file: 'C:/code/testrepo/.bzr/repository/packs/e5c831b49ebe63551656348d6b6ae931.pack')

See attached log for the full output.

It looks like both commits are trying to do the autopack and one is failing because the other has already done it. I've reproduced this with and without a smart server, but it seems easier to reproduce with a smart server. I've tested it with both Bazaar 2.0.1 and 2.1.0b3 using Python 2.5 on Windows XP SP3.

To reproduce:
bzr init-repo --no-trees testrepo
bzr init testrepo\branch1
bzr init testrepo\branch2
start bzr serve --allow-writes
bzr checkout --lightweight bzr://localhost/testrepo/branch1
bzr checkout --lightweight bzr://localhost/testrepo/branch2

In one window, "cd branch1" and repeat the following command many times:
commit -m "test" --unchanged

In another window, do the same thing in "branch2" at the same time.

(Note: This also happens even if committing real changes. We've noticed it just by having 3 developers using the same repository at the same time. It makes us worried about how it would scale to 30+.)

Related branches

description: updated

Has anyone been able to reproduce this issue yet? It's going to be a serious issue for us if we decide to switch to Bazaar.

summary: - NoSuchFile error when committing concurrently
+ Autopack fails with NoSuchFile error when committing concurrently

Another way to reproduce this error:
* Import a large tree into a new repository (so that there is only 1 revision)
* Create two branches
* Create a lightweight checkout of each branch
* Commit 8 times in one checkout
* Commit once in one checkout - this will trigger an autopack
* While the autopack is still in progress, commit in the other checkout
* One of them will then fail with the "NoSuchFile" error

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Gareth White wrote:
> Another way to reproduce this error:
> * Import a large tree into a new repository (so that there is only 1 revision)
> * Create two branches
> * Create a lightweight checkout of each branch
> * Commit 8 times in one checkout
> * Commit once in one checkout - this will trigger an autopack
> * While the autopack is still in progress, commit in the other checkout
> * One of them will then fail with the "NoSuchFile" error
>

Can you include the actual traceback? We certainly have code to handle
concurrent autopacking. It is possible that this regressed, or it is
possible that a platform is raising an unexpected exception which we
aren't handling.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAksmxHsACgkQJdeBCYSNAAPq5QCfQgHbVmgJ30YR5NJ7n+jsveau
bPkAn1QTg+w/ZTx7h402Sir9f5MS+cj1
=F09+
-----END PGP SIGNATURE-----

I set up a new (treeless) shared repository and created a new branch. I then created a lightweight checkout and imported approx. 600 files, including some large ones so that the (uncompressed) tree size is about 250 MB. Note that this is not strictly necessary to reproduce the bug, but makes it easier as the autopack takes more time.

I then created a second branch from this first one and made another lightweight checkout. I committed 8 times in the first checkout (so that the total number of revisions in the repo is 9) and then committed in both checkouts simultaneously. The first one succeeded, but the second one failed with the NoSuchFile error.

The attached log is from the one that failed. The traceback is similar to the one I posted earlier using the smart server.

This is the log from the one that succeeded.

John A Meinel (jameinel) wrote :
Download full text (4.3 KiB)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Gareth White wrote:
> I set up a new (treeless) shared repository and created a new branch. I
> then created a lightweight checkout and imported approx. 600 files,
> including some large ones so that the (uncompressed) tree size is about
> 250 MB. Note that this is not strictly necessary to reproduce the bug,
> but makes it easier as the autopack takes more time.
>
> I then created a second branch from this first one and made another
> lightweight checkout. I committed 8 times in the first checkout (so that
> the total number of revisions in the repo is 9) and then committed in
> both checkouts simultaneously. The first one succeeded, but the second
> one failed with the NoSuchFile error.
>
> The attached log is from the one that failed. The traceback is similar
> to the one I posted earlier using the smart server.
>
> ** Attachment added: "bzr.log"
> http://launchpadlibrarian.net/36812834/bzr.log
>

So your traceback seems a bit surprising. Namely, it looks like the
failure is occurring while renaming the pack files to the obsolete
directory. I have an idea what is going on, and why we only run into
this on Windows.

Specifically, the first autopack succeeds and finishes its operation. At
that point, though, the second autopack is running and is reading from
the files.

We try to rename the obsoleted pack files into another directory, but
the rename fails on Windows because the file is still open. (This
doesn't happen on POSIX systems because the inode abstraction means you
are allowed to do all sorts of things to open files, rename, delete,
etc, without bothering the process that has the file open.)

The code in question actually handles when a file it expected has been
renamed out of the way, but it doesn't handle when it fails to rename.
(So process 2 is perfectly fine if the file it was reading disappears,
process 1 doesn't not how to handle not being able to move the file.)

So we need to sort out what we want to do. We could

a) Leave the file alone. It has already become unreferenced garbage (we
don't try to rename the file until after we've removed the reference to
it.) It won't break anything to have it there, but it does waste disk space.

b) Try a short sleep and try-again. This won't fix all cases, and
eventually we should time out and stop trying. Actually, we probably
want to move on to renaming all the other files we can, and then come
back to trying to rename this one. That way the other process doesn't
hold open another pack file, and in fact, will probably trigger it to
stop what it is doing completely, freeing up this one.

We could do one loop, or 2, or 5, or X and then stop. I don't think we
want to try indefinitely.

It certainly can be identified after the fact that this is an
unreferenced file, though currently we don't have that action
specifically exposed.

c) Start referencing the file again. Which would help it get cleaned out
in a later autopack/pack cycle. I don't really like this, as the second
process is probably thinking about getting rid of it, and we are
preventing that process from realizing that it should stop trying.
(someone else already did the au...

Read more...

Martin Pool (mbp) on 2009-12-16
Changed in bzr:
status: New → Confirmed
importance: Undecided → Medium

2009/12/16 John A Meinel <email address hidden>:
> b) Try a short sleep and try-again. This won't fix all cases, and
> eventually we should time out and stop trying. Actually, we probably
> want to move on to renaming all the other files we can, and then come
> back to trying to rename this one. That way the other process doesn't
> hold open another pack file, and in fact, will probably trigger it to
> stop what it is doing completely, freeing up this one.
>
> We could do one loop, or 2, or 5, or X and then stop. I don't think we
> want to try indefinitely.
>
> It certainly can be identified after the fact that this is an
> unreferenced file, though currently we don't have that action
> specifically exposed.

I'd probably try that first, and also make it give a warning that it's
doing so, so that

1- people know why it's slowed down
2- it will come up in later bug reports, if there are any

--
Martin <http://launchpad.net/~mbp/>

Wouldn't the error be PermissionDenied if the file was in use during the rename? I thought what was happening was that the second autopack was attempting to move a file which had already been moved by the first autopack. Maybe I'm looking at the wrong bit of code. In any case, I guess any of these problems could happen if multiple autopacks were going on concurrently.

Is there some way it could detect an autopack was already in progress and then not attempt the second one? It seems like it would be difficult to handle all errors that could come up when two or more are running concurrently.

BTW, I was able to reproduce the problem on a larger repository on revisions other than the first 10, 100, 1000 etc. Even though those autopacks are the slowest by far (8 minutes+) the other multiples of 10 or 100 etc. also take a non-trivial amount of time. I suppose that in practice we would be pretty unlucky to get an overlap of two of these autopacks even if we were doing 100 commits a day, but it could happen.

(It would be great if you do change the autopack algorithm at some point to consider the size of the packs. Our first revision is almost 1 GB so we'd be pretty happy if it was never packed again at all! Alternatively if there was a way to disable autopacking altogether we could just schedule a manual pack overnight. Anyway, I'm getting a bit off topic now...)

On Wed, 2009-12-16 at 13:13 +0000, Gareth White wrote:
>
> autopack. Maybe I'm looking at the wrong bit of code. In any case, I
> guess any of these problems could happen if multiple autopacks were
> going on concurrently.

Uhm, all manipulation of the pack list file is done within a write lock.
all pack operations except 'reconcile' and data deletion are designed to
be safe to run concurrently. I realise there is obviously an issue, but
please take the time to get a trace of the actual syscalls causing the
issue.

-Rob

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Martin Pool wrote:
> 2009/12/16 John A Meinel <email address hidden>:
>> b) Try a short sleep and try-again. This won't fix all cases, and
>> eventually we should time out and stop trying. Actually, we probably
>> want to move on to renaming all the other files we can, and then come
>> back to trying to rename this one. That way the other process doesn't
>> hold open another pack file, and in fact, will probably trigger it to
>> stop what it is doing completely, freeing up this one.
>>
>> We could do one loop, or 2, or 5, or X and then stop. I don't think we
>> want to try indefinitely.
>>
>> It certainly can be identified after the fact that this is an
>> unreferenced file, though currently we don't have that action
>> specifically exposed.
>
> I'd probably try that first, and also make it give a warning that it's
> doing so, so that
>
> 1- people know why it's slowed down
> 2- it will come up in later bug reports, if there are any
>

Yeah, I think that is a reasonable way to go.

I also wonder, though. I'm guessing we are getting "NoSuchFile" because
we are failing to properly translate an existing error to
PermissionDenied. But I'd have to dig further to confirm or deny that.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkso/KoACgkQJdeBCYSNAAN91QCg0X0ZR/b5LxvLA1hjJ43BjI0L
5sAAniVpHbfVR9QTvMAp5jREiafQ0oZu
=y742
-----END PGP SIGNATURE-----

John A Meinel (jameinel) wrote :
Download full text (3.4 KiB)

I just reproduced this locally, but got a very different traceback (and more the one I would expect / what I had hoped we had fixed).

Using lp:bzr/2.0 I inserted a 'import pdb; pdb.set_trace()' in _do_autopack just after it decides it needs to do one. I then ran 'bzr commit' until it stopped, once that happened I started another commit in the other window, let it stop at the same place, continued on the first, and the second failed with:
...
  File "C:\Users\jameinel\dev\bzr\swork\bzrlib\repofmt\pack_repo.py", line 1472, in autopack
    return self._do_autopack()
  File "C:\Users\jameinel\dev\bzr\swork\bzrlib\repofmt\pack_repo.py", line 1513, in _do_autopack
    reload_func=self._restart_autopack)
  File "C:\Users\jameinel\dev\bzr\swork\bzrlib\repofmt\groupcompress_repo.py", line 691, in _execute_pack_operations
    result = packer.pack()
  File "C:\Users\jameinel\dev\bzr\swork\bzrlib\repofmt\pack_repo.py", line 760, in pack
    return self._create_pack_from_packs()
  File "C:\Users\jameinel\dev\bzr\swork\bzrlib\repofmt\groupcompress_repo.py", line 473, in _create_pack_from_packs
    self._copy_revision_texts()
  File "C:\Users\jameinel\dev\bzr\swork\bzrlib\repofmt\groupcompress_repo.py", line 411, in _copy_revision_texts
    'revisions', self._get_progress_stream, 1)
  File "C:\Users\jameinel\dev\bzr\swork\bzrlib\repofmt\groupcompress_repo.py", line 400, in _copy_stream
    reuse_blocks=False):
  File "C:\Users\jameinel\dev\bzr\swork\bzrlib\groupcompress.py", line 1663, in _insert_record_stream
    for record in stream:
  File "C:\Users\jameinel\dev\bzr\swork\bzrlib\repofmt\groupcompress_repo.py", line 207, in pb_stream
    for idx, record in enumerate(substream):
  File "C:\Users\jameinel\dev\bzr\swork\bzrlib\groupcompress.py", line 1423, in get_record_stream
    orig_keys, ordering, include_delta_closure):
  File "C:\Users\jameinel\dev\bzr\swork\bzrlib\groupcompress.py", line 1582, in _get_remaining_record_stream
    for factory in batcher.yield_factories(full_flush=True):
  File "C:\Users\jameinel\dev\bzr\swork\bzrlib\groupcompress.py", line 1146, in yield_factories
    block_read_memo, block = blocks.next()
  File "C:\Users\jameinel\dev\bzr\swork\bzrlib\groupcompress.py", line 1379, in _get_blocks
    zdata = raw_records.next()
  File "C:\Users\jameinel\dev\bzr\swork\bzrlib\knit.py", line 3364, in get_raw_records
    reader = pack.make_readv_reader(transport, path, offsets)
  File "C:\Users\jameinel\dev\bzr\swork\bzrlib\pack.py", line 221, in make_readv_reader
    transport.readv(filename, readv_blocks)))
  File "C:\Users\jameinel\dev\bzr\swork\bzrlib\transport\__init__.py", line 656, in readv
    return self._readv(relpath, offsets)
  File "C:\Users\jameinel\dev\bzr\swork\bzrlib\transport\__init__.py", line 668, in _readv
    fp = self.get(relpath)
  File "C:\Users\jameinel\dev\bzr\swork\bzrlib\transport\local.py", line 167, in get
    self._translate_error(e, path)
  File "C:\Users\jameinel\dev\bzr\swork\bzrlib\transport\__init__.py", line 306, in _translate_error
    raise errors.NoSuchFile(path, extra=e)
NoSuchFile: No such file: u'C:/Users/jameinel/dev/,tmp/remote-mem-test/.bzr/repository/packs/54548bdc2dfdbe29ff0a46e298e4fd38.pack...

Read more...

John A Meinel (jameinel) wrote :

So it looks like the issue is that NoSuchFile is getting propagated all the way to the top, rather than getting caught by _DirectPackAccess.get_raw_records() and being translated into: errors.RetryWithNewPacks.
And I'm guessing that is the case because:
                if self._reload_func is None:
                    raise

Which means that _DirectPackAccess isn't getting the reload function that _do_autopack is passing down. I think part of the reason for that is because of how we changed the packer code functionality in the 2.0 switchover. Namely it now goes via the get_record_stream() abstraction. Which does have its own retry functionality, but expects self._reload_func to be set before it raises RestartWithNewPacks.

Still investigating.

John A Meinel (jameinel) wrote :

The associated branch has a one-line fix for this, but I have to figure out how we can get it tested properly.

I think the issue is that testing this requires some amount of concurrency. The existing restart tests monkeypatch functions to make sure that the files are removed at appropriate times.

Also, the 'gc_repo.texts.get_record_stream()' code *does* do the right thing, because it gets a reload function from

self.revision_index = AggregateIndex(self.reload_pack_names, flush)

The failure was that GCCHKPacker creates temp versioned_file instances that only reference the pack files we are going to be repacking. And when doing so, it wasn't setting the reload function.

Still trying to sort out a tasteful way for how to test it.

John A Meinel (jameinel) wrote :

Robert mentions using a tracing/triggering transport that inspects the VFS actions to do X, and then have it trigger another repository to pack while X is occurring.

John A Meinel (jameinel) wrote :
Changed in bzr:
status: Confirmed → Fix Committed

Thanks for looking into this. I suspect there may be a couple more causes of the error though, as I can still reproduce it with the fix. I've been testing it on my OS X machine at home and get the same errors as I was seeing on Windows at work. The attached shell script does 100 commits concurrently on two branches, and the commits fail intermittently with the NoSuchFile error. There seem to be at least three different variants: the one that's now fixed (see traceback above), a similar one with the the index files, and a different one while renaming the pack files. I'll include the tracebacks below.

Download full text (11.9 KiB)

This is the error with one of the index files:

0.158 bzr arguments: [u'commit', u'-m', u'dummy revision', u'--unchanged', u'testrepo/branch1']
0.197 looking for plugins in /Users/garethw/.bazaar/plugins
0.197 looking for plugins in /Users/garethw/Programming/Bazaar/2.0.4-495000-win32-autopack/bzrlib/plugins
0.477 looking for plugins in /Library/Python/2.5/site-packages/bzrlib/plugins
0.478 Plugin name launchpad already loaded
0.478 Plugin name netrc_credential_store already loaded
0.807 encoding stdout as sys.stdout encoding 'UTF-8'
0.981 opening working tree '/Users/garethw/Programming/Bazaar/testrepo/branch1'
1.051 preparing to commit
[ 1593] 2009-12-19 12:40:48.278 INFO: Committing to: /Users/garethw/Programming/Bazaar/testrepo/branch1/
1.138 Selecting files for commit with filter None
1.551 Auto-packing repository <bzrlib.repofmt.groupcompress_repo.GCRepositoryPackCollection object at 0x1388950>, which has 11 pack files, containing 11 revisions. Packing 10 files into 1 affecting 10 revisions
1.563 repacking 10 revisions
1.611 repacking 10 inventories
1.676 repacking chk: 1 id_to_entry roots, 1 p_id_map roots, 336 total keys
6.132 repacking 1224 texts
226.291 aborting commit write group because of exception:
226.342 Traceback (most recent call last):
  File "/Users/garethw/Programming/Bazaar/2.0.4-495000-win32-autopack/bzrlib/commit.py", line 375, in commit
    self.rev_id = self.builder.commit(self.message)
  File "/Users/garethw/Programming/Bazaar/2.0.4-495000-win32-autopack/bzrlib/repository.py", line 175, in commit
    self.repository.commit_write_group()
  File "/Users/garethw/Programming/Bazaar/2.0.4-495000-win32-autopack/bzrlib/repository.py", line 1554, in commit_write_group
    result = self._commit_write_group()
  File "/Users/garethw/Programming/Bazaar/2.0.4-495000-win32-autopack/bzrlib/repofmt/pack_repo.py", line 2268, in _commit_write_group
    hint = self._pack_collection._commit_write_group()
  File "/Users/garethw/Programming/Bazaar/2.0.4-495000-win32-autopack/bzrlib/repofmt/pack_repo.py", line 2122, in _commit_write_group
    result = self.autopack()
  File "/Users/garethw/Programming/Bazaar/2.0.4-495000-win32-autopack/bzrlib/repofmt/pack_repo.py", line 1472, in autopack
    return self._do_autopack()
  File "/Users/garethw/Programming/Bazaar/2.0.4-495000-win32-autopack/bzrlib/repofmt/pack_repo.py", line 1512, in _do_autopack
    reload_func=self._restart_autopack)
  File "/Users/garethw/Programming/Bazaar/2.0.4-495000-win32-autopack/bzrlib/repofmt/groupcompress_repo.py", line 692, in _execute_pack_operations
    result = packer.pack()
  File "/Users/garethw/Programming/Bazaar/2.0.4-495000-win32-autopack/bzrlib/repofmt/pack_repo.py", line 760, in pack
    return self._create_pack_from_packs()
  File "/Users/garethw/Programming/Bazaar/2.0.4-495000-win32-autopack/bzrlib/repofmt/groupcompress_repo.py", line 478, in _create_pack_from_packs
    self._copy_signature_texts()
  File "/Users/garethw/Programming/Bazaar/2.0.4-495000-win32-autopack/bzrlib/repofmt/groupcompress_repo.py", line 464, in _copy_signature_texts
    signature_keys = source_vf.keys()
  File "/Users/garethw/Programming/Bazaar/2.0.4-495000-win3...

Download full text (6.8 KiB)

This is the error while renaming (obsoleting) a pack file:

0.102 bzr arguments: [u'commit', u'-m', u'dummy revision', u'--unchanged', u'testrepo/branch1']
0.119 looking for plugins in /Users/garethw/.bazaar/plugins
0.119 looking for plugins in /Users/garethw/Programming/Bazaar/2.0.4-495000-win32-autopack/bzrlib/plugins
0.274 looking for plugins in /Library/Python/2.5/site-packages/bzrlib/plugins
0.275 Plugin name launchpad already loaded
0.275 Plugin name netrc_credential_store already loaded
0.479 encoding stdout as sys.stdout encoding 'UTF-8'
0.578 opening working tree '/Users/garethw/Programming/Bazaar/testrepo/branch1'
0.611 preparing to commit
[ 1783] 2009-12-19 12:57:01.012 INFO: Committing to: /Users/garethw/Programming/Bazaar/testrepo/branch1/
0.644 Selecting files for commit with filter None
0.954 Auto-packing repository <bzrlib.repofmt.groupcompress_repo.GCRepositoryPackCollection object at 0x1388950>, which has 15 pack files, containing 190 revisions. Packing 13 files into 1 affecting 13 revisions
0.957 repacking 13 revisions
0.997 repacking 13 inventories
1.029 repacking chk: 1 id_to_entry roots, 1 p_id_map roots, 0 total keys
1.037 repacking 0 texts
1.044 repacking 0 signatures
1.097 aborting commit write group because of exception:
1.110 Traceback (most recent call last):
  File "/Users/garethw/Programming/Bazaar/2.0.4-495000-win32-autopack/bzrlib/commit.py", line 375, in commit
    self.rev_id = self.builder.commit(self.message)
  File "/Users/garethw/Programming/Bazaar/2.0.4-495000-win32-autopack/bzrlib/repository.py", line 175, in commit
    self.repository.commit_write_group()
  File "/Users/garethw/Programming/Bazaar/2.0.4-495000-win32-autopack/bzrlib/repository.py", line 1554, in commit_write_group
    result = self._commit_write_group()
  File "/Users/garethw/Programming/Bazaar/2.0.4-495000-win32-autopack/bzrlib/repofmt/pack_repo.py", line 2268, in _commit_write_group
    hint = self._pack_collection._commit_write_group()
  File "/Users/garethw/Programming/Bazaar/2.0.4-495000-win32-autopack/bzrlib/repofmt/pack_repo.py", line 2122, in _commit_write_group
    result = self.autopack()
  File "/Users/garethw/Programming/Bazaar/2.0.4-495000-win32-autopack/bzrlib/repofmt/pack_repo.py", line 1472, in autopack
    return self._do_autopack()
  File "/Users/garethw/Programming/Bazaar/2.0.4-495000-win32-autopack/bzrlib/repofmt/pack_repo.py", line 1512, in _do_autopack
    reload_func=self._restart_autopack)
  File "/Users/garethw/Programming/Bazaar/2.0.4-495000-win32-autopack/bzrlib/repofmt/groupcompress_repo.py", line 710, in _execute_pack_operations
    self._obsolete_packs(packs)
  File "/Users/garethw/Programming/Bazaar/2.0.4-495000-win32-autopack/bzrlib/repofmt/pack_repo.py", line 1788, in _obsolete_packs
    '../obsolete_packs/' + pack.file_name())
  File "/Users/garethw/Programming/Bazaar/2.0.4-495000-win32-autopack/bzrlib/transport/local.py", line 408, in rename
    self._translate_error(e, path_from)
  File "/Users/garethw/Programming/Bazaar/2.0.4-495000-win32-autopack/bzrlib/transport/__init__.py", line 306, in _translate_error
    raise errors.NoSuchFile(path, extra=e)
NoSuchFile: No such file: u'/Users...

Read more...

Although the main source of this bug has been fixed I can still reproduce the two errors above using the attached shell script. Should I reopen the bug or create a new one?

2010/1/13 Gareth White <email address hidden>:
> Although the main source of this bug has been fixed I can still
> reproduce the two errors above using the attached shell script. Should I
> reopen the bug or create a new one?

If you think it has a different cause (and it sounds like it does)
please open a new bug, and mention this one there and vice versa.
Thanks!

--
Martin <http://launchpad.net/~mbp/>

See 507553, 507557 and 507566 for the related bugs.

John A Meinel (jameinel) on 2010-01-21
Changed in bzr:
assignee: nobody → John A Meinel (jameinel)
milestone: none → 2.1.0rc1
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers