Autopack fails with err 17 "File Exists" on windows

Bug #304023 reported by Adrian Wilkins
2
Affects Status Importance Assigned to Milestone
Bazaar
Fix Released
Undecided
Unassigned

Bug Description

Bazaar 1.9

Committing to a bound heavyweight checkout situated in a rich-root-pack repo, bound to a server branch in a rich-root-pack no-trees repo

Local repo is d:\workspace
Network repo is y:\repository

On autopack client returns

File exists: u'Y:/repository/.bzr/repository/pack-names': {Errno 17}File exists

Log reports
3113.989 Auto-packing repository <bzrlib.repofmt.pack_repo.RepositoryPackCollection object at 0x033F9550>, which has 25 pack files, containing 991 revisions into 19 packs.

The error is NOT reported in the log.

Manually packing repository :-

bzr pack y:\repository

Resolves this with no error.

Possible avenues of exlporation
 - opportunistic locks (on or off)
 - dangling file handles
 ? Who knows.

Revision history for this message
John A Meinel (jameinel) wrote :

So a question, are you directly bound to the Y:/ repository, or are you bound somehow through bzr+ssh:// or bzr://?

With bzr 1.9 as the client and server, I believe it will send an RPC request to do the auto-pack, rather than doing it from the client. Which would explain how you could see the error (for bzr+ssh it would be triggered by the remote bzr, and then shown via the ssh stderr connection), but it wouldn't be in the local log file, because it was actually written to the server's log file.

Then again, if that was happening, I wouldn't expect it to say "Y:/..." and I wouldn't expect to see the "Auto-packing ...." in the log file.

On windows, there can be some odd results if the process dies without flushing its files. Things seem to be buffered in userspace, so if you don't do ".flush()" it never actually writes anything to disk. (I believe on Linux the buffering is in kernel space, so when the process dies, the kernel does the final flush to disk.)

I will say that regardless, 3000s for an operation before it decides to auto-pack seems rather long.

Anyway, back to the bug. I would guess this is a dangling file handle which is triggered in the autopack code and not in the regular "pack" code (they share the bulk of the code, but it could be something outside of the actual packing.)

Alternatively, if it took 3000s to get to this point, is it possible that you were racing with some other user who pushed into the same repository? And then you are actually racing between 2 processes that are both trying to open/re-write the file?

Revision history for this message
John A Meinel (jameinel) wrote :

I should mention that we *do* take out a write lock before writing the file, so it is unlikely that 2 processes are writing the file at the same time. It would not be surprising to have 1 process *reading* the file, while 1 tries to replace it, though. As on POSIX that is legal, while on Windows it takes out an implicit path lock whenever a file is opened for reading.

The length of time that we spend reading the pack-names file should be really short, though. So it would need pretty precise timing to run into a problem.

If it is just a "one-client reading while another client writes", I wonder if we get an exception that we could take to mean "try again in 1s" rather than failing right away.

Revision history for this message
Adrian Wilkins (adrian-wilkins) wrote :

I think the "3000" is indicative of the process being kept alive to service a bzr-eclipse instance, although I've seen it occur at the command line also.

The checkout is directly bound to y:\ via the CIFS / SMB / Windows network share method.

A race is possible but the repository is only in use by ~5 people, and I get reports of this problem at approximately the intervals you might expect from autopacking ; anecdotally it seems far too frequent to be racing.

I would suspect opportunistic locking more strongly if we hadn't had a group policy update aimed at switching oplocks off, which I've verified manually by reading the registry on at least one affected machine.

Since all the checkouts are heavy, does the client even need to read pack-names on the server? The commonest usage scenario here is that each user is typically the sole user on a given branch, so they shouldn't need to be reading packs from the server very often, except when comitting - it should be possible to verify that there are no new revisions just by reading indices and revision lists (? is this true - I don't know enough to answer off the top of my head but it would seem to be intuitively sound).

Does the code move the original pack-names file before writing over it during a commit transaction? Would that work?

Revision history for this message
Robert Collins (lifeless) wrote : Re: [Bug 304023] Re: Autopack fails with err 17 "File Exists" on windows

On Mon, 2008-12-01 at 22:18 +0000, Adrian Wilkins wrote:
>
> Since all the checkouts are heavy, does the client even need to read
> pack-names on the server? The commonest usage scenario here is that
> each
> user is typically the sole user on a given branch, so they shouldn't
> need to be reading packs from the server very often, except when
> comitting - it should be possible to verify that there are no new
> revisions just by reading indices and revision lists (? is this true -
> I
> don't know enough to answer off the top of my head but it would seem
> to
> be intuitively sound).

pack-names is the entry point to all the indices, it must be read before
reading any data from the repository. (Otherwise we don't know what
packs are present).

> Does the code move the original pack-names file before writing over it
> during a commit transaction? Would that work?

It should IIRC, but its worth checking.

My guess is a operation reading the pack-names file while another tries
to replace it.

-Rob

--
GPG key available at: <http://www.robertcollins.net/keys.txt>.

Revision history for this message
Adrian Wilkins (adrian-wilkins) wrote :

This happened for me again during a push, the file that provoked err 17 was "last-revision". The push failed and was left at revision 0. This was during a script, which then sucessfully bound the local branch to the empty branch on the server.

After pushing the branch manually, the "pack-names" problem then recurs. Packing manually resolves this again. This occurs at 10 packs on the server, it would seem.

Revision history for this message
Adrian Wilkins (adrian-wilkins) wrote :

Curiouser and curiouser .....

I rebranched most of the branches to a new repo to remove some cruft.

On my machine the following succeeds.

rm -r -for y:\repository\tools
bzr branch y:\repository_old\tools y:\repository\tools

On the other machine it fails with an ERR 17 (errno.EEXIST) on the target "last-revision" file, reliably.

I checked the oplocks settings and both machines are identical in that regard.

Comparing a filesystem trace from the two machines reveals that the name collision is not occuring on the actual last-revision file, but the temp filename used in AtomicFile. The name that AtomicFile uses is invariant - and it looks like on the machine that fails, one end of the client/server equation is convinced that this file still exists when a second AtomicFile creates it, even though it is renamed successfully in both traces.

row 459 in log - last-revision atomic file is renamed
row 629 in log - one machine succeeds at creating new atomic file, one machine fails with "NAME COLLISION"

I've added rand_chars(10) to the filename that AtomicFile generates, which seems to have no ill effects, and I'll test this change on the affected machine tomorrow to see if it cures the problem with no ill effects (the user concerned has gone home).

Revision history for this message
Robert Collins (lifeless) wrote : Re: [Bug 304023] Re: Autopack fails with err 17 "File Exists" on windows

On Tue, 2008-12-09 at 16:27 +0000, Adrian Wilkins wrote:
>
> I've added rand_chars(10) to the filename that AtomicFile generates,
> which seems to have no ill effects, and I'll test this change on the
> affected machine tomorrow to see if it cures the problem with no ill
> effects (the user concerned has gone home).
makes sense to me

-Rob
--
GPG key available at: <http://www.robertcollins.net/keys.txt>.

Revision history for this message
Adrian Wilkins (adrian-wilkins) wrote :

This change does cure the problem on affected machines.

I still have nothing but guesses why some machines are affected and some are not, but further diagnostic effort would I feel, ultimately be wasted - I'm fairly sure it would be a problem with something that I have no ability to change.

Patch sent to list.

Revision history for this message
Mark Hammond (mhammond) wrote :

A random thought is that between virus scanners and the MS indexing service, its not that uncommon these days to find a newly touched file can't be removed, as one of these background processes has detected the change and is scanning it. This could also account for differences in behaviour between machines (eg, one user may have disabled as much "real-time" scanning as she could find...)

Revision history for this message
Adrian Wilkins (adrian-wilkins) wrote :

This is a possibility, as the machine in question does typically have Indexing Services on, but as far as I'm aware, the network folder in question is not being catalogued.

The unaffected machine has the same virus scanner, but no indexing service. I'll try to reproduce the test with indexing services off on the affected machine if I get a chance.

Changed in bzr:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.