git: file ids are not very unique

Bug #351317 reported by Eric Anderson
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Bazaar Git Plugin
Triaged
Wishlist
Unassigned
Breezy
Triaged
Medium
Unassigned

Bug Description

When trying to use subtree formats I cannot join in more than one git repository. To reproduce try the following:

$ mkdir test
$ cd test/
test$ bzr init --development-subtree
Created a standalone tree (format: development2-subtree)
test$ bzr branch git://github.com/harukizaemon/schema_validations.git schema_validations
Branched 4 revision(s).
test$ bzr join --reference schema_validations
test$ bzr branch git://github.com/harukizaemon/redhillonrails_core.git redhillonrails_core
Branched 1 revision(s).
test$ bzr join --reference redhillonrails_core
bzr: ERROR: Cannot join redhillonrails_core. Root id already present in tree

These git repositories are just used as an example because they are small and therefore quick to branch from. Any repo will get the same behavior. The problem is in the mapping. The root id for all git repositories are the same which is the constant ROOT_ID. If I make the following change I can add a new repository as a subtree:

    def generate_file_id(self, path):
        # Git paths are just bytestrings
        # We must just hope they are valid UTF-8..
        assert isinstance(path, str)
        if path == "":
            return ROOT_ID.join('-a')
        return escape_file_id(path)

    def parse_file_id(self, file_id):
        if file_id.startswith(ROOT_ID):
            return ""
        return unescape_file_id(file_id)

But then I am back to the same problem of not being able to do anymore. I can change the '-a' to '-b' (or anything not already used) and therefore work around the issue. But obviously this is not a solution.

I tried just appending a randomly generated string to the suffix but within the joining process it seems we need to have the same value returned every time generate_file_id is called. My next attempt was to try affixing the current time under the idea that within the joining operation the time is not likely to change but within different joining operations it will. This seems to work. My naive code for returning the root id is:

ROOT_ID.join(str(time.mktime(datetime.datetime.now().timetuple())))

I know nothing of Python and this was just borrowed from some site explaining how to get the current number of seconds since the unix epoch (I'm just a Ruby programmer so our stuff would just be ROOT_ID + Time.now.to_i). Anyway this obviously has two problems:

* It is possible that two repositories could be joined within the same second (via a script or something). Then we are back to our problem.
* It is also possible that joining a repo could span multiple seconds meaning the generate_file_id will not always return the same value within a joining operation causing an error.

But it seems to work well enough for my purposes until a real fix gets created. I would imagine the best thing to do would be to append a suffix based on the repo's URI (maybe hashed for fun). But the mapping object doesn't seem to have any reference to the repo it is mapping from what I can tell making that not possible unless we pass more info into the generate_file_id method.

description: updated
Revision history for this message
Jelmer Vernooij (jelmer) wrote :

the fundamental problem here is the way file ids in bzr-git are constructed at the moment; even if we would fix the tree root file id issue, then the file ids for other paths in the tree would still clash (a README file would have the same file id if it existed in both trees).

Generating proper file ids and following renames is something that we've delayed until the next mapping version, as it can be quite complex.

Changed in bzr-git:
importance: Undecided → Wishlist
status: New → Triaged
Revision history for this message
Jelmer Vernooij (jelmer) wrote :

appending the current time is does not work, as it means that if two people branch the same git repository they end up with different contents that conflict (but with the same revision ids).

Revision history for this message
Eric Anderson (eric-pixelwareinc) wrote :

You know a lot more about this than me so I'm going to trust you on all this.

But I did just want to note that with my "timestamp" hack on the tree root id it seems to work so far. I have different repositories joined in by reference. They both have a README file and an MIT-LICENSE file but I have not gotten any complaints from bzr. I am guessing this is because I am joining by reference. Maybe joining by reference only cares about the tree root id since each joined repo is still it's own repository?

Anyway I'm sure I am doing this all wrong and am way out of my league. But that is what personal projects are for. Doing stupid stuff that you wouldn't do at work. :)

Thanks for all the hard work you have done on this so far. I'm basically hoping that the bzr-git bridge will be good enough that I can not be roped into using git like all the cool kids are doing in the Rails world. Bzr just fits my way of doing things so much better.

Revision history for this message
Jelmer Vernooij (jelmer) wrote :

It will become problematic when you create two indepedent imports of the git repository and then try to merge them; this will result in complaints from bzr.

Jelmer Vernooij (jelmer)
tags: added: next-mapping-format
Jelmer Vernooij (jelmer)
summary: - Cannot join by reference more than one repository
+ file ids are not very unique
Revision history for this message
Sergei Golubchik (sergii) wrote : Re: file ids are not very unique

I've got this issue too, trying to merge repositories and getting conflicts on COPYING, README, etc.

My solution was to append the revision id where the file was first seen (that is, the revision that has added the file). It's unique, and one can branch, uncommit, pull, fork git repositories, etc — and the file id will [supposedly] stay the same. Furthermore, it's probably even safe for roundtripping.

Here's a patch that implements it. I've only tested branch, uncommit, and pull, it quite possibly incomplete and will break other use cases.

Revision history for this message
Sergei Golubchik (sergii) wrote :

patch

Revision history for this message
Sergei Golubchik (sergii) wrote :

for the record: "Thoughts on file ids" thread on the bazaar mailing list
https://lists.ubuntu.com/archives/bazaar/2011q2/072368.html

Jelmer Vernooij (jelmer)
Changed in brz-git:
status: New → Triaged
Jelmer Vernooij (jelmer)
Changed in brz-git:
importance: Undecided → Medium
Jelmer Vernooij (jelmer)
summary: - file ids are not very unique
+ git: file ids are not very unique
tags: added: git
affects: brz-git → brz
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.