-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Aaron Bentley wrote: > John A Meinel wrote: >> Aaron Bentley wrote: >>> This is hitting us any time we use bundles to create/update branches. >>> It's a pretty high-frequency, high-irritation event. >>> Aaron >> so... is this happening when: > >> 1) applying a bundle to an existing stacked branch? > > After applying the bundle to the stacked branch, our branch puller is > unable to pull the branch. > So if you follow the rest of my emails, it turns out that creating bundles from --2a formats is just 100% broken. regardless stacking or not (chk pages are not being inserted into the bundle, and the raw text from 'repo.inventories' is not sufficient without them). But there are still issues to be discussed with how "insert_revisions" is going to interact with stacking. >> If I was guessing, I could easily say that the bundle code: > >> 1) Doesn't contain the fulltext for inventories of parent revisions, > > Bundles aren't meant to contain fulltexts of any inventories. > >> nor does it contain the data for all texts that are referenced by the >> inventories that it *does* have. > > It contains the data for the texts that are referenced by the > inventories it does have, as long as the referenced texts are new. If > they are present in the basis revision, they will not be included. > >> 2) Thus when creating a stacked branch from the bundle, we don't have >> the parent inventories to *insert* into the new stacked repository. > > Why do we need the parent inventories? Stacking requirements. The specific issues are: 1) When accessing a stacked branch via bzr+ssh the *server process* does not have access to the backing repository. This was by design, and not just accidental fallout. a) Because of this choice, when *streaming* data from the smart server, the stacked branch needs enough information to generate a stream for all revisions that it has, and then the client will ask the fallback directly for another stream for all the remaining data. 2) To stream the data for a set of revisions, we need the text keys that should be transmitted. This was changed recently from: set_of_file_keys_in_new.intersect(revision_ids_being_transmitted) to set_of_file_keys_seen_in_new.difference( set_of_file_keys_in_all_available_parents) This was done so that we could be more resilient in the face of ghosts, and inventory inconsistencies, etc. We have two things that we wanted to trust that we've found out we can't: file_key == (file_id, revision) #always true ie = inventory(revision)[file_id] ie.revision == file_key[1] # not always true This is the inventory inconsistency issue. The other is the ghosts issue: if file_key in inventory then either: file_revision == inventory.revision # the text was in this revision or file_revision in set_of_revision_being_transmitted The issue is with round-tripping and bzr-svn. Basically, if you did work in bzr, and push that into svn, you end up with last-modified revision information being stored in svn which points to a bzr revision_id which was *not* pushed to svn. (you merge a bzr change that modified foo in revision bar, creating revision dada. you push that to svn causing it to record a revision 'dada' that has a ghost merge parent 'bar' and a file_key of (file_id, 'bar'). Someone else pulls your svn revisions using bzr-svn, and then end up with a file_key mentioning a revision that they cannot have in their repository, since it wasn't in svn.) >> (Without going to the backing repository and pulling them out) > > add_mpdiffs reconstructs the fulltexts and calls > VersionedFile.add_lines, so yes, it would hit the backing repository to > regenerate the fulltexts. But it usually wouldn't try to insert > inventories that were already present in the backing repository. > Sure. However because of the stacking + streaming issues I just mentioned at some point we do need to insert the parent inventories. Now, for xml formats we have a "race condition" which is: 1) If you insert an line-delta for the XML inventory into the stacked branch, then it only references newly introduced texts, and does not reference all possible texts. Thus even though we don't have the parent inventory later when doing "streaming fetch" we don't try to send more data then we have available because it just wasn't referenced. 2) However, if you are at the end of a delta chain, etc, and you insert a "fulltext" for the XML inventory, suddenly when doing streaming fetch from that stacked branch, it will see all these text keys that it thinks it needs to transmit, for which it cannot find in the local repository. [and boom]. This was *the* major issue with the AbsentContentFactory tracebacks that we saw with fetch in bzr 1.13. It seems that bundles are yet another vector for injecting the same problems. (lightweight checkouts and commit are another one that I already know about that still isn't fixed.) ... >> This points the failure at the "insert data from a bundle" code, causing >> it to generate invalid stacked repositories. > >> This would all be simplified quite a bit if we just made a bundle >> effectively just a way to transfer the data for the stream that you get >> out of StreamSource. > > I started work on that at the sprint. > >> There is a small caveat that the bundle *also* >> needs to think of itself as not containing any data > > Pardon? ... >> However at present we have a habit of setting submit_location to a local >> mirror, and obviously that will generate the wrong information about >> "what data do you have that I don't". > > If the head revision of the local mirror is accurate or stale, it may > bundle more data than is strictly needed, but not too little. > > Aaron My point was that the streaming code might not look at the Branch.last_revision() and instead look at "is revision in Repository.all_revision_ids". And if you have a local shared repository, it *will* have the revisions you are trying to bundle. As an example: bzr init-repo shared cd shared bzr init base bzr commit --unchanged base -m 1 bzr commit --unchanged base -m 2 bzr branch base new_work bzr commit --unchanged new_work -m 3 cd new_work # outside the repository bzr push --stacked-on $shared/base ../../stacked ll ../../stacked/.bzr/repository/packs total 0 At this point, the new stacked branch has 0 data in it, because all of the data was found in the repository that was available at 'shared/base'. If we implemented bundles naively as just a stacked branch on top of the local mirror branch (which shares a repository with the branch that we are trying to bundle) then we would also get 0 data in the bundle (just like there is 0 data in the stacked repository.) So we have to implement it as a Stacked repository that is stacked on a BackingRepository that only has access to data that is in the ancestry of the branch we are stacking on. We do this directly in the current bundle code. (by using set ancestry differences, IIRC.) Now, I *think* if we use set_ancestry to find the revisions to send, and then use the generic "get_stream()" for those revisions, it will contain the right data. (It would contain the correct revisions, and the correct inventories, etc.) John =:-> -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (Cygwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkpfSPIACgkQJdeBCYSNAANG6QCdFXf+1YUmoTkpHbC8miE7oUV/ wLsAn0it9vF/Yl4TdacTyHlbOp2DpqMG =BbvL -----END PGP SIGNATURE-----