Bazaar

KnitRepository.insert_data_stream() copies data in improper order

Bug #205156 reported by John A Meinel on 2008-03-22

Affects		Status	Importance	Assigned to	Milestone
	Bazaar	Fix Released	Critical	Unassigned

Bug Description

Looking at the implementation of "Repository.insert_data_stream()" it is copying the inventory texts to disk as soon as they come in, rather than waiting for the texts to be written.
This is valid for pack repositories, because we will abort the whole pack if it is incomplete. But for Knit repositories if it is interrupted, then Bazaar believes everything was properly copied (because we used "if the Revision is present then the Inventory is, and if the Inventory is, then all Texts are.) We still assume that with packs, but when it is violated we don't add it to the list of pack-names.

This is critical as it makes ^C during a bzr+ssh:// push of knit repositories corrupt the target repo.

John A Meinel (jameinel) on 2008-03-22

Changed in bzr:
importance:	Undecided → Critical
status:	New → Triaged

Revision history for this message

John A Meinel (jameinel) wrote on 2008-03-23:

Someone should confirm the output stream sequence. My cursory double check seems to say that the data stream sent by the server should be reasonable.

Specifically, get_data_stream seems to be layered on top of Repository.item_keys_introduced_by() whose default implementation seems to stream the inventories after the file texts.

However, we still have some real-world cases where it seems to have streamed in the wrong order.

My quick test of creating a --knit with bzr.dev, and then committing a couple and manually stepping the 'get_data_stream()' iterator also says that we copy file texts first.

So we still need to figure out how ^C during a push could copy the inventories, but not the file texts.

Revision history for this message

Robert Collins (lifeless) wrote on 2008-03-24: Re: [Bug 205156] Re: KnitRepository.insert_data_stream() copies data in improper order

On Sun, 2008-03-23 at 17:32 +0000, John A Meinel wrote:
> Someone should confirm the output stream sequence. My cursory double
> check seems to say that the data stream sent by the server should be
> reasonable.
>
> Specifically, get_data_stream seems to be layered on top of
> Repository.item_keys_introduced_by() whose default implementation seems
> to stream the inventories after the file texts.
>
> However, we still have some real-world cases where it seems to have
> streamed in the wrong order.
>
> My quick test of creating a --knit with bzr.dev, and then committing a
> couple and manually stepping the 'get_data_stream()' iterator also says
> that we copy file texts first.
>
> So we still need to figure out how ^C during a push could copy the
> inventories, but not the file texts.

I suggest for the short term we assert in the smart server that the
output is correctly ordered. However in the long term we want pack to
pack operations to stream by packs not by fileid so this is
inappropriate; instead we need a capability flag from the client to say
that its ok not to be ordered.
-Rob
--
GPG key available at: <http://www.robertcollins.net/keys.txt>.

Revision history for this message

John A Meinel (jameinel) wrote on 2008-06-03:

What I do see is that we copy the 'revisions' data in non topological order. Which can create ghosts in the revisions.kndx if you ^C in the middle. It still doesn't explain what I saw 'in the wild' with inventories present but not file texts, but I do know there is at least one bug here.

Revision history for this message

John A Meinel (jameinel) wrote on 2008-06-05: