export could take unmodified files from wt rather than repository

Bug #343218 reported by Marek Aaron Sapota on 2009-03-15
10
Affects Status Importance Assigned to Milestone
Bazaar
Low
John A Meinel
Gentoo Overlay for Bazaar
Undecided
Mark Lee

Bug Description

Seems that bzr.eclass instead of copying local directory to workdir fetches new one from upstream. I'm not sure, but it can be related to lightweight checkout - some time ago I thing branch was used (I used to do bzr pull, now it is bzr up to see if anything changed) and it performed much better.

Related branches

Mark Lee (malept) wrote :

Right, I changed it to use lightweight checkouts to save on disk space, not network fetch time.

René 'Necoro' Neumann (necoro) wrote :

But normally, disk space is not the issue - but install time is. I'd vote for reverting it to use branches. (Perhaps one could create branches w/o working-trees to save a little space, if really wanted)

Marek Aaron Sapota (maarons) wrote :

Can't 'bzr export' be changed to simple 'cp -r'?
With lightweight checkouts gnash compile time went up about 15 minutes, I don't think it's worth this several MB of additional space taken. Without a working tree this would be even less additional space.

René 'Necoro' Neumann (necoro) wrote :

I would not replace 'bzr export'. I would only change bzr to actually simply make a 'cp -r' if nothing else is needed. (I should open a bug for this - as I was already adviced in #bzr).

And as you can see from my mail, export is not a problem anymore when using branches.

Marek Aaron Sapota (maarons) wrote :

Sounds reasonable. If you file the bug report could you post a link here so others can track it?

Mark Lee (malept) wrote :

Pinging the Bazaar devs re: changing the `bzr export` behavior with regards to lightweight checkouts.

Changed in bzr-gentoo-overlay:
assignee: nobody → malept
status: New → Confirmed
pva (pva) wrote :

Heh, this issue again is discussed in gentoo-dev:
http://archives.gentoo.org/gentoo-dev/msg_69e16fcf1b760849695165de80e22b1d.xml

Site:
"In my test (GNU Emacs BZR repo, 2 Mbit/s connection) bzr export took 54 minutes, whereas the initial checkout took only 11 minutes."

This is really weird.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

pva wrote:
> Heh, this issue again is discussed in gentoo-dev:
> http://archives.gentoo.org/gentoo-dev/msg_69e16fcf1b760849695165de80e22b1d.xml
>
> Site:
> "In my test (GNU Emacs BZR repo, 2 Mbit/s connection) bzr export took 54 minutes, whereas the initial checkout took only 11 minutes."
>
> This is really weird.
>

It isn't that weird. We focused on checkout performance because it is
something we do all the time. We haven't tried to optimize export
performance, because it is not nearly as frequent.

If you are adding to this exporting over a network connection, I'm
pretty sure the big factor is that we are requesting file contents
one-at-a-time rather than all-at-once. I don't think it would be hard to
address for someone who cared.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAksnpesACgkQJdeBCYSNAAPxHwCZAWze5+bDuDv6uQZ/nvYceNjX
nywAoIA+KaAeAvZ7SSweDq6CDl2GwOj+
=o5lo
-----END PGP SIGNATURE-----

We have a live ebuild in Gentoo for Emacs. This means that the source is fetched from the SCM repository (Bazaar switch-over is near) and built on the user's system. An initial checkout time of nearly an hour without any action will definitely scare away new users. We are taking measures on our side to improve it, but I hope some will care.

John A Meinel (jameinel) wrote :

Can you try the associated branch? It just changes the calling code to request all file content in one pass, rather than one file at a time.

Testing on my local network, it makes "bzr export foo http://..." drop from ~2m down to 25s.

> Can you try the associated branch? It just changes the calling code
> to request all file content in one pass, rather than one file at a
> time.

This brings the time down from 54 min to 9 min.

But it still accesses the repote repo during export (in fact, from the
transferred volume I would conclude that everything is re-fetched over
the network).

Ulrich

Robert Collins (lifeless) wrote :

On Tue, 2009-12-15 at 20:24 +0000, Ulrich Müller wrote:
> > Can you try the associated branch? It just changes the calling code
> > to request all file content in one pass, rather than one file at a
> > time.
>
> This brings the time down from 54 min to 9 min.
>
> But it still accesses the repote repo during export (in fact, from the
> transferred volume I would conclude that everything is re-fetched over
> the network).

Yes, that is the case. bzr doesn't have a 'local cache' when you have a
lightweight checkout. We could look at some complex logic to use files
you have locally when they are not modified, and/or to get streaming in
place for this operation, but it will always be slower than when you
have local history.

-Rob

John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ulrich Müller wrote:
>> Can you try the associated branch? It just changes the calling code
>> to request all file content in one pass, rather than one file at a
>> time.
>
> This brings the time down from 54 min to 9 min.
>
> But it still accesses the repote repo during export (in fact, from the
> transferred volume I would conclude that everything is re-fetched over
> the network).
>
> Ulrich
>

A lightweight checkout doesn't have any history locally, so it has to. I
suppose one option would be to have export see what files are locally
modified, and use the wt for unchanged files, and fall back to the
repository for the rest. That would complicate the code a fair amount,
though.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAksn88cACgkQJdeBCYSNAANJQgCghjqxu4ubt6P8ziTRFkSqGHbT
OkIAn0MWBayWi4vI/pFHjeh39w+fhOzN
=NFCC
-----END PGP SIGNATURE-----

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> On Tue, 2009-12-15 at 20:24 +0000, Ulrich Müller wrote:
>>> Can you try the associated branch? It just changes the calling code
>>> to request all file content in one pass, rather than one file at a
>>> time.
>> This brings the time down from 54 min to 9 min.
>>
>> But it still accesses the repote repo during export (in fact, from the
>> transferred volume I would conclude that everything is re-fetched over
>> the network).
>
> Yes, that is the case. bzr doesn't have a 'local cache' when you have a
> lightweight checkout. We could look at some complex logic to use files
> you have locally when they are not modified, and/or to get streaming in
> place for this operation, but it will always be slower than when you
> have local history.
>
> -Rob
>

Note that the change I made (and proposed) does handle streaming. Hence
the 54 min => 9 min.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAksn9cYACgkQJdeBCYSNAANrcgCgsoAWLXzYH2PbIhHpNVxLPeHP
XWEAoICfr6cSWVpm+3eKvaVuQmn+5cTm
=jPC9
-----END PGP SIGNATURE-----

Robert Collins (lifeless) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

John A Meinel wrote:
> Robert Collins wrote:
>> On Tue, 2009-12-15 at 20:24 +0000, Ulrich Müller wrote:
>>>> Can you try the associated branch? It just changes the calling code
>>>> to request all file content in one pass, rather than one file at a
>>>> time.
>>> This brings the time down from 54 min to 9 min.
>>>
>>> But it still accesses the repote repo during export (in fact, from the
>>> transferred volume I would conclude that everything is re-fetched over
>>> the network).
>> Yes, that is the case. bzr doesn't have a 'local cache' when you have a
>> lightweight checkout. We could look at some complex logic to use files
>> you have locally when they are not modified, and/or to get streaming in
>> place for this operation, but it will always be slower than when you
>> have local history.
>
>> -Rob
>
>
> Note that the change I made (and proposed) does handle streaming. Hence
> the 54 min => 9 min.

The repository iter_files_bytes api does not stream yet - it uses VFS
operations.

There will be some fat still to shave there if it does start streaming ;).

- -Rob

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAksn+6gACgkQ42zgmrPGrq7/kACcCqQ4V06zwBhMCGAvy84kpMq8
iBsAnjY8RGaxUeQZ8s1po6TRsgJxt/Q9
=91m+
-----END PGP SIGNATURE-----

John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> John A Meinel wrote:
>> Robert Collins wrote:
>>> On Tue, 2009-12-15 at 20:24 +0000, Ulrich Müller wrote:
>>>>> Can you try the associated branch? It just changes the calling code
>>>>> to request all file content in one pass, rather than one file at a
>>>>> time.
>>>> This brings the time down from 54 min to 9 min.
>>>>
>>>> But it still accesses the repote repo during export (in fact, from the
>>>> transferred volume I would conclude that everything is re-fetched over
>>>> the network).
>>> Yes, that is the case. bzr doesn't have a 'local cache' when you have a
>>> lightweight checkout. We could look at some complex logic to use files
>>> you have locally when they are not modified, and/or to get streaming in
>>> place for this operation, but it will always be slower than when you
>>> have local history.
>>> -Rob
>
>> Note that the change I made (and proposed) does handle streaming. Hence
>> the 54 min => 9 min.
>
> The repository iter_files_bytes api does not stream yet - it uses VFS
> operations.
>
> There will be some fat still to shave there if it does start streaming
> ;).
>
> -Rob
>

Ah right. Though it 'streams' in the "get_record_stream()" sense, which
is still far better than repeated "get_file_lines()" calls. :)

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAksn/xwACgkQJdeBCYSNAAMcYwCg0JJfVN3w2LFPQujkgJGlakSq
/EIAn0424yztVLh928CTea8o3gRDHPks
=P1TR
-----END PGP SIGNATURE-----

2009/12/16 John A Meinel <email address hidden>:
> A lightweight checkout doesn't have any history locally, so it has to. I
> suppose one option would be to have export see what files are locally
> modified, and use the wt for unchanged files, and fall back to the
> repository for the rest. That would complicate the code a fair amount,
> though.

If we do this, I'd suggest not putting it into export, but rather into
the DirstateRevisionTree (or whatever it's called). Some care may be
needed to make sure there is no race between checking the file is
unmodified and actually reading it. But it's possible.

--
Martin <http://launchpad.net/~mbp/>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Martin Pool wrote:
> 2009/12/16 John A Meinel <email address hidden>:
>> A lightweight checkout doesn't have any history locally, so it has to. I
>> suppose one option would be to have export see what files are locally
>> modified, and use the wt for unchanged files, and fall back to the
>> repository for the rest. That would complicate the code a fair amount,
>> though.
>
> If we do this, I'd suggest not putting it into export, but rather into
> the DirstateRevisionTree (or whatever it's called). Some care may be
> needed to make sure there is no race between checking the file is
> unmodified and actually reading it. But it's possible.
>

Sure, something like:

1) Check the hashcache to see if there is a chance we are up-to-date
2) Read the contents, hold them in memory, compute the sha hash
3) If everything matched, return the text, else read from upstream.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkspAIsACgkQJdeBCYSNAAPXvACfTiFkiHWtRCVvPbmrJj6ZIdTD
ha8An1QZgTxGjy19y59jnWwV60sz+ZFV
=cfpA
-----END PGP SIGNATURE-----

Martin Pool (mbp) on 2009-12-21
summary: - Export with lightweight checkouts takes a lot of time
+ export could take unmodified files from wt rather than repository
Changed in bzr:
status: New → Confirmed
importance: Undecided → Low
John A Meinel (jameinel) wrote :

Sorry, the bug changed while I wasn't looking. We fixed the export to be a lot faster by changing the order of creating the new files, but we don't yet re-use the content from the existing WT.

Changed in bzr:
assignee: nobody → John A Meinel (jameinel)
milestone: none → 2.0.4
status: Confirmed → Fix Released
assignee: John A Meinel (jameinel) → nobody
status: Fix Released → Confirmed
Eric Siegerman (eric97) wrote :

John A. Meinel wrote:
> Sure, something like:
>
> 1) Check the hashcache to see if there is a chance we are up-to-date
> 2) Read the contents, hold them in memory, compute the sha hash
> 3) If everything matched, return the text, else read from upstream.

"Hold them in memory" will fail for sufficiently huge files -- and long before that, it'll cease to be an optimization, by inducing paging.

Safer, though less optimal, would be to provisionally copy the local file, hashing as you go; then blow it away and recopy from the server if the hash didn't match.

For the smart-server case, how about the rsync protocol, using, as its inputs, the WT copy and the known-good data that upstream wants to provide? That'll go one better, by yielding the "use local data when possible" optimization down below the whole-file level. (This assumes a (technically and legally) usable rsync library, of course. Reimplementing the protocol from scratch seems a bit extreme :-))

On Thu, 2009-12-24 at 15:54 +0000, Eric Siegerman wrote:
> John A. Meinel wrote:
> > Sure, something like:
> >
> > 1) Check the hashcache to see if there is a chance we are up-to-date
> > 2) Read the contents, hold them in memory, compute the sha hash
> > 3) If everything matched, return the text, else read from upstream.
>
> "Hold them in memory" will fail for sufficiently huge files -- and long
> before that, it'll cease to be an optimization, by inducing paging.

We can't return two contents for a file in a zip or tar though; and the
export internals are geared to work with directories *or* zip / tar and
so on.

Also, we always hold the entire file in memory today, so 2) is fine
under our current constraints.

-Rob

John A Meinel (jameinel) on 2010-01-21
Changed in bzr:
milestone: 2.0.4 → none
Vincent Ladeuil (vila) on 2010-09-20
Changed in bzr:
assignee: nobody → John A Meinel (jameinel)
milestone: none → 2.3b1
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers