Failure to saturate bandwidth downloading lp:bzr

Bug #423804 reported by John A Meinel on 2009-09-03
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Bazaar
High
John A Meinel
Breezy
High
Unassigned

Bug Description

Now that we've upgraded to --2a, I decided to just start from scratch, and create a new repo, populate it from the official conversion, and then copy my extra revisions into it.

So:
  bzr init-repo --2a --no-trees bzr
  cd bzr
  bzr branch lp:bzr bzr.dev

I expected it to saturate my network link while it downloaded ~30-40MB of data, and then be done in about 2 minutes.

Instead, I'm seeing very choppy downloads, as shown in the attachment.

Note that I'm downloading via lp: which should be 'bzr+ssh', so it should be streaming from the remote side.

I'm using bzr.dev, though, so maybe we are running into a version compatibility thing?

If I do a quick breakin, I do see:
  c:\users\jameinel\dev\bzr-1.9\bzr.dev\bzrlib\remote.py(1906)missing_parents_chain()
-> for kind, stream in self._get_stream(sources[0], search):
  c:\users\jameinel\dev\bzr-1.9\bzr.dev\bzrlib\smart\repository.py(538)record_stream()
-> for bytes in byte_stream:
  c:\users\jameinel\dev\bzr-1.9\bzr.dev\bzrlib\smart\message.py(336)read_streamed_body()
-> self._read_more()

Which looks like it is streaming as expected.

John A Meinel (jameinel) wrote :
John A Meinel (jameinel) wrote :

I brought this up a bit in Mooloolaba. As a guess, I'm thinking that the pack-on-the-fly code may be slowing down the fetch. Even worse, it may be interacting poorly with the Nagle algorithm, so that once the pack has completed, the TCP auto-negotiation has decided that it shouldn't send as much content.

It is also possible that we have some server side issues, where we aren't keeping the write buffer full. However, I would suspect the source side before I suspect the target side.

In contrast I'm uploading a picture of network throughput for iterating over 'get_stream()' from Babune to my machine. Note that it is *completely* flat. Something is throttling, though, as it transferred at a steady 100kB/s when I should have at least 300kB/s of download bandwidth.

I'll try a few more tests, including checking to see what the throttling issue is, what 'branch' looks like from babune, and whether 'get_stream()' from launchpad is also as flat.

John A Meinel (jameinel) wrote :

I also wanted to mention that the 'dips' that are shown there are specifically in the transition from signatures+revisions+inventories => the first chk_bytes pages => the second chk_bytes pages => texts.

I did test get_stream() against launchpad, and this is what I saw.

Note that the peak here is about 3x that of streaming from babune (so we are hitting the bandwidth cap on my system).

The major gaps are at the same point, though here the downtime is greater. I wonder if we are hitting CPU limits on launchpad server side. I also don't have any idea of launchpad code hosting's current load, or whether there are other limiting factors between here and launchpad.

What is also very strange is the 'ramp up' that is seen repeatedly. Where you get a severe drop, followed by a fast ramp, slow ramp, fast ramp, peak and then dropoff again.

I'm also not sure what bzr version launchpad is running, I'm fairly confident that I'm connecting to bzr-2.0.2 on Babune.

So it *could* be that just saturating my ISP's bandwidth because LP has so much more capacity, is causing all sorts of networking craziness. I'd like to test streaming from LP to babune, since that is a much lower latency connection, but I don't know how I would grab a graph of the bandwidth there.

John A Meinel (jameinel) wrote :

More graphs. This one is the bandwidth of downloading all of bzr from a freshly packed repository on chinstrap, rather than from bazaar.launchpad.net.
Note that while still a bit 'spikey' it stays much closer to peak bandwidth for the whole time.
The other difference here is that I'm using openssh as my ssh provider, rather than paramiko. I'll upload one with paramiko in a second. So there are still a fair amount of variables. I'll try a copy of production's exact repo (rather than freshly packed) and see if this changes anything.

John A Meinel (jameinel) wrote :

This is connecting to chinstrap via paramiko. It seems a bit spikier, but the total download time is about the same.

Note also that I think the X and Y axis should be identical for all the recent graphs.

John A Meinel (jameinel) wrote :

This next graph is 'get_stream()' from chinstrap using an 'lftp mirror' copy of bazaar.launchpad.net's bzr repository.
Unfortunately there was a tiny data spike, which caused the Y axis to resize, but the X axis is the same. The peak seen in this graph is the same as the peaks elsewhere. So we again manage to hit peak bandwidth, but it is much choppier.

Looking at chinstraps.bzr.log I see *lots* of:
263.431 creating new compressed block on-the-fly in 0.000s 2092406 bytes => 10517 bytes

This indicates that the thrashing we noticed before really does decrease throughput. We fixed size-on-disk by having the client recompress the stream.

John A Meinel (jameinel) wrote :

This is another chinstrap => local using the mirror of production.
In this case, I have a small patch to the source codebase. Specifically:
=== modified file 'bzrlib/repofmt/groupcompress_repo.py'
--- bzrlib/repofmt/groupcompress_repo.py 2009-10-23 17:27:45 +0000
+++ bzrlib/repofmt/groupcompress_repo.py 2009-11-17 21:41:18 +0000
@@ -1021,7 +1021,7 @@
         super(GroupCHKStreamSource, self).__init__(from_repository, to_format)
         self._revision_keys = None
         self._text_keys = None
- self._text_fetch_order = 'groupcompress'
+ self._text_fetch_order = 'unordered'
         self._chk_id_roots = None
         self._chk_p_id_roots = None

In other words, it doesn't change the signatures/repository/inventory or chk streaming, but it changes the text streaming to be done 'unordered' rather than trying to recompute the 'groupcompress' "proper" order.

This is better for networking at least. The runtime dropped from 350s down to 260->290s. And most of the "creating new compressed block on-the-fly" lines are gone.

I don't have a good feel for what this will mean client-side for final-disk size. As it will be trying to collapse groups, but it won't have the data in an optimal order.

John A Meinel wrote:
[...]
> I'm also not sure what bzr version launchpad is running, I'm fairly

My bzr ping plugin says bzr+ssh://bazaar.launchpad.net/ is running 2.0.0.

-Andrew.

John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Bennetts wrote:
> John A Meinel wrote:
> [...]
>> I'm also not sure what bzr version launchpad is running, I'm fairly
>
> My bzr ping plugin says bzr+ssh://bazaar.launchpad.net/ is running
> 2.0.0.
>
> -Andrew.
>

Yeah, I got the same info when I tried dumping the arg dict from the
protocol request parser.

staging is also running 2.0.0, and mwhudson confirmed that it is likely
to be true.

However, chinstrap is running 2.0.2, and some tests with bzr.dev on that
machine showed the same results.

sorting groupcompress on the source is part of the problem, slowdowns
from having a non-optimally packed source repo is another.

John=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAksDL6cACgkQJdeBCYSNAAMywgCdFPmF7x6I3p8ZwpzSEMiSrP7X
C1AAoLYu6vDSxNK7X1+1QhI+CbFWoyra
=JPjm
-----END PGP SIGNATURE-----

Robert Collins (lifeless) wrote :

On Tue, 2009-11-17 at 23:20 +0000, John A Meinel wrote:

> sorting groupcompress on the source is part of the problem, slowdowns
> from having a non-optimally packed source repo is another.

I think its all going to be tied into windowing. I encourage the reading
thread / circular buffer experiment ;)

-Rob

John A Meinel (jameinel) wrote :

I'm just attaching the script I'm using on the client for testing. This is the 'no-op' client so that we hopefully are only showing the server's ability to keep the pipe full.

Long term we need to optimize both sides. And make sure that the server keeps its end full, and the client is clearing out its end as fast as possible. But splitting the testing should make it easier to get there.

The script is adapted from my work on testing memory consumption, so it depends on Meliae, etc. But we can easily strip that out.

John A Meinel (jameinel) wrote :

I've determined that at least some of this is just my connectivity to Chinstrap. I'm attaching the network results for

ssh chinstrap cat .../.bzr/repository/xxx.pack > local.pack

Which is a 31MB file.

The whole transmission is very "peaky". I've also confirmed that it isn't just disk issues by running 2 times.

John A Meinel (jameinel) wrote :

I then tried a 'wget' from a bazaar.launchpad.net branch of bzr.dev. And I got similar results. Sorry the scale changed, something peaked above my actual bandwith cap...

Anyway, you can still see that it cycles up and down near the bandwidth max, and occasionally drops off considerably.

John A Meinel (jameinel) wrote :

Note that I tried this on Babune, which has a significantly higher bandwidth and lower latency, and it seemed to generate a fairly stable 1.9+MB/s stream. So it might be my home network (I've had some flakiness with my wireless router, though I got the same results on the wired side of the router.)

I do wonder if this is a network configuration issue, though. As I've seen downloads from Steam, etc, that are just a rock-solid 300kB/s. Or maybe it is just because it is going transcontinentally?

Robert Collins (lifeless) wrote :

ssh can introduce artifacts itself, as it has its own windowing going
on.

Interacting better with this needs some study. Also stdio redirection
may not keep as clean a buffer as needed ;).

I do think we need to test the impact of focusing on having a window
window for the tcp stack as a specific thing - and I'd do that before
worrying about the server side at all.

-Rob

John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> ssh can introduce artifacts itself, as it has its own windowing going
> on.
>
> Interacting better with this needs some study. Also stdio redirection
> may not keep as clean a buffer as needed ;).
>
> I do think we need to test the impact of focusing on having a window
> window for the tcp stack as a specific thing - and I'd do that before
> worrying about the server side at all.
>
> -Rob
>

Sure. Though I'm seeing some of this via 'wget' as well, which would not
be using SSH. :)

For now, I'm not going to be spending too much more time on this. I may
try some small things here and there, but it looks like my network is
screwy enough that figuring out the signal vs noise is going to be a
pain. (I just saw the same 'cyclical' issues when watching a YouTube
video, so it is possibly an issue with how my ISP is doing throttling?)

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAksmxC8ACgkQJdeBCYSNAANuWgCeKzp0Ypsq9YYg236RX5exxLmA
wwgAn09hquArMLTcx1weelnGu6Dq+BEQ
=rO7X
-----END PGP SIGNATURE-----

John A Meinel (jameinel) wrote :

So I did a bit more debugging, since I was looking at this code a bit while trying to address the os.write code.

Anyway, it looks like the ssh process allows us about 1.6MB of in-transit data before it starts blocking.

This was detected by having a canned request return a lot of bulk data, and then having a trivial client that just reads the response 64kB at a time.

My network is still a bit strange, but I can see that the server-side claims to write as many as 30 64kB chunks that the client has not claimed to have read yet. This was done by breaking up a large request into 64kB chunks, and calling out.write() on each one, and timing how long it took, while at the same time, the local client reports when it reads 64kB.

This used ssh's stderr multiplexing to get the messages on stderr locally. Which I'm sure isn't perfectly aligned between processes, but it should be at least close.

Almost all of the write calls look like: wrote 65536 sub bytes in 0.000s
In that it takes <= 3ms for a write. Every so often you get:
wrote 65536 sub bytes in 2.143s

Which I assume means that ssh is blocking waiting for some of the buffer to clear up.

This means we have something less than 2MB of buffering that we get for bzr+ssh. (I assume this would be different with Launchpad's Twisted implementation.) Though it may also depend on the Window, etc settings.

Though according to Wireshark my peak "Win" is only about 260,000 bytes. (Wireshark also tells me I get a fair number of retransmissions, etc. Which I assume is because of something with my network, and not something bzr could do anything about.)

Anyway, it would appear that there isn't much to be done for bzr+ssh, since as near as I can tell, ssh is already handling the buffering for us. I wonder if it would help to decrease the 1MB internal buffer, so that we aren't too close to the limit of the ssh subprocess. Conceptually, that would allow us to multiplex a bit more. So that we don't wait and have the buffer empty before we then supply a huge content chunk that ends up blocking.

Something to consider, at least.

Robert Collins (lifeless) wrote :

You may be measuring the size of the pipe buffer, not of ssh's buffer
itself.

-Rob

Martin Pool (mbp) wrote :

2009/12/23 Robert Collins <email address hidden>:
> You may be measuring the size of the pipe buffer, not of ssh's buffer
> itself.

In my tests, linux pipe buffers are about 128k - but ssh may be
changing it somehow.

--
Martin <http://launchpad.net/~mbp/>

John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Martin Pool wrote:
> 2009/12/23 Robert Collins <email address hidden>:
>> You may be measuring the size of the pipe buffer, not of ssh's buffer
>> itself.
>
> In my tests, linux pipe buffers are about 128k - but ssh may be
> changing it somehow.
>
>

I'm testing between chinstrap and my home machine. If I run the test
locally with:

wbzr serve --inet <request.txt | python read_and_sleep.py

It shows that the writer has 2 64k writes that get loaded up before the
reader blocks it. So Martin's 128kB comment fits.

However, even if ssh is simply increasing its pipe buffer, that is still
sufficient for *bzr*. Namely, it is ssh's job to poll the socket and
make sure it keeps putting the bytes out. And it tells us that we can
write 1-2MB to ssh before it blocks. Which to me means:

We should write when we can, and not try to buffer too much. We can
trust that ssh will do a decent amount of buffering. I don't think this
means we should write a single byte at a time, but a 64kB buffer
actually sounds reasonable. So when we get a bit, ssh can start putting
it on the wire, and we can keep producing more data.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAksxgmcACgkQJdeBCYSNAAN8vQCg1nRdaA1YsvLgS3s9NT3c4ZlN
vkoAnRbahZAsoLottSXBKoXsSbQ7kcMD
=cQcy
-----END PGP SIGNATURE-----

Martin Pool (mbp) wrote :

2009/12/23 John A Meinel <email address hidden>:
> We should write when we can, and not try to buffer too much. We can
> trust that ssh will do a decent amount of buffering. I don't think this
> means we should write a single byte at a time, but a 64kB buffer
> actually sounds reasonable. So when we get a bit, ssh can start putting
> it on the wire, and we can keep producing more data.

I think in the absence of contrary evidence, we should only gather up
writes enough that the syscall overhead is sufficiently amortized. So
1byte would obviously be quite expensive but 64kB may even be past the
sweet spot.

--
Martin <http://launchpad.net/~mbp/>

Robert Collins (lifeless) wrote :

On Wed, 2009-12-23 at 04:06 +0000, Martin Pool wrote:
> 2009/12/23 John A Meinel <email address hidden>:
> > We should write when we can, and not try to buffer too much. We can
> > trust that ssh will do a decent amount of buffering. I don't think this
> > means we should write a single byte at a time, but a 64kB buffer
> > actually sounds reasonable. So when we get a bit, ssh can start putting
> > it on the wire, and we can keep producing more data.
>
> I think in the absence of contrary evidence, we should only gather up
> writes enough that the syscall overhead is sufficiently amortized. So
> 1byte would obviously be quite expensive but 64kB may even be past the
> sweet spot.

In both the non-ssh and ssh case we have evidence that gathering up
writes is important - we had performance bugs with the first packet
being sent off and then us waiting for acks to send the rest of a
request. I remind the bug of this because its not a amortisation issue:
its interactions up and down the stack tht we need to beware of.

I agree that in the stream case buffering much more than is needed to
avoid syscall overhead is likely a problem. Our *goal* though, IMO, is
to saturate whatever we're writing to - whether we do that with small
writes or big ones doesn't really matter, as long as we're ahead of the
recipient it can optimise for us.

-Rob

John A Meinel (jameinel) wrote :

 I still feel like there is probably something about the connection to the London data center that needs to be investigated. It might be something with going over the Atlantic ocean, I don't really know. But I just tried a 'wget' of static content from kernel.org versus bazaar.launchpad.net, and the performance graphs are quite different.

I checked my ping times, and
ping www.kernel.org
Approximate round trip times in milli-seconds:
    Minimum = 68ms, Maximum = 169ms, Average = 119ms

ping bazaar.launchpad.net
approximate round trip times in milli-seconds:
    Minimum = 110ms, Maximum = 203ms, Average = 157ms

So there is a difference, but it isn't all that dramatic.(30% increase)

I don't specifically know of a non-canonical US site that I can go 'abuse' to test bandwidth results. (I tried uk.kernel.org, but I get terrible download rates from them, far worse than bazaar.launchpad.net.)

John A Meinel (jameinel) wrote :
John A Meinel (jameinel) wrote :

I did manage to find the 'free.fr' sourceforge mirror, and tried evaluating it. I actually get performance ~ on par with what I saw going to bazaar.lp.net.

John A Meinel (jameinel) wrote :

As a point of comparison, on Vincent's machine, I'm able to get up to 1.9MB/s download from both free.fr and bazaar.launchpad.net. As such it doesn't really seem like a raw-bandwidth issue.

So while I think bzr probably *could* do something better, we really need to understand why *wget* isn't saturating the bandwidth first. I did take a Wireshark dump while using wget from bazaar.launchpad.net.

I do see a fair amount of out-of-order and retransmission sections. Though my wireshark-fu is weak. I have a larger dump (15MB), but I figured just sending the first 3MB would probably be enough to see what is going on.

Martin Pool (mbp) wrote :
Download full text (4.0 KiB)

tcptrace gives quite a nice summary complementary to the detailed view that wireshark gives you. Here's the -l output on this file:

wget_bazaar.lp.net-short.pcap: 3000 100% (0:00:35.485313)
3000 packets seen, 3000 TCP packets traced
elapsed wallclock time: 0:00:05.063794, 592 pkts/sec analyzed
trace file elapsed time: 0:00:35.485839
 first packet: Fri Jun 18 07:44:03.868056 2010
 last packet: Fri Jun 18 07:44:39.353895 2010
TCP connection info:
1 TCP connection traced:
TCP connection 1:
 host a: 192.168.2.16:57288
 host b: crowberry.canonical.com:80
 complete conn: no (SYNs: 2) (FINs: 0)
 first packet: Fri Jun 18 07:44:03.868056 2010
 last packet: Fri Jun 18 07:44:39.353895 2010
 elapsed time: 0:00:35.485839
 total packets: 3000
 filename: wget_bazaar.lp.net-short.pcap
   a->b: b->a:
     total packets: 1085 total packets: 1915
     ack pkts sent: 1084 ack pkts sent: 1915
     pure acks sent: 1083 pure acks sent: 1
     sack pkts sent: 403 sack pkts sent: 0
     dsack pkts sent: 3 dsack pkts sent: 0
     max sack blks/ack: 2 max sack blks/ack: 0
     unique bytes sent: 188 unique bytes sent: 2745919
     actual data pkts: 1 actual data pkts: 1913
     actual data bytes: 188 actual data bytes: 2750275
     rexmt data pkts: 0 rexmt data pkts: 3
     rexmt data bytes: 0 rexmt data bytes: 4356
     zwnd probe pkts: 0 zwnd probe pkts: 0
     zwnd probe bytes: 0 zwnd probe bytes: 0
     outoforder pkts: 0 outoforder pkts: 390
     pushed data pkts: 1 pushed data pkts: 20
     SYN/FIN pkts sent: 1/0 SYN/FIN pkts sent: 1/0
     req 1323 ws/ts: Y/N req 1323 ws/ts: Y/N
     adv wind scale: 2 adv wind scale: 7
     req sack: Y req sack: Y
     sacks sent: 403 sacks sent: 0
     urgent data pkts: 0 pkts urgent data pkts: 0 pkts
     urgent data bytes: 0 bytes urgent data bytes: 0 bytes
     mss requested: 1460 bytes mss requested: 1460 bytes
     max segm size: 188 bytes max segm size: 1452 bytes
     min segm size: 188 bytes min segm size: 8 bytes
     avg segm size: 187 bytes avg segm size: 1437 bytes
     max win adv: 261340 bytes max win adv: 6912 bytes
     min win adv: 16068 bytes min win adv: 6912 bytes
     zero win adv: 0 times zero win adv: 0 times
     avg win adv: 180585 bytes avg win adv: 6912 byt...

Read more...

Martin Pool (mbp) wrote :

Oh Launchpad, why must you mess up my formatting?

Anyhow here's the output in a more readable form.

John A Meinel (jameinel) wrote :

Something definitely just got a lot better, possibly related to bug #582157.

See attached image of wget managing to get a fairly consistent download rate. (Though my actual bandwidth is supposed to be right at 3Mbps)

Command was:
$ wget http://bazaar.launchpad.net/~bzr-pqm/bzr/bzr.dev/.bzr/repository/packs/a1bd96ce1421fe7d0b8d357b708a73c7.pack

Which showed 275KB/s overall, and generally sustained 300KB/s.

'bzr branch lp:bzr' is also better, not as good as wget in this case, but better than it used to be. I may poke at this again, since it seems the TCP/IP issues are generally sorted out.

Jelmer Vernooij (jelmer) on 2011-02-01
tags: added: 2a hpss
Jelmer Vernooij (jelmer) on 2017-06-12
Changed in brz:
status: New → Triaged
importance: Undecided → High
To post a comment you must log in.