bzr to launchpad is slower than to devpad

Bug #680763 reported by Martin Pool
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Bazaar
Confirmed
High
Unassigned
Launchpad itself
Triaged
High
Unassigned

Bug Description

On Mon, 2010-11-15 at 11:27 -0600, Edwin Grubbs wrote:
> -------------------------------------------------------------------
> DVCS | Destination server | Elapsed time (sec)
> -------------------------------------------------------------------
> BZR | bazaar.launchpad.net in London | 13.29
> BZR | devpad.canonical.com in London | 5.84
> BZR | slicehost vm in Dallas datacenter | 2.10
> GIT | github in Virginia datacenter | 1.85

Those numbers were measured from Texas so it's not surprising that connecting to London is slower than both US destinations. But it is surprising and bad that connecting to Launchpad is so much worse than connecting to a machine in the same datacentre.

This is for pushing a new stacked branch; pushing into an existing stacked branch is reported to have similar issues.

We should reproduce this, and break it down into more detailed bugs. It may require multiple fixes on either the lp or bzr side. Please use this bug for overall analysis of the problem.

Martin Pool (mbp)
description: updated
Changed in bzr:
status: New → Confirmed
importance: Undecided → High
milestone: none → 2.3b4
Revision history for this message
John A Meinel (jameinel) wrote :

I'm timing to Chinstrap rather than Devpad because I can't get Paramiko to do the chinstrap song-and-dance to create an nc tunnel. However, the data center stuff should be the same.

$ time bzr log bzr+ssh://<email address hidden>/~bzr-pqm/bzr/bzr.dev -r -1
...
HPSS calls: 16 (9 vfs) SmartSSHClientMedium(bzr+ssh://<email address hidden>/)
real 0m9.203s

$ time bzr log bzr+ssh://<email address hidden>/home/jameinel/dev/bzr/bzr.dev -r -1
...
HPSS calls: 19 (9 vfs) SmartSSHClientMedium(bzr+ssh://<email address hidden>/)
real 0m9.471s

So at least that part is quite comparable from Chicago. Though chinstrap probably has more pack files.

'info' is also pretty comparable:
$ time bzr info bzr+ssh://<email address hidden>/~bzr-pqm/bzr/bzr.dev
HPSS calls: 12 (0 vfs) SmartSSHClientMedium(bzr+ssh://<email address hidden>/)
real 0m7.704s

$ time bzr info bzr+ssh://<email address hidden>/home/jameinel/dev/bzr/bzr.dev
HPSS calls: 15 (0 vfs) SmartSSHClientMedium(bzr+ssh://<email address hidden>/)
real 0m6.081s

A bit slower, but nothing like the 13s vs 5.8s we see in the initial report.

My initial guess is that stacked branches may have a specific impact here, and it may depend on the specific pack layout. Pushing up a stacked branch does at least some amount of data correctness check, which can be impacted by repo layout. So if devpad was more packed than Launchpad, that might impact this timing.

I think relevant to the discussion would be the time to push a new branch into a shared repository, since I think that is what the git code is doing.

Aaron Bentley (abentley)
Changed in launchpad-code:
status: New → Triaged
importance: Undecided → Medium
tags: added: codehosting-ssh
Vincent Ladeuil (vila)
Changed in bzr:
milestone: 2.3b4 → 2.3.0
Revision history for this message
Vincent Ladeuil (vila) wrote :

Any progress on this ? Should we re-target ?

Revision history for this message
Martin Pool (mbp) wrote :

We're not, at the moment, using milestone/target to indicate intention to fix, except for release blockers, so I'll take that off, but leave it high.

It's probably most likely john is right and it's connected to stacking.

I think this is worth leaving open at least to the extent of determining what the difference is, if any.

It may be this will be improved by forking lp-serve, though I doubt that accounted for 7.5s.

Changed in bzr:
milestone: 2.3.0 → none
Changed in launchpad:
importance: Medium → High
Revision history for this message
Martin Pool (mbp) wrote :

I wondered if some of the recent changes in Launchpad would have changed this. There is definitely still a difference:

mbp@joy% time bzr -Dhpss -Dhpssdetails push bzr+ssh://chinstrap.canonical.com//tmp/docdiff-test2
Created new branch.
HPSS calls: 14 (1 vfs) SmartSSHClientMedium(bzr+ssh://<email address hidden>/)
bzr -Dhpss -Dhpssdetails push 0.24s user 0.06s system 2% cpu 13.296 total

mbp@joy% time bzr -Dhpss -Dhpssdetails push lp:~mbp/+junk/docdiff-test2
Created new branch.
HPSS calls: 14 (1 vfs) SmartSSHClientMedium(bzr+ssh://<email address hidden>/)
bzr -Dhpss -Dhpssdetails push lp:~mbp/+junk/docdiff-test2 0.27s user 0.05s system 1% cpu 19.633 total

so 6.34s or nearly 50% slower. It's doing the same number of calls and the same amount of traffic; lp just has more lag.

The differences are:

* 2.4s extra delay before starting the SSH connection, I think because we're doing a separate xmlrpc to find the URL (bug 397739) (2.551s vs 0.168s)
* 1.1s extra delay opening the SSH connection and getting the first response back (6.7s vs 5.63s); this might be fixed by bug 660264 when that's live
* 2.5s extra delay in BzrDirFormat.initialize_ex_1.16 (3.0 vs 0.46), possibly because this is doing slow permission or other checks on the server side before proceeding
* 0.34s other extra delay extra

Revision history for this message
Martin Pool (mbp) wrote :
Revision history for this message
Martin Pool (mbp) wrote :
Revision history for this message
Martin Pool (mbp) wrote :

I filed bug 740759 for the call being unusually slow.

Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 680763] Re: bzr to launchpad is slower than to devpad

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 3/23/2011 8:11 AM, Martin Pool wrote:
> I filed bug 740759 for the call being unusually slow.
>

You could try using "lp://staging/..." or "lp://qastaging/..." to push
up to bazaar.staging.launchpad.net and bazaar.qastaging.launchpad.net

That would at least let you see the impact of the forking server code.

Note also my previous comments about SSH scaling poorly with concurrent
connection requests. I think it has to do with how SSH handshaking
works. That there are many messages and each involves some crypto. So if
you have 10 incoming requests *each one* takes 10x longer. (So if it
usually takes 1s to do an ssh handshake, if you get 10 concurrent
requests, all of them take 10s.)

So certainly Crowberry has a fairly high baseline load. Though the above
doesn't seem to apply to active ongoing connections, just the initial
handshake.

That said, since twisted is running a select loop in single-threaded
mode, a lot of concurrent activity probably will slow down ssh handshaking.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk2JsI4ACgkQJdeBCYSNAAP3EQCgsRExv2JzJdp4BRrgwEZ25Bgd
UPgAn2RGyAevHHoWBqed+jpxMYDa4Nim
=hojf
-----END PGP SIGNATURE-----

Revision history for this message
Martin Pool (mbp) wrote :

Good idea. Staging is offline still, but qastaging works:

mbp@joy% time bzr -Dhpss -Dhpssdetails push lp://qastaging/~mbp/+junk/docdiff-test2
Created new branch.
HPSS calls: 14 (1 vfs) SmartSSHClientMedium(bzr+ssh://<email address hidden>/)
bzr -Dhpss -Dhpssdetails push lp://qastaging/~mbp/+junk/docdiff-test2 0.28s user 0.05s system 1% cpu 17.856 total

mbp@joy% time bzr -Dhpss -Dhpssdetails push lp://qastaging/~mbp/+junk/docdiff-test3
Created new branch.
HPSS calls: 14 (1 vfs) SmartSSHClientMedium(bzr+ssh://<email address hidden>/)
bzr -Dhpss -Dhpssdetails push lp://qastaging/~mbp/+junk/docdiff-test3 0.27s user 0.06s system 2% cpu 16.103 total

So that's 2.8-4.5s slower than chinstrap, and 1.8-3.5s faster than lpnet. The time from starting to open the ssh connection until it gets the response is 5.69s, which is basically the same as chinstrap, and indicates that John's fix for bug 660264 should basically bring lp to parity with another machine in the dc as far as opening an ssh connection and starting bzr.

initialize_ex_1.16 took 1.4s on the second run and 1.7s on the first run, which is pretty interesting: right between the time for bzr running alone and bzr running on lpnet. It may be the qastaging database or internal xmlrpc is less loaded, but on the other hand it's a smaller database.

I guess we could write a script that just does this call repeatedly.

Revision history for this message
John A Meinel (jameinel) wrote :
Revision history for this message
John A Meinel (jameinel) wrote :
Revision history for this message
John A Meinel (jameinel) wrote :
Revision history for this message
John A Meinel (jameinel) wrote :
Download full text (4.1 KiB)

The core functionality that is interesting is:

Launchpad:
           1 0 13.8808 0.0002 bzrlib.push:58(_show_push_branch)
          +1 0 6.1671 0.0001 +bzrlib.branch:1400(create_clone_on_transport)
          +1 0 2.2326 0.0000 +bzrlib.transport:1563(get_transport)
          +1 0 0.2033 0.0000 +bzrlib.remote:2629(get_stacked_on_url)
          +1 0 5.2292 0.0000 +bzrlib.bzrdir:700(open_from_transport)
          +1 0 0.0450 0.0000 +bzrlib.branch:1214(get_push_location)
          +1 0 0.0035 0.0000 +bzrlib.push:46(report)
          +1 0 0.0000 0.0000 +bzrlib.push:40(__init__)
           1 0 6.1648 0.0001 bzrlib.bzrdir:153(clone_on_transport)
          +1 0 0.6122 0.0000 +bzrlib.remote:1618(fetch)
          +1 0 2.3935 0.0000 +bzrlib.bzrdir:1313(initialize_on_transport_ex)
          +1 0 3.0734 0.0000 +<<string>>:1(clone_read_locked)
          +1 0 0.0340 0.0000 +bzrlib.bzrdir:1168(open_branch)
          +1 0 0.0466 0.0000 +bzrlib.bzrdir:865(cloning_metadir)
          +1 0 0.0040 0.0000 +bzrlib.bzrdir:568(find_repository)

QAStaging:
           1 0 15.5928 0.0001 bzrlib.push:58(_show_push_branch)
          +1 0 1.8000 0.0000 +bzrlib.transport:1563(get_transport)
          +1 0 0.0650 0.0000 +bzrlib.remote:2629(get_stacked_on_url)
          +1 0 11.1057 0.0000 +bzrlib.branch:1400(create_clone_on_transport)
          +1 0 2.5834 0.0000 +bzrlib.bzrdir:700(open_from_transport)
          +1 0 0.0365 0.0000 +bzrlib.branch:1214(get_push_location)
          +1 0 0.0021 0.0000 +bzrlib.push:46(report)
          +1 0 0.0000 0.0000 +bzrlib.push:40(__init__)
           1 0 11.1040 0.0001 bzrlib.bzrdir:153(clone_on_transport)
          +1 0 6.7379 0.0000 +<<string>>:1(clone_read_locked)
          +1 0 3.4196 0.0000 +bzrlib.bzrdir:1313(initialize_on_transport_ex)
          +1 0 0.8771 0.0000 +bzrlib.remote:1618(fetch)
          +1 0 0.0294 0.0000 +bzrlib.bzrdir:1168(open_branch)
          +1 0 0.0004 0.0000 +bzrlib.decorators:234(wrapped)
          +1 0 0.0354 0.0000 +bzrlib.bzrdir:865(cloning_metadir)
          +1 0 0.0035 0.0000 +bzrlib.bzrdir:568(find_repository)

chinstrap:
           1 0 2.2664 0.0001 bzrlib.push:58(_show_push_branch)
          +1 0 0.0623 0.0000 +bzrlib.transport:1563(get_transport)
          +1 0 0.6539 0.0000 +bzrlib.branch:1400(create_clone_on_transport)
          +1 0 1.4800 0.0000 +bzrlib.bzrdir:700(open_from_transport)
          +1 0 ...

Read more...

Revision history for this message
John A Meinel (jameinel) wrote :

@martin "initialize_on_transport_ex" times for me are:
0.0958 chinstrap
2.4302 lp
3.4193 lp://qastaging

so the speed saved by the forking server is lost in the initialize_on_transport_ex

Revision history for this message
Martin Pool (mbp) wrote :

On 24 March 2011 02:00, John A Meinel <email address hidden> wrote:
> @martin "initialize_on_transport_ex" times for me are:
> 0.0958 chinstrap
> 2.4302 lp
> 3.4193 lp://qastaging
>
> so the speed saved by the forking server is lost in the
> initialize_on_transport_ex

I think (unless my arithmetic was wrong) I saw qastaging being faster.
 But it's a smaller database server, though less loaded, so perhaps it
is reasonable to expect it be more variable. Anyhow, the next step is
in the other bug for someone to profile or look at the Launchpad code.

Jelmer Vernooij (jelmer)
tags: added: check-for-breezy
Jelmer Vernooij (jelmer)
tags: removed: check-for-breezy
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.