codehosting connection time is load dependent

Bug #691165 reported by John A Meinel on 2010-12-16
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lazr.sshserver
Low
Unassigned

Bug Description

It appears that the twisted SSH server takes significantly longer to ssh handshake when there are multiple simultaneous requests.

I don't fully understand what is going on here, but I'll give the symptoms.

Note that these results are independent of the lp-forking-service being available. So it appears to be a twisted issue.

I have a script that can spawn a variable number of 'ssh' connections simultaneously, and then issue a very simple bzr request onto that connection. If I set the load factor to 1 (only spawn 1 connection at a time) I get speeds like this:
Spawned 11 connections in 31.936s
average time: 2.898s

If I set the load factor to 2, the time changes to:
Spawned 16 connections in 32.092s
average time: 3.982s

If I set it up to 5, it looks like:
Spawned 43 connections in 63.078s
average time: 7.224s

And at load 10:
Spawned 52 connections in 68.676s
average time: 12.888s

It also tends to be *very* batch. I wait 0.1s between each spawned connection, but they tend to hang for a while, and then all respond at once.

For contrast, I did the same thing, but just to the loopback on carob.

Load 1:
Spawned 61 connections in 60.170s
average time: 0.982s

Load 10:
Spawned 562 connections in 60.821s
average time: 0.971s

Which shows 0 load dependence.

So the contrast is that with 10 simultaneous connections, it takes ~4x longer to get the connection with Twisted.

John A Meinel (jameinel) wrote :

I'm attaching the script I was using for load testing. It basically does:
  echo hello | ssh $HOST bzr serve --inet --directory=/ --allow-writes

In as many parallel processes as you ask it to. (by using --load=$X,)

On 17 December 2010 03:31, John A Meinel <email address hidden> wrote:
> Public bug reported:
>
> It appears that the twisted SSH server takes significantly longer to ssh
> handshake when there are multiple simultaneous requests.
>
> I don't fully understand what is going on here, but I'll give the
> symptoms.

Given that this is twisted code, this sounds to me a lot like it's
doing some kind of blocking operation inline.

--
Martin

Andrew Bennetts (spiv) wrote :

What Martin said. My first guess is that Conch is almost certainly doing all the crypto in the same thread that runs the event loop, so if that takes non-negligible time then it will delay processing of the next connection. Some profiling of that code will probably find the culprit fairly easily.

I can rule out one possibility: I'm fairly sure that the XML-RPC call to the Launchpad database for SSH keys for that username is not blocking in that code; at least it wasn't when I wrote it :)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 12/19/2010 7:45 PM, Andrew Bennetts wrote:
> What Martin said. My first guess is that Conch is almost certainly
> doing all the crypto in the same thread that runs the event loop, so if
> that takes non-negligible time then it will delay processing of the next
> connection. Some profiling of that code will probably find the culprit
> fairly easily.
>
> I can rule out one possibility: I'm fairly sure that the XML-RPC call to
> the Launchpad database for SSH keys for that username is not blocking in
> that code; at least it wasn't when I wrote it :)
>

How do you profile the twisted server, though?

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk0iQZEACgkQJdeBCYSNAAMvggCgw9iobknbf9tHpKe/aHgzaQRN
388AnR/6lHRb9ccMMwkXgTSulEEgJoyN
=lGs1
-----END PGP SIGNATURE-----

Martin Pool (mbp) wrote :

Andrew may know a better way but I would just use a poor man's profiler:

Look at whether the server is spending its time in user space or in
system calls (and if so, which system call.) If spiv's guess is right
and it's doing too much blocking crypto work, it will be in userspace.
 Then we could use strace or gdb attachment to narrow it down.

Andrew Bennetts (spiv) wrote :

John A Meinel wrote:
[...]
> How do you profile the twisted server, though?

It's just Python, which you know how to profile, so I guess you're
specifically asking how to profile something run via twistd(1)? Add
--profile=profile.stats (and maybe --profiler=cprofiler) to the twistd
invocation.

(If you're invoking reactor.run() directly for some reason, then
obviously you can just use the usual profiling APIs.)

Also, you could rig the process to enable/disable profiling when poked
by a signal or similar, regardless of how it was started. I think
something like this would work, but I haven't tested:

import cProfile, signal
profile = cProfile.Profile()
profiling = False
def toggle_profiler(*ignored):
    global profiling
    if not profiling:
       profile.enable()
    else:
       profile.disable()
       profile.dump_stats('/tmp/profile.stats')
signal.signal(signal.SIGUSR1, toggle_profiler)

Also, along the lines of Martin's suggestion, I'd expect “perf record
twistd ...” followed by “perf report” would probably show if a
significant proportion of the time is spent in crypto libraries, generic
Python, or something else entirely. (Conch uses PyCrypto, which should
be doing the hard work in C functions, unless somehow the packaging is
broken, which seems unlikely.)

Tim Penhey (thumper) on 2011-01-07
Changed in launchpad:
status: New → Triaged
importance: Undecided → Low
tags: added: codehosting-ssh
removed: codehosting
John A Meinel (jameinel) wrote :

Andrew and I poked at this a bit, and found that installing python-gmpy is actually a good way to reduce startup overhead. (publickey crypto does a lot of large-integer math, and gmpy is about 6x faster than python's builtin pow())

It also turns out that the perceived scaling isn't as bad on production as it is on the other machines I tested. Specifically, on my local single-cpu instance:

--load=1
time 0.28s

--load=10
time 2.80s

(or about linear scaling)

On qa-staging I also see
5.4s => 13.5s

going from a load of 1 to a load of 10. So a 2.5x scaling for a 10x load factor.

On production, though, I see:
3.0s => 4.0s

So a 25% increase at a 10x load factor.

My guess is that some of the scaling is just being CPU bound on my local machine. So the fact that bazaar.launchpad.net has better and more CPUs means it can finish one of the connections fast enough to avoid it affecting another connection.

Note that the load dependence is also much more apparent when the lp-forking-service is active, because the total time to run becomes much lower. So a 1s overhead is a much larger fraction of total time.

Colin Watson (cjwatson) on 2015-01-14
affects: launchpad → lazr.sshserver
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers