Comment 3 for bug 798115

Revision history for this message
Clint Byrum (clint-fewbar) wrote : Re: [Bug 798115] Re: Ensemble is too slow to startup

Excerpts from Ahmed Kamal's message of Thu Jun 16 10:11:55 UTC 2011:
> I was investigating why the open-tunnel functionality was not really
> helping cut-down the needed time to run ensemble commands. Here are some
> findings
>
> 1- Bootstrapping an initial environment, no open-tunnel used. Ran the
> command "time ensemble status", "real" time reported on average is 6.8
> seconds
>
> 2- Now running "ensemble open-tunnel" in another window, establishes a
> tunnel. Running "time ensemble status", "real" time reported on average
> was 5 seconds
>
> 3- This time is seemingly not for establishing the ssh connection. Measuring the pure ssh connection time with
> time ssh <email address hidden>:22 -S ~/.<email address hidden>\:22 w
>
> resulted in an average time of 0.3 seconds. Meaning the 5 seconds (or
> 4.7s) ensemble uses to run "ensemble status" are needed for local
> ensemble command line initialization
>
> Potential solutions might be to profile and cut-down the time needed for
> ensemble to initialize itself, or to provide an interactive ensemble
> mode where commands are entered serially, avoiding closing and rerunning
> ensemble for every command
>

As Gustavo said, its not SSH that is the issue, but Zookeeper
communication that slows things down the most. Without a bootstrapped
environment though, ensemble does take 1.7 seconds to tell me that there
is no bootstrapped environment. This sluggishness will not be popular
with users and most of it can be handled by printing a message quickly
saying "Thu Jun 16 10:08:03 - Communicating with ensemble control node..."

This 1.7 second thing seems pretty bad, what is it doing in that time?

Here are some interesting findings from 'strace -ttT -f' of ensemble status with
no bootstrap:

2192 10:13:10.664831 connect(6, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("209.18.47.61")}, 16) = 0 <0.000027>
2192 10:13:10.687296 recvfrom(6, "q\7\201\200\0\1\0\2\0\0\0\0\2s3\tamazonaws\3com\0\0\1"..., 1024, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("209.18.47.61")}, [16]) = 69 <0.000027>

There's the DNS lookup for amazon's server to my "local" DNS ... 22ms

2187 10:13:10.688239 connect(5, {sa_family=AF_INET, sin_port=htons(443), sin_addr=inet_addr("207.171.185.200")}, 16) = -1 EINPROGRESS (Operation now in progress) <0.000091>
2187 10:13:10.730919 connect(5, {sa_family=AF_INET, sin_port=htons(443), sin_addr=inet_addr("207.171.185.200")}, 16) = 0 <0.000051>

Connecting to Amazon took 43ms.. Not exactly high latency, but not local network.

2187 10:13:10.887957 shutdown(5, 2 /* send and receive */ <unfinished ...>

There is the end of the HTTPS communication w/ amazon from what I can see. 156ms to negotiate a session and issue one request to Amazon over SSL.

So about 221ms to talk just to Amazon's server. Then we have to connect to the region server:

2192 10:13:10.888547 connect(5, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("209.18.47.61")}, 16) = 0 <0.000020>
2192 10:13:10.908226 recvfrom(5, "jx\201\200\0\1\0\2\0\0\0\0\3ec2\tus-east-1\tamazo"..., 1024, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("209.18.47.61")}, [16]) = 79 <0.000036>

There's another 19ms. For some reason we do this twice .. maybe the answer was a CNAME?

2192 10:13:10.909036 connect(5, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("209.18.47.61")}, 16) = 0 <0.000020>
2192 10:13:11.098121 recvfrom(5, "\20H\201\200\0\1\0\2\0\0\0\0\3ec2\tus-east-1\tamazo"..., 1024, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("209.18.47.61")}, [16]) = 79 <0.000064>

Ooops this one was way slower, 189ms!

2187 10:13:11.099200 connect(6, {sa_family=AF_INET, sin_port=htons(443), sin_addr=inet_addr("207.171.162.181")}, 16) = -1 EINPROGRESS (Operation now in progress) <0.000120>
2187 10:13:11.181703 connect(6, {sa_family=AF_INET, sin_port=htons(443), sin_addr=inet_addr("207.171.162.181")}, 16) = 0 <0.000051>

Connecting to this server took 83ms.. borderline high latency

2187 10:13:11.624505 shutdown(6, 2 /* send and receive */) = 0 <0.000082>

And the communication took *443ms*. I'm guessing server response time for this server is much slower.

So a total of 443+221 or 562ms of the 1700 or so ms of the run was spent just talking to Amazon.

I think its fair to say we need to have status messages *showing* that we're waiting on the machine provider.. like

$TIMESTAMP - Waiting on machine provider: ec2 for operation: FindSeed

Also, after all of that, a bunch of python stuff happens that isn't "free", and also ldconfig gets run 3 or 4 times.. not sure why that is.