Status is slow over high-latency connections

Bug #798115 reported by Ahmed Kamal
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
pyjuju
Confirmed
Low
Unassigned

Bug Description

Ensemble needs 5 seconds to initialize itself. This is too slow for a light weight cli tool

Ahmed Kamal (kim0)
Changed in ensemble:
milestone: none → dublin
Revision history for this message
Ahmed Kamal (kim0) wrote :

I was investigating why the open-tunnel functionality was not really helping cut-down the needed time to run ensemble commands. Here are some findings

1- Bootstrapping an initial environment, no open-tunnel used. Ran the command "time ensemble status", "real" time reported on average is 6.8 seconds

2- Now running "ensemble open-tunnel" in another window, establishes a tunnel. Running "time ensemble status", "real" time reported on average was 5 seconds

3- This time is seemingly not for establishing the ssh connection. Measuring the pure ssh connection time with
time ssh <email address hidden>:22 -S ~/.<email address hidden>\:22 w

resulted in an average time of 0.3 seconds. Meaning the 5 seconds (or 4.7s) ensemble uses to run "ensemble status" are needed for local ensemble command line initialization

Potential solutions might be to profile and cut-down the time needed for ensemble to initialize itself, or to provide an interactive ensemble mode where commands are entered serially, avoiding closing and rerunning ensemble for every command

Revision history for this message
Gustavo Niemeyer (niemeyer) wrote :

This isn't startup time. This is simply Ensemble doing its job, communicating over Zookeeper to obtain
status information.

summary: - Ensemble is too slow to startup
+ Ensemble is slow over high-latency connections
summary: - Ensemble is slow over high-latency connections
+ Status is slow over high-latency connections
Changed in ensemble:
milestone: dublin → none
status: New → Confirmed
importance: Undecided → Low
Revision history for this message
Clint Byrum (clint-fewbar) wrote : Re: [Bug 798115] Re: Ensemble is too slow to startup
Download full text (4.6 KiB)

Excerpts from Ahmed Kamal's message of Thu Jun 16 10:11:55 UTC 2011:
> I was investigating why the open-tunnel functionality was not really
> helping cut-down the needed time to run ensemble commands. Here are some
> findings
>
> 1- Bootstrapping an initial environment, no open-tunnel used. Ran the
> command "time ensemble status", "real" time reported on average is 6.8
> seconds
>
> 2- Now running "ensemble open-tunnel" in another window, establishes a
> tunnel. Running "time ensemble status", "real" time reported on average
> was 5 seconds
>
> 3- This time is seemingly not for establishing the ssh connection. Measuring the pure ssh connection time with
> time ssh <email address hidden>:22 -S ~/.<email address hidden>\:22 w
>
> resulted in an average time of 0.3 seconds. Meaning the 5 seconds (or
> 4.7s) ensemble uses to run "ensemble status" are needed for local
> ensemble command line initialization
>
> Potential solutions might be to profile and cut-down the time needed for
> ensemble to initialize itself, or to provide an interactive ensemble
> mode where commands are entered serially, avoiding closing and rerunning
> ensemble for every command
>

As Gustavo said, its not SSH that is the issue, but Zookeeper
communication that slows things down the most. Without a bootstrapped
environment though, ensemble does take 1.7 seconds to tell me that there
is no bootstrapped environment. This sluggishness will not be popular
with users and most of it can be handled by printing a message quickly
saying "Thu Jun 16 10:08:03 - Communicating with ensemble control node..."

This 1.7 second thing seems pretty bad, what is it doing in that time?

Here are some interesting findings from 'strace -ttT -f' of ensemble status with
no bootstrap:

2192 10:13:10.664831 connect(6, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("209.18.47.61")}, 16) = 0 <0.000027>
2192 10:13:10.687296 recvfrom(6, "q\7\201\200\0\1\0\2\0\0\0\0\2s3\tamazonaws\3com\0\0\1"..., 1024, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("209.18.47.61")}, [16]) = 69 <0.000027>

There's the DNS lookup for amazon's server to my "local" DNS ... 22ms

2187 10:13:10.688239 connect(5, {sa_family=AF_INET, sin_port=htons(443), sin_addr=inet_addr("207.171.185.200")}, 16) = -1 EINPROGRESS (Operation now in progress) <0.000091>
2187 10:13:10.730919 connect(5, {sa_family=AF_INET, sin_port=htons(443), sin_addr=inet_addr("207.171.185.200")}, 16) = 0 <0.000051>

Connecting to Amazon took 43ms.. Not exactly high latency, but not local network.

2187 10:13:10.887957 shutdown(5, 2 /* send and receive */ <unfinished ...>

There is the end of the HTTPS communication w/ amazon from what I can see. 156ms to negotiate a session and issue one request to Amazon over SSL.

So about 221ms to talk just to Amazon's server. Then we have to connect to the region server:

2192 10:13:10.888547 connect(5, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("209.18.47.61")}, 16) = 0 <0.000020>
2192 10:13:10.908226 recvfrom(5, "jx\201\200\0\1\0\2\0\0\0\0\3ec2\tus-east-1\tamazo"..., 1024, 0, {sa_family=AF_I...

Read more...

Revision history for this message
Kapil Thangavelu (hazmat) wrote :

This is addressed in the statusd branch attached to bug 843539

Changed in juju:
milestone: none → florence
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.