Juju commands return SSL errors

Bug #985812 reported by Brad Crittenden
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
pyjuju
Invalid
High
Unassigned

Bug Description

While waiting for a service to start our scripts must repeatedly query 'juju status' waiting for the agent-state to transition to 'started'. Infrequently but in a repeatable manner, 'juju status' will encounter an SSL error and fail.

Here is the output of a demonstration script I wrote that calls 'juju status' repeatedly, sleeping 0.1 seconds between calls:

/home/bac> ./status_fail.py
....................................................................................................................................
Exception caught on iteration 133
Command '['juju', 'status']' returned non-zero exit status 1
2012-04-18 11:34:26,993 WARNING ssl-hostname-verification is disabled for this environment
2012-04-18 11:34:26,993 WARNING EC2 API calls encrypted but not authenticated
2012-04-18 11:34:26,993 WARNING S3 API calls encrypted but not authenticated
2012-04-18 11:34:26,993 WARNING Ubuntu Cloud Image lookups encrypted but not authenticated
2012-04-18 11:34:26,994 INFO Connecting to environment...
2012-04-18 11:34:28,317 INFO Connected to environment.
Connection to the other side was lost in a non-clean fashion: Connection lost.
2012-04-18 11:34:28,874 ERROR Connection to the other side was lost in a non-clean fashion: Connection lost.

Traceback (most recent call last):
  File "./status_fail.py", line 17, in <module>
    juju_status()
  File "/usr/lib/python2.7/dist-packages/shelltoolbox/__init__.py", line 133, in callable_command
    return run(*all_args)
CalledProcessError: Command '['juju', 'status']' returned non-zero exit status 1

(The script used can be seen at http://pastebin.ubuntu.com/937261/)

Revision history for this message
Clint Byrum (clint-fewbar) wrote :

Any way you can make your script use 'juju -v status' and give us the stderr?

Changed in juju:
importance: Undecided → High
Revision history for this message
Kapil Thangavelu (hazmat) wrote : Re: [Bug 985812] [NEW] Juju commands return SSL errors

Excerpts from Brad Crittenden's message of 2012-04-19 18:45:25 UTC:
> Public bug reported:
>
> While waiting for a service to start our scripts must repeatedly query
> 'juju status' waiting for the agent-state to transition to 'started'.
> Infrequently but in a repeatable manner, 'juju status' will encounter an
> SSL error and fail.
>
> Here is the output of a demonstration script I wrote that calls 'juju
> status' repeatedly, sleeping 0.1 seconds between calls:
>
> /home/bac> ./status_fail.py
> ....................................................................................................................................
> Exception caught on iteration 133
> Command '['juju', 'status']' returned non-zero exit status 1
> 2012-04-18 11:34:26,993 WARNING ssl-hostname-verification is disabled for this environment
> 2012-04-18 11:34:26,993 WARNING EC2 API calls encrypted but not authenticated
> 2012-04-18 11:34:26,993 WARNING S3 API calls encrypted but not authenticated
> 2012-04-18 11:34:26,993 WARNING Ubuntu Cloud Image lookups encrypted but not authenticated
> 2012-04-18 11:34:26,994 INFO Connecting to environment...
> 2012-04-18 11:34:28,317 INFO Connected to environment.
> Connection to the other side was lost in a non-clean fashion: Connection lost.
> 2012-04-18 11:34:28,874 ERROR Connection to the other side was lost in a non-clean fashion: Connection lost.
>
> Traceback (most recent call last):
> File "./status_fail.py", line 17, in <module>
> juju_status()
> File "/usr/lib/python2.7/dist-packages/shelltoolbox/__init__.py", line 133, in callable_command
> return run(*all_args)
> CalledProcessError: Command '['juju', 'status']' returned non-zero exit status 1
>
> (The script used can be seen at http://pastebin.ubuntu.com/937261/)
>
> ** Affects: juju
> Importance: Undecided
> Status: New
>

What's the shelltoolbox pkg in the traceback here? Also fwiw there's a
juju-watcher in charmrunner that does something similiar (wait on service and relations).

Also i'm curious about the use case? ... ie. bringing up services souldn't be
ordering dependent on the cli. charms can inspect relations and invoke other
relation hooks now.

Revision history for this message
Brad Crittenden (bac) wrote :
Download full text (5.2 KiB)

Hi Clint,

I changed the script to invoke 'juju -v status' and got this:

/home/bac> ./status_fail.py
.............................................
Exception caught on iteration 46
Command '['juju', '-v', 'status']' returned non-zero exit status 1
2012-04-23 07:50:30,342 DEBUG Initializing juju status runtime
2012-04-23 07:50:30,354 WARNING ssl-hostname-verification is disabled for this environment
2012-04-23 07:50:30,354 WARNING EC2 API calls encrypted but not authenticated
2012-04-23 07:50:30,355 WARNING S3 API calls encrypted but not authenticated
2012-04-23 07:50:30,355 WARNING Ubuntu Cloud Image lookups encrypted but not authenticated
2012-04-23 07:50:30,355 INFO Connecting to environment...
2012-04-23 07:50:30,696 DEBUG Connecting to environment using ec2-107-22-125-141.compute-1.amazonaws.com...
2012-04-23 07:50:30,696 DEBUG Spawning SSH process with remote_user="ubuntu" remote_host="ec2-107-22-125-141.compute-1.amazonaws.com" remote_port="2181" local_port="32810".
2012-04-23 07:50:31,200:5623(0xb73e8b00):ZOO_INFO@log_env@658: Client environment:zookeeper.version=zookeeper C client 3.3.5
2012-04-23 07:50:31,200:5623(0xb73e8b00):ZOO_INFO@log_env@662: Client environment:host.name=sapa
2012-04-23 07:50:31,200:5623(0xb73e8b00):ZOO_INFO@log_env@669: Client environment:os.name=Linux
2012-04-23 07:50:31,200:5623(0xb73e8b00):ZOO_INFO@log_env@670: Client environment:os.arch=3.2.0-10-generic-pae
2012-04-23 07:50:31,200:5623(0xb73e8b00):ZOO_INFO@log_env@671: Client environment:os.version=#18-Ubuntu SMP Tue Jan 24 17:58:07 UTC 2012
2012-04-23 07:50:31,201:5623(0xb73e8b00):ZOO_INFO@log_env@679: Client environment:user.name=bac
2012-04-23 07:50:31,201:5623(0xb73e8b00):ZOO_INFO@log_env@687: Client environment:user.home=/home/bac
2012-04-23 07:50:31,201:5623(0xb73e8b00):ZOO_INFO@log_env@699: Client environment:user.dir=/home/bac
2012-04-23 07:50:31,201:5623(0xb73e8b00):ZOO_INFO@zookeeper_init@727: Initiating client connection, host=localhost:32810 sessionTimeout=10000 watcher=0xb7821620 sessionId=0 sessionPasswd=<null> context=0x8fb4778 flags=0
2012-04-23 07:50:31,202:5623(0xb64ffb40):ZOO_INFO@check_events@1585: initiated connection to server [127.0.0.1:32810]
2012-04-23 07:50:32,097:5623(0xb64ffb40):ZOO_INFO@check_events@1632: session establishment complete on server [127.0.0.1:32810], sessionId=0x136df075bb80035, negotiated timeout=10000
2012-04-23 07:50:32,129 DEBUG Environment is initialized.
2012-04-23 07:50:32,130 INFO Connected to environment.
2012-04-23 07:50:32,369:5623(0xb64ffb40):ZOO_ERROR@handle_socket_error_msg@1603: Socket [127.0.0.1:32810] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
2012-04-23 07:50:32,370:5623(0xb64ffb40):ZOO_ERROR@handle_socket_error_msg@1579: Socket [127.0.0.1:32810] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 391, in errback
    self._startRunCallbacks(fail)
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 458, in _startRunCallbacks
    self._runCallbacks()
  File "/usr/lib/python2.7/dist-packages/twisted/inte...

Read more...

Revision history for this message
Brad Crittenden (bac) wrote :

Hi Kapil,

| What's the shelltoolbox pkg in the traceback here? Also fwiw there's a
| juju-watcher in charmrunner that does something similiar (wait on service and relations).

lp:python-shelltoolbox is a package that just provides some wrappers for interacting with shell commands. The one used here is just makes using subprocess.Popen a little cleaner.

I should have also noted that if you just run 'watch juju status' it will occasionally encounter the same error I'm reporting here so it is unrelated to the toolset I'm using.

| Also i'm curious about the use case? ... ie. bringing up services souldn't be
| ordering dependent on the cli. charms can inspect relations and invoke other
| relation hooks now.

Our buildbot-slave has to run commands to set up the lxc container with a launchpad tree. We're waiting on that process to complete (transition to 'started') so we can then automate the next set of steps post-juju. For instance, we want to actually kick off a buildbot build which we can't do until everything is up and running.

Revision history for this message
Clint Byrum (clint-fewbar) wrote : Re: [Bug 985812] Re: Juju commands return SSL errors

Excerpts from Brad Crittenden's message of Mon Apr 23 12:08:22 UTC 2012:
> | Also i'm curious about the use case? ... ie. bringing up services souldn't be
> | ordering dependent on the cli. charms can inspect relations and invoke other
> | relation hooks now.
>
> Our buildbot-slave has to run commands to set up the lxc container with
> a launchpad tree. We're waiting on that process to complete (transition
> to 'started') so we can then automate the next set of steps post-juju.
> For instance, we want to actually kick off a buildbot build which we
> can't do until everything is up and running.
>

Sounds reasonable. So it seems to me that juju status is just unreliable
and has a pretty predictable fail rate. I would hope that we can provide
a more fundamental "watch for this state" in juju that does not poll
status but instead installs a ZK watch.

Revision history for this message
Kapil Thangavelu (hazmat) wrote :

i've attached a branch to this ticket, brad would you mind verifying if it works for you, its just a client side fix, doesn't need any changes to the deployed env. i'm just trying to verify if the error is coming from the zk client or ssh. The branch changes the zk client used for the cli to one that will tolerate transient packet loss.

Changed in juju:
status: New → In Progress
milestone: none → galapagos
Revision history for this message
Brad Crittenden (bac) wrote :

Kapil I tried your branch but got the following failure after 183 iterations:

/home/bac> ./status_fail.py
......................................................................................................................................................................................
Exception caught on iteration 183
Command '['/tmp/retry-client-for-cli/bin/juju', '-v', 'status']' returned non-zero exit status 1
2012-04-24 15:22:51,804 DEBUG Initializing juju status runtime
2012-04-24 15:22:51,817 WARNING ssl-hostname-verification is disabled for this environment
2012-04-24 15:22:51,817 WARNING EC2 API calls encrypted but not authenticated
2012-04-24 15:22:51,817 WARNING S3 API calls encrypted but not authenticated
2012-04-24 15:22:51,817 WARNING Ubuntu Cloud Image lookups encrypted but not authenticated
2012-04-24 15:22:51,818 INFO Connecting to environment...
2012-04-24 15:22:51,996 ERROR Cannot connect to environment: An SSL endpoint failed verification: [('SSL routines', 'SSL3_READ_BYTES', 'ssl handshake failure')]
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/juju/providers/common/connect.py", line 42, in run
    client = yield self._internal_connect(share)
SSLVerificationError: An SSL endpoint failed verification: [('SSL routines', 'SSL3_READ_BYTES', 'ssl handshake failure')]
Traceback (most recent call last):
Failure: juju.errors.SSLVerificationError: An SSL endpoint failed verification: [('SSL routines', 'SSL3_READ_BYTES', 'ssl handshake failure')]
2012-04-24 15:22:51,997 ERROR Traceback (most recent call last):
Failure: juju.errors.SSLVerificationError: An SSL endpoint failed verification: [('SSL routines', 'SSL3_READ_BYTES', 'ssl handshake failure')]

An SSL endpoint failed verification: [('SSL routines', 'SSL3_READ_BYTES', 'ssl handshake failure')]
2012-04-24 15:22:51,997 ERROR An SSL endpoint failed verification: [('SSL routines', 'SSL3_READ_BYTES', 'ssl handshake failure')]

Traceback (most recent call last):
  File "./status_fail.py", line 18, in <module>
    juju_status()
  File "/usr/lib/python2.7/dist-packages/shelltoolbox/__init__.py", line 133, in callable_command
    return run(*all_args)
CalledProcessError: Command '['/tmp/retry-client-for-cli/bin/juju', '-v', 'status']' returned non-zero exit status 1

Revision history for this message
Clint Byrum (clint-fewbar) wrote :

Brad.. I think that looks like a legitimate SSL failure when talking to S3 or EC2, most likely S3 since it is the part where we find the zookeeper machines. Perhaps SSL hiccups a lot for Amazon, and we should build in a few retries?

Revision history for this message
Clint Byrum (clint-fewbar) wrote :

I'm setting the status of this back to "New", as I don't think this was ever a zookeeper error. Perhaps our SSL timeouts are just too low for S3/EC2.

I'm also dropping it from galapagos since its not really well understood and time is short.

Changed in juju:
status: In Progress → New
milestone: galapagos → none
Revision history for this message
Clint Byrum (clint-fewbar) wrote :

I've looked through the EC2 provider code and it seems that the s3 calls (the only SSL calls that would happen in status) are wrapped by an error handler which reports this message. Its totally legitimate failure to communicate with S3, and I'm not sure juju should burden itself with retrying such an operation.

As such, I don't think this is a bug in juju, txaws, or even twisted.. its just a regular error condition.

Changed in juju:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.