Error on allwatcher api

Bug #1269519 reported by Chris Johnston
26
This bug affects 5 people
Affects Status Importance Assigned to Milestone
juju-core
Fix Released
High
Roger Peppe
juju-deployer
Fix Released
Critical
Unassigned

Bug Description

Traceback (most recent call last):
  File "/home/ubuntu/projects/amulet/ci-deployer.DLqa/deployer/cli.py", line 217, in <module>
    main()
  File "/home/ubuntu/projects/amulet/ci-deployer.DLqa/deployer/cli.py", line 118, in main
    run()
  File "/home/ubuntu/projects/amulet/ci-deployer.DLqa/deployer/cli.py", line 209, in run
    importer.Importer(env, deployment, options).run()
  File "/home/ubuntu/projects/amulet/ci-deployer.DLqa/deployer/action/importer.py", line 191, in run
    self.wait_for_units()
  File "/home/ubuntu/projects/amulet/ci-deployer.DLqa/deployer/action/importer.py", line 166, in wait_for_units
    int(timeout), watch=self.options.watch, no_exit=ignore_error)
  File "/home/ubuntu/projects/amulet/ci-deployer.DLqa/deployer/env/go.py", line 225, in wait_for_units
    self.client.wait_for_units(timeout, goal_state, callback=callback)
  File "/usr/lib/python2.7/dist-packages/jujuclient.py", line 281, in wait_for_units
    return WaitForUnits(watch, goal_state).run(callback)
  File "/usr/lib/python2.7/dist-packages/jujuclient.py", line 495, in run
    for change_set in self.watch:
  File "/usr/lib/python2.7/dist-packages/jujuclient.py", line 206, in next
    return super(TimeoutWatcher, self).next()
  File "/usr/lib/python2.7/dist-packages/jujuclient.py", line 168, in next
    'Id': self.watcher_id})
  File "/usr/lib/python2.7/dist-packages/jujuclient.py", line 135, in _rpc
    raise EnvError(result)
jujuclient.EnvError: <Env Error - Details:
 { u'Error': u'state watcher was stopped', u'RequestId': 3, u'Response': { }}
 >

https://pastebin.canonical.com/103023/

Related branches

Revision history for this message
Kapil Thangavelu (hazmat) wrote :
Download full text (3.9 KiB)

Roger looked over the machine log for this env and isolated the api server conversation from the client to the server as, which seems to intimate an issue on the api server, as the client hasn't closed the watch or state connection before getting the error.

2014-01-15 17:26:30 DEBUG juju.state.apiserver apiserver.go:102 <- [1A] <unknown> {"RequestId":0,"Type":"Admin","Request":"Login","Params":{"AuthTag":"user-admin","Password":"70e69ee9b8c02eddd0a4a39d6fb67401","Nonce":""}}
2014-01-15 17:26:30 DEBUG juju.state.apiserver apiserver.go:109 -> [1A] user-admin 83.768352ms {"RequestId":0,"Response":{}} Admin[""].Login
2014-01-15 17:26:30 DEBUG juju.state.apiserver apiserver.go:102 <- [1A] user-admin {"RequestId":0,"Type":"Client","Request":"WatchAll","Params":{}}
2014-01-15 17:26:30 DEBUG juju.state.apiserver apiserver.go:109 -> [1A] user-admin 697.662us {"RequestId":0,"Response":{"AllWatcherId":"1"}} Client[""].WatchAll
2014-01-15 17:26:30 DEBUG juju.state.apiserver apiserver.go:102 <- [1A] user-admin {"RequestId":1,"Type":"AllWatcher","Id":"1","Request":"Next","Params":{}}
2014-01-15 17:26:30 DEBUG juju.state.apiserver apiserver.go:109 -> [1A] user-admin 482.836us {"RequestId":1,"Response":{"Deltas":[["machine","change",{"Id":"0","InstanceId":"","Status":"started","StatusInfo":"","StatusData":null}],["unit","change",{"Name":"django/0","Service":"django","Series":"precise","CharmURL":"","PublicAddress":"","PrivateAddress":"","MachineId":"1","Ports":[],"Status":"pending","StatusInfo":"","StatusData":null}],["service","change",{"Name":"django","Exposed":true,"CharmURL":"local:precise/python-django-7","OwnerTag":"user-admin","Life":"alive","MinUnits":0,"Constraints":{},"Config":{"additional_distro_packages":"python-mock,python-oauth","additional_pip_packages":"django-tastypie==0.9.15","application_path":"ppa-assigner/","django_extra_settings":"LAUNCHPAD_PPA_USER=\"ci-engineering-airline\"","django_settings":"ppa_assigner.settings","django_south":true,"python_path":"/srv/django/ci-utils:/srv/django/ppa-assigner","repos_url":"lp:ubuntu-ci-services-itself","vcs":"bzr"}}],["service","change",{"Name":"gunicorn","Exposed":false,"CharmURL":"local:precise/gunicorn-3","OwnerTag":"user-admin","Life":"alive","MinUnits":0,"Constraints":{},"Config":{"python_path":"/srv/django/ci-utils:/srv/django/ppa-assigner"}}],["relation","change",{"Key":"postgres:replication","Id":0,"Endpoints":[{"ServiceName":"postgres","Relation":{"Name":"replication","Role":"peer","Interface":"pgreplication","Optional":false,"Limit":1,"Scope":"global"}}]}],["service","change",{"Name":"postgres","Exposed":false,"CharmURL":"local:precise/postgresql-2","OwnerTag":"user-admin","Life":"alive","MinUnits":0,"Constraints":{},"Config":{}}],["unit","change",{"Name":"postgres/0","Service":"postgres","Series":"precise","CharmURL":"","PublicAddress":"","PrivateAddress":"","MachineId":"2","Ports":[],"Status":"pending","StatusInfo":"","StatusData":null}],["machine","change",{"Id":"2","InstanceId":"","Status":"pending","StatusInfo":"","StatusData":null}],["machine","change",{"Id":"1","InstanceId":"a2b23356-c2ef-4795-92bc-6905cdffa2a6","Status":"pending","StatusInfo":"","StatusData":null}]]}} AllWatcher[...

Read more...

summary: - juju-deployer -T fails with jujuclient.EnvError: <Env Error - Details:
- { u'Error': u'state watcher was stopped', u'RequestId': 3,
- u'Response': { }} >
+ Error on allwatcher api
no longer affects: juju
Revision history for this message
Kapil Thangavelu (hazmat) wrote :

python jujuclient version was 0.0.7+bzr12-0~bzr16~precise1ubuntu1~0.IS.12.04

Revision history for this message
Kapil Thangavelu (hazmat) wrote :

i've gotten a half-dozen reports about this today from folks trying to automate deployments with deployer fwiw.

Changed in juju-deployer:
importance: Undecided → Critical
Marco Ceppi (marcoceppi)
Changed in juju-deployer:
status: New → Confirmed
Revision history for this message
Kapil Thangavelu (hazmat) wrote :

linked roger's branch which adds some additional logging to help debug this.

Curtis Hovey (sinzui)
Changed in juju-core:
status: New → In Progress
milestone: none → 1.17.1
importance: Undecided → High
assignee: nobody → Roger Peppe (rogpeppe)
Revision history for this message
Matt Bruzek (mbruzek) wrote :

Roger asked for from-scratch instructions on how to reproduce this bug. I just did it today and here are the steps I took on 1.17.0-trusty-amd64. You will have to grab my rabbitmq-server amulet test that I wrote.

bzr branch lp:~mbruzek/charms/precise/rabbitmq-server/tests

sudo juju boostrap -e local
juju status

cd rabbitmq-server
python3 tests/basic_deploy_test.py

The basic_deploy_test.py does a simple deploy of the rabbitmq-server from the amulet testing harness. I saw the bug here this morning.

Please let me know if you need any more information.

Revision history for this message
Michael Nelson (michael.nelson) wrote :

FWIW, here's how I'm reproducing this issue http://paste.ubuntu.com/6768177/

Revision history for this message
Matt Bruzek (mbruzek) wrote :

I found out a different naming scheme and I may have to remove the basic test from the repository so I will paste it here in case Roger has not had a chance to branch my code.

#!/usr/bin/python3

import os
import amulet

d = amulet.Deployment()
# Add the rabbitmq-server charm to the deployment.
d.add('rabbitmq-server')

# The number of seconds to wait for the environment to setup.
seconds = 900
try:
    # Execute the deployer with the current mapping.
    d.setup(timeout=seconds)
except amulet.helpers.TimeoutError:
    message = 'The environment did not setup in %d seconds.', seconds
    # The SKIP status enables skip or fail the test based on configuration.
    amulet.raise_status(amulet.SKIP, msg=message)
except:
    raise

# Run the command that checks if the rabbitmq-server service is running.
command = 'rabbitmqctl status'
# Execute the command on the deployed service.
output, code = d.sentry.unit['rabbitmq-server/0'].run(command)
# Check the return code for the success and failure of this test.
if (code != 0):
    message = 'The ' + command + ' did not return the expected code of 0.'
    print(output)
    amulet.raise_status(amulet.FAIL, msg=message)
else:
    message = 'The rabbitmq-server passed this test.'
    print(output)
    print(message)

Revision history for this message
Kapil Thangavelu (hazmat) wrote :

hmm.. one possibility might be a lack of pings on the websocket (synchronous client so pings are only explicit) which might be leading to some form of client connection timeout. although i would expect the symptons of that to be a different (and the socket to be closed by the server).

Revision history for this message
Roger Peppe (rogpeppe) wrote :

Kapil: I'm pretty sure that was the problem - the connection was shut down
after exactly 3 minutes, which is the timeout set in state/apiserver for
the maximum allowed interval between ping requests.

I merged revision 2219 to fix this (and it seems to work for me - I can no
longer reproduce the issue from the script above).

The symptoms were as expected - the server shut down the connection,
but before it does that, it closes all resources associated with the connection,
one of which is the state watcher, which sees that it's been closed down
and sends a "i've been stopped" reply to the existing request.

Roger Peppe (rogpeppe)
Changed in juju-core:
status: In Progress → Fix Committed
Changed in juju-deployer:
status: Confirmed → Fix Committed
Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
Revision history for this message
Kapil Thangavelu (hazmat) wrote :

marking this fix released in deployer.. users will need juju versions 1.17+ though to avoid hitting it on long running deployments.

Changed in juju-deployer:
status: Fix Committed → Fix Released
Revision history for this message
Chris Johnston (cjohnston) wrote :

I just started seeing this again.

juju-core 1.17.3-0ubuntu1
python-jujuclient 0.15-0ubuntu1

Revision history for this message
Evan (ev) wrote :
Download full text (6.1 KiB)

Just saw this in juju-deployer r103
juju-core 1.17.1-0ubuntu1
python-jujuclient 0.0.9-0ubuntu1

all-machines.log: https://pastebin.canonical.com/105508/

2014-02-25 11:24:43 Resolving configuration
2014-02-25 11:24:44 Deploying services...
2014-02-25 11:24:44 <deployer.env.go.GoEnvironment object at 0x165fd50>
2014-02-25 11:24:44 Deploying service bsb-gunicorn using local:precise/gunicorn
2014-02-25 11:24:46 Deploying service bsb-restish using local:precise/restish
2014-02-25 11:24:47 Deploying service bsb-worker using local:precise/rabbitmq-worker
2014-02-25 11:24:50 Deploying service imagebuild-gunicorn using local:precise/gunicorn
2014-02-25 11:24:51 Deploying service imagebuild-restish using local:precise/restish
2014-02-25 11:24:53 Deploying service imagebuild-worker using local:precise/rabbitmq-worker
2014-02-25 11:24:55 Deploying service lander-gunicorn using local:precise/gunicorn
2014-02-25 11:24:57 Deploying service lander-jenkins using local:precise/jenkins
2014-02-25 11:25:00 Deploying service lander-jenkins-sub using local:precise/lander-jenkins
2014-02-25 11:25:02 Deploying service lander-restish using local:precise/restish
2014-02-25 11:25:03 Deploying service ppa-cleaner using local:precise/upstart
2014-02-25 11:25:05 Deploying service ppa-django using local:precise/python-django
2014-02-25 11:25:08 Deploying service ppa-gunicorn using local:precise/gunicorn
2014-02-25 11:25:10 Deploying service ppa-postgres using local:precise/postgresql
2014-02-25 11:25:12 Deploying service rabbit using local:precise/rabbitmq-server
2014-02-25 11:25:14 Deploying service tr-gunicorn using local:precise/gunicorn
2014-02-25 11:25:19 Deploying service tr-rabbit-worker using local:precise/rabbitmq-worker
2014-02-25 11:25:21 Deploying service tr-restish using local:precise/restish
2014-02-25 11:25:23 Deploying service ts-django using local:precise/python-django
2014-02-25 11:25:25 Deploying service ts-gunicorn using local:precise/gunicorn
2014-02-25 11:25:27 Deploying service ts-postgres using local:precise/postgresql
2014-02-25 11:25:29 Deploying service webui-apache using local:precise/apache2
2014-02-25 11:25:31 Deploying service webui-app using local:precise/webui
2014-02-25 11:25:34 Deploying service ci-juju-gui using local:precise/juju-gui
2014-02-25 11:25:34 Refetching status for placement deploys
2014-02-25 11:26:18 Adding units...
2014-02-25 11:26:19 Config specifies num units for subordinate: bsb-gunicorn
2014-02-25 11:26:19 Service 'bsb-restish' does not need any more units added.
2014-02-25 11:26:19 Service 'bsb-worker' does not need any more units added.
2014-02-25 11:26:19 Config specifies num units for subordinate: imagebuild-gunicorn
2014-02-25 11:26:19 Service 'imagebuild-restish' does not need any more units added.
2014-02-25 11:26:19 Service 'imagebuild-worker' does not need any more units added.
2014-02-25 11:26:19 Config specifies num units for subordinate: lander-gunicorn
2014-02-25 11:26:19 Service 'lander-jenkins' does not need any more units added.
2014-02-25 11:26:19 Config specifies num units for subordinate: lander-jenkins-sub
2014-02-25 11:26:19 Service 'lander-restish' does not need any ...

Read more...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.