add_unit to deployed service fails

Bug #1430488 reported by Robert C Jennings
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Amulet
Fix Released
Critical
Adam Israel

Bug Description

The UnitSentry.upload_scripts function appears to be attempting the mkdir & scp before the machine is ready. I think this is because Talisman.__init__ is not waiting for the machine/unit for an already deployed service to become ready and it is calling UnitSentry.fromunitdata() which calls UnitSentry.upload_scripts().

Revision history for this message
Robert C Jennings (rcj) wrote :
Download full text (4.1 KiB)

Testing my charm fails when I try to add a unit.

Test: http://bazaar.launchpad.net/~rcj/charms/trusty/ubuntu-repository-cache/trunk/view/head:/tests/110-multi_unit.simple
Failure:
$ JUJU_ENV=amazon-test juju charm test -e amazon-test --timeout 2000s -v --constraints instance-type=m3.medium tests/110-multi_unit.simple
juju-test INFO : Starting test run on amazon-test using Juju 1.21.3
juju-test DEBUG : Loading configuration options from testplan YAML
juju-test DEBUG : Creating a new Conductor
juju-test.conductor DEBUG : Starting a bootstrap for amazon-test, kill after 300
juju-test.conductor DEBUG : Running the following: juju bootstrap --constraints instance-type=m3.medium -e amazon-test
Bootstrapping environment "amazon-test"
Starting new instance for initial state server
Launching instance
 - i-a2e89558
Installing Juju agent on bootstrap instance
Waiting for address
Attempting to connect to 54.242.46.69:22
Attempting to connect to 10.187.27.170:22
Warning: Permanently added '54.242.46.69' (ECDSA) to the list of known hosts.
Logging to /var/log/cloud-init-output.log on remote host
Running apt-get update
Running apt-get upgrade
Installing package: curl
Installing package: cpu-checker
Installing package: bridge-utils
Installing package: rsyslog-gnutls
Fetching tools: curl -sSfw 'tools from %{url_effective} downloaded: HTTP %{http_code}; time %{time_total}s; size %{size_download} bytes; speed %{speed_download} bytes/s ' --retry 10 -o $bin/tools.tar.gz <[https://juju-dist.s3.amazonaws.com/tools/releases/juju-1.21.3-trusty-amd64.tgz]>
Bootstrapping Juju machine agent
Starting Juju machine agent (jujud-machine-0)
Bootstrap complete
juju-test.conductor DEBUG : Waiting for bootstrap
juju-test.conductor DEBUG : Still not bootstrapped
juju-test.conductor DEBUG : Running the following: juju status -e amazon-test
juju-test.conductor DEBUG : State for 1.21.3: started
juju-test.conductor.110-multi_unit.simple DEBUG : Running 110-multi_unit.simple (tests/110-multi_unit.simple)
2015-03-02 09:49:44 Starting deployment of amazon-test
2015-03-02 09:49:46 Deploying services...
2015-03-02 09:49:47 Deploying service ubuntu-repository-cache using local:trusty/ubuntu-repository-cache
2015-03-02 09:52:19 Adding relations...
2015-03-02 09:52:20 Exposing service 'ubuntu-repository-cache'
2015-03-02 09:52:21 Deployment complete in 156.42 seconds
2015-03-02 09:52:34,657 juju-test WARNING : test: Setup complete, waiting for startup to complete.
2015-03-02 09:52:36,948 juju-test WARNING : test: Start complete
2015-03-02 09:53:06,990 juju-test WARNING : test: PASS: Leader is not serving metadata too early
2015-03-02 09:53:37,009 juju-test WARNING : test: PASS: Leader is not serving pool too early
2015-03-02 09:53:37,009 juju-test WARNING : test: INFO: Adding 2nd unit
ERROR exit status 1 (Warning: Permanently added '54.242.46.69' (ECDSA) to the list of known hosts.
ERROR subprocess encountered error code 1
ssh_exchange_identification: Connection closed by remote host
lost connection)
Traceback (most recent call last):
  File "tests/110-multi_unit.simple", line 96, in <module>
    d.add_unit('ubuntu-repository-cache')
  File "/usr/lib/python3/dist-p...

Read more...

Revision history for this message
Robert C Jennings (rcj) wrote :

The failure reproduces 100% of the time and always on the add_unit(), never on the initial deployment of the service. It recreates with AWS and Openstack providers.

Revision history for this message
Robert C Jennings (rcj) wrote :

This is blocking review for the ubuntu-repository-charm.

Revision history for this message
Robert C Jennings (rcj) wrote :

Comment #3 refers to the ubuntu-repository-charm, that is bug #1366834 which is blocked by this issue.

Changed in amulet:
assignee: nobody → Marco Ceppi (marcoceppi)
Marco Ceppi (marcoceppi)
Changed in amulet:
importance: Undecided → Critical
milestone: none → 1.10.1
Marco Ceppi (marcoceppi)
Changed in amulet:
assignee: Marco Ceppi (marcoceppi) → Adam Israel (aisrael)
status: New → In Progress
Revision history for this message
Adam Israel (aisrael) wrote :

I've pushed a branch that fixes the bug for me:

https://github.com/AdamIsrael/amulet/tree/lp-1430488

Once Robert's confirmed that it's fixed for him (per our conversation on IRC) I'll open a pull request against amulet.

Revision history for this message
Robert C Jennings (rcj) wrote :

Adam,

I'm still seeing the problem when I run bundletester on my ubuntu-repository-cache charm. I'm dealing with bad networking at a sprint but I want to get you a recreate or test to add to amulet that will show this.

The problem here is not that the timeout needs to be longer. The behavior suggests that Talisman.__init__() returns from wait_for_status() and progresses to UnitSentry.fromunitdata() which calls UnitSentry.upload_scripts() and attempts an scp prior to the unit actually being ready, because we're seeing the scp failure when this error presents itself. Once we get this working I see a problem having a timeout of 5 minutes in the add_unit that is not configurable as the ubuntu-repository-cache charm can take a while to spool up a unit.

If you look at tests/function/test_sentry.py (or any other testing for amulet) there is no test for add_unit after Deployer.setup is called to deploy the first N units.

Revision history for this message
Robert C Jennings (rcj) wrote :

Adam,

I tested your patches and things look great. Thanks. Can you look at adding the unit testing for unit_add (https://github.com/juju/amulet/pull/68 or something like it) Thanks.

Marco Ceppi (marcoceppi)
Changed in amulet:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.