deployer and quickstart are broken in 1.24-alpha1

Bug #1441826 reported by Curtis Hovey
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
juju-core
Fix Released
Critical
Horacio Durán
1.24
Fix Released
Critical
Horacio Durán

Bug Description

There appears to be changes to api or a decline its reliability that broken deployer and later quickstart in aws, hp, and joyent. MAAS continued to work until recently

Deployer broke in on or before commit e374bae as is seen in the bundle tests section in:
   http://reports.vapour.ws/releases/2481
The deployer stack appears to be up, but I think the relation between the django units and the haproxy is missing.

Quickstart continued to work until commit 0f79f48 as is seen in the bundle tests section in:
    http://reports.vapour.ws/releases/2511
Quickstart isn't informative because it swallows what it is doing.

The last successful run of both deployer and quickstart was
    http://reports.vapour.ws/releases/2478
which looks like this
    http://data.vapour.ws/juju-ci/products/version-2478/aws-deployer-bundle/build-80/consoleText

I believe CI's tests also changed so the issue might be the test script exiting prematurely.

Curtis Hovey (sinzui)
Changed in juju-ci-tools:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Curtis Hovey (sinzui) wrote :

We can see in last test of 1.23-beta3 that everything passed
    http://reports.vapour.ws/releases/2521

Then we tested 1.23-beta4 and Hp failed because some instances were left behind from a previous test
    http://reports.vapour.ws/releases/2522
    http://reports.vapour.ws/releases/2523

but when we tested master before and after the 1.23 version we see total failure. We restested some of substrates several times.
    http://reports.vapour.ws/releases/2520
    http://reports.vapour.ws/releases/2525

So we know we can test 1.22 and 1.23 and expect all to pass, but one might need retesting because the substrate was dirty.

Revision history for this message
Curtis Hovey (sinzui) wrote :

We have retest 1.24 and 1.23 and we can see that 1.23 works with deployer and quickstart in maas, hp, aws, and joyent, but none of these work for 1.24. This looks like a regression in 1.24.

The maas 1.7 deployer test is the most informative because it has the highest timeout set. We see deployer gived up
    2015-04-09 14:19:10 [DEBUG] deployer.env: Delta unit: landscape/0 change:executing
    2015-04-09 14:19:20 [DEBUG] deployer.env: Delta unit: landscape/0 change:idle
    2015-04-09 15:02:43 [DEBUG] deployer.env: Connecting to environment...
    2015-04-09 15:02:44 [DEBUG] deployer.env: Connected to environment
    2015-04-09 15:02:44 [ERROR] deployer.import: Reached deployment timeout.. exiting
    2015-04-09 15:02:44 [INFO] deployer.cli: Deployment stopped. run time: 3041.44

We expect too see messages like this after connecting
    2015-04-09 17:52:46 [INFO] deployer.import: Adding relations...
    2015-04-09 17:52:46 [INFO] deployer.import: Adding relation landscape <-> rabbitmq-server
    2015-04-09 17:52:46 [INFO] deployer.import: Adding relation landscape <-> haproxy
    2015-04-09 17:52:47 [INFO] deployer.import: Adding relation landscape:vhost-config <-> apache2:vhost-config
    2015-04-09 17:52:47 [INFO] deployer.import: Adding relation landscape:db-admin <-> postgresql:db-admin
    2015-04-09 17:52:47 [INFO] deployer.import: Adding relation haproxy:website <-> apache2:reverseproxy
    2015-04-09 17:52:48 [INFO] deployer.import: Adding relation landscape-msg <-> rabbitmq-server
    2015-04-09 17:52:48 [INFO] deployer.import: Adding relation landscape-msg <-> haproxy
    2015-04-09 17:52:48 [INFO] deployer.import: Adding relation landscape-msg:db-admin <-> postgresql:db-admin
    2015-04-09 17:52:49 [DEBUG] deployer.import: Waiting for relation convergence 60s
    2015-04-09 17:53:53 [INFO] deployer.import: Exposing service 'apache2'
    2015-04-09 17:53:53 [INFO] deployer.cli: Deployment complete in 500.56 seconds

tags: added: api deployer quickstart
Revision history for this message
Ian Booth (wallyworld) wrote :

Issue appears to be related to failure to start lxc instance:

      2/lxc/0:
        agent-state-info: 'failed to retrieve the template to clone: template container
          "juju-trusty-lxc-template" did not stop'
        instance-id: pending
        series: trusty
      2/lxc/1:
        agent-state-info: 'lxc container cloning failed: cannot clone a running container'
        instance-id: pending
        series: trusty

Probably related or a duplicate of this bug 1441319

Revision history for this message
Curtis Hovey (sinzui) wrote :

The lxc container output looks like the deployer test of the many app-servers behind a haproxy. The quickstart test uses the landscape scalable bundle which doesn't use containers
    http://bazaar.launchpad.net/~juju-qa/juju-ci-tools/repository/view/head:/landscape-scalable.yaml

Revision history for this message
Curtis Hovey (sinzui) wrote :

This might provide more information. Download
   https://bazaar.launchpad.net/~juju-qa/juju-ci-tools/repository/view/head:/landscape-scalable.yaml
With juju 1.24-alpha1
   juju --debug deployer --deploy-delay 10 --config landscape-scalable.yaml

Revision history for this message
Ian Booth (wallyworld) wrote :
Download full text (4.7 KiB)

I tried the deployer on AWS. Got a deployer timeout. Tried juju debug-log, but that appears broken also:

$ juju debug-log -n 1000
ERROR cannot open log file: open /var/log/juju/all-machines.log: no such file or directory

NB I had to revert to an earlier revision of 1.24 to avoid the ssh bug preventing bootstrap.

The status below shows the landscape charms are stuck installing. sshing into machine 4 and looking at the unit log shows the unit agent restarting due to rsyslog connection errors:

2015-04-23 04:27:08 INFO juju.worker runner.go:261 start "rsyslog"
2015-04-23 04:27:08 DEBUG juju.worker.rsyslog worker.go:93 starting rsyslog worker mode 1 for "unit-landscape-msg-0" ""
2015-04-23 04:27:08 DEBUG juju.worker.rsyslog worker.go:190 making syslog connection for "juju-unit-landscape-msg-0" to 10.236.188.206:6514
2015-04-23 04:27:08 ERROR juju.worker runner.go:219 exited "rsyslog": dial tcp 10.236.188.206:6514: connection refused
2015-04-23 04:27:08 INFO juju.worker runner.go:253 restarting "rsyslog" in 3s

So it appears maybe recent changes to logging are:
1. breaking debug-log
2. stopping some unit agents from starting

juju --debug deployer --deploy-delay 10 --config ~/landscape-scalable.yaml
2015-04-23 03:19:31 INFO juju.cmd supercommand.go:37 running juju [1.24-alpha1-utopic-amd64 gc]
2015-04-23 13:19:31 Using deployment landscape-scalable
2015-04-23 13:19:31 Starting deployment of landscape-scalable
2015-04-23 13:19:48 Deploying services...
2015-04-23 13:19:51 Deploying service apache2 using cs:trusty/apache2-4
2015-04-23 13:20:14 Deploying service haproxy using cs:trusty/haproxy-1
2015-04-23 13:20:36 Deploying service landscape using cs:trusty/landscape-server
2015-04-23 13:21:02 Deploying service landscape-msg using cs:trusty/landscape-server
2015-04-23 13:21:26 Deploying service postgresql using cs:trusty/postgresql-3
2015-04-23 13:21:53 Deploying service rabbitmq-server using cs:trusty/rabbitmq-server-7
2015-04-23 14:16:44 Reached deployment timeout.. exiting
2015-04-23 14:16:44 Deployment stopped. run time: 3433.34
2015-04-23 04:16:44 ERROR juju.cmd supercommand.go:430 subprocess encountered error code 1

$ juju status
[Machines]
ID STATE VERSION DNS INS-ID SERIES HARDWARE
0 started 1.24-alpha1.1 54.82.28.180 i-d741242a trusty arch=amd64 cpu-cores=1 cpu-power=100 mem=1740M root-disk=8192M availability-zone=us-east-1c
1 started 1.24-alpha1.1 54.158.244.52 i-da1a3626 trusty arch=amd64 cpu-cores=1 cpu-power=100 mem=1740M root-disk=8192M availability-zone=us-east-1d
2 started 1.24-alpha1.1 54.91.176.112 i-3b4ba4ed trusty arch=amd64 cpu-cores=1 cpu-power=100 mem=1740M root-disk=8192M availability-zone=us-east-1e
3 started 1.24-alpha1.1 54.90.7.109 i-1ceddc33 trusty arch=amd64 cpu-cores=1 cpu-power=300 mem=3840M root-disk=8192M availability-zone=us-east-1b
4 started 1.24-alpha1.1 54.146.74.189 i-03eddc2c trusty arch=amd64 cpu-cores=1 cpu-power=300 mem=3840M root-disk=8192M availability-zone=us-east-1b
5 started 1.24-alpha1.1 54.145.18.41 i-8d4b2e70 trust...

Read more...

Curtis Hovey (sinzui)
tags: added: blocker
Changed in juju-core:
importance: High → Critical
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 1.24-alpha1 → 1.25.0
importance: Critical → High
Revision history for this message
Ian Booth (wallyworld) wrote :

In comment 6 https://bugs.launchpad.net/juju-core/+bug/1441826/comments/6 the issue was the the option to install apt packages was disabled in the env configuration and thus rsyslog was not installed. This is currently being fixed in bug 1424892.

Curtis Hovey (sinzui)
Changed in juju-core:
importance: High → Critical
Revision history for this message
Horacio Durán (hduran-8) wrote :

This was a regression introduced by changes in multiwatcher, I am working on a patch to solve it.

Changed in juju-core:
assignee: nobody → Horacio Durán (hduran-8)
Revision history for this message
Ian Booth (wallyworld) wrote : Re:

You found the cause, awesome. What was the regression?

On 29/04/15 04:04, Horacio Durán wrote:
> This was a regression introduced by changes in multiwatcher, I am
> working on a patch to solve it.
>

Revision history for this message
Horacio Durán (hduran-8) wrote : Re: [Bug 1441826] Re:

The legacy status for units in the megawatcher was not being properly set,
it did not follow the correct rules resulting in very odd statuses
(different from the ones in status output) and therefore never arriving to
the statuses expected by deployer.

On Tue, Apr 28, 2015 at 6:12 PM, Ian Booth <email address hidden> wrote:

> You found the cause, awesome. What was the regression?
>
> On 29/04/15 04:04, Horacio Durán wrote:
> > This was a regression introduced by changes in multiwatcher, I am
> > working on a patch to solve it.
> >
>
> --
> You received this bug notification because you are a bug assignee.
> https://bugs.launchpad.net/bugs/1441826
>
> Title:
> deployer and quickstart are broken in 1.24-alpha1
>
> Status in Juju CI Tools:
> Triaged
> Status in juju-core:
> Triaged
> Status in juju-core 1.24 series:
> In Progress
>
> Bug description:
> There appears to be changes to api or a decline its reliability that
> broken deployer and later quickstart in aws, hp, and joyent. MAAS
> continued to work until recently
>
> Deployer broke in on or before commit e374bae as is seen in the bundle
> tests section in:
> http://reports.vapour.ws/releases/2481
> The deployer stack appears to be up, but I think the relation between
> the django units and the haproxy is missing.
>
>
> Quickstart continued to work until commit 0f79f48 as is seen in the
> bundle tests section in:
> http://reports.vapour.ws/releases/2511
> Quickstart isn't informative because it swallows what it is doing.
>
> The last successful run of both deployer and quickstart was
> http://reports.vapour.ws/releases/2478
> which looks like this
>
> http://data.vapour.ws/juju-ci/products/version-2478/aws-deployer-bundle/build-80/consoleText
>
> I believe CI's tests also changed so the issue might be the test
> script exiting prematurely.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju-ci-tools/+bug/1441826/+subscriptions
>

Revision history for this message
Horacio Durán (hduran-8) wrote :

I just proposed the fix:

http://reviews.vapour.ws/r/1508/ for 1.24

http://reviews.vapour.ws/r/1509/ for master

Changed in juju-core:
assignee: Horacio Durán (hduran-8) → nobody
status: Triaged → In Progress
Ian Booth (wallyworld)
Changed in juju-core:
status: In Progress → Fix Committed
assignee: nobody → Horacio Durán (hduran-8)
Curtis Hovey (sinzui)
no longer affects: juju-ci-tools
Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.