Bug #1441826 “deployer and quickstart are broken in 1.24-alpha1” : Series 1.24 : Bugs : juju-core

Curtis Hovey (sinzui) on 2015-04-08

Changed in juju-ci-tools:
status:	New → Triaged
importance:	Undecided → High

Revision history for this message

Curtis Hovey (sinzui) wrote on 2015-04-08:

#1

We can see in last test of 1.23-beta3 that everything passed
http://reports.vapour.ws/releases/2521

Then we tested 1.23-beta4 and Hp failed because some instances were left behind from a previous test
http://reports.vapour.ws/releases/2522
http://reports.vapour.ws/releases/2523

but when we tested master before and after the 1.23 version we see total failure. We restested some of substrates several times.
http://reports.vapour.ws/releases/2520
http://reports.vapour.ws/releases/2525

So we know we can test 1.22 and 1.23 and expect all to pass, but one might need retesting because the substrate was dirty.

Revision history for this message

Curtis Hovey (sinzui) wrote on 2015-04-09:

#2

We have retest 1.24 and 1.23 and we can see that 1.23 works with deployer and quickstart in maas, hp, aws, and joyent, but none of these work for 1.24. This looks like a regression in 1.24.

The maas 1.7 deployer test is the most informative because it has the highest timeout set. We see deployer gived up
    2015-04-09 14:19:10 [DEBUG] deployer.env: Delta unit: landscape/0 change:executing
    2015-04-09 14:19:20 [DEBUG] deployer.env: Delta unit: landscape/0 change:idle
    2015-04-09 15:02:43 [DEBUG] deployer.env: Connecting to environment...
    2015-04-09 15:02:44 [DEBUG] deployer.env: Connected to environment
    2015-04-09 15:02:44 [ERROR] deployer.import: Reached deployment timeout.. exiting
    2015-04-09 15:02:44 [INFO] deployer.cli: Deployment stopped. run time: 3041.44

We expect too see messages like this after connecting
    2015-04-09 17:52:46 [INFO] deployer.import: Adding relations...
    2015-04-09 17:52:46 [INFO] deployer.import: Adding relation landscape <-> rabbitmq-server
    2015-04-09 17:52:46 [INFO] deployer.import: Adding relation landscape <-> haproxy
    2015-04-09 17:52:47 [INFO] deployer.import: Adding relation landscape:vhost-config <-> apache2:vhost-config
    2015-04-09 17:52:47 [INFO] deployer.import: Adding relation landscape:db-admin <-> postgresql:db-admin
    2015-04-09 17:52:47 [INFO] deployer.import: Adding relation haproxy:website <-> apache2:reverseproxy
    2015-04-09 17:52:48 [INFO] deployer.import: Adding relation landscape-msg <-> rabbitmq-server
    2015-04-09 17:52:48 [INFO] deployer.import: Adding relation landscape-msg <-> haproxy
    2015-04-09 17:52:48 [INFO] deployer.import: Adding relation landscape-msg:db-admin <-> postgresql:db-admin
    2015-04-09 17:52:49 [DEBUG] deployer.import: Waiting for relation convergence 60s
    2015-04-09 17:53:53 [INFO] deployer.import: Exposing service 'apache2'
    2015-04-09 17:53:53 [INFO] deployer.cli: Deployment complete in 500.56 seconds

tags:

added: api deployer quickstart

Revision history for this message

Ian Booth (wallyworld) wrote on 2015-04-22:

#3

Issue appears to be related to failure to start lxc instance:

      2/lxc/0:
        agent-state-info: 'failed to retrieve the template to clone: template container
          "juju-trusty-lxc-template" did not stop'
        instance-id: pending
        series: trusty
      2/lxc/1:
        agent-state-info: 'lxc container cloning failed: cannot clone a running container'
        instance-id: pending
        series: trusty

Probably related or a duplicate of this bug 1441319

Revision history for this message

Curtis Hovey (sinzui) wrote on 2015-04-22:

#4

The lxc container output looks like the deployer test of the many app-servers behind a haproxy. The quickstart test uses the landscape scalable bundle which doesn't use containers
http://bazaar.launchpad.net/~juju-qa/juju-ci-tools/repository/view/head:/landscape-scalable.yaml

Revision history for this message

Curtis Hovey (sinzui) wrote on 2015-04-22:

#5

This might provide more information. Download
https://bazaar.launchpad.net/~juju-qa/juju-ci-tools/repository/view/head:/landscape-scalable.yaml
With juju 1.24-alpha1
juju --debug deployer --deploy-delay 10 --config landscape-scalable.yaml

Revision history for this message

Ian Booth (wallyworld) wrote on 2015-04-23:

#6

Download full text (4.7 KiB)

I tried the deployer on AWS. Got a deployer timeout. Tried juju debug-log, but that appears broken also:

$ juju debug-log -n 1000
ERROR cannot open log file: open /var/log/juju/all-machines.log: no such file or directory

NB I had to revert to an earlier revision of 1.24 to avoid the ssh bug preventing bootstrap.

The status below shows the landscape charms are stuck installing. sshing into machine 4 and looking at the unit log shows the unit agent restarting due to rsyslog connection errors:

2015-04-23 04:27:08 INFO juju.worker runner.go:261 start "rsyslog"
2015-04-23 04:27:08 DEBUG juju.worker.rsyslog worker.go:93 starting rsyslog worker mode 1 for "unit-landscape-msg-0" ""
2015-04-23 04:27:08 DEBUG juju.worker.rsyslog worker.go:190 making syslog connection for "juju-unit-landscape-msg-0" to 10.236.188.206:6514
2015-04-23 04:27:08 ERROR juju.worker runner.go:219 exited "rsyslog": dial tcp 10.236.188.206:6514: connection refused
2015-04-23 04:27:08 INFO juju.worker runner.go:253 restarting "rsyslog" in 3s

So it appears maybe recent changes to logging are:
1. breaking debug-log
2. stopping some unit agents from starting

juju --debug deployer --deploy-delay 10 --config ~/landscape-scalable.yaml
2015-04-23 03:19:31 INFO juju.cmd supercommand.go:37 running juju [1.24-alpha1-utopic-amd64 gc]
2015-04-23 13:19:31 Using deployment landscape-scalable
2015-04-23 13:19:31 Starting deployment of landscape-scalable
2015-04-23 13:19:48 Deploying services...
2015-04-23 13:19:51 Deploying service apache2 using cs:trusty/apache2-4
2015-04-23 13:20:14 Deploying service haproxy using cs:trusty/haproxy-1
2015-04-23 13:20:36 Deploying service landscape using cs:trusty/landscape-server
2015-04-23 13:21:02 Deploying service landscape-msg using cs:trusty/landscape-server
2015-04-23 13:21:26 Deploying service postgresql using cs:trusty/postgresql-3
2015-04-23 13:21:53 Deploying service rabbitmq-server using cs:trusty/rabbitmq-server-7
2015-04-23 14:16:44 Reached deployment timeout.. exiting
2015-04-23 14:16:44 Deployment stopped. run time: 3433.34
2015-04-23 04:16:44 ERROR juju.cmd supercommand.go:430 subprocess encountered error code 1

$ juju status
[Machines]
ID STATE VERSION DNS INS-ID SERIES HARDWARE
0 started 1.24-alpha1.1 54.82.28.180 i-d741242a trusty arch=amd64 cpu-cores=1 cpu-power=100 mem=1740M root-disk=8192M availability-zone=us-east-1c
1 started 1.24-alpha1.1 54.158.244.52 i-da1a3626 trusty arch=amd64 cpu-cores=1 cpu-power=100 mem=1740M root-disk=8192M availability-zone=us-east-1d
2 started 1.24-alpha1.1 54.91.176.112 i-3b4ba4ed trusty arch=amd64 cpu-cores=1 cpu-power=100 mem=1740M root-disk=8192M availability-zone=us-east-1e
3 started 1.24-alpha1.1 54.90.7.109 i-1ceddc33 trusty arch=amd64 cpu-cores=1 cpu-power=300 mem=3840M root-disk=8192M availability-zone=us-east-1b
4 started 1.24-alpha1.1 54.146.74.189 i-03eddc2c trusty arch=amd64 cpu-cores=1 cpu-power=300 mem=3840M root-disk=8192M availability-zone=us-east-1b
5 started 1.24-alpha1.1 54.145.18.41 i-8d4b2e70 trust...

I tried the deployer on AWS. Got a deployer timeout. Tried juju debug-log, but that appears broken also:

$ juju debug-log -n 1000
ERROR cannot open log file: open /var/log/juju/all-machines.log: no such file or directory

NB I had to revert to an earlier revision of 1.24 to avoid the ssh bug preventing bootstrap.

The status below shows the landscape charms are stuck installing. sshing into machine 4 and looking at the unit log shows the unit agent restarting due to rsyslog connection errors:

2015-04-23 04:27:08 INFO juju.worker runner.go:261 start "rsyslog"
2015-04-23 04:27:08 DEBUG juju.worker.rsyslog worker.go:93 starting rsyslog worker mode 1 for "unit-landscape-msg-0" ""
2015-04-23 04:27:08 DEBUG juju.worker.rsyslog worker.go:190 making syslog connection for "juju-unit-landscape-msg-0" to 10.236.188.206:6514
2015-04-23 04:27:08 ERROR juju.worker runner.go:219 exited "rsyslog": dial tcp 10.236.188.206:6514: connection refused
2015-04-23 04:27:08 INFO juju.worker runner.go:253 restarting "rsyslog" in 3s

So it appears maybe recent changes to logging are:
1. breaking debug-log
2. stopping some unit agents from starting

juju --debug deployer --deploy-delay 10 --config ~/landscape-scalable.yaml
2015-04-23 03:19:31 INFO juju.cmd supercommand.go:37 running juju [1.24-alpha1-utopic-amd64 gc]
2015-04-23 13:19:31 Using deployment landscape-scalable
2015-04-23 13:19:31 Starting deployment of landscape-scalable
2015-04-23 13:19:48 Deploying services...
2015-04-23 13:19:51  Deploying service apache2 using cs:trusty/apache2-4
2015-04-23 13:20:14  Deploying service haproxy using cs:trusty/haproxy-1
2015-04-23 13:20:36  Deploying service landscape using cs:trusty/landscape-server
2015-04-23 13:21:02  Deploying service landscape-msg using cs:trusty/landscape-server
2015-04-23 13:21:26  Deploying service postgresql using cs:trusty/postgresql-3
2015-04-23 13:21:53  Deploying service rabbitmq-server using cs:trusty/rabbitmq-server-7
2015-04-23 14:16:44 Reached deployment timeout.. exiting
2015-04-23 14:16:44 Deployment stopped. run time: 3433.34
2015-04-23 04:16:44 ERROR juju.cmd supercommand.go:430 subprocess encountered error code 1

$ juju status
[Machines] 
ID         STATE   VERSION       DNS           INS-ID     SERIES HARDWARE                                                                                    
0          started 1.24-alpha1.1 54.82.28.180  i-d741242a trusty arch=amd64 cpu-cores=1 cpu-power=100 mem=1740M root-disk=8192M availability-zone=us-east-1c 
1          started 1.24-alpha1.1 54.158.244.52 i-da1a3626 trusty arch=amd64 cpu-cores=1 cpu-power=100 mem=1740M root-disk=8192M availability-zone=us-east-1d 
2          started 1.24-alpha1.1 54.91.176.112 i-3b4ba4ed trusty arch=amd64 cpu-cores=1 cpu-power=100 mem=1740M root-disk=8192M availability-zone=us-east-1e 
3          started 1.24-alpha1.1 54.90.7.109   i-1ceddc33 trusty arch=amd64 cpu-cores=1 cpu-power=300 mem=3840M root-disk=8192M availability-zone=us-east-1b 
4          started 1.24-alpha1.1 54.146.74.189 i-03eddc2c trusty arch=amd64 cpu-cores=1 cpu-power=300 mem=3840M root-disk=8192M availability-zone=us-east-1b 
5          started 1.24-alpha1.1 54.145.18.41  i-8d4b2e70 trusty arch=amd64 cpu-cores=1 cpu-power=300 mem=3840M root-disk=8192M availability-zone=us-east-1c 
6          started 1.24-alpha1.1 54.162.42.99  i-011935fd trusty arch=amd64 cpu-cores=1 cpu-power=100 mem=1740M root-disk=8192M availability-zone=us-east-1d

[Services]      
NAME            EXPOSED CHARM                         
apache2         false   cs:trusty/apache2-4           
haproxy         false   cs:trusty/haproxy-1           
landscape       false   cs:trusty/landscape-server-10 
landscape-msg   false   cs:trusty/landscape-server-10 
postgresql      false   cs:trusty/postgresql-3        
rabbitmq-server false   cs:trusty/rabbitmq-server-7

[Units]           
ID                WORKLOAD-STATE            AGENT-STATE VERSION       MACHINE PORTS    PUBLIC-ADDRESS 
apache2/0         unknown                   idle        1.24-alpha1.1 1                54.158.244.52  
haproxy/0         unknown                   idle        1.24-alpha1.1 2                54.91.176.112  
landscape-msg/0   maintenance               idle        1.24-alpha1.1 4                54.146.74.189  
                  installing charm software                                                           
landscape/0       maintenance               idle        1.24-alpha1.1 3                54.90.7.109    
                  installing charm software                                                           
postgresql/0      unknown                   idle        1.24-alpha1.1 5       5432/tcp 54.145.18.41   
rabbitmq-server/0 unknown                   idle        1.24-alpha1.1 6       5672/tcp 54.162.42.99

Curtis Hovey (sinzui) on 2015-04-27

tags:	added: blocker
Changed in juju-core:
importance:	High → Critical

Curtis Hovey (sinzui) on 2015-04-27

Changed in juju-core:
milestone:	1.24-alpha1 → 1.25.0
importance:	Critical → High

Revision history for this message

Ian Booth (wallyworld) wrote on 2015-04-27:

#7

In comment 6 https://bugs.launchpad.net/juju-core/+bug/1441826/comments/6 the issue was the the option to install apt packages was disabled in the env configuration and thus rsyslog was not installed. This is currently being fixed in bug 1424892.

Curtis Hovey (sinzui) on 2015-04-28

Changed in juju-core:
importance:	High → Critical

Revision history for this message

Horacio Durán (hduran-8) wrote on 2015-04-28:

#8

This was a regression introduced by changes in multiwatcher, I am working on a patch to solve it.

Horacio Durán (hduran-8) on 2015-04-28

Changed in juju-core:
assignee:	nobody → Horacio Durán (hduran-8)

Revision history for this message

Ian Booth (wallyworld) wrote on 2015-04-28: Re:

#9

You found the cause, awesome. What was the regression?

On 29/04/15 04:04, Horacio Durán wrote:
> This was a regression introduced by changes in multiwatcher, I am
> working on a patch to solve it.
>

Revision history for this message

Horacio Durán (hduran-8) wrote on 2015-04-28: Re: [Bug 1441826] Re:

#10

The legacy status for units in the megawatcher was not being properly set,
it did not follow the correct rules resulting in very odd statuses
(different from the ones in status output) and therefore never arriving to
the statuses expected by deployer.

On Tue, Apr 28, 2015 at 6:12 PM, Ian Booth <email address hidden> wrote:

> You found the cause, awesome. What was the regression?
>
> On 29/04/15 04:04, Horacio Durán wrote:
> > This was a regression introduced by changes in multiwatcher, I am
> > working on a patch to solve it.
> >
>
> --
> You received this bug notification because you are a bug assignee.
> https://bugs.launchpad.net/bugs/1441826
>
> Title:
> deployer and quickstart are broken in 1.24-alpha1
>
> Status in Juju CI Tools:
> Triaged
> Status in juju-core:
> Triaged
> Status in juju-core 1.24 series:
> In Progress
>
> Bug description:
> There appears to be changes to api or a decline its reliability that
> broken deployer and later quickstart in aws, hp, and joyent. MAAS
> continued to work until recently
>
> Deployer broke in on or before commit e374bae as is seen in the bundle
> tests section in:
> http://reports.vapour.ws/releases/2481
> The deployer stack appears to be up, but I think the relation between
> the django units and the haproxy is missing.
>
>
> Quickstart continued to work until commit 0f79f48 as is seen in the
> bundle tests section in:
> http://reports.vapour.ws/releases/2511
> Quickstart isn't informative because it swallows what it is doing.
>
> The last successful run of both deployer and quickstart was
> http://reports.vapour.ws/releases/2478
> which looks like this
>
> http://data.vapour.ws/juju-ci/products/version-2478/aws-deployer-bundle/build-80/consoleText
>
> I believe CI's tests also changed so the issue might be the test
> script exiting prematurely.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju-ci-tools/+bug/1441826/+subscriptions
>

The legacy status for units in the megawatcher was not being properly set,
it did not follow the correct rules resulting in very odd statuses
(different from the ones in status output) and therefore never arriving to
the statuses expected by deployer.

On Tue, Apr 28, 2015 at 6:12 PM, Ian Booth <ian.booth@canonical.com> wrote:

> You found the cause, awesome. What was the regression?
>
> On 29/04/15 04:04, Horacio Durán wrote:
> > This was a regression introduced by changes in multiwatcher, I am
> > working on a patch to solve it.
> >
>
> --
> You received this bug notification because you are a bug assignee.
> https://bugs.launchpad.net/bugs/1441826
>
> Title:
>   deployer and quickstart are broken in 1.24-alpha1
>
> Status in Juju CI Tools:
>   Triaged
> Status in juju-core:
>   Triaged
> Status in juju-core 1.24 series:
>   In Progress
>
> Bug description:
>   There appears to be changes to api or a decline its reliability that
>   broken deployer and later quickstart in aws, hp, and joyent. MAAS
>   continued to work until recently
>
>   Deployer broke in on or before commit e374bae as is seen in the bundle
> tests section in:
>      http://reports.vapour.ws/releases/2481
>   The deployer stack appears to be up, but I think the relation between
> the django units and the haproxy is missing.
>
>
>   Quickstart continued to work until commit 0f79f48 as is seen in the
> bundle tests section in:
>       http://reports.vapour.ws/releases/2511
>   Quickstart isn't informative because it swallows what it is doing.
>
>   The last successful run of both deployer and quickstart was
>       http://reports.vapour.ws/releases/2478
>   which looks like this
>
> http://data.vapour.ws/juju-ci/products/version-2478/aws-deployer-bundle/build-80/consoleText
>
>   I believe CI's tests also changed so the issue might be the test
>   script exiting prematurely.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju-ci-tools/+bug/1441826/+subscriptions
>

Revision history for this message

Horacio Durán (hduran-8) wrote on 2015-04-28:

#11

I just proposed the fix:

http://reviews.vapour.ws/r/1508/ for 1.24

http://reviews.vapour.ws/r/1509/ for master

Changed in juju-core:
assignee:	Horacio Durán (hduran-8) → nobody
status:	Triaged → In Progress

Ian Booth (wallyworld) on 2015-04-29

Changed in juju-core:
status:	In Progress → Fix Committed
assignee:	nobody → Horacio Durán (hduran-8)

Curtis Hovey (sinzui) on 2015-05-01

no longer affects:

juju-ci-tools

Curtis Hovey (sinzui) on 2015-05-02

Changed in juju-core:
status:	Fix Committed → Fix Released

juju-core

deployer and quickstart are broken in 1.24-alpha1

Bug Description

Other bug subscribers

Remote bug watches

Affects		Status	Importance	Assigned to	Milestone
	juju-core	Fix Released	Critical	Horacio Durán	juju-core 1.25-alpha1
	1.24	Fix Released	Critical	Horacio Durán	juju-core 1.24-alpha1