Bug #1453805 “Juju takes more than 20 minutes to enable voting” : Bugs : Canonical Juju

Aaron Bentley (abentley) on 2015-05-11

Changed in juju-core:
status:	New → Incomplete
tags:	added: ensure-availability intermittent-failure

Aaron Bentley (abentley) on 2015-05-11

description:

updated

Revision history for this message

Aaron Bentley (abentley) wrote on 2015-05-11:

#1

We now believe this to be caused by https://github.com/juju/juju/commit/c8c41b048df0e038b75c353856d021080713408f which has landed in 1.23, 1.24 and master.

tags:	added: blocker ci regression
Changed in juju-core:
status:	Incomplete → Triaged
importance:	Undecided → Critical

Revision history for this message

Cheryl Jennings (cherylj) wrote on 2015-05-11:

#2

I've backed out the changes associated with https://github.com/juju/juju/commit/c8c41b048df0e038b75c353856d021080713408f and they're up for review here:

1.23: http://reviews.vapour.ws/r/1645/
1.24: http://reviews.vapour.ws/r/1646/
master: http://reviews.vapour.ws/r/1649/

Changed in juju-core:
assignee:	nobody → Cheryl Jennings (cherylj)

Revision history for this message

Cheryl Jennings (cherylj) wrote on 2015-05-11:

#3

Walked through the logs with Menno and he saw that for the ensure-availability case, we are trying to connect to mongo before it has been started. He'll be working on a fix, rather than backing out the previous changes. I will be closing the PRs mentioned above.

Menno Finlay-Smits (menno.smits) on 2015-05-11

Changed in juju-core:
status:	Triaged → In Progress
assignee:	Cheryl Jennings (cherylj) → Menno Smits (menno.smits)
milestone:	none → 1.25.0

Menno Finlay-Smits (menno.smits) on 2015-05-12

Changed in juju-core:
status:	In Progress → Fix Committed

Revision history for this message

Menno Finlay-Smits (menno.smits) wrote on 2015-05-12:

#4

The "blocker" tag can be removed once the functional-ha-backup-restore CI job passes with the following revs:

1.23 branch - 85cea029
1.24 branch - e3a8a512
master - 93e106de

Revision history for this message

Canonical Juju QA Bot (juju-qa-bot) wrote on 2015-05-12: Fix Released in juju-core 1.24

#5

Juju-CI verified that this issue is Fix Released in juju-core 1.24:
http://reports.vapour.ws/releases/2633

Revision history for this message

Curtis Hovey (sinzui) wrote on 2015-05-12:

#6

http://reports.vapour.ws/releases/2632 did not pass because of the kvm-charm-store-staging-deploy job failed, but destroy-environment didn't remove the lock. This prevented all other juju operations on the kvm-slave from bootstrapping. But because of a bug in maas bootstraps,
https://bugs.launchpad.net/juju-core/+bug/1392407
nodes are not released when the bootstrap fails. Health checks that run hourly, in combination with revision testing used up all nodes.

The 386 unit tests failed because ec2's mirrors were bad. the job was changed to use Ubuntu's archives.

We believe the 1.23 curse was entire cloud and substrate failure.

Revision history for this message

Canonical Juju QA Bot (juju-qa-bot) wrote on 2015-05-12: Fix Released in juju-core master

#7

Juju-CI verified that this issue is Fix Released in juju-core master:
http://reports.vapour.ws/releases/2634

Changed in juju-core:
status:	Fix Committed → Fix Released

Revision history for this message

Aaron Bentley (abentley) wrote on 2016-04-14:

#8

We are still seeing this in master as well as other places. It was seen in master May 14 even though it was reported as fixed on May 12, and has been recurring since then. It is an intermittent issue, but it is the top issue for master this week.

Lots more examples can be seen here:
http://reports.vapour.ws/releases/issue/5550abbd749a5604f63cf58c

Changed in juju-core:
status:	Fix Released → Triaged

Revision history for this message

Cheryl Jennings (cherylj) wrote on 2016-04-14:

#9

A new bug should be opened for the new failures. Going through the logs, I see this error:

2016-04-14 16:53:35 WARNING juju.worker.peergrouper worker.go:312 couldn't find a space containing all peer group machines

The CI test should also be updated to include the output of status for the admin model upon failure.

Revision history for this message

Menno Finlay-Smits (menno.smits) wrote on 2016-04-15:

#10

I've spent a bit of time digging through the logs for this. I'm not convinced the "couldn't find a space containing all peer group machines" error is related to the cause here. It only appears once, well before juju enable-ha is called and I see it even when I attempt to unsuccessfully reproduce the problem.

What is perhaps more relevant is that the peergrouper never sees the additional MongoDB nodes after the EnableHA API is called. On the newly added controller machines this is repeatedly logged:

2016-04-14 21:19:35 INFO juju.mongo open.go:125 dialled mongo successfully on address "127.0.0.1:37017"
2016-04-14 21:19:36 DEBUG juju.worker.dependency engine.go:479 "state" manifold worker stopped: cannot connect to mongodb: no reachable servers
2016-04-14 21:19:36 ERROR juju.worker.dependency engine.go:526 "state" manifold worker returned unexpected error: cannot connect to mongodb: no reachable servers

This indicates that MongoDB was running because a TCP connection to it was established, but not fully functioning ("no reachable servers").

We really need the MongoDB logs to be able to determine what's going on, and I haven't been able to reproduce the issue myself so far in order to get the logs myself (enable-ha is working for me on AWS).

Menno Finlay-Smits (menno.smits) on 2016-04-15

Changed in juju-core:
status:	Triaged → Incomplete
milestone:	1.25-alpha1 → 2.0-rc1

Revision history for this message

Menno Finlay-Smits (menno.smits) wrote on 2016-04-15:

#11

Without further information it's not clear whether this is somehow related to the issue which was originally fixed. For now, I've opted to keep everything on this ticket.

Revision history for this message

Aaron Bentley (abentley) wrote on 2016-04-15:

#12

@Cheryl, I agree that a new bug should be opened for new failures. I started to do that with this one, until I noticed that incidences of the symptom didn't really stop when the bug was reported fixed As I noted, although the bug was reported fixed, it recurred a mere two days later. So it seems to me that the bug was never actually fixed.

Curtis Hovey (sinzui) on 2016-04-15

tags:	removed: blocker
Changed in juju-core:
importance:	Critical → High

Revision history for this message

Aaron Bentley (abentley) wrote on 2016-04-15:

#13

I've added /var/log/syslog and /var/log/mongodb/mongodb.log to the set of logs we gather, so the next failure we see should have the information you require.

Revision history for this message

Aaron Bentley (abentley) wrote on 2016-04-16:

#14

We now have an example that includes the syslog:
http://reports.vapour.ws/releases/3894/job/functional-ha-recovery/attempt/3120

Changed in juju-core:
status:	Incomplete → Triaged

Revision history for this message

Menno Finlay-Smits (menno.smits) wrote on 2016-04-18:

#15

Thank you - having the syslog output is very helpful. Unfortunately, now the DEBUG level output in the Juju machine-N.log files seems to be missing. The loglevel is set to INFO in the recent runs. Can this be fixed please? I need the DEBUG output from the peergrouper as well as the MongoDB output in syslog.

On the basis that the problem is always the same, I'll try and use the logs from earlier runs where we have the DEBUG output in machine-N.log with the newly available syslog data.

Menno Finlay-Smits (menno.smits) on 2016-04-19

Changed in juju-core:
status:	Triaged → In Progress

Revision history for this message

Menno Finlay-Smits (menno.smits) wrote on 2016-04-20:

#16

Possible fix (merged) is here: https://github.com/juju/juju/pull/5210

Whether it has fixed the problem or not is not known yet (there have been CI quota issues).

Curtis Hovey (sinzui) on 2016-04-20

Changed in juju-core:
milestone:	2.0-beta5 → 2.0-rc1

Curtis Hovey (sinzui) on 2016-04-25

Changed in juju-core:
milestone:	2.0-beta6 → 2.0-beta7

Revision history for this message

Martin Packman (gz) wrote on 2016-04-26:

#17

More recent matches on this issue have the syslog and debug peergrouper logging:

<http://reports.vapour.ws/releases/3921/job/functional-ha-recovery/attempt/3152>

2016-04-25 19:03:38 DEBUG juju.worker.peergrouper desired.go:38 calculating desired peer group
2016-04-25 19:03:38 DEBUG juju.worker.peergrouper desired.go:43 members: ...
&peergrouper.machine{id: "0", wantsVote: true, hostPorts: [54.88.23.134:37017 172.31.10.22:37017 127.0.0.1:37017 [::1]:37017]}: rs_id=1, rs_addr=172.31.10.22:37017
&peergrouper.machine{id: "2", wantsVote: true, hostPorts: [52.87.183.236:37017 172.31.39.254:37017 127.0.0.1:37017 [::1]:37017]}: rs_id=2, rs_addr=172.31.39.254:37017
2016-04-25 19:03:38 DEBUG juju.worker.peergrouper desired.go:44 extra: []replicaset.Member(nil)
2016-04-25 19:03:38 DEBUG juju.worker.peergrouper desired.go:45 maxId: 2
2016-04-25 19:03:38 DEBUG juju.worker.peergrouper desired.go:117 assessing possible peer group changes:
2016-04-25 19:03:38 DEBUG juju.worker.peergrouper desired.go:128 machine "2" is a potential voter
2016-04-25 19:03:38 DEBUG juju.worker.peergrouper desired.go:124 machine "0" is already voting
2016-04-25 19:03:38 DEBUG juju.worker.peergrouper desired.go:142 assessed

Revision history for this message

Menno Finlay-Smits (menno.smits) wrote on 2016-04-26:

#18

The HA related CI runs of master were clean for a while, but then this problem resurfaced, so the cleanup of the peergrouper didn't help.

http://juju-ci.vapour.ws/job/functional-ha-backup-restore/3649/

I will keep looking at this. My current theory is that the watchers for new controller machines aren't always firing for some reason.

Curtis Hovey (sinzui) on 2016-05-06

description:

updated

Curtis Hovey (sinzui) on 2016-05-13

Changed in juju-core:
milestone:	2.0-beta7 → 2.0-beta8

Curtis Hovey (sinzui) on 2016-06-03

Changed in juju-core:
milestone:	2.0-beta8 → 2.0-beta9

Curtis Hovey (sinzui) on 2016-06-16

Changed in juju-core:
milestone:	2.0-beta9 → 2.0-beta10

Curtis Hovey (sinzui) on 2016-06-24

Changed in juju-core:
milestone:	2.0-beta10 → 2.0-beta11

Curtis Hovey (sinzui) on 2016-07-01

Changed in juju-core:
milestone:	2.0-beta11 → 2.0-beta12

Cheryl Jennings (cherylj) on 2016-07-05

Changed in juju-core:
milestone:	2.0-beta12 → 2.0-beta13

Menno Finlay-Smits (menno.smits) on 2016-07-17

Changed in juju-core:
assignee:	Menno Smits (menno.smits) → nobody

Curtis Hovey (sinzui) on 2016-07-21

Changed in juju-core:
status:	In Progress → Triaged

Curtis Hovey (sinzui) on 2016-07-22

Changed in juju-core:
milestone:	2.0-beta13 → 2.0-beta14

Curtis Hovey (sinzui) on 2016-08-04

Changed in juju-core:
milestone:	2.0-beta14 → 2.0-beta15

Anastasia (anastasia-macmood) on 2016-08-05

Changed in juju-core:
milestone:	2.0-beta15 → 2.0.0

Anastasia (anastasia-macmood) on 2016-08-11

Changed in juju-core:
milestone:	2.0.0 → 2.0-beta16

Menno Finlay-Smits (menno.smits) on 2016-08-12

Changed in juju-core:
assignee:	nobody → Menno Smits (menno.smits)

Canonical Juju QA Bot (juju-qa-bot) on 2016-08-23

affects:	juju-core → juju
Changed in juju:
milestone:	2.0-beta16 → none
milestone:	none → 2.0-beta16

Curtis Hovey (sinzui) on 2016-08-25

Changed in juju:
milestone:	2.0-beta16 → 2.0-beta17

Anastasia (anastasia-macmood) on 2016-09-01

Changed in juju:
milestone:	2.0-beta17 → 2.0-beta18

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2016-09-02:

#19

This seems to be fixed.

I am looking at the latest build on master at http://reports.vapour.ws/releases/4346
and I am seeing these tests succeeded:

functional-ha-backup-restore (non-voting)
functional-ha-backup-restore-juju1
functional-ha-recovery-azure (non-voting)
functional-ha-recovery-rackspace

Changed in juju:
status:	Triaged → Incomplete
milestone:	2.0-beta18 → none

Revision history for this message

Aaron Bentley (abentley) wrote on 2016-09-02:

#20

This is an intermittent failure and is tagged as such. The previous revision-build, 4345, did have this failure:
http://reports.vapour.ws/releases/4345/job/functional-ha-backup-restore/attempt/4046

The success of 4646 is consistent with an intermittent failure, so I think there is not enough evidence to think it is fixed. If we manage to go a week without this failure, that would be convincing evidence to me.

Revision history for this message

Curtis Hovey (sinzui) wrote on 2016-09-02:

#21

An alternate path to closing this bug is to learn what upstream believes is a reasonable time for 3 dbs to be put into a working replica setl

Changed in juju:
status:	Incomplete → Triaged

Revision history for this message

Curtis Hovey (sinzui) wrote on 2016-09-02:

#22

Also, The only place to check if an issue is still happening is the issue page that is linked in the description. It shows recent occurrences and frequency. No more than 5 seconds is needed to research if an issue is still present.

Anastasia (anastasia-macmood) on 2016-09-08

Changed in juju:
milestone:	none → 2.1.0

Alexis Bruemmer (alexis-bruemmer) on 2016-09-22

tags:	added: ha
Changed in juju:
milestone:	2.1.0 → 2.0.0

Alexis Bruemmer (alexis-bruemmer) on 2016-10-04

tags:

added: ateam

Menno Finlay-Smits (menno.smits) on 2016-10-05

Changed in juju:
status:	Triaged → In Progress

Revision history for this message

Menno Finlay-Smits (menno.smits) wrote on 2016-10-05:

#23

The fixes for bug 1588224 may help with this too. Investigating further now regardless.

Revision history for this message

Menno Finlay-Smits (menno.smits) wrote on 2016-10-05:

#24

I've spent some time analysing recent CI failures relating to this issue. There appears to be a number of causes.

1. After enable-ha is run, the new controller machines come up but one of them can't be contacted by the other controller machines. An example is here:
http://reports.vapour.ws/releases/4445/job/functional-ha-backup-restore/attempt/4239

In this case, controller machine-2 came up and the machine agent appears to be running (according to the juju status output for the controller model) but the MongoDB logs on machine-0 and machine-1 indicate they are unable to connect to machine-2. I suspect a network or firewall issue.

It would have been very helpful to have had the machine agent logs for machine-2 but the retrieval of that file appears to have failed: scp: /var/log/juju/machine-2.log: Permission denied

The MongoDB logs (in /var/log/syslog) are also missing for machine-2 for some reason.

We are now collecting the ifconfig output for each machine which is awesome. It would also be very helpful to have the output of "ip route" in case there is a routing issue (possible in this case).

What options do we have for diagnosing firewall issues?

2. In some cases, enable-ha fails because one of the new controller machines fails to provision. An example of this is here: http://reports.vapour.ws/releases/4437/job/functional-ha-recovery-azure/attempt/583

Here you can see that machine-2 has a provisioning error after 20mins and isn't coming up.

Can we update the CI test infrastructure to identify this as a separate failure? This is a lower level concern that's not really related to activation HA.

3. In other cases, the failure seems to be triggered by a very slow or stuck instance. An example is here:
http://reports.vapour.ws/releases/4413/job/functional-ha-backup-restore/attempt/4190

The MongoDB logs for machine-2 show that replicaset initiation has started but proceeds very slowly and then gets stuck. This is backed up by the machine agent logs which show that the machine agent is unable to connect to MongoDB (because it is still initialising).

Again, it would be helpful if these could be categorised as a separate issue. It's a little harder for this one though as it's not so obvious by looking at "juju status".

4. Finally, there are actual failures to initialise a replicaset. An example is here: http://reports.vapour.ws/releases/4399/job/functional-ha-recovery-rackspace/attempt/1093

Here we see MongoDB on both machine-1 and machine-2 complaining with "replSet can't get local.system.replset config from self or any seed (EMPTYCONFIG)" errors.

I'm going to focus my efforts on this issue for now.

I've spent some time analysing recent CI failures relating to this issue. There appears to be a number of causes.

1. After enable-ha is run, the new controller machines come up but one of them can't be contacted by the other controller machines. An example is here: 
http://reports.vapour.ws/releases/4445/job/functional-ha-backup-restore/attempt/4239

In this case, controller machine-2 came up and the machine agent appears to be running (according to the juju status output for the controller model) but the MongoDB logs on machine-0 and machine-1 indicate they are unable to connect to machine-2. I suspect a network or firewall issue.

It would have been very helpful to have had the machine agent logs for machine-2 but the retrieval of that file appears to have failed: scp: /var/log/juju/machine-2.log: Permission denied

The MongoDB logs (in /var/log/syslog) are also missing for machine-2 for some reason.

We are now collecting the ifconfig output for each machine which is awesome. It would also be very helpful to have the output of "ip route" in case there is a routing issue (possible in this case).

What options do we have for diagnosing firewall issues?

2. In some cases, enable-ha fails because one of the new controller machines fails to provision. An example of this is here: http://reports.vapour.ws/releases/4437/job/functional-ha-recovery-azure/attempt/583

Here you can see that machine-2 has a provisioning error after 20mins and isn't coming up.

Can we update the CI test infrastructure to identify this as a separate failure? This is a lower level concern that's not really related to activation HA.

3. In other cases, the failure seems to be triggered by a very slow or stuck instance. An example is here:
http://reports.vapour.ws/releases/4413/job/functional-ha-backup-restore/attempt/4190

The MongoDB logs for machine-2 show that replicaset initiation has started but proceeds very slowly and then gets stuck. This is backed up by the machine agent logs which show that the machine agent is unable to connect to MongoDB (because it is still initialising).

Again, it would be helpful if these could be categorised as a separate issue. It's a little harder for this one though as it's not so obvious by looking at "juju status".

4. Finally, there are actual failures to initialise a replicaset. An example is here: http://reports.vapour.ws/releases/4399/job/functional-ha-recovery-rackspace/attempt/1093

Here we see MongoDB on both machine-1 and machine-2 complaining with "replSet can't get local.system.replset config from self or any seed (EMPTYCONFIG)" errors.

I'm going to focus my efforts on this issue for now.

Revision history for this message

Aaron Bentley (abentley) wrote on 2016-10-05: Re: [Bug 1453805] Re: Juju takes more than 20 minutes to enable voting

#25

On 2016-10-05 05:19 AM, Menno Smits wrote:
> In this case, controller machine-2 came up and the machine agent appears
> to be running (according to the juju status output for the controller
> model) but the MongoDB logs on machine-0 and machine-1 indicate they are
> unable to connect to machine-2. I suspect a network or firewall issue.
>
> It would have been very helpful to have had the machine agent logs for
> machine-2 but the retrieval of that file appears to have failed: scp:
> /var/log/juju/machine-2.log: Permission denied
>
> The MongoDB logs (in /var/log/syslog) are also missing for machine-2 for
> some reason.

They also got permission denied:
Warning: Permanently added '54.197.28.112' (ECDSA) to the list of known
hosts.
scp: /var/log/syslog: Permission denied

Also, 2016-10-04 01:38:18 WARNING Could not capture ifconfig state:

I think that the user we're logging in as as no privileges. I wonder if
something went wrong with set-up?

> We are now collecting the ifconfig output for each machine which is
> awesome. It would also be very helpful to have the output of "ip route"
> in case there is a routing issue (possible in this case).

It shouldn't be a problem to add that.

> What options do we have for diagnosing firewall issues?

I imagine it would be possible to dump the security groups and rules,
but that would vary by substrate.

> 2. In some cases, enable-ha fails because one of the new controller machines fails to provision. An example of this is here: http://reports.vapour.ws/releases/4437/job/functional-ha-recovery-azure/attempt/583
>
> Here you can see that machine-2 has a provisioning error after 20mins
> and isn't coming up.
>
> Can we update the CI test infrastructure to identify this as a separate
> failure? This is a lower level concern that's not really related to
> activation HA.

We can do that. We have code that checks for errors in status, we'd
probably apply it here. It checks 'juju-status' among others. Shouldn't
'juju-status' be something other than 'pending' if provisioning failed?

We haven't been checking machine-status. I thought it was a
provider-specific value. What are the 'machine-status' values that
indicate an error?

Aaron

On 2016-10-05 05:19 AM, Menno Smits wrote:
> In this case, controller machine-2 came up and the machine agent appears
> to be running (according to the juju status output for the controller
> model) but the MongoDB logs on machine-0 and machine-1 indicate they are
> unable to connect to machine-2. I suspect a network or firewall issue.
> 
> It would have been very helpful to have had the machine agent logs for
> machine-2 but the retrieval of that file appears to have failed: scp:
> /var/log/juju/machine-2.log: Permission denied
> 
> The MongoDB logs (in /var/log/syslog) are also missing for machine-2 for
> some reason.

They also got permission denied:
Warning: Permanently added '54.197.28.112' (ECDSA) to the list of known
hosts.
scp: /var/log/syslog: Permission denied

Also, 2016-10-04 01:38:18 WARNING Could not capture ifconfig state:

I think that the user we're logging in as as no privileges.  I wonder if
something went wrong with set-up?

> We are now collecting the ifconfig output for each machine which is
> awesome. It would also be very helpful to have the output of "ip route"
> in case there is a routing issue (possible in this case).

It shouldn't be a problem to add that.

> What options do we have for diagnosing firewall issues?

I imagine it would be possible to dump the security groups and rules,
but that would vary by substrate.

> 2. In some cases, enable-ha fails because one of the new controller machines fails to provision. An example of this is here: http://reports.vapour.ws/releases/4437/job/functional-ha-recovery-azure/attempt/583
> 
> Here you can see that machine-2 has a provisioning error after 20mins
> and isn't coming up.
> 
> Can we update the CI test infrastructure to identify this as a separate
> failure? This is a lower level concern that's not really related to
> activation HA.

We can do that.  We have code that checks for errors in status, we'd
probably apply it here. It checks 'juju-status' among others.  Shouldn't
'juju-status' be something other than 'pending' if provisioning failed?

We haven't been checking machine-status.  I thought it was a
provider-specific value.  What are the 'machine-status' values that
indicate an error?

Aaron

Revision history for this message

Aaron Bentley (abentley) wrote on 2016-10-05:

#26

I was able to create a new rule for 2, so http://reports.vapour.ws/releases/4437/job/functional-ha-recovery-azure/attempt/583 and future failures like it will be classified separately.

Revision history for this message

Menno Finlay-Smits (menno.smits) wrote on 2016-10-06:

#27

> I was able to create a new rule for 2, so
> http://reports.vapour.ws/releases/4437/job/functional-ha-recovery-azure/attempt/583
> and future failures like it will be classified separately.

Excellent - thank you.

Curtis Hovey (sinzui) on 2016-10-06

Changed in juju:
milestone:	2.0-rc3 → 2.0.0

Revision history for this message

Menno Finlay-Smits (menno.smits) wrote on 2016-10-07:

#28

Download full text (3.2 KiB)

> In this case, controller machine-2 came up and the machine agent appears
> > to be running (according to the juju status output for the controller
> > model) but the MongoDB logs on machine-0 and machine-1 indicate they are
> > unable to connect to machine-2. I suspect a network or firewall issue.
> >
> > It would have been very helpful to have had the machine agent logs for
> > machine-2 but the retrieval of that file appears to have failed: scp:
> > /var/log/juju/machine-2.log: Permission denied
> >
> > The MongoDB logs (in /var/log/syslog) are also missing for machine-2 for
> > some reason.
>
> They also got permission denied:
> Warning: Permanently added '54.197.28.112' (ECDSA) to the list of known
> hosts.
> scp: /var/log/syslog: Permission denied
>
> Also, 2016-10-04 01:38:18 WARNING Could not capture ifconfig state:
>
> I think that the user we're logging in as as no privileges. I wonder if
> something went wrong with set-up?

The log files are never accessible directly to the `ubuntu` user. I've
just looked at the code in juju-ci-tools which grabs the logs and it
attempts 'sudo chmod -Rf go+r ' on each log file over ssh before
scping them - it's the ssh part that's actually the problem. See the
following in the output:

2016-10-04 01:33:34 WARNING Could not allow access to the juju logs:
2016-10-04 01:33:34 WARNING

I'm not sure why this is failing though. It appears that the stderr
output isn't being captured when this fails. That could be useful to
see (pass stderr=subprocess.STDOUT to subprocess.Popen).

> > We are now collecting the ifconfig output for each machine which is
> > awesome. It would also be very helpful to have the output of "ip route"
> > in case there is a routing issue (possible in this case).
>
> It shouldn't be a problem to add that.

That would be great.

> > What options do we have for diagnosing firewall issues?
>
> I imagine it would be possible to dump the security groups and rules,
> but that would vary by substrate.

Yes, not straightforward...

> > 2. In some cases, enable-ha fails because one of the new controller machines fails to provision. An example of this is here: http://reports.vapour.ws/releases/4437/job/functional-ha-recovery-azure/attempt/583
> >
> > Here you can see that machine-2 has a provisioning error after 20mins
> > and isn't coming up.
> >
> > Can we update the CI test infrastructure to identify this as a separate
> > failure? This is a lower level concern that's not really related to
> > activation HA.
>
> We can do that. We have code that checks for errors in status, we'd
> probably apply it here. It checks 'juju-status' among others. Shouldn't
> 'juju-status' be something other than 'pending' if provisioning failed?

It seems that because things are stuck rather than a cloud API
returning an error, Juju hasn't determined that a provisioning error
has occurred.

> We haven't been checking machine-status. I thought it was a
> provider-specific value. What are the 'machine-status' values that
> indicate an error?

They're not provider specific. The provider specific status values are
mapped to standard Juju values with the provider specific content
ending up in the "message" s...

Canonical Juju

Juju takes more than 20 minutes to enable voting

Bug Description

Duplicates of this bug

Other bug subscribers

Remote bug watches

	Status	Importance	Assigned to	Milestone
Canonical Juju	Fix Released	High	Tim Penhey	Canonical Juju 2.3.2
juju-core
1.23	Fix Released	Critical	Menno Finlay-Smits	juju-core 1.23.4
1.24	Fix Released	Critical	Menno Finlay-Smits	juju-core 1.24-beta2

Changed in juju:
status:	Confirmed → Triaged
milestone:	none → 2.2.0

tags:	added: mongodb
Changed in juju:
milestone:	2.2.0 → none
assignee:	Menno Smits (menno.smits) → nobody

Changed in juju:
status:	Triaged → Fix Committed
milestone:	none → 2.3.2
assignee:	nobody → Tim Penhey (thumper)