Juju takes more than 20 minutes to enable voting

Bug #1453805 reported by Aaron Bentley
20
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Tim Penhey
juju-core
1.23
Fix Released
Critical
Menno Finlay-Smits
1.24
Fix Released
Critical
Menno Finlay-Smits

Bug Description

As seen here:
   http://reports.vapour.ws/releases/issue/5550abbd749a5604f63cf58c

(The timeout is 1200 seconds)

EXCEPTION CAUGHT:

ERROR:root:Timed out waiting for voting to be enabled.
Traceback (most recent call last):
  File "/var/lib/jenkins/juju-ci-tools/assess_recovery.py", line 195, in main
    client.wait_for_ha()
  File "/mnt/jenkinshome/juju-ci-tools/jujupy.py", line 416, in wait_for_ha
    raise Exception('Timed out waiting for voting to be enabled.')
Exception: Timed out waiting for voting to be enabled.

Aaron Bentley (abentley)
Changed in juju-core:
status: New → Incomplete
tags: added: ensure-availability intermittent-failure
Aaron Bentley (abentley)
description: updated
Revision history for this message
Aaron Bentley (abentley) wrote :

We now believe this to be caused by https://github.com/juju/juju/commit/c8c41b048df0e038b75c353856d021080713408f which has landed in 1.23, 1.24 and master.

tags: added: blocker ci regression
Changed in juju-core:
status: Incomplete → Triaged
importance: Undecided → Critical
Revision history for this message
Cheryl Jennings (cherylj) wrote :
Changed in juju-core:
assignee: nobody → Cheryl Jennings (cherylj)
Revision history for this message
Cheryl Jennings (cherylj) wrote :

Walked through the logs with Menno and he saw that for the ensure-availability case, we are trying to connect to mongo before it has been started. He'll be working on a fix, rather than backing out the previous changes. I will be closing the PRs mentioned above.

Changed in juju-core:
status: Triaged → In Progress
assignee: Cheryl Jennings (cherylj) → Menno Smits (menno.smits)
milestone: none → 1.25.0
Changed in juju-core:
status: In Progress → Fix Committed
Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

The "blocker" tag can be removed once the functional-ha-backup-restore CI job passes with the following revs:

1.23 branch - 85cea029
1.24 branch - e3a8a512
master - 93e106de

Revision history for this message
Canonical Juju QA Bot (juju-qa-bot) wrote : Fix Released in juju-core 1.24

Juju-CI verified that this issue is Fix Released in juju-core 1.24:
    http://reports.vapour.ws/releases/2633

Revision history for this message
Curtis Hovey (sinzui) wrote :

http://reports.vapour.ws/releases/2632 did not pass because of the kvm-charm-store-staging-deploy job failed, but destroy-environment didn't remove the lock. This prevented all other juju operations on the kvm-slave from bootstrapping. But because of a bug in maas bootstraps,
    https://bugs.launchpad.net/juju-core/+bug/1392407
nodes are not released when the bootstrap fails. Health checks that run hourly, in combination with revision testing used up all nodes.

The 386 unit tests failed because ec2's mirrors were bad. the job was changed to use Ubuntu's archives.

We believe the 1.23 curse was entire cloud and substrate failure.

Revision history for this message
Canonical Juju QA Bot (juju-qa-bot) wrote : Fix Released in juju-core master

Juju-CI verified that this issue is Fix Released in juju-core master:
    http://reports.vapour.ws/releases/2634

Changed in juju-core:
status: Fix Committed → Fix Released
Revision history for this message
Aaron Bentley (abentley) wrote :

We are still seeing this in master as well as other places. It was seen in master May 14 even though it was reported as fixed on May 12, and has been recurring since then. It is an intermittent issue, but it is the top issue for master this week.

Lots more examples can be seen here:
http://reports.vapour.ws/releases/issue/5550abbd749a5604f63cf58c

Changed in juju-core:
status: Fix Released → Triaged
Revision history for this message
Cheryl Jennings (cherylj) wrote :

A new bug should be opened for the new failures. Going through the logs, I see this error:

2016-04-14 16:53:35 WARNING juju.worker.peergrouper worker.go:312 couldn't find a space containing all peer group machines

The CI test should also be updated to include the output of status for the admin model upon failure.

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

I've spent a bit of time digging through the logs for this. I'm not convinced the "couldn't find a space containing all peer group machines" error is related to the cause here. It only appears once, well before juju enable-ha is called and I see it even when I attempt to unsuccessfully reproduce the problem.

What is perhaps more relevant is that the peergrouper never sees the additional MongoDB nodes after the EnableHA API is called. On the newly added controller machines this is repeatedly logged:

2016-04-14 21:19:35 INFO juju.mongo open.go:125 dialled mongo successfully on address "127.0.0.1:37017"
2016-04-14 21:19:36 DEBUG juju.worker.dependency engine.go:479 "state" manifold worker stopped: cannot connect to mongodb: no reachable servers
2016-04-14 21:19:36 ERROR juju.worker.dependency engine.go:526 "state" manifold worker returned unexpected error: cannot connect to mongodb: no reachable servers

This indicates that MongoDB was running because a TCP connection to it was established, but not fully functioning ("no reachable servers").

We really need the MongoDB logs to be able to determine what's going on, and I haven't been able to reproduce the issue myself so far in order to get the logs myself (enable-ha is working for me on AWS).

Changed in juju-core:
status: Triaged → Incomplete
milestone: 1.25-alpha1 → 2.0-rc1
Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

Without further information it's not clear whether this is somehow related to the issue which was originally fixed. For now, I've opted to keep everything on this ticket.

Revision history for this message
Aaron Bentley (abentley) wrote :

@Cheryl, I agree that a new bug should be opened for new failures. I started to do that with this one, until I noticed that incidences of the symptom didn't really stop when the bug was reported fixed As I noted, although the bug was reported fixed, it recurred a mere two days later. So it seems to me that the bug was never actually fixed.

Curtis Hovey (sinzui)
tags: removed: blocker
Changed in juju-core:
importance: Critical → High
Revision history for this message
Aaron Bentley (abentley) wrote :

I've added /var/log/syslog and /var/log/mongodb/mongodb.log to the set of logs we gather, so the next failure we see should have the information you require.

Revision history for this message
Aaron Bentley (abentley) wrote :
Changed in juju-core:
status: Incomplete → Triaged
Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

Thank you - having the syslog output is very helpful. Unfortunately, now the DEBUG level output in the Juju machine-N.log files seems to be missing. The loglevel is set to INFO in the recent runs. Can this be fixed please? I need the DEBUG output from the peergrouper as well as the MongoDB output in syslog.

On the basis that the problem is always the same, I'll try and use the logs from earlier runs where we have the DEBUG output in machine-N.log with the newly available syslog data.

Changed in juju-core:
status: Triaged → In Progress
Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

Possible fix (merged) is here: https://github.com/juju/juju/pull/5210

Whether it has fixed the problem or not is not known yet (there have been CI quota issues).

Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-beta5 → 2.0-rc1
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-beta6 → 2.0-beta7
Revision history for this message
Martin Packman (gz) wrote :

More recent matches on this issue have the syslog and debug peergrouper logging:

<http://reports.vapour.ws/releases/3921/job/functional-ha-recovery/attempt/3152>

2016-04-25 19:03:38 DEBUG juju.worker.peergrouper desired.go:38 calculating desired peer group
2016-04-25 19:03:38 DEBUG juju.worker.peergrouper desired.go:43 members: ...
   &peergrouper.machine{id: "0", wantsVote: true, hostPorts: [54.88.23.134:37017 172.31.10.22:37017 127.0.0.1:37017 [::1]:37017]}: rs_id=1, rs_addr=172.31.10.22:37017
   &peergrouper.machine{id: "2", wantsVote: true, hostPorts: [52.87.183.236:37017 172.31.39.254:37017 127.0.0.1:37017 [::1]:37017]}: rs_id=2, rs_addr=172.31.39.254:37017
2016-04-25 19:03:38 DEBUG juju.worker.peergrouper desired.go:44 extra: []replicaset.Member(nil)
2016-04-25 19:03:38 DEBUG juju.worker.peergrouper desired.go:45 maxId: 2
2016-04-25 19:03:38 DEBUG juju.worker.peergrouper desired.go:117 assessing possible peer group changes:
2016-04-25 19:03:38 DEBUG juju.worker.peergrouper desired.go:128 machine "2" is a potential voter
2016-04-25 19:03:38 DEBUG juju.worker.peergrouper desired.go:124 machine "0" is already voting
2016-04-25 19:03:38 DEBUG juju.worker.peergrouper desired.go:142 assessed

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

The HA related CI runs of master were clean for a while, but then this problem resurfaced, so the cleanup of the peergrouper didn't help.

http://juju-ci.vapour.ws/job/functional-ha-backup-restore/3649/

I will keep looking at this. My current theory is that the watchers for new controller machines aren't always firing for some reason.

Curtis Hovey (sinzui)
description: updated
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-beta7 → 2.0-beta8
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-beta8 → 2.0-beta9
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-beta9 → 2.0-beta10
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-beta10 → 2.0-beta11
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-beta11 → 2.0-beta12
Changed in juju-core:
milestone: 2.0-beta12 → 2.0-beta13
Changed in juju-core:
assignee: Menno Smits (menno.smits) → nobody
Curtis Hovey (sinzui)
Changed in juju-core:
status: In Progress → Triaged
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-beta13 → 2.0-beta14
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-beta14 → 2.0-beta15
Changed in juju-core:
milestone: 2.0-beta15 → 2.0.0
Changed in juju-core:
milestone: 2.0.0 → 2.0-beta16
Changed in juju-core:
assignee: nobody → Menno Smits (menno.smits)
affects: juju-core → juju
Changed in juju:
milestone: 2.0-beta16 → none
milestone: none → 2.0-beta16
Curtis Hovey (sinzui)
Changed in juju:
milestone: 2.0-beta16 → 2.0-beta17
Changed in juju:
milestone: 2.0-beta17 → 2.0-beta18
Revision history for this message
Anastasia (anastasia-macmood) wrote :

This seems to be fixed.

I am looking at the latest build on master at http://reports.vapour.ws/releases/4346
and I am seeing these tests succeeded:

functional-ha-backup-restore (non-voting)
functional-ha-backup-restore-juju1
functional-ha-recovery-azure (non-voting)
functional-ha-recovery-rackspace

Changed in juju:
status: Triaged → Incomplete
milestone: 2.0-beta18 → none
Revision history for this message
Aaron Bentley (abentley) wrote :

This is an intermittent failure and is tagged as such. The previous revision-build, 4345, did have this failure:
http://reports.vapour.ws/releases/4345/job/functional-ha-backup-restore/attempt/4046

The success of 4646 is consistent with an intermittent failure, so I think there is not enough evidence to think it is fixed. If we manage to go a week without this failure, that would be convincing evidence to me.

Revision history for this message
Curtis Hovey (sinzui) wrote :

An alternate path to closing this bug is to learn what upstream believes is a reasonable time for 3 dbs to be put into a working replica setl

Changed in juju:
status: Incomplete → Triaged
Revision history for this message
Curtis Hovey (sinzui) wrote :

Also, The only place to check if an issue is still happening is the issue page that is linked in the description. It shows recent occurrences and frequency. No more than 5 seconds is needed to research if an issue is still present.

Changed in juju:
milestone: none → 2.1.0
tags: added: ha
Changed in juju:
milestone: 2.1.0 → 2.0.0
tags: added: ateam
Changed in juju:
status: Triaged → In Progress
Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

The fixes for bug 1588224 may help with this too. Investigating further now regardless.

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

I've spent some time analysing recent CI failures relating to this issue. There appears to be a number of causes.

1. After enable-ha is run, the new controller machines come up but one of them can't be contacted by the other controller machines. An example is here:
http://reports.vapour.ws/releases/4445/job/functional-ha-backup-restore/attempt/4239

In this case, controller machine-2 came up and the machine agent appears to be running (according to the juju status output for the controller model) but the MongoDB logs on machine-0 and machine-1 indicate they are unable to connect to machine-2. I suspect a network or firewall issue.

It would have been very helpful to have had the machine agent logs for machine-2 but the retrieval of that file appears to have failed: scp: /var/log/juju/machine-2.log: Permission denied

The MongoDB logs (in /var/log/syslog) are also missing for machine-2 for some reason.

We are now collecting the ifconfig output for each machine which is awesome. It would also be very helpful to have the output of "ip route" in case there is a routing issue (possible in this case).

What options do we have for diagnosing firewall issues?

2. In some cases, enable-ha fails because one of the new controller machines fails to provision. An example of this is here: http://reports.vapour.ws/releases/4437/job/functional-ha-recovery-azure/attempt/583

Here you can see that machine-2 has a provisioning error after 20mins and isn't coming up.

Can we update the CI test infrastructure to identify this as a separate failure? This is a lower level concern that's not really related to activation HA.

3. In other cases, the failure seems to be triggered by a very slow or stuck instance. An example is here:
http://reports.vapour.ws/releases/4413/job/functional-ha-backup-restore/attempt/4190

The MongoDB logs for machine-2 show that replicaset initiation has started but proceeds very slowly and then gets stuck. This is backed up by the machine agent logs which show that the machine agent is unable to connect to MongoDB (because it is still initialising).

Again, it would be helpful if these could be categorised as a separate issue. It's a little harder for this one though as it's not so obvious by looking at "juju status".

4. Finally, there are actual failures to initialise a replicaset. An example is here: http://reports.vapour.ws/releases/4399/job/functional-ha-recovery-rackspace/attempt/1093

Here we see MongoDB on both machine-1 and machine-2 complaining with "replSet can't get local.system.replset config from self or any seed (EMPTYCONFIG)" errors.

I'm going to focus my efforts on this issue for now.

Revision history for this message
Aaron Bentley (abentley) wrote : Re: [Bug 1453805] Re: Juju takes more than 20 minutes to enable voting

On 2016-10-05 05:19 AM, Menno Smits wrote:
> In this case, controller machine-2 came up and the machine agent appears
> to be running (according to the juju status output for the controller
> model) but the MongoDB logs on machine-0 and machine-1 indicate they are
> unable to connect to machine-2. I suspect a network or firewall issue.
>
> It would have been very helpful to have had the machine agent logs for
> machine-2 but the retrieval of that file appears to have failed: scp:
> /var/log/juju/machine-2.log: Permission denied
>
> The MongoDB logs (in /var/log/syslog) are also missing for machine-2 for
> some reason.

They also got permission denied:
Warning: Permanently added '54.197.28.112' (ECDSA) to the list of known
hosts.
scp: /var/log/syslog: Permission denied

Also, 2016-10-04 01:38:18 WARNING Could not capture ifconfig state:

I think that the user we're logging in as as no privileges. I wonder if
something went wrong with set-up?

> We are now collecting the ifconfig output for each machine which is
> awesome. It would also be very helpful to have the output of "ip route"
> in case there is a routing issue (possible in this case).

It shouldn't be a problem to add that.

> What options do we have for diagnosing firewall issues?

I imagine it would be possible to dump the security groups and rules,
but that would vary by substrate.

> 2. In some cases, enable-ha fails because one of the new controller machines fails to provision. An example of this is here: http://reports.vapour.ws/releases/4437/job/functional-ha-recovery-azure/attempt/583
>
> Here you can see that machine-2 has a provisioning error after 20mins
> and isn't coming up.
>
> Can we update the CI test infrastructure to identify this as a separate
> failure? This is a lower level concern that's not really related to
> activation HA.

We can do that. We have code that checks for errors in status, we'd
probably apply it here. It checks 'juju-status' among others. Shouldn't
'juju-status' be something other than 'pending' if provisioning failed?

We haven't been checking machine-status. I thought it was a
provider-specific value. What are the 'machine-status' values that
indicate an error?

Aaron

Revision history for this message
Aaron Bentley (abentley) wrote :

I was able to create a new rule for 2, so http://reports.vapour.ws/releases/4437/job/functional-ha-recovery-azure/attempt/583 and future failures like it will be classified separately.

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

> I was able to create a new rule for 2, so
> http://reports.vapour.ws/releases/4437/job/functional-ha-recovery-azure/attempt/583
> and future failures like it will be classified separately.

Excellent - thank you.

Curtis Hovey (sinzui)
Changed in juju:
milestone: 2.0-rc3 → 2.0.0
Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :
Download full text (3.2 KiB)

> In this case, controller machine-2 came up and the machine agent appears
> > to be running (according to the juju status output for the controller
> > model) but the MongoDB logs on machine-0 and machine-1 indicate they are
> > unable to connect to machine-2. I suspect a network or firewall issue.
> >
> > It would have been very helpful to have had the machine agent logs for
> > machine-2 but the retrieval of that file appears to have failed: scp:
> > /var/log/juju/machine-2.log: Permission denied
> >
> > The MongoDB logs (in /var/log/syslog) are also missing for machine-2 for
> > some reason.
>
> They also got permission denied:
> Warning: Permanently added '54.197.28.112' (ECDSA) to the list of known
> hosts.
> scp: /var/log/syslog: Permission denied
>
> Also, 2016-10-04 01:38:18 WARNING Could not capture ifconfig state:
>
> I think that the user we're logging in as as no privileges. I wonder if
> something went wrong with set-up?

The log files are never accessible directly to the `ubuntu` user. I've
just looked at the code in juju-ci-tools which grabs the logs and it
attempts 'sudo chmod -Rf go+r ' on each log file over ssh before
scping them - it's the ssh part that's actually the problem. See the
following in the output:

2016-10-04 01:33:34 WARNING Could not allow access to the juju logs:
2016-10-04 01:33:34 WARNING

I'm not sure why this is failing though. It appears that the stderr
output isn't being captured when this fails. That could be useful to
see (pass stderr=subprocess.STDOUT to subprocess.Popen).

> > We are now collecting the ifconfig output for each machine which is
> > awesome. It would also be very helpful to have the output of "ip route"
> > in case there is a routing issue (possible in this case).
>
> It shouldn't be a problem to add that.

That would be great.

> > What options do we have for diagnosing firewall issues?
>
> I imagine it would be possible to dump the security groups and rules,
> but that would vary by substrate.

Yes, not straightforward...

> > 2. In some cases, enable-ha fails because one of the new controller machines fails to provision. An example of this is here: http://reports.vapour.ws/releases/4437/job/functional-ha-recovery-azure/attempt/583
> >
> > Here you can see that machine-2 has a provisioning error after 20mins
> > and isn't coming up.
> >
> > Can we update the CI test infrastructure to identify this as a separate
> > failure? This is a lower level concern that's not really related to
> > activation HA.
>
> We can do that. We have code that checks for errors in status, we'd
> probably apply it here. It checks 'juju-status' among others. Shouldn't
> 'juju-status' be something other than 'pending' if provisioning failed?

It seems that because things are stuck rather than a cloud API
returning an error, Juju hasn't determined that a provisioning error
has occurred.

> We haven't been checking machine-status. I thought it was a
> provider-specific value. What are the 'machine-status' values that
> indicate an error?

They're not provider specific. The provider specific status values are
mapped to standard Juju values with the provider specific content
ending up in the "message" s...

Read more...

Curtis Hovey (sinzui)
Changed in juju:
milestone: 2.0.0 → 2.0.1
Curtis Hovey (sinzui)
Changed in juju:
milestone: 2.0.1 → none
Changed in juju:
status: In Progress → Confirmed
Changed in juju:
status: Confirmed → Triaged
milestone: none → 2.2.0
Revision history for this message
Anastasia (anastasia-macmood) wrote :

This seems to be related to mongodb/replica set and Juju synergy. Tagging it with "mongo" for future reference.

tags: added: mongodb
Changed in juju:
milestone: 2.2.0 → none
assignee: Menno Smits (menno.smits) → nobody
Revision history for this message
Anastasia (anastasia-macmood) wrote :

Removing 2.2 milestone as we will not be addressing this issue in 2.2.

Revision history for this message
Tim Penhey (thumper) wrote :

Pretty sure this is the same as bug 1720251.

Changed in juju:
status: Triaged → Fix Committed
milestone: none → 2.3.2
assignee: nobody → Tim Penhey (thumper)
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.