Agents losing connection to leader tracker

Bug #1680582 reported by Peter Sabaini
24
This bug affects 4 people
Affects Status Importance Assigned to Milestone
juju-core
Won't Fix
Undecided
Unassigned

Bug Description

Hi,

we see this quite often in our agent log files:

2017-04-06 19:16:21 WARNING juju.worker.dependency engine.go:305 failed to start "uniter" manifold worker: "leadership-tracker" not running: dependency not available

This seems to mess with leadership elections. Eg. I can see that the keystone token-flush cronjob (which should only be activated for the leader) is present on 2 of our HA units.

Also, this seems to impede juju run:
$ juju run --unit keystone/0 true
error: dial unix /var/lib/juju/agents/unit-keystone-0/run.socket: connect: no such file or directory

Frequently (but not always) this also seems to hang 'juju status'

Juju version: 1.25.10

Changed in juju-core:
importance: Undecided → Critical
status: New → Triaged
milestone: none → 1.25.12
Revision history for this message
Anastasia (anastasia-macmood) wrote :

@Peter Sabaini,

What do you do when you see this failure? Do you restart state server?

Status performance may also be affected by how long the environment is up. Some new additions to mgopurge tool deal with data built-up and could help with performance, see https://github.com/juju/juju/wiki/MgoPurgeTool.

Changed in juju-core:
status: Triaged → Incomplete
importance: Critical → Undecided
milestone: 1.25.12 → none
Revision history for this message
Peter Sabaini (peter-sabaini) wrote :

Anastasia,

indeed we typ. restart the state server when this happens.

The juju-db on one env where I could see this issue yesterday is ~8Gb on disk size; we ran mgopurge Feb 21st there. Are you advising we should re-run it?

Thanks

Revision history for this message
Anastasia (anastasia-macmood) wrote :

@Peter Sabaini (peter-sabaini),
MgoPurge tool now has a facility to prune transactions, thus greatly reducing db size. Pruning can be run on a live, non-paused system as per wiki. Please test it on non-production environment first - we have mostly developed it for Juju 2.x environments which run on mongo 3.2.

Do you happen to know how many transactions that can be pruned may be in the db currently?

In Juju 2.2, equivalent functionality will be run every 2hrs when/if needed.

If restarting state server works well, mgopurge may not be needed as frequently.

I am marking this bug as Won't Fix since restarting state server is a workaround that helps you to get out of the tight spot and we do not have capacity to focus on Juju 1.25.x at the moment - we are full steam ahead on Juju 2.x.

Changed in juju-core:
status: Incomplete → Won't Fix
Revision history for this message
Peter Sabaini (peter-sabaini) wrote :

The problem here is that we do not necessarily see when/if leadership tracking is broken. Until we have services break that rely on correct leadership tracking (with eg. multiple keystone token-flush jobs running or swift-proxies not getting restarted after automatic ring balancing etc.)

Revision history for this message
Peter Sabaini (peter-sabaini) wrote :

For clarification I should maybe add that the units seem to fail on connecting to the leadership tracker, but the effect of this is that the unit doesn't start up. So, if the unit got into this state it won't run any hooks (therefore misses relationship updates, config changes etc.); also cannot use juju run.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.