ANARCHY!!!!!!! Entirely leaderless application spotted in the wild

Bug #1858693 reported by Daniel Manrique
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
Low
Unassigned

Bug Description

We noticed a Mojo rollout getting stuck/timing out waiting for steady state, with this from Juju:

19:00:47 O: 2020-01-07 18:58:20 [INFO] canonical-livepatch does not have a leader
19:00:47 O: 2020-01-07 18:58:45 [INFO] All units idle since 2020-01-07 18:58:26.540644Z (external-services/0, sca-app-lb/0, sca-app-r25c5378/0, sca-app-r25c5378/1, sca-app-r8ee1873/1, sca-app-r8ee1873/2, sca-cache-lb/0, sca-cache/0, sca-dp-fe/2, sca-dp-fe/3, sca-memcached/0, sca-prometheus/0, sca-rabbitmq/0, sca-worker-r25c5378/0, sca-worker-r25c5378/1, sca-worker-r25c5378/2, sca-worker-r8ee1873/0, sca-worker-r8ee1873/1)
19:00:47 O: 2020-01-07 18:58:45 [INFO] canonical-livepatch does not have a leader
.... (AD NAUSEAM)

Indeed, juju status and is-leader show everything thinks the application has no leader:

juju run --application canonical-livepatch "is-leader"
- Stdout: |
    False
  UnitId: canonical-livepatch/0
- Stdout: |
    False
  UnitId: canonical-livepatch/1826
- Stdout: |
    False
  UnitId: canonical-livepatch/1827
- Stdout: |
    False
  UnitId: canonical-livepatch/2
- Stdout: |
    False
  UnitId: canonical-livepatch/3
- Stdout: |
    False
  UnitId: canonical-livepatch/1578
- Stdout: |
    False
  UnitId: canonical-livepatch/1820
- Stdout: |
    False
  UnitId: canonical-livepatch/6
- Stdout: |
    False
  UnitId: canonical-livepatch/7
- Stdout: |
    False
  UnitId: canonical-livepatch/1579
- Stdout: |
    False
  UnitId: canonical-livepatch/1821
- Stdout: |
    False
  UnitId: canonical-livepatch/1822
- Stdout: |
    False
  UnitId: canonical-livepatch/1823
- Stdout: |
    False
  UnitId: canonical-livepatch/1824
- Stdout: |
    False
  UnitId: canonical-livepatch/1825
- Stdout: |
    False
  UnitId: canonical-livepatch/1828
- Stdout: |
    False
  UnitId: canonical-livepatch/4
- Stdout: |
    False
  UnitId: canonical-livepatch/5

$ juju status |grep canonical-livepatch
canonical-livepatch active 18 canonical-livepatch local 0 ubuntu
  canonical-livepatch/6 active idle 10.50.70.211 Running kernel 4.4.0-166.195-generic, patchState: kernel-upgrade-required
  canonical-livepatch/0 active idle 10.50.70.212 Running kernel 4.4.0-148.174-generic, patchState: kernel-upgrade-required
  canonical-livepatch/1824 active idle 10.50.71.190 Running kernel 4.4.0-170.199-generic, patchState: applied
  canonical-livepatch/1825 active idle 10.50.71.2 Running kernel 4.4.0-170.199-generic, patchState: applied
  canonical-livepatch/1822 active idle 10.50.71.184 Running kernel 4.4.0-170.199-generic, patchState: applied
  canonical-livepatch/1823 active idle 10.50.71.185 Running kernel 4.4.0-170.199-generic, patchState: applied
  canonical-livepatch/2 active idle 10.50.70.214 Running kernel 4.4.0-148.174-generic, patchState: kernel-upgrade-required
  canonical-livepatch/3 active idle 10.50.70.213 Running kernel 4.4.0-148.174-generic, patchState: kernel-upgrade-required
  canonical-livepatch/1579 active idle 185.125.191.231 Running kernel 4.4.0-157.185-generic, patchState: kernel-upgrade-required
  canonical-livepatch/1578 active idle 185.125.191.240 Running kernel 4.4.0-157.185-generic, patchState: kernel-upgrade-required
  canonical-livepatch/4 active idle 10.50.70.216 Running kernel 4.4.0-148.174-generic, patchState: kernel-upgrade-required
  canonical-livepatch/5 active idle 10.50.70.218 Running kernel 4.4.0-166.195-generic, patchState: kernel-upgrade-required
  canonical-livepatch/7 active idle 10.50.70.217 Running kernel 4.4.0-166.195-generic, patchState: kernel-upgrade-required
  canonical-livepatch/1826 active idle 10.50.71.195 Running kernel 4.4.0-170.199-generic, patchState: applied
  canonical-livepatch/1827 active idle 10.50.71.197 Running kernel 4.4.0-170.199-generic, patchState: applied
  canonical-livepatch/1821 active idle 10.50.71.186 Running kernel 4.4.0-170.199-generic, patchState: applied
  canonical-livepatch/1820 active idle 10.50.71.187 Running kernel 4.4.0-170.199-generic, patchState: applied
  canonical-livepatch/1828 active idle 10.50.71.202 Running kernel 4.4.0-170.199-generic, patchState: applied

(note no leadership asterisk)

Logs show this in all units, when the agent is restarted or when a new unit is added:

2020-01-07 16:17:25 DEBUG juju.worker.leadership tracker.go:125 canonical-livepatch/1828 making initial claim for canonical-livepatch leadership
2020-01-07 16:17:25 INFO juju.worker.leadership tracker.go:217 canonical-livepatch leadership for canonical-livepatch/1828 denied

This is the PS4.5 shared controller with Juju 2.6.10:

$ juju status canonical-livepatch
Model Controller Cloud/Region Version SLA Timestamp
stg-ols-scasnap3 prodstack-is prodstack-45/bootstack-ps45 2.6.10 unsupported 20:43:57Z

Things we've tried to get a new leader elected:

- Restart all Juju agents for that application
- Add a new unit to see if given a new candidate, a leader is elected

Things we haven't tried:

- Restart *all* juju agents
- Remove, then re-add the application

This seems very similar to https://bugs.launchpad.net/juju/+bug/1656275 which is already fixed so it's possible our issue is not entirely the same; also, we're seeing this on a subordinate charm.

Also, IS confirmed this thing is using raft leases, not legacy leases.

Let us know if any more diagnostics/logs are needed.

Revision history for this message
Christian Muirhead (2-xtian) wrote :

Hi, can you please check for errors related to leases or from juju.worker.dependency in the controller logs?

Revision history for this message
Christian Muirhead (2-xtian) wrote :

Also, the output from running juju_engine_report on the controller machines will tell us whether the various raft workers are up.

Revision history for this message
Christian Muirhead (2-xtian) wrote :

From talking to verterok and roadmr in irc, a new application with two units gets a leader, and removing the leader eventually transfers leadership to the other unit. So it seems like the raft workers are up and the lease clock is advancing.

The other bug referenced is about the db and raft state being out of sync, but status and is-leader all use the raft state now, so I don't think it's that.

Revision history for this message
Christian Muirhead (2-xtian) wrote :

I got hold of the lease snapshots from IS, they look like this:

Earlier snapshot (taken at 2020-01-08T21:59:54.560)

global-time: 0001-11-28T14:46:13.032021499Z
entries:
...
  ? namespace: application-leadership
    model-uuid: e951cb4a-9abf-420a-8ee0-4dcc7716dbe9
    lease: canonical-livepatch
  : holder: canonical-livepatch/1816
    start: 0001-11-28T14:46:00.03012772Z
    duration: 1m0s

Later snapshot (taken at 2020-01-08T22:03:08.582)

global-time: 0001-11-28T14:49:14.156856294Z
entries:
...
  ? namespace: application-leadership
    model-uuid: e951cb4a-9abf-420a-8ee0-4dcc7716dbe9
    lease: canonical-livepatch
  : holder: canonical-livepatch/1816
    start: 0001-11-28T14:48:47.148564464Z
    duration: 1m0s

That makes it looks like the canonical-livepatch/1816 agent was still running and renewing its lease - there's nothing else that would be updating the start time like that.

So it seems like the bug is more likely to be with unit teardown somehow?

Changed in juju:
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
Peter Sabaini (peter-sabaini) wrote :

I'm seeing something similar with a mongodb application which doesn't have any leader.

Juju controller has this in machine-0.log:

2020-03-17 21:35:15 ERROR juju.worker.dependency engine.go:671 "raft-forwarder" manifold worker returned unexpected error: applying command: timed out enqueuing operation
2020-03-20 22:33:44 ERROR juju.worker.dependency engine.go:671 "raft-forwarder" manifold worker returned unexpected error: applying command: timed out enqueuing operation
2020-04-07 21:33:14 ERROR juju.worker.dependency engine.go:671 "raft-forwarder" manifold worker returned unexpected error: applying command: timed out enqueuing operation
2020-04-07 21:33:46 ERROR juju.worker.dependency engine.go:671 "raft-forwarder" manifold worker returned unexpected error: applying command: timed out enqueuing operation
2020-04-07 21:33:56 ERROR juju.worker.dependency engine.go:671 "raft-forwarder" manifold worker returned unexpected error: applying command: timed out enqueuing operation
2020-04-07 21:34:06 ERROR juju.worker.dependency engine.go:671 "raft-forwarder" manifold worker returned unexpected error: applying command: timed out enqueuing operation

Let me know if I can provide more diagnostics.

Revision history for this message
Christian Muirhead (2-xtian) wrote :

Could you please collect an engine report and a goroutine dump from the controller machine? 0 should be fine if it's the raft leader, but it might be best to get all of them just in case. SSH into the machines and run `juju_engine_report` and `juju_goroutines`.

That error seems quite different from the original one even though the end result is the same. It sounds like the raft-forwarder is getting errors from the raft engine because the internal leaderLoop is blocked for some reason - the goroutine dump will hopefully show where it's blocked.

Revision history for this message
Christian Muirhead (2-xtian) wrote :

Also the lease.log files from /var/log/juju on each controller machine might be useful.

Revision history for this message
Christian Muirhead (2-xtian) wrote :

Also I'd been assuming that this was happening all the time (so the leaderLoop is fully blocked), but is that right? If you deploy a new application, does the new unit become the leader?

Revision history for this message
Joseph Phillips (manadart) wrote :

Another question in the context of https://bugs.launchpad.net/juju/+bug/1879663.
Were any of the machines in this model subject to upgrade-series?

Revision history for this message
Guillermo Gonzalez (verterok) wrote :

no upgrade-series done for those machines

This happened again in a brand new model, running 2.9.16

Revision history for this message
John A Meinel (jameinel) wrote :

In 2.9.16 we are aware of https://bugs.launchpad.net/juju/+bug/1942354 which could be this, but we'd need to dig if it is the same symptom.

Revision history for this message
Canonical Juju QA Bot (juju-qa-bot) wrote :

This Medium-priority bug has not been updated in 60 days, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance: Medium → Low
tags: added: expirebugs-bot
Revision history for this message
Paul Goins (vultaire) wrote :

I've encountered something similar.

I'm trying to migrate a model from a Xenial to a Bionic controller. Both controllers are on Juju 2.9.37, the latest stable 2.9 series Juju at the time of writing.

Please see the attachment for details, but:

* The model status shows that my attempted migration failed due to a missing unit - which no longer exists, and hasn't existed for some time!
* The graylog-mongodb app only has a single unit, but it is not the leader.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.