Canonical Juju

ANARCHY!!!!!!! Entirely leaderless application spotted in the wild

Bug #1858693 reported by Daniel Manrique on 2020-01-07

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Triaged	Low	Unassigned

Bug Description

We noticed a Mojo rollout getting stuck/timing out waiting for steady state, with this from Juju:

19:00:47 O: 2020-01-07 18:58:20 [INFO] canonical-livepatch does not have a leader
19:00:47 O: 2020-01-07 18:58:45 [INFO] All units idle since 2020-01-07 18:58:26.540644Z (external-services/0, sca-app-lb/0, sca-app-r25c5378/0, sca-app-r25c5378/1, sca-app-r8ee1873/1, sca-app-r8ee1873/2, sca-cache-lb/0, sca-cache/0, sca-dp-fe/2, sca-dp-fe/3, sca-memcached/0, sca-prometheus/0, sca-rabbitmq/0, sca-worker-r25c5378/0, sca-worker-r25c5378/1, sca-worker-r25c5378/2, sca-worker-r8ee1873/0, sca-worker-r8ee1873/1)
19:00:47 O: 2020-01-07 18:58:45 [INFO] canonical-livepatch does not have a leader
.... (AD NAUSEAM)

Indeed, juju status and is-leader show everything thinks the application has no leader:

juju run --application canonical-livepatch "is-leader"
- Stdout: |
    False
  UnitId: canonical-livepatch/0
- Stdout: |
    False
  UnitId: canonical-livepatch/1826
- Stdout: |
    False
  UnitId: canonical-livepatch/1827
- Stdout: |
    False
  UnitId: canonical-livepatch/2
- Stdout: |
    False
  UnitId: canonical-livepatch/3
- Stdout: |
    False
  UnitId: canonical-livepatch/1578
- Stdout: |
    False
  UnitId: canonical-livepatch/1820
- Stdout: |
    False
  UnitId: canonical-livepatch/6
- Stdout: |
    False
  UnitId: canonical-livepatch/7
- Stdout: |
    False
  UnitId: canonical-livepatch/1579
- Stdout: |
    False
  UnitId: canonical-livepatch/1821
- Stdout: |
    False
  UnitId: canonical-livepatch/1822
- Stdout: |
    False
  UnitId: canonical-livepatch/1823
- Stdout: |
    False
  UnitId: canonical-livepatch/1824
- Stdout: |
    False
  UnitId: canonical-livepatch/1825
- Stdout: |
    False
  UnitId: canonical-livepatch/1828
- Stdout: |
    False
  UnitId: canonical-livepatch/4
- Stdout: |
    False
  UnitId: canonical-livepatch/5

$ juju status |grep canonical-livepatch
canonical-livepatch active canonical-livepatch/6 canonical-livepatch/0 canonical-livepatch/1824 canonical-livepatch/1825 canonical-livepatch/1822 canonical-livepatch/1823 canonical-livepatch/2 canonical-livepatch/3 canonical-livepatch/1579 canonical-livepatch/1578 canonical-livepatch/4 canonical-livepatch/5 canonical-livepatch/7 canonical-livepatch/1826 canonical-livepatch/1827 canonical-livepatch/1821 canonical-livepatch/1820 canonical-livepatch/1828 18 canonical-livepatch local 0 ubuntu
active idle 10.50.70.211 Running kernel 4.4.0-166.195-generic, patchState: kernel-upgrade-required
active idle 10.50.70.212 Running kernel 4.4.0-148.174-generic, patchState: kernel-upgrade-required
active idle 10.50.71.190 Running kernel 4.4.0-170.199-generic, patchState: applied
active idle 10.50.71.2 Running kernel 4.4.0-170.199-generic, patchState: applied
active idle 10.50.71.184 Running kernel 4.4.0-170.199-generic, patchState: applied
active idle 10.50.71.185 Running kernel 4.4.0-170.199-generic, patchState: applied
active idle 10.50.70.214 Running kernel 4.4.0-148.174-generic, patchState: kernel-upgrade-required
active idle 10.50.70.213 Running kernel 4.4.0-148.174-generic, patchState: kernel-upgrade-required
active idle 185.125.191.231 Running kernel 4.4.0-157.185-generic, patchState: kernel-upgrade-required
active idle 185.125.191.240 Running kernel 4.4.0-157.185-generic, patchState: kernel-upgrade-required
active idle 10.50.70.216 Running kernel 4.4.0-148.174-generic, patchState: kernel-upgrade-required
active idle 10.50.70.218 Running kernel 4.4.0-166.195-generic, patchState: kernel-upgrade-required
active idle 10.50.70.217 Running kernel 4.4.0-166.195-generic, patchState: kernel-upgrade-required
active idle 10.50.71.195 Running kernel 4.4.0-170.199-generic, patchState: applied
active idle 10.50.71.197 Running kernel 4.4.0-170.199-generic, patchState: applied
active idle 10.50.71.186 Running kernel 4.4.0-170.199-generic, patchState: applied
active idle 10.50.71.187 Running kernel 4.4.0-170.199-generic, patchState: applied
active idle 10.50.71.202 Running kernel 4.4.0-170.199-generic, patchState: applied

(note no leadership asterisk)

Logs show this in all units, when the agent is restarted or when a new unit is added:

2020-01-07 16:17:25 DEBUG juju.worker.leadership tracker.go:125 canonical-livepatch/1828 making initial claim for canonical-livepatch leadership
2020-01-07 16:17:25 INFO juju.worker.leadership tracker.go:217 canonical-livepatch leadership for canonical-livepatch/1828 denied

This is the PS4.5 shared controller with Juju 2.6.10:

$ juju status canonical-livepatch
Model Controller Cloud/Region Version SLA Timestamp
stg-ols-scasnap3 prodstack-is prodstack-45/bootstack-ps45 2.6.10 unsupported 20:43:57Z

Things we've tried to get a new leader elected:

- Restart all Juju agents for that application
- Add a new unit to see if given a new candidate, a leader is elected

Things we haven't tried:

- Restart *all* juju agents
- Remove, then re-add the application

This seems very similar to https://bugs.launchpad.net/juju/+bug/1656275 which is already fixed so it's possible our issue is not entirely the same; also, we're seeing this on a subordinate charm.

Also, IS confirmed this thing is using raft leases, not legacy leases.

Let us know if any more diagnostics/logs are needed.

Tags:

Revision history for this message

Christian Muirhead (2-xtian) wrote on 2020-01-07:

Hi, can you please check for errors related to leases or from juju.worker.dependency in the controller logs?

Revision history for this message

Christian Muirhead (2-xtian) wrote on 2020-01-07:

Also, the output from running juju_engine_report on the controller machines will tell us whether the various raft workers are up.

Revision history for this message

Christian Muirhead (2-xtian) wrote on 2020-01-07:

From talking to verterok and roadmr in irc, a new application with two units gets a leader, and removing the leader eventually transfers leadership to the other unit. So it seems like the raft workers are up and the lease clock is advancing.

The other bug referenced is about the db and raft state being out of sync, but status and is-leader all use the raft state now, so I don't think it's that.

Revision history for this message

Christian Muirhead (2-xtian) wrote on 2020-01-08:

I got hold of the lease snapshots from IS, they look like this:

Earlier snapshot (taken at 2020-01-08T21:59:54.560)

global-time: 0001-11-28T14:46:13.032021499Z
entries:
...
  ? namespace: application-leadership
    model-uuid: e951cb4a-9abf-420a-8ee0-4dcc7716dbe9
    lease: canonical-livepatch
  : holder: canonical-livepatch/1816
    start: 0001-11-28T14:46:00.03012772Z
    duration: 1m0s

Later snapshot (taken at 2020-01-08T22:03:08.582)

global-time: 0001-11-28T14:49:14.156856294Z
entries:
...
  ? namespace: application-leadership
    model-uuid: e951cb4a-9abf-420a-8ee0-4dcc7716dbe9
    lease: canonical-livepatch
  : holder: canonical-livepatch/1816
    start: 0001-11-28T14:48:47.148564464Z
    duration: 1m0s

That makes it looks like the canonical-livepatch/1816 agent was still running and renewing its lease - there's nothing else that would be updating the start time like that.

So it seems like the bug is more likely to be with unit teardown somehow?

Richard Harding (rharding) on 2020-02-03

Changed in juju:
status:	New → Triaged
importance:	Undecided → Medium

Revision history for this message

Peter Sabaini (peter-sabaini) wrote on 2020-04-14:

I'm seeing something similar with a mongodb application which doesn't have any leader.

Juju controller has this in machine-0.log:

2020-03-17 21:35:15 ERROR juju.worker.dependency engine.go:671 "raft-forwarder" manifold worker returned unexpected error: applying command: timed out enqueuing operation
2020-03-20 22:33:44 ERROR juju.worker.dependency engine.go:671 "raft-forwarder" manifold worker returned unexpected error: applying command: timed out enqueuing operation
2020-04-07 21:33:14 ERROR juju.worker.dependency engine.go:671 "raft-forwarder" manifold worker returned unexpected error: applying command: timed out enqueuing operation
2020-04-07 21:33:46 ERROR juju.worker.dependency engine.go:671 "raft-forwarder" manifold worker returned unexpected error: applying command: timed out enqueuing operation
2020-04-07 21:33:56 ERROR juju.worker.dependency engine.go:671 "raft-forwarder" manifold worker returned unexpected error: applying command: timed out enqueuing operation
2020-04-07 21:34:06 ERROR juju.worker.dependency engine.go:671 "raft-forwarder" manifold worker returned unexpected error: applying command: timed out enqueuing operation

Let me know if I can provide more diagnostics.

Revision history for this message

Christian Muirhead (2-xtian) wrote on 2020-04-15:

Could you please collect an engine report and a goroutine dump from the controller machine? 0 should be fine if it's the raft leader, but it might be best to get all of them just in case. SSH into the machines and run `juju_engine_report` and `juju_goroutines`.

That error seems quite different from the original one even though the end result is the same. It sounds like the raft-forwarder is getting errors from the raft engine because the internal leaderLoop is blocked for some reason - the goroutine dump will hopefully show where it's blocked.

Revision history for this message

Christian Muirhead (2-xtian) wrote on 2020-04-15:

Also the lease.log files from /var/log/juju on each controller machine might be useful.

Revision history for this message

Christian Muirhead (2-xtian) wrote on 2020-04-16:

Also I'd been assuming that this was happening all the time (so the leaderLoop is fully blocked), but is that right? If you deploy a new application, does the new unit become the leader?

Revision history for this message

Joseph Phillips (manadart) wrote on 2020-05-22:

Another question in the context of https://bugs.launchpad.net/juju/+bug/1879663.
Were any of the machines in this model subject to upgrade-series?

Revision history for this message

Guillermo Gonzalez (verterok) wrote on 2021-10-28:

#10

no upgrade-series done for those machines

This happened again in a brand new model, running 2.9.16

Revision history for this message

John A Meinel (jameinel) wrote on 2021-10-28:

#11

In 2.9.16 we are aware of https://bugs.launchpad.net/juju/+bug/1942354 which could be this, but we'd need to dig if it is the same symptom.

Revision history for this message

Canonical Juju QA Bot (juju-qa-bot) wrote on 2022-11-03:

#12

This Medium-priority bug has not been updated in 60 days, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance:	Medium → Low
tags:	added: expirebugs-bot

Revision history for this message

Paul Goins (vultaire) wrote on 2022-11-17:

#13

juju_status_borked_migration_no_leader.txt Edit (2.8 KiB, text/plain)

I've encountered something similar.

I'm trying to migrate a model from a Xenial to a Bionic controller. Both controllers are on Juju 2.9.37, the latest stable 2.9 series Juju at the time of writing.

Please see the attachment for details, but:

* The model status shows that my attempted migration failed due to a missing unit - which no longer exists, and hasn't existed for some time!
* The graylog-mongodb app only has a single unit, but it is not the leader.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

juju_status_borked_migration_no_leader.txt Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.