Non leader controller agents lost unexpectedly

Bug #2037308 reported by Alexander Balderson
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
High
Unassigned

Bug Description

Running juju 3.1.6 all the juju controllers start in HA, and the non leader units start, after about 2 hours though, the leader election starts to fail, and the 2 non leaders go down (and stay down)

from the leader unit we see:
2023-09-25 08:20:30 ERROR juju.worker.dependency engine.go:695 "leadership-tracker" manifold worker returned unexpected error: leadership failure: lease operation timed out
2023-09-25 08:20:30 INFO juju.worker.uniter uniter.go:347 unit "controller/0" shutting down: catacomb 0xc004cf4480 is dying
2023-09-25 08:20:30 DEBUG juju.worker.uniter runlistener.go:130 juju-exec listener stopping
2023-09-25 08:20:30 DEBUG juju.worker.uniter runlistener.go:149 juju-exec listener stopped
2023-09-25 08:20:30 DEBUG juju.worker.uniter.remotestate watcher.go:1092 starting secrets rotation watcher
2023-09-25 08:20:30 DEBUG juju.worker.dependency engine.go:603 "uniter" manifold worker completed successfully
2023-09-25 08:20:33 DEBUG juju.worker.dependency engine.go:580 "leadership-tracker" manifold worker started at 2023-09-25 08:20:33.656614138 +0000 UTC
2023-09-25 08:20:33 DEBUG juju.worker.dependency engine.go:580 "uniter" manifold worker started at 2023-09-25 08:20:33.667361616 +0000 UTC
2023-09-25 08:20:33 DEBUG juju.worker.uniter uniter.go:932 starting local juju-exec listener on {unix /var/lib/juju/agents/unit-controller-0/run.socket <nil>}
2023-09-25 08:20:33 INFO juju.worker.uniter uniter.go:363 unit "controller/0" started
2023-09-25 08:20:33 DEBUG juju.worker.uniter runlistener.go:117 juju-exec listener running
2023-09-25 08:20:33 INFO juju.worker.uniter uniter.go:389 hooks are retried false
2023-09-25 08:21:23 DEBUG juju.worker.uniter.remotestate watcher.go:1092 starting secrets rotation watcher
2023-09-25 08:21:23 DEBUG juju.worker.uniter.remotestate watcher.go:1105 starting secret revisions expiry watcher
2023-09-25 08:21:23 DEBUG juju.worker.uniter.remotestate watcher.go:1118 starting obsolete secret revisions watcher
2023-09-25 08:21:23 DEBUG juju.worker.dependency engine.go:618 "leadership-tracker" manifold worker stopped: leadership failure: lease operation timed out
stack trace:
lease operation timed out
github.com/juju/juju/worker/leadership.(*Tracker).refresh:187: leadership failure
github.com/juju/juju/worker/leadership.(*Tracker).loop:127:

at the same time the other controllers say:
2023-09-25 08:17:48 DEBUG juju.worker.dependency engine.go:618 "secret-drain-worker" manifold worker stopped: connection is shut down
stack trace:
github.com/juju/juju/rpc.init:14: connection is shut down
github.com/juju/juju/rpc.(*Conn).Call:178:
github.com/juju/juju/api.(*state).APICall:1261:

I havent been able to pick out what caused the worker to stop and fail to start again.
the testrun can be found at:
https://solutions.qa.canonical.com/testruns/3f60bdfd-6087-425c-8927-9d432d7e403b
and the controller crashdump can be found at:
https://oil-jenkins.canonical.com/artifacts/3f60bdfd-6087-425c-8927-9d432d7e403b/generated/generated/juju_maas_controller/juju-crashdump-controller-2023-09-25-08.38.37.tar.gz

summary: - [3.1.6] Non leader controller agents lost as soon as they start
+ [3.1.6] Non leader controller agents lost unexpectedly
Revision history for this message
Jeffrey Chang (modern911) wrote : Re: [3.1.6] Non leader controller agents lost unexpectedly

All failed runs could be found from https://solutions.qa.canonical.com/bugs/2037308

Revision history for this message
John A Meinel (jameinel) wrote :

Definitely should be investigated, since this is showing failures in 3.1.6 that weren't seen before.

Changed in juju:
importance: Undecided → High
status: New → Triaged
Revision history for this message
Jeffrey Chang (modern911) wrote :
Download full text (3.5 KiB)

We found a case on juju 3.1.5,
testrun https://solutions.qa.canonical.com/testruns/bce89dbd-17d7-4a88-9f0d-5d24647cbba2
crashdump https://oil-jenkins.canonical.com/artifacts/bce89dbd-17d7-4a88-9f0d-5d24647cbba2/generated/generated/juju_maas_controller/juju-crashdump-controller-2023-10-01-04.49.43.tar.gz

I see some connection issue

023-10-01 04:51:43 DEBUG juju.mgo server.go:329 Ping for 10.246.64.203:37017 is 1 ms
2023-10-01 04:51:43 DEBUG juju.mgo server.go:329 Ping for 10.246.64.201:37017 is 1 ms
2023-10-01 04:51:43 DEBUG juju.mgo server.go:329 Ping for 10.246.64.202:37017 is 6 ms
2023-10-01 04:51:43 DEBUG juju.apiserver request_notifier.go:134 [24] user-admin API connection terminated after 471.125531ms
2023-10-01 04:51:44 DEBUG juju.apiserver request_notifier.go:134 [26] user-admin API connection terminated after 541.63376ms
2023-10-01 04:51:44 DEBUG juju.apiserver request_notifier.go:189 <- [C] machine-2 {"request-id":101,"type":"ProxyUpdater","version":2,"request":"WatchForProxyConfigAndAPIHostPortChanges","params":"'params redacted'"}
2023-10-01 04:51:44 DEBUG juju.apiserver request_notifier.go:221 -> [C] machine-2 10.433376ms {"request-id":101,"response":"'body redacted'"} ProxyUpdater[""].WatchForProxyConfigAndAPIHostPortChanges
2023-10-01 04:51:44 DEBUG juju.apiserver request_notifier.go:189 <- [C] machine-2 {"request-id":102,"type":"ProxyUpdater","version":2,"request":"ProxyConfig","params":"'params redacted'"}
2023-10-01 04:51:44 DEBUG juju.apiserver request_notifier.go:189 <- [C] machine-2 {"request-id":103,"type":"NotifyWatcher","version":1,"id":"24","request":"Next","params":"'params redacted'"}
2023-10-01 04:51:44 DEBUG juju.apiserver request_notifier.go:221 -> [C] machine-2 10.01048ms {"request-id":102,"response":"'body redacted'"} ProxyUpdater[""].ProxyConfig
2023-10-01 04:51:44 WARNING juju.core.raftlease client.go:134 response timeout waiting for Command(ver: 1, op: extend, ns: singular-controller, model: 5a837b, lease: 5a837ba3-bb3b-4a72-85f0-34afc3c9fa3c, holder: machine-0) to be processed
2023-10-01 04:51:44 WARNING juju.core.raftlease client.go:134 response timeout waiting for Command(ver: 1, op: extend, ns: singular-controller, model: 1072fa, lease: 1072fa17-025d-4f16-8ce8-286b381ad824, holder: machine-0) to be processed
2023-10-01 04:51:44 WARNING juju.core.raftlease client.go:134 response timeout waiting for Command(ver: 1, op: extend, ns: singular-controller, model: 7ab2db, lease: 7ab2db7f-752c-4a51-8271-6a694d91507f, holder: machine-0) to be processed
2023-10-01 04:51:44 DEBUG juju.worker.raft writer.go:29 [raft] 2023-10-01T04:51:44.313Z [WARN] raft: Election timeout reached, restarting election
2023-10-01 04:51:44 DEBUG juju.worker.raft writer.go:29 [raft] 2023-10-01T04:51:44.314Z [INFO] raft: entering candidate state: node="Node at 10.246.64.201:17070 [Candidate]" term=439
2023-10-01 04:51:44 INFO juju.worker.raft.rafttransport dialer.go:53 dialing 10.246.64.202:17070
2023-10-01 04:51:44 INFO juju.worker.raft.rafttransport dialer.go:53 dialing 10.246.64.203:17070
2023-10-01 04:51:44 DEBUG juju.worker.raft writer.go:29 [raft] 2023-10-01T04:51:44.316Z [ERROR] raft: failed to make requestVote RPC: target="{Vot...

Read more...

Revision history for this message
Alan Baghumian (alanbach) wrote :

I encountered this error today with my Juju controller/model 2.9.37. My first attempt was to upgrade my controller and models to 2.9.45 which did not resolve the issue.

Finally I re-built the raft using the magical binary https://bugs.launchpad.net/juju/+bug/1853055 comment #16 which resolved the issue.

summary: - [3.1.6] Non leader controller agents lost unexpectedly
+ Non leader controller agents lost unexpectedly
Revision history for this message
Alan Baghumian (alanbach) wrote :

I removed the version from the bug title, as this is also happens on 2.9.x series.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.