Canonical Juju

models not logging

Bug #1930899 reported by james beedy on 2021-06-04

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Fix Released	High	Achilleas Anagnostopoulos	Canonical Juju 2.9.8
	2.8	Fix Released	High	Achilleas Anagnostopoulos	Canonical Juju 2.8.13

Bug Description

Hello,

We are experiencing a situation where juju models stop generating logs.

ubuntu@juju-controller-1:~$ sudo tail -f /<email address hidden>
2021-06-03 14:25:59 DEBUG juju.worker.instancepoller worker.go:534 moving machine "42" (instance ID "e66hhy") to long poll group
2021-06-03 14:25:59 DEBUG juju.worker.instancepoller worker.go:534 moving machine "43" (instance ID "fwsm4a") to long poll group
2021-06-03 14:26:00 INFO juju.worker.provisioner provisioner_task.go:423 machine 43 already started as instance "fwsm4a"
2021-06-03 14:26:00 INFO juju.worker.provisioner provisioner_task.go:423 machine 36 already started as instance "kqqhng"
2021-06-03 14:26:00 INFO juju.worker.provisioner provisioner_task.go:423 machine 39 already started as instance "qgmmtn"
2021-06-03 14:26:00 INFO juju.worker.provisioner provisioner_task.go:423 machine 40 already started as instance "k33bxg"
2021-06-03 14:26:00 INFO juju.worker.provisioner provisioner_task.go:423 machine 41 already started as instance "x3rp3w"
2021-06-03 14:26:00 INFO juju.worker.provisioner provisioner_task.go:423 machine 42 already started as instance "e66hhy"
2021-06-03 14:26:00 INFO juju.worker.provisioner provisioner_task.go:475 provisioner-harvest-mode is set to destroyed; unknown instances not stopped []

juju debug-log does show some interesting output

$ juju debug-log
unit-license-manager-agent-2: 14:22:59 DEBUG juju.worker.dependency stack trace:
lease operation timed out
/var/lib/jenkins/workspace/BuildJuju-centos-amd64/_build/src/github.com/juju/juju/worker/leadership/tracker.go:187: leadership failure
/var/lib/jenkins/workspace/BuildJuju-centos-amd64/_build/src/github.com/juju/juju/worker/leadership/tracker.go:153:
unit-license-manager-agent-2: 14:22:59 DEBUG juju.worker.uniter juju-run listener stopping
unit-license-manager-agent-2: 14:22:59 DEBUG juju.worker.uniter juju-run listener stopped
unit-license-manager-agent-2: 14:22:59 DEBUG juju.worker.uniter.operation preparing operation "resign leadership"
unit-license-manager-agent-2: 14:22:59 DEBUG juju.worker.uniter.operation executing operation "resign leadership"
unit-license-manager-agent-2: 14:22:59 WARNING juju.worker.uniter.operation we should run a leader-deposed hook here, but we can't yet
unit-license-manager-agent-2: 14:22:59 DEBUG juju.worker.uniter.operation committing operation "resign leadership"
controller-0: 14:23:08 INFO juju.worker.provisioner Shutting down provisioner task machine-0
controller-0: 14:23:08 INFO juju.worker.logger logger worker stopped
controller-0: 14:23:08 INFO juju.worker.machineundertaker tearing down machine undertaker

As you can see both juju debug-log and the controller logs show the date of the last logs that were generated were yesterday.

Any insight on how to proceed here would be greatly appreciated.

Thank you!

See original description

Revision history for this message

Heitor (heitorpbittencourt) wrote on 2021-06-04:

We can see the debug logs on the machine, when looking into /var/log/juju/unit-foo.log but there's no update when running `juju debug-log`.

Revision history for this message

Heitor (heitorpbittencourt) wrote on 2021-06-04:

This is with juju 2.8.10 (latest/stable) and the controller is 2.8.6

james beedy (jamesbeedy) on 2021-06-04

description:

updated

Revision history for this message

Ian Booth (wallyworld) wrote on 2021-06-11:

There's not a lot to go on here.
Was the shut down part of the jujud agent bouncing?
Is mongo still healthy?
It seems like the raft cluster which maintains leadership leases may have become unhealthy. Is this an HA setup? There's no disk space or other issues on the controllers? Are all controllers affected or just one?

Revision history for this message

james beedy (jamesbeedy) wrote on 2021-06-14:

Was the shut down part of the jujud agent bouncing?
We do not see any bouncing jujud agents.

Is mongo still healthy?
I'm guessing it is ... I wouldn't know if it wasn't healthy though. Is there some utility I can run to get a dump of the controller and mongo health for you?

It seems like the raft cluster which maintains leadership leases may have become unhealthy.
Very possibly.

Is this an HA setup?
Yes.

There's no disk space or other issues on the controllers?
From what I can tell, no, there is plenty of disk.

Revision history for this message

John A Meinel (jameinel) wrote on 2021-06-16:

I'm asking Achilleas to meet with you on mattermost (chat.charmhub.io) at the start of his work day tomorrow. You can then work on live debugging to figure out what is going wrong. Doing this via slow poll between timezones isn't going to work.

Changed in juju:
assignee:	nobody → Achilleas Anagnostopoulos (achilleasa)
importance:	Undecided → High
status:	New → Incomplete
milestone:	none → 2.9.6

Canonical Juju QA Bot (juju-qa-bot) on 2021-06-21

Changed in juju:
milestone:	2.9.6 → 2.9.7

Canonical Juju QA Bot (juju-qa-bot) on 2021-06-30

Changed in juju:
milestone:	2.9.7 → 2.9.8

Revision history for this message

Ian Booth (wallyworld) wrote on 2021-07-05:

Marking as "fixed" as per
https://github.com/juju/juju/pull/13097

Extra logging added to surface the underlying root cause.

Changed in juju:
status:	Incomplete → Fix Committed

Canonical Juju QA Bot (juju-qa-bot) on 2021-07-13

Changed in juju:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.