models not logging

Bug #1930899 reported by james beedy
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Achilleas Anagnostopoulos
2.8
Fix Released
High
Achilleas Anagnostopoulos

Bug Description

Hello,

We are experiencing a situation where juju models stop generating logs.

ubuntu@juju-controller-1:~$ sudo tail -f /<email address hidden>
2021-06-03 14:25:59 DEBUG juju.worker.instancepoller worker.go:534 moving machine "42" (instance ID "e66hhy") to long poll group
2021-06-03 14:25:59 DEBUG juju.worker.instancepoller worker.go:534 moving machine "43" (instance ID "fwsm4a") to long poll group
2021-06-03 14:26:00 INFO juju.worker.provisioner provisioner_task.go:423 machine 43 already started as instance "fwsm4a"
2021-06-03 14:26:00 INFO juju.worker.provisioner provisioner_task.go:423 machine 36 already started as instance "kqqhng"
2021-06-03 14:26:00 INFO juju.worker.provisioner provisioner_task.go:423 machine 39 already started as instance "qgmmtn"
2021-06-03 14:26:00 INFO juju.worker.provisioner provisioner_task.go:423 machine 40 already started as instance "k33bxg"
2021-06-03 14:26:00 INFO juju.worker.provisioner provisioner_task.go:423 machine 41 already started as instance "x3rp3w"
2021-06-03 14:26:00 INFO juju.worker.provisioner provisioner_task.go:423 machine 42 already started as instance "e66hhy"
2021-06-03 14:26:00 INFO juju.worker.provisioner provisioner_task.go:475 provisioner-harvest-mode is set to destroyed; unknown instances not stopped []

juju debug-log does show some interesting output

$ juju debug-log
unit-license-manager-agent-2: 14:22:59 DEBUG juju.worker.dependency stack trace:
lease operation timed out
/var/lib/jenkins/workspace/BuildJuju-centos-amd64/_build/src/github.com/juju/juju/worker/leadership/tracker.go:187: leadership failure
/var/lib/jenkins/workspace/BuildJuju-centos-amd64/_build/src/github.com/juju/juju/worker/leadership/tracker.go:153:
unit-license-manager-agent-2: 14:22:59 DEBUG juju.worker.uniter juju-run listener stopping
unit-license-manager-agent-2: 14:22:59 DEBUG juju.worker.uniter juju-run listener stopped
unit-license-manager-agent-2: 14:22:59 DEBUG juju.worker.uniter.operation preparing operation "resign leadership"
unit-license-manager-agent-2: 14:22:59 DEBUG juju.worker.uniter.operation executing operation "resign leadership"
unit-license-manager-agent-2: 14:22:59 WARNING juju.worker.uniter.operation we should run a leader-deposed hook here, but we can't yet
unit-license-manager-agent-2: 14:22:59 DEBUG juju.worker.uniter.operation committing operation "resign leadership"
controller-0: 14:23:08 INFO juju.worker.provisioner Shutting down provisioner task machine-0
controller-0: 14:23:08 INFO juju.worker.logger logger worker stopped
controller-0: 14:23:08 INFO juju.worker.machineundertaker tearing down machine undertaker

As you can see both juju debug-log and the controller logs show the date of the last logs that were generated were yesterday.

Any insight on how to proceed here would be greatly appreciated.

Thank you!

Revision history for this message
Heitor (heitorpbittencourt) wrote :

We can see the debug logs on the machine, when looking into /var/log/juju/unit-foo.log but there's no update when running `juju debug-log`.

Revision history for this message
Heitor (heitorpbittencourt) wrote :

This is with juju 2.8.10 (latest/stable) and the controller is 2.8.6

james beedy (jamesbeedy)
description: updated
Revision history for this message
Ian Booth (wallyworld) wrote :

There's not a lot to go on here.
Was the shut down part of the jujud agent bouncing?
Is mongo still healthy?
It seems like the raft cluster which maintains leadership leases may have become unhealthy. Is this an HA setup? There's no disk space or other issues on the controllers? Are all controllers affected or just one?

Revision history for this message
james beedy (jamesbeedy) wrote :

Was the shut down part of the jujud agent bouncing?
    We do not see any bouncing jujud agents.

Is mongo still healthy?
    I'm guessing it is ... I wouldn't know if it wasn't healthy though. Is there some utility I can run to get a dump of the controller and mongo health for you?

It seems like the raft cluster which maintains leadership leases may have become unhealthy.
    Very possibly.

Is this an HA setup?
    Yes.

There's no disk space or other issues on the controllers?
    From what I can tell, no, there is plenty of disk.

Revision history for this message
John A Meinel (jameinel) wrote :

I'm asking Achilleas to meet with you on mattermost (chat.charmhub.io) at the start of his work day tomorrow. You can then work on live debugging to figure out what is going wrong. Doing this via slow poll between timezones isn't going to work.

Changed in juju:
assignee: nobody → Achilleas Anagnostopoulos (achilleasa)
importance: Undecided → High
status: New → Incomplete
milestone: none → 2.9.6
Changed in juju:
milestone: 2.9.6 → 2.9.7
Changed in juju:
milestone: 2.9.7 → 2.9.8
Revision history for this message
Ian Booth (wallyworld) wrote :

Marking as "fixed" as per
https://github.com/juju/juju/pull/13097

Extra logging added to surface the underlying root cause.

Changed in juju:
status: Incomplete → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.