high CPU load, controller barely responding

Bug #1934524 reported by Laurent Sesquès
28
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Incomplete
High
Joseph Phillips

Bug Description

On a juju controller running 2.8.9, HA-enabled, we suddenly get high CPU usage on one controller unit, and the controller barely responds, blocking any useful work.

The logs show a lot of:

2021-07-02 12:33:37 ERROR juju.worker.dependency engine.go:671 "application-scaler" manifold worker returned unexpected error: fortress worker shutting down
2021-07-02 12:33:37 ERROR juju.worker.dependency engine.go:671 "status-history-pruner" manifold worker returned unexpected error: fortress worker shutting down
2021-07-02 12:33:37 ERROR juju.worker.dependency engine.go:671 "state-cleaner" manifold worker returned unexpected error: fortress worker shutting down

and:

2021-07-02 12:37:33 ERROR juju.worker.raft.raftforwarder target.go:186 couldn't claim lease "b66ed2e6-912a-40e1-8a86-ff4955965091:application-leadership#telegraf#" for "telegraf/3" in db: state changing too quickly; try again soon
2021-07-02 12:37:33 ERROR juju.worker.raft.raftforwarder target.go:186 couldn't claim lease "24892cae-723c-425a-8434-961abeea5de8:application-leadership#nrpe#" for "nrpe/0" in db: state changing too quickly; try again soon
2021-07-02 12:37:34 ERROR juju.worker.raft.raftforwarder target.go:186 couldn't claim lease "ddaf8bfe-e17a-49e2-87ad-2072c5bc8d66:application-leadership#telegraf#" for "telegraf/2" in db: state changing too quickly; try again soon

A restart of the unit with high CPU usage solves the issue.
@jameinel, I left logs on mombin for your attention.

Revision history for this message
John A Meinel (jameinel) wrote :

Other symptoms that we were seeing was goroutines growing significantly during that time, and a lot of API calls to Leadership (which could just be worker routines trying to run hooks getting restarted, and trying to reestablish their state).

juju_engine_report was failing to return.

Revision history for this message
Laurent Sesquès (sajoupa) wrote :

Also, juju_engine_report just hangs:
$ juju_engine_report
Querying @jujud-machine-2 introspection socket: /depengine

Revision history for this message
John A Meinel (jameinel) wrote :

Debugging the cpu.prof seems to be time spent in:
         0 0% 0.16% 1.12s 18.45% github.com/juju/juju/apiserver/common.(*UnitStateAPI).SetState
and
         0 0% 16.14% 0.94s 15.49% github.com/juju/juju/apiserver/common.UnitStateState.ApplyOperation

Revision history for this message
James Simpson (jsimpso) wrote :

Just had another occurrence of this:

ubuntu@juju-789ae5-controller-1:~$ tail /var/log/juju/machine-1.log -n 1000 | grep 'fortress worker shutting down' | wc -l
275

ubuntu@juju-789ae5-controller-0:~$ tail /var/log/juju/machine-0.log -n 1000 | grep 'fortress worker shutting down' | wc -l
0

ubuntu@juju-789ae5-controller-2:~$ tail /var/log/juju/machine-2.log -n 1000 | grep 'fortress
 worker shutting down' | wc -l
0

juju_engine_report just hangs as previously seen

I've uploaded the following:

https://private-fileshare.canonical.com/~jsimpso/juju_cpu_profile.log
https://private-fileshare.canonical.com/~jsimpso/juju_goroutines.log
https://private-fileshare.canonical.com/~jsimpso/juju_heap_profile.log

Had to restart the jujud-machine agent twice to resolve - after the first time the symptoms persisted and matching log entries shot up from 275/100 to 500/1000. The second restart did the trick though.

Revision history for this message
Alexandre Gomes (alejdg) wrote :

Another occurrence of the same issue. Collected the reports below and restarted jujud-machine in the worst-performing unit.

https://private-fileshare.canonical.com/~alejdg/juju_goroutines-2021-07-07.21.xz
https://private-fileshare.canonical.com/~alejdg/juju_heap_profile-2021-07-07.21.xz

Revision history for this message
John A Meinel (jameinel) wrote :

Debugging the cpu.prof seems to be time spent in:
         0 0% 0.16% 1.12s 18.45% github.com/juju/juju/apiserver/common.(*UnitStateAPI).SetState
and
         0 0% 16.14% 0.94s 15.49% github.com/juju/juju/apiserver/common.UnitStateState.ApplyOperation

But neither of those are sufficiently high that it would indicate why things would be blocked.

Revision history for this message
John A Meinel (jameinel) wrote :

Private logs and history are available in the private side of this bug: https://bugs.launchpad.net/juju/+bug/1936346

Changed in juju:
status: New → Triaged
importance: Undecided → High
Tom Haddon (mthaddon)
tags: added: canonical-is-upgrades
Revision history for this message
Gareth Woolridge (moon127) wrote :

Yesterday we ran into unresponsive status and stuck models/agents again and the "small hammer" of bouncing jujud-machine-2 got things back.

Today I had to rolling restart all 3 controller machine agents (with some minutes in between) after which things slowly recovered over 10 mins or so.

Revision history for this message
Tom Haddon (mthaddon) wrote :

subscribed ~field-high

Revision history for this message
Gareth Woolridge (moon127) wrote :

Unresponsive and bounced again.

Revision history for this message
John A Meinel (jameinel) wrote :

We have a PR that is looking to improve aspects of leadership, which we believe will help with this, though we are focusing the fix on the 2.9 series (vs 2.8.9 that is currently being run).
https://github.com/juju/juju/pull/13203

We do already have a fix from Simon about the particular 'leaking of goroutines' but that is more of a symptom than a root fix.

Revision history for this message
John A Meinel (jameinel) wrote :

Our current thoughts are that the requests for more leases can outstrip the backend's ability to process them, and there isn't good back pressure to cause the clients to slow down. Which leads to a cascade failure where the queue of items to process is taking longer than before the client will retry, so you end up never catching up. (example, client A requests leadership, backend replies in 6s, but client A has timed out in 5s, and has already issued the next request and is no longer listening for the previous response.)

Interestingly I do see the failure/restart moment:
https://grafana.admin.canonical.com/d/sR1-JkYmz/juju2-controllers-thumpers?orgId=1&var-controller=prodstack5-prodstack5-prodstack-is&var-host=All&var-node=All&from=1627400967967&to=1627438668890

I also see one time that looks like things were falling apart, but there wasn't an need to actually restart:
https://grafana.admin.canonical.com/d/sR1-JkYmz/juju2-controllers-thumpers?orgId=1&var-controller=prodstack5-prodstack5-prodstack-is&var-host=All&var-node=All&from=1627549993845&to=1627645534714

(leadership times are spiking, goroutine count is going up, etc.)

There is another instance of it at:
https://grafana.admin.canonical.com/d/sR1-JkYmz/juju2-controllers-thumpers?orgId=1&var-controller=prodstack5-prodstack5-prodstack-is&var-host=All&var-node=All&from=1627608107638&to=1627615798850

Which again shows that it was able to recover.

These do end up correlated with spikes on "update leaseholders failed" in Mongo operations. (Though it also has the same 'update leaseholders' succeeding at the same time, so it is mostly indicative of a lot of lease churn.)
https://grafana.admin.canonical.com/d/sR1-JkYmz/juju2-controllers-thumpers?viewPanel=89&orgId=1&var-controller=prodstack5-prodstack5-prodstack-is&var-host=All&var-node=All&from=1627387701024&to=1627457136614

Revision history for this message
John A Meinel (jameinel) wrote :

I asked to get some more logs generated. Turning 'juju.worker.lease.raft=TRACE' and then getting the logs, I'll attach them here for both ps4.5 and ps5

Revision history for this message
Laurent Sesquès (sajoupa) wrote :

We had another occurrence today.
I tried to set 'juju.worker.lease.raft=TRACE' but the command didn't return after a few minutes.
I restarted jujud-machine-1, then set 'juju.worker.lease.raft=TRACE'. Waited for a bit, set it back to DEBUG and grabbed the machine-* logs. They're in your home dir on mombin.

John A Meinel (jameinel)
Changed in juju:
milestone: none → 2.9-next
Revision history for this message
Gareth Woolridge (moon127) wrote :

Multiple rolling restarts already today!

Revision history for this message
Simon Richardson (simonrichardson) wrote :

So I've been spending some time on this, there will be a discourse post about this in the near future.

-----

The current implementation forwards all messages through jujus internal central eventing system (juju/pubsub repo), there are few flaws with this system:

 1. If no connection to a remote server is found, all messages are dropped. Worse still, no errors are returned in this case. The lease system believes that a claim, extension, or revoke succeeded.
 2. If raft can not keep up with demand, by processing the lease requests to the log, then back pressure on the system isn't handled. The events are placed in a queue and at some point in the future, they'll be processed. You can't cancel an event once placed in the queue, you can just tell the caller of the event (via a subscription) that the event they're waiting for will never come and move on.

The work I've been doing is to fix both of these problems. Instead of sending an event over the central hub, we'll use the API server for all lease requests. That allows us to use the battle-tested facade code, along with cancellable requests from the caller. Any API client requests that are not backed by a remote server connection will return an error to the lease manager, allowing a retry strategy appropriately.

Raft is now fronted with a queuing system that can coalesce command operations for batching and can exert the correct back pressure where appropriate.

The new[2] implementation is behind a feature flag[1] `features="[raft-api-leases]"` and requires you to bounce the controller once set. **It's important to note**, this is going through some load testing ATM and I've found an issue when bouncing the controller and coming back online can cause a stampeding hear issue.

 1. https://discourse.charmhub.io/t/controller-feature-flags/426
 2. https://github.com/juju/juju/pull/13356

Revision history for this message
Haw Loeung (hloeung) wrote :

FWIW, we tried the raft-api-leases feature on one of our main shared Juju controllers and ran into issues with leases timing out. We've had to revert back to disabling this feature and that seems to have stabilised. It was enabled at the same time as upgrading the controllers from Juju 2.8.7 to 2.9.18.

Changed in juju:
status: Triaged → In Progress
assignee: nobody → Simon Richardson (simonrichardson)
Harry Pidcock (hpidcock)
Changed in juju:
assignee: Simon Richardson (simonrichardson) → Harry Pidcock (hpidcock)
assignee: Harry Pidcock (hpidcock) → Joseph Phillips (manadart)
Revision history for this message
Joseph Phillips (manadart) wrote :

Things are much improved since this was filed.
This can be reopened if it is still a concern.

Changed in juju:
status: In Progress → Incomplete
milestone: 2.9-next → none
Revision history for this message
Romain Couturat (romaincout) wrote :

Our 2.9.44 controller suffers high load (mongo taking ~600% CPU of 8CPU)
Restarting the controller (all units at once) helps for roughly 40min (which might indicate something).
When we ran mgopurge the load had stayed normal for longer (8h) but eventually cam back as well.
Over the last 48h (an multiple shift), we ran 2 mgopurge and did 9 restarts.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.