Canonical Juju

insert/remove cleanups spike caused juju controllers to become unresponsive.

Bug #1886498 reported by Nick Moffitt on 2020-07-06

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Triaged	Low	Unassigned

Bug Description

First, here is the graphs for this incident:

https://grafana.admin.canonical.com/d/sR1-JkYmz/juju2-controllers-thumpers?orgId=1&from=1594039330168&to=1594042246856

Second, here's everything mongodb spat to syslog all day:

https://pastebin.canonical.com/p/sdkGJH8nKw/

Per my chat with simon at https://chat.canonical.com/canonical/pl/u67n6mnaytbaifkyfqgraguxky it seems like the sequence of events was:

1. insert/remove cleanups spike.
2. txn_ops and locks high
3. deployments slow way down
4. txn_ops and locks resolve
5. status queries continue to become slower for some time
6. Everything resolves.

Is there a way we can prevent these "cleanup" spikes?

Tags:

Revision history for this message

Simon Richardson (simonrichardson) wrote on 2020-07-06:

From the logs it looks like mongo struggled to get resources from the machine. It took 8.5seconds to acquire a lock and Juju really struggled to recover.

------

Jul 6 12:51:32 juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-0 mongod.37017[16302]: [ftdc] serverStatus was very slow: { after basic: 0, after asserts: 0, after connections: 0, after extra_info: 0, after globalLock: 0, after locks: 0, after network: 0, after opcounters: 0, after opcountersRepl: 0, after repl: 0, after security: 0, after storageEngine: 0, after tcmalloc: 0, after wiredTiger: 1010, at end: 1010 }
Jul 6 12:51:39 juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-0 mongod.37017[16302]: [conn188253] command admin.system.users command: saslStart { saslStart: 1, mechanism: "SCRAM-SHA-1", payload: "xxx" } keyUpdates:0 writeConflicts:0 numYields:0 reslen:155 locks:{ Global: { acquireCount: { r: 2 }, acquireWaitCount: { r: 1 }, timeAcquiringMicros: { r: 8443463 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } protocol:op_query 8455ms

Revision history for this message

Pen Gale (pengale) wrote on 2020-10-28:

This might be related to the work that we're doing on reducing txn-watcher sync errors. Triaging as medium, but will bring it up in the core team daily.

Changed in juju:
status:	New → Triaged
importance:	Undecided → Medium
tags:	added: sts

Revision history for this message

Pen Gale (pengale) wrote on 2020-10-29:

After some discussion about the incident report: this was originally a noisy neighbors issue, but mongo had a hard time recovering after the controllers were moved away from their neighbors.

Leaving triaged as Medium, as there is work to do in the long run to make Juju more robust in this situation. But the underlying cause was resource starvation, and there aren't simple/immediate fixes.

Revision history for this message

Canonical Juju QA Bot (juju-qa-bot) wrote on 2022-11-03:

This bug has not been updated in 2 years, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance:	Medium → Low
tags:	added: expirebugs-bot

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.