insert/remove cleanups spike caused juju controllers to become unresponsive.

Bug #1886498 reported by Nick Moffitt
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
Low
Unassigned

Bug Description

First, here is the graphs for this incident:

https://grafana.admin.canonical.com/d/sR1-JkYmz/juju2-controllers-thumpers?orgId=1&from=1594039330168&to=1594042246856

Second, here's everything mongodb spat to syslog all day:

https://pastebin.canonical.com/p/sdkGJH8nKw/

Per my chat with simon at https://chat.canonical.com/canonical/pl/u67n6mnaytbaifkyfqgraguxky it seems like the sequence of events was:

1. insert/remove cleanups spike.
2. txn_ops and locks high
3. deployments slow way down
4. txn_ops and locks resolve
5. status queries continue to become slower for some time
6. Everything resolves.

Is there a way we can prevent these "cleanup" spikes?

Revision history for this message
Simon Richardson (simonrichardson) wrote :

From the logs it looks like mongo struggled to get resources from the machine. It took 8.5seconds to acquire a lock and Juju really struggled to recover.

------

 Jul 6 12:51:32 juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-0 mongod.37017[16302]: [ftdc] serverStatus was very slow: { after basic: 0, after asserts: 0, after connections: 0, after extra_info: 0, after globalLock: 0, after locks: 0, after network: 0, after opcounters: 0, after opcountersRepl: 0, after repl: 0, after security: 0, after storageEngine: 0, after tcmalloc: 0, after wiredTiger: 1010, at end: 1010 }
    Jul 6 12:51:39 juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-0 mongod.37017[16302]: [conn188253] command admin.system.users command: saslStart { saslStart: 1, mechanism: "SCRAM-SHA-1", payload: "xxx" } keyUpdates:0 writeConflicts:0 numYields:0 reslen:155 locks:{ Global: { acquireCount: { r: 2 }, acquireWaitCount: { r: 1 }, timeAcquiringMicros: { r: 8443463 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } protocol:op_query 8455ms

Revision history for this message
Pen Gale (pengale) wrote :

This might be related to the work that we're doing on reducing txn-watcher sync errors. Triaging as medium, but will bring it up in the core team daily.

Changed in juju:
status: New → Triaged
importance: Undecided → Medium
tags: added: sts
Revision history for this message
Pen Gale (pengale) wrote :

After some discussion about the incident report: this was originally a noisy neighbors issue, but mongo had a hard time recovering after the controllers were moved away from their neighbors.

Leaving triaged as Medium, as there is work to do in the long run to make Juju more robust in this situation. But the underlying cause was resource starvation, and there aren't simple/immediate fixes.

Revision history for this message
Canonical Juju QA Bot (juju-qa-bot) wrote :

This bug has not been updated in 2 years, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance: Medium → Low
tags: added: expirebugs-bot
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.