GetMeterStatus called too frequently
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Juju |
Fix Released
|
High
|
Unassigned | ||
2.4 |
Won't Fix
|
High
|
Unassigned | ||
2.5 |
Fix Released
|
High
|
@les |
Bug Description
Looking at production Prometheus metrics, we can see that roughly every 5 minutes the controller sees a spike of calls of GetMeterStatus (500/s).
Auditing the code, it seems that the WatchMeterStatus watches 2 documents, the MeterStatus document for the individual application, *and* the MetricManager global document.
It does this because if you start failing to send metrics, you'll go into Amber alert after we have 3 failed metric sends. So we need to know if we are starting to fail sends.
However, the MetricManager global document *also* includes a "last successful upload" key. Which means that on every successful upload, it also update that document, causing all agents to wake up checking if their MeterStatus has changed.
Also confusing is that the 'last successful upload' document is stored as a global singleton, but it appears the "MetricsWorker" is run for every model. (Possibly on each controller.)
A simple fix might be to split out the 'last-successfu
We could potentially do things like encode the "upload is in amber state" to the db, so that a single failure doesn't wake everything up. But that is of much lower priority (and potentially affects correctness) than just splitting out the fields so we don't end up waking up on every successful send.
A different possibility would be to use a custom watcher that internally is backed by a DocWatcher but then omits changes that don't affect the consecutiveerrors count.
However, that is harder to actually implement.
Changed in juju: | |
status: | Fix Committed → Fix Released |
https:/ /github. com/juju/ juju/pull/ 9676