GetMeterStatus called too frequently

Bug #1811700 reported by John A Meinel
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Unassigned
2.4
Won't Fix
High
Unassigned
2.5
Fix Released
High
@les

Bug Description

Looking at production Prometheus metrics, we can see that roughly every 5 minutes the controller sees a spike of calls of GetMeterStatus (500/s).

Auditing the code, it seems that the WatchMeterStatus watches 2 documents, the MeterStatus document for the individual application, *and* the MetricManager global document.

It does this because if you start failing to send metrics, you'll go into Amber alert after we have 3 failed metric sends. So we need to know if we are starting to fail sends.

However, the MetricManager global document *also* includes a "last successful upload" key. Which means that on every successful upload, it also update that document, causing all agents to wake up checking if their MeterStatus has changed.

Also confusing is that the 'last successful upload' document is stored as a global singleton, but it appears the "MetricsWorker" is run for every model. (Possibly on each controller.)

A simple fix might be to split out the 'last-successful-upload' from the 'number of failed uploads', and only update the document if the number of failed attempts actually changes.

We could potentially do things like encode the "upload is in amber state" to the db, so that a single failure doesn't wake everything up. But that is of much lower priority (and potentially affects correctness) than just splitting out the fields so we don't end up waking up on every successful send.

A different possibility would be to use a custom watcher that internally is backed by a DocWatcher but then omits changes that don't affect the consecutiveerrors count.

However, that is harder to actually implement.

Tags: metrics
Revision history for this message
Casey Marshall (cmars) wrote :
Revision history for this message
Anastasia (anastasia-macmood) wrote :

This has been forward ported to 2.6. I'll mark as Fix Committed.

But it will not be backported to 2.4. I will mark as "Wont Fix".

Changed in juju:
status: Triaged → Fix Committed
milestone: none → 2.6-beta1
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.