CloudKitty bug in_featch_metrics method (gnocchi.py)
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
cloudkitty |
Fix Committed
|
Undecided
|
Rafael Weingartner |
Bug Description
We discovered this problem when we started using "rate:mean" for some metrics, which was generating negative (and some zeros) values in InfluxDB by CloudKitty. This problem seems to be caused by a misunderstanding of how Gnocchi works (concerning aggregation methods). So, how does it (aggregation methods in Gnocchi) work (as far as I understand)?
Rather than storing raw data points, Gnocchi aggregates them (the data points) before storing them (according to the archiving policies). This built-in feature is different from most other time-series databases, which usually support this mechanism as an option and compute aggregation (average, minimum, etc.) at query time.
Why is that important?
When retrieving a measurement for a metric, we should do the following:
gnocchi measures show <metricID> --start <start date as YYYY-MM-
Additionally, Gnocchi can also compute new aggregations on the fly. This means, if we, for instance, do not have an aggregation method defined in the archiving policy, but we would like to somehow calculate a new one (e.g. rate:mean) using some already aggregated data, we can do the following:
gnocchi aggregates --resource-type ceph_account "(aggregate "not foreseen aggregation method" (metric <metric> <original_
One can use this, for instance, to calculate the "rate:max" from a metric that has been configured only with "max" as the aggregation method in the archiving policy.
For some reason, CloudKitty was coded in a way that it is always using the aggregation API to retrieve measurements. This does not make sense for metrics where we have the archiving policy defining the desired aggregation method. This also explains the negative values (and a lot of zeros) that we noticed in Grafana (that is used to display the data stored in InfluxDB). CloudKitty is using a similar request to Gnocchi:
gnocchi aggregates --resource-type ceph_account "(aggregate rate:mean (metric <our metric name> rate:mean))" "project_
This means, taking the rate:mean from a rate:mean measurement.
Looking at the source code history, I noticed that until "Apr 17, 2018" (https:/
gnocchi measures show <metricID> --start <start date as YYYY-MM-
Then, with commit "https:/
gnocchi aggregates --resource-type ceph_account "(aggregate "not foreseen aggregation method" (metric <metric> <original_
In my opinion, CloudKitty should use the "metric.
Let me show a concrete case:
# gnocchi measures show f3b87a0f-
+------
| timestamp | granularity | value |
+------
| 2020-01-
| 2020-01-
| 2020-01-
| 2020-01-
+------
# gnocchi aggregates --resource-type ceph_account "(aggregate rate:mean (metric <metric name> rate:mean))" "project_
+------
| group | name | timestamp | granularity | value |
+------
| id: 7390a9b1-
| id: 7390a9b1-
| id: 7390a9b1-
+------
description: | updated |
Changed in cloudkitty: | |
assignee: | nobody → Rafael Weingartner (rafaelweingartner) |
After fiddling with the code for quite some time, trying to find a method to change the `aggregates.fetch` to `metric. get_measures` , I decided to take a different approach. The code (Gnocchi processor in CloudKitty) is all designed to work with the output and request (input) of the `aggregates.fetch` Gnocchi API.
Therefore, I think that we can still use it. What I did then is the following. I created a new `extra_args` called `metric_ aggregation_ method` . When this extra args is configured, we override the `aggregation_ method` that is defined. Therefore, the `aggregation_ method` is used for the aggregation command in the aggregation API, and the `metric_ aggregation_ method` is used to defined the aggregation method of the metric that we want to use in the aggregation API.
This would allow us to use the `aggregation_ method` as `median`, and the `metric_ aggregation_ method` as any other aggregation method we want. Then, just because the way CloudKitty works (processing in timeframes of 1 hour by default), we just have one datapoint; thus, we can obtain the correct values using the aggregation API.