CloudKitty bug in_featch_metrics method (gnocchi.py)

Bug #1860476 reported by Rafael Weingartner
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
cloudkitty
Fix Committed
Undecided
Rafael Weingartner

Bug Description

We discovered this problem when we started using "rate:mean" for some metrics, which was generating negative (and some zeros) values in InfluxDB by CloudKitty. This problem seems to be caused by a misunderstanding of how Gnocchi works (concerning aggregation methods). So, how does it (aggregation methods in Gnocchi) work (as far as I understand)?

Rather than storing raw data points, Gnocchi aggregates them (the data points) before storing them (according to the archiving policies). This built-in feature is different from most other time-series databases, which usually support this mechanism as an option and compute aggregation (average, minimum, etc.) at query time.

Why is that important?

When retrieving a measurement for a metric, we should do the following:
gnocchi measures show <metricID> --start <start date as YYYY-MM-DD'T'HH:mm:SS> --stop <stop date as YYYY-MM-DD'T'HH:mm:SS> --granularity <granularity, according to the archiving policies, we can use 3600> --aggregation <one of the aggregation method we defined in the archiving policies. The default is "mean">

Additionally, Gnocchi can also compute new aggregations on the fly. This means, if we, for instance, do not have an aggregation method defined in the archiving policy, but we would like to somehow calculate a new one (e.g. rate:mean) using some already aggregated data, we can do the following:
gnocchi aggregates --resource-type ceph_account "(aggregate "not foreseen aggregation method" (metric <metric> <original_aggregation_method>))" "project_id='<project_id_when_filtering_by_project>'" --groupby id --start <start date as YYYY-MM-DD'T'HH:mm:SS> --stop <stop date as YYYY-MM-DD'T'HH:mm:SS> --granularity <granularity, according to the archiving policies, we can use 3600>

One can use this, for instance, to calculate the "rate:max" from a metric that has been configured only with "max" as the aggregation method in the archiving policy.

For some reason, CloudKitty was coded in a way that it is always using the aggregation API to retrieve measurements. This does not make sense for metrics where we have the archiving policy defining the desired aggregation method. This also explains the negative values (and a lot of zeros) that we noticed in Grafana (that is used to display the data stored in InfluxDB). CloudKitty is using a similar request to Gnocchi:
gnocchi aggregates --resource-type ceph_account "(aggregate rate:mean (metric <our metric name> rate:mean))" "project_id='<project_id>'" --groupby id --start <start time>--stop <stop time> --granularity 3600

This means, taking the rate:mean from a rate:mean measurement.

Looking at the source code history, I noticed that until "Apr 17, 2018" (https://github.com/openstack/cloudkitty/blob/5035de30a8e5c7fe2a81bbf1e2c4270abaf06d12/cloudkitty/collector/gnocchi.py#L159), the code always used the method "metric.get_measures", which is equivalent to:
gnocchi measures show <metricID> --start <start date as YYYY-MM-DD'T'HH:mm:SS> --stop <stop date as YYYY-MM-DD'T'HH:mm:SS> --granularity <granularity, according to the archiving policies, we can use 3600> --aggregation <one of the aggregation method we defined in the archiving policies. The default is "mean">

Then, with commit "https://github.com/openstack/cloudkitty/commit/059a94039209653c0ef256a0f076d749381f6822", the call was changed to "aggregates.fetch", which is equivalent to:

gnocchi aggregates --resource-type ceph_account "(aggregate "not foreseen aggregation method" (metric <metric> <original_aggregation_method>))" "project_id='<project_id_when_filtering_by_project>'" --groupby id --start <start date as YYYY-MM-DD'T'HH:mm:SS> --stop <stop date as YYYY-MM-DD'T'HH:mm:SS> --granularity <granularity, according to the archiving policies, we can use 3600>

In my opinion, CloudKitty should use the "metric.get_measures" method as we want to retrieve an already computed and archived measurement for a metric. If we want CloudKitty to work with on-the-fly aggregation methods, we should create a mode to configure that in CloudKitty. Currently, that is not possible. Furthermore, we can see at line 258, that the gnocchi operation in CloudKitty uses the same aggregation method configured in CloudKitty as the aggregation method for the operation, and the aggregation method used to store the metric, which does not make much sense in some cases.

Let me show a concrete case:
# gnocchi measures show f3b87a0f-41e0-4721-aaaa-23435651f --start 2020-01-17T19:00:00 --stop 2020-01-17T23:00:00 --granularity 3600 --aggregation "rate:mean"
+---------------------------+-------------+-------------------+
| timestamp | granularity | value |
+---------------------------+-------------+-------------------+
| 2020-01-17T19:00:00+01:00 | 3600.0 | 295184.8333333333 |
| 2020-01-17T20:00:00+01:00 | 3600.0 | 494694.5833333333 |
| 2020-01-17T21:00:00+01:00 | 3600.0 | 494694.5833333333 |
| 2020-01-17T22:00:00+01:00 | 3600.0 | 0.0 |
+---------------------------+-------------+-------------------+

# gnocchi aggregates --resource-type ceph_account "(aggregate rate:mean (metric <metric name> rate:mean))" "project_id='7390a9b1d4be4d7dfg428107e4ff'" --groupby id --start 2020-01-17T19:00:00 --stop 2020-01-17T23:00:00 --granularity 3600
+------------------------------------------+------------+---------------------------+-------------+--------------------+
| group | name | timestamp | granularity | value |
+------------------------------------------+------------+---------------------------+-------------+--------------------+
| id: 7390a9b1-d4be-4d75-b4bd-08ab8107e4ff | aggregated | 2020-01-17T20:00:00+00:00 | 3600.0 | 0.0 |
| id: 7390a9b1-d4be-4d75-b4bd-08ab8107e4ff | aggregated | 2020-01-17T21:00:00+00:00 | 3600.0 | -494694.5833333333 |
| id: 7390a9b1-d4be-4d75-b4bd-08ab8107e4ff | aggregated | 2020-01-17T22:00:00+00:00 | 3600.0 | 0.0 |
+------------------------------------------+------------+---------------------------+-------------+--------------------+

description: updated
Changed in cloudkitty:
assignee: nobody → Rafael Weingartner (rafaelweingartner)
Revision history for this message
Rafael Weingartner (rafaelweingartner) wrote :

After fiddling with the code for quite some time, trying to find a method to change the `aggregates.fetch` to `metric.get_measures`, I decided to take a different approach. The code (Gnocchi processor in CloudKitty) is all designed to work with the output and request (input) of the `aggregates.fetch` Gnocchi API.

Therefore, I think that we can still use it. What I did then is the following. I created a new `extra_args` called `metric_aggregation_method`. When this extra args is configured, we override the `aggregation_method` that is defined. Therefore, the `aggregation_method` is used for the aggregation command in the aggregation API, and the `metric_aggregation_method` is used to defined the aggregation method of the metric that we want to use in the aggregation API.

This would allow us to use the `aggregation_method` as `median`, and the `metric_aggregation_method` as any other aggregation method we want. Then, just because the way CloudKitty works (processing in timeframes of 1 hour by default), we just have one datapoint; thus, we can obtain the correct values using the aggregation API.

summary: - CloudKitty using "aggregates.fetch" instead of "metric.get_measures"
+ CloudKitty bug in_featch_metrics method (gnocchi.py)
Revision history for this message
Rafael Weingartner (rafaelweingartner) wrote :

After creating the fix, I checked master and there was a similar solution, which was introduced by https://github.com/openstack/cloudkitty/commit/8a0f80ad915e9dff5ab1ec244fdaae5682f6f195. Therefore, we can close the bug report.

Changed in cloudkitty:
status: New → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.