aodh uses deprecated gnocchi api to aggregate metrics and doesn't work properly

Bug #1946793 reported by Tristan Zhang
28
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Aodh
Fix Released
Undecided
James Page
Ubuntu Cloud Archive
Fix Released
Undecided
Unassigned
Wallaby
Fix Released
High
Unassigned
Xena
Fix Released
High
Unassigned
Yoga
Fix Released
High
Seyeong Kim
aodh (Ubuntu)
Fix Released
High
James Page

Bug Description

[Impact]
aodh uses older gnocchi api. this causes issue when we are using metric command

openstack metric measures aggregation
openstack metric aggregates

[Test Case]
1. deploy openstack env with telemetry and heat ( heat template could be from the comment )
2. heat template should be adjusted for #1's env.
- any variables for openstack
- desired number 2
3. openstack stack create test -t ./heat
4. assume stack id = 136bf93d-9dc9-4b3f-862d-6fdec1b6abf7
5. access to instance, and dd command to give cpu load
6. test openstack metric command

openstack metric measures aggregation --query 'server_group=136bf93d-9dc9-4b3f-862d-6fdec1b6abf7' --aggregation rate:mean --metric cpu --resource-type instance --fill null

openstack metric aggregates '(aggregate rate:mean (metric cpu mean))' 'server_group=136bf93d-9dc9-4b3f-862d-6fdec1b6abf7' --resource-type instance --granularity 300 --fill null

7. then, check gnocchi log(apache log) if it calls v1/aggregation or v1/aggregates

[Where problems could occur]
shortage while upgrading.
getting metrics could have issue.

[Others]

Original Description below

On gnocchi API docs, there are 2 API methods to aggregate metrics

1. /v1/aggregation/metric?

See: https://gnocchi.osci.io/rest.html#aggregation-across-metrics-deprecated

This one is deprecated

2. /v1/aggregates?

See: https://gnocchi.osci.io/rest.html#dynamic-aggregates

aodh uses the 1st one to aggregate metrics, for example:

```
        if isinstance(start, datetime.datetime):
            start = start.isoformat()
        if isinstance(stop, datetime.datetime):
            stop = stop.isoformat()

        params = dict(start=start, stop=stop, aggregation=aggregation,
                      reaggregation=reaggregation, granularity=granularity,
                      needed_overlap=needed_overlap, groupby=groupby,
                      refresh=refresh, resample=resample, fill=fill)
        if query is None:
            for metric in metrics:
                self._ensure_metric_is_uuid(metric)
            params['metric'] = metrics
            measures = self._get("v1/aggregation/metric",
                                 params=params).json()
```

aodh doesn't work properly in our production environment after upgraded to Ussuri.

When there is only 1 instance, aodh works properly and alarms can be triggered when the load on the instance is higher than the threshold.

However, after the stack is scaled up, and the second instance is created. The average cpu usage got from gnocchi by aodh evaluator is not correct. The metric measures are negative sometimes.

I manually pulled metrics with gnocchi command

The aggregation of metrics is correct with command

```
openstack metric aggregates
```

It uses new API in the backend

The aggregation of metrics is not correct with command

```
openstack metric measures aggregation
```

It uses the deprecated API which aodh is using.

Revision history for this message
Gustavo Sanchez (gustavosr98) wrote :

Subscribing field-high

Revision history for this message
Gustavo Sanchez (gustavosr98) wrote :

I can confirm that metrics are showing as negative by aodh evaluator.

In my case I used a heat template to create stacks https://paste.ubuntu.com/p/zQdtWRPMYd/.

Measures are positive and accumulative of cpu time usage. And after some time aodh evaluating the metric. It is turned into a negative number. Thus the threshold is incorrectly evaluated.

https://paste.ubuntu.com/p/b7QTfd6Y2c/

Revision history for this message
James Page (james-page) wrote :

Aodh is using the older deprecated API in Gnocchi *but* it is deprecated not removed or expected to be broken.

Changed in aodh (Ubuntu):
status: New → Incomplete
Revision history for this message
James Page (james-page) wrote :

@Gustavo - please could you attach logs from both gnocchi and aodh services to this bug report please - they might cast a little more light on what's going wrong.

Revision history for this message
Gustavo Sanchez (gustavosr98) wrote :

Gnocchi and Ceilometer logs

Revision history for this message
Tristan Zhang (tzmtl) wrote :

@James I don't think there are any errors in logs. The deprecated api works, it just doesn't give us a proper aggregation metric.

You can try both these API with openstack commands as I pasted above.

# openstack metric aggregates
# openstack metric measures aggregation

For only 1 running instance, they give the same value. But for 2 instances, they value are different, and the one from old method is wrong.

Or you can just call the old and new gnocchi api with a curl command, you will see the difference.

Revision history for this message
Gustavo Sanchez (gustavosr98) wrote :

Sorry, forgot to add aodh logs

Changed in aodh (Ubuntu):
status: Incomplete → New
Revision history for this message
Gustavo Sanchez (gustavosr98) wrote :

Results from using both APIs. Old one returns negative metrics.

https://paste.ubuntu.com/p/gSQKv63K6Z/

Revision history for this message
James Page (james-page) wrote :

I've put up a proposed change to switch to the Dynamic Aggregates API:

  https://review.opendev.org/c/openstack/aodh/+/829870

I believe this will produce the desired behaviour and reproduces alot of what worked with the cpu_util metric from earlier OpenStack releases.

tl;dr it passes unit tests - testing now to see how it works in a deployment.

Changed in aodh:
status: New → In Progress
assignee: nobody → James Page (james-page)
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in aodh (Ubuntu):
status: New → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/aodh 14.0.0.0rc1

This issue was fixed in the openstack/aodh 14.0.0.0rc1 release candidate.

James Page (james-page)
Changed in aodh:
status: In Progress → Fix Released
Changed in aodh (Ubuntu):
status: Confirmed → In Progress
James Page (james-page)
Changed in aodh (Ubuntu):
assignee: nobody → James Page (james-page)
importance: Undecided → High
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package aodh - 1:14.0.0~rc1-0ubuntu1

---------------
aodh (1:14.0.0~rc1-0ubuntu1) jammy; urgency=medium

  * New upstream release candidate (LP: #1946793).
  * d/p/*: Refresh.

 -- James Page <email address hidden> Mon, 14 Mar 2022 09:09:40 +0000

Changed in aodh (Ubuntu):
status: In Progress → Fix Released
Revision history for this message
Seyeong Kim (seyeongkim) wrote :

Recently this is merged into stable/xena as well

Could you please backport it to our xena-staging?

Thanks.

tags: added: sts
Changed in cloud-archive:
status: New → Fix Released
Revision history for this message
Corey Bryant (corey.bryant) wrote :

@Seyeong, it looks like we're fixing bug #1974682 in this as well. Can you update the xena debdiff changelog accordingly for both SRU bug #'s? Also the wallaby debdiff only includes one patch.

Revision history for this message
Seyeong Kim (seyeongkim) wrote :
Revision history for this message
Seyeong Kim (seyeongkim) wrote :
Revision history for this message
Seyeong Kim (seyeongkim) wrote (last edit ):

@corey,

Thanks for remind, I confused about wallaby and yoga.. I re-uploaded debdiff for them and added # for both LP.

Revision history for this message
Seyeong Kim (seyeongkim) wrote :

while building verification env(reproduction), i faced mysql issue. stuck there. I'll update if it is fixed.(or found workaround)

Revision history for this message
Corey Bryant (corey.bryant) wrote : Please test proposed package

Hello Tristan, or anyone else affected,

Accepted aodh into xena-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:xena-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-xena-needed to verification-xena-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-xena-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-xena-needed
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Hello Tristan, or anyone else affected,

Accepted aodh into wallaby-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:wallaby-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-wallaby-needed to verification-wallaby-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-wallaby-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-wallaby-needed
Seyeong Kim (seyeongkim)
description: updated
Seyeong Kim (seyeongkim)
description: updated
Seyeong Kim (seyeongkim)
description: updated
Revision history for this message
Seyeong Kim (seyeongkim) wrote :

anyone who can help me to verify this?

I've built openstack env and created stack ( referred to comment above )

and upgraded aodh in aodh unit ( juju ) but still, after scaling up more than 2 instances, it is not the same.

this result is from 1 instance to 3 instances and then back to 1 instance

Every 1.0s: openstack metric measures aggregation --query 'server_group=28dd6467-d453-480d-8436-791bff253451' --aggregation rate:mean --metric cpu --resource-type instance --fill null xtrusia-bastion: Wed Feb 1 04:16:02 2023

+---------------------------+-------------+-----------------+
| timestamp | granularity | value |
+---------------------------+-------------+-----------------+
| 2023-02-01T03:25:00+00:00 | 300.0 | -50000000.0 |
| 2023-02-01T03:30:00+00:00 | 300.0 | 10000000.0 |
| 2023-02-01T03:35:00+00:00 | 300.0 | 10000000.0 |
| 2023-02-01T03:40:00+00:00 | 300.0 | 276040000000.0 |
| 2023-02-01T03:45:00+00:00 | 300.0 | 23100000000.0 |
| 2023-02-01T03:50:00+00:00 | 300.0 | -380000000.0 |
| 2023-02-01T03:55:00+00:00 | 300.0 | 380000000.0 |
| 2023-02-01T04:00:00+00:00 | 300.0 | -1830000000.0 |
| 2023-02-01T04:05:00+00:00 | 300.0 | -297330000000.0 |
| 2023-02-01T04:10:00+00:00 | 300.0 | 0.0 |
+---------------------------+-------------+-----------------+

Every 1.0s: openstack metric aggregates '(aggregate rate:mean (metric cpu mean))' 'server_group=28dd6467-d453-480d-8436-791bff253451' --resource-type instance --granularity 300 --fill null xtrusia-bastion: Wed Feb 1 04:16:27 2023

+------------+---------------------------+-------------+-----------------+
| name | timestamp | granularity | value |
+------------+---------------------------+-------------+-----------------+
| aggregated | 2023-02-01T03:20:00+00:00 | 300.0 | 430000000.0 |
| aggregated | 2023-02-01T03:25:00+00:00 | 300.0 | 380000000.0 |
| aggregated | 2023-02-01T03:30:00+00:00 | 300.0 | 390000000.0 |
| aggregated | 2023-02-01T03:35:00+00:00 | 300.0 | 400000000.0 |
| aggregated | 2023-02-01T03:40:00+00:00 | 300.0 | 276440000000.0 |
| aggregated | 2023-02-01T03:45:00+00:00 | 300.0 | 299540000000.0 |
| aggregated | 2023-02-01T03:50:00+00:00 | 300.0 | 299160000000.0 |
| aggregated | 2023-02-01T03:55:00+00:00 | 300.0 | 299540000000.0 |
| aggregated | 2023-02-01T04:00:00+00:00 | 300.0 | -684670000000.0 |
| aggregated | 2023-02-01T04:05:00+00:00 | 300.0 | -490620000000.0 |
| aggregated | 2023-02-01T04:10:00+00:00 | 300.0 | 380000000.0 |
| aggregated | 2023-02-01T04:15:00+00:00 | 300.0 | 410000000.0 |
+------------+---------------------------+-------------+-----------------+

Revision history for this message
Seyeong Kim (seyeongkim) wrote :

I checked api log directly to verify this.

After upgrading, it doesn't call deprecated api anymore. please refer to [Test Case] further.

ii aodh-api 1:13.0.0-0ubuntu2~cloud1 all OpenStack Telemetry (Ceilometer) Alarming - API server
ii aodh-common 1:13.0.0-0ubuntu2~cloud1 all OpenStack Telemetry (Ceilometer) Alarming - common files
ii aodh-evaluator 1:13.0.0-0ubuntu2~cloud1 all OpenStack Telemetry (Ceilometer) Alarming - alarm evaluator
ii aodh-expirer 1:13.0.0-0ubuntu2~cloud1 all OpenStack Telemetry (Ceilometer) Alarming - expirer
ii aodh-listener 1:13.0.0-0ubuntu2~cloud1 all OpenStack Telemetry (Ceilometer) Alarming - listener
ii aodh-notifier 1:13.0.0-0ubuntu2~cloud1 all OpenStack Telemetry (Ceilometer) Alarming - alarm notifier
ii python3-aodh 1:13.0.0-0ubuntu2~cloud1 all OpenStack Telemetry (Ceilometer) Alarming - Python 3 libraries

description: updated
tags: added: verification-xena-done
removed: verification-xena-needed
Revision history for this message
Seyeong Kim (seyeongkim) wrote :

same check as above

ii aodh-api 1:12.0.0-0ubuntu1~cloud2 all OpenStack Telemetry (Ceilometer) Alarming - API server
ii aodh-common 1:12.0.0-0ubuntu1~cloud2 all OpenStack Telemetry (Ceilometer) Alarming - common files
ii aodh-evaluator 1:12.0.0-0ubuntu1~cloud2 all OpenStack Telemetry (Ceilometer) Alarming - alarm evaluator
ii aodh-expirer 1:12.0.0-0ubuntu1~cloud2 all OpenStack Telemetry (Ceilometer) Alarming - expirer
ii aodh-listener 1:12.0.0-0ubuntu1~cloud2 all OpenStack Telemetry (Ceilometer) Alarming - listener
ii aodh-notifier 1:12.0.0-0ubuntu1~cloud2 all OpenStack Telemetry (Ceilometer) Alarming - alarm notifier
ii python3-aodh 1:12.0.0-0ubuntu1~cloud2 all OpenStack Telemetry (Ceilometer) Alarming - Python 3 libraries

tags: added: verification-wallaby-done
removed: verification-wallaby-needed
Seyeong Kim (seyeongkim)
description: updated
description: updated
Revision history for this message
Gustavo Sanchez (gustavosr98) wrote :

Hi @seyeongkim

These are some notes I have from after applying the patch and trying to do auto-scaling VMs with heat. Template -> https://paste.ubuntu.com/p/zQdtWRPMYd/

"""
Things to consider when Autoscaling:
----- 1. Granularity comes tied to the metric. Check ceilometer + gnocchi configs for the metric.
----- 2. cpu_util deprecated. Using cpu metric instead considering the vcpu count of the flavor.

# The difference between successive measures
%CPU * vCPUs * Granularity * 10.000.000 = Δ cpu metric
1% * 1 * 300 s * 10.000.000 = 3.000.000.000 [Eg. m1.nano]
1% * 2 * 300 s * 10.000.000 = 6.000.000.000 [Eg. m1.small]

----- 3. Granularity (openstack configuration) < Cooldown (heat template)
Test A. granularity=300 < cooldown=600 -> Ok https://paste.ubuntu.com/p/mHvmBNq7KF/
Test B. granularity=300 > cooldown=300 -> Not ok https://paste.ubuntu.com/p/dKDtGYcdG8/
Test C. granularity=300 == cooldown=300 -> Not ok https://paste.ubuntu.com/p/pdc7rGFtMY/

What to do if the desire is to have a smaller granularity
1. Change ceilometer pulling interval for that metric (cpu in this case)
2. Change metric - archive-policy in gnocchi. Which seems that cannot be updated once it is initially set on the metric. Haven't found how to make it have effect if updated.

---------- Other notes on telemetry
-- Ceilometer
https://docs.openstack.org/ceilometer/latest/admin/telemetry-measurements.html
# Enable more metrics
juju config ceilometer enable-all-pollsters=true
juju config ceilometer-agent enable-all-pollsters=true

# Metrics granularity vs Ceilometer polling freq.
juju config ceilometer polling-interval=300
juju config ceilometer-agent polling-interval=300

# Check configs
juju ssh ceilometer/0 'sudo cat /etc/ceilometer/pipeline.yaml'
juju ssh ceilometer-agent/0 'sudo cat /etc/ceilometer/polling.yaml'

-- Gnocchi
https://gnocchi.osci.io/operating.html
# Resource / Metric
$ openstack metric resource show -c metrics --type instance $VM_UUID

# Metric / Measures
$ openstack metric measures show -r $VM_UUID cpu
$ openstack metric measures show $METRIC_UUID

# Archive policies
$ openstack metric archive-policy list

# Mapping Metric <-> Archive policy
# NOTE: Archive policy of a metric cannot be changed
$ openstack metric archive-policy-rule create <rule-name> --archive-policy-name <archive-policy-name>

-- Aodh
openstack alarm create \
  --name cpu_70_percent_1vcpu \
  --type gnocchi_resources_threshold \
  --description 'Instance CPU High' \
  --metric cpu \
  --threshold 210000000000 \
  --comparison-operator gt \
  --aggregation-method mean \
  --granularity 300 \
  --evaluation-periods 1 \
  --alarm-action 'log://' \
  --resource-type instance \
  --resource-id $INSTANCE_ID

openstack alarm-history show $ALARM_UUID
"""

Revision history for this message
Corey Bryant (corey.bryant) wrote : Update Released

The verification of the Stable Release Update for aodh has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

This bug was fixed in the package aodh - 1:13.0.0-0ubuntu2~cloud1
---------------

 aodh (1:13.0.0-0ubuntu2~cloud1) focal-xena; urgency=medium
 .
   * gnocchi: Use Dynamic Aggregates API (LP: #1946793, LP: #1974682)
     - d/p/0001-Bump-minimum-version-of-gnocchiclient-for-aggregats-.patch
     - d/p/0002-Ignore-Gnocchi-API-error-when-the-metric-is-not-yet-.patch
     - d/p/0003-gnocchi-Use-Dynamic-Aggregates-API.patch
     - d/control: Align min version of python3-gnocchiclient with patch above.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

The verification of the Stable Release Update for aodh has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

This bug was fixed in the package aodh - 1:12.0.0-0ubuntu1~cloud2
---------------

 aodh (1:12.0.0-0ubuntu1~cloud2) focal-wallaby; urgency=medium
 .
   * gnocchi: Use Dynamic Aggregates API (LP: #1946793, LP: #1974682)
     - d/p/0001-Bump-minimum-version-of-gnocchiclient-for-aggregats-.patch
     - d/p/0002-Ignore-Gnocchi-API-error-when-the-metric-is-not-yet-.patch
     - d/p/0003-gnocchi-Use-Dynamic-Aggregates-API.patch
     - d/control: Align min version of python3-gnocchiclient with patch above.

Revision history for this message
Seyeong Kim (seyeongkim) wrote :
Revision history for this message
Edward Hope-Morley (hopem) wrote :

@seyeongkim can you please provide an explanation as to why this sru is being re-opened?

Revision history for this message
Edward Hope-Morley (hopem) wrote :

@seyeongkim the patches in your debdiff seem to come from bug 1974682 and with the other not having a bug associated. Shouldn't we be using 1974682 for this SRU?

Revision history for this message
Edward Hope-Morley (hopem) wrote :

Ok so it seems we neglected to SRU this to Yoga first before X and W hence why this is re-opened.

Revision history for this message
Edward Hope-Morley (hopem) wrote :

jammy sru has been in the unapproved queue [1] since 2023-01-19

[1] https://launchpad.net/ubuntu/jammy/+queue?queue_state=1&queue_text=

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/aodh 13.1.0

This issue was fixed in the openstack/aodh 13.1.0 release.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

This was already fix released for jammy-updates in 1:14.0.0-0ubuntu1.1 and yoga 1:14.0.0-0ubuntu1.1~cloud0.

tags: added: verification-yoga-needed
tags: added: verification-yoga-done
removed: verification-yoga-needed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/aodh train-eol

This issue was fixed in the openstack/aodh train-eol release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/aodh ussuri-eol

This issue was fixed in the openstack/aodh ussuri-eol release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/aodh victoria-eom

This issue was fixed in the openstack/aodh victoria-eom release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/aodh wallaby-eom

This issue was fixed in the openstack/aodh wallaby-eom release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.