partition coordinator cannot work correctly

Bug #1575530 reported by Liusheng on 2016-04-27
20
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Aodh
Fix Released
High
Liusheng
Ceilometer
Won't Fix
High
ZhiQiang Fan

Bug Description

When enable partition coordinator of aodh-evaluator service and start multiple aodh-evaluator services, the alarms may be repeatedly evaluated on different aodh-evaluator services at the same time. see following log:

2016-04-27 15:59:29.989 104633 DEBUG aodh.coordination [-] Members of group: ['d33ee413-059f-460e-b6fb-7c99a3bc7af0', '8b2b0a72-6a3a-4bda-ba9d-0fc67d501692'] extract_my_subset /opt/stack/aodh/aodh/coordination.py:200
2016-04-27 15:59:29.991 104633 DEBUG aodh.coordination [-] My subset: [<aodh.storage.models.Alarm object at 0x7f8a0941a7d0>, <aodh.storage.models.Alarm object at 0x7f8a0937de50>, <aodh.storage.models.Alarm object at 0x7f8a093b7450>] extract_my_subset /opt/stack/aodh/aodh/coordination.py:204
2016-04-27 15:59:29.991 104633 INFO aodh.evaluator [-] initiating evaluation cycle on 3 alarms
2016-04-27 15:59:29.991 104633 DEBUG aodh.evaluator [-] evaluating alarm b8e92d6f-abc8-4d7f-a39e-d31a79ae3810 _evaluate_alarm /opt/stack/aodh/aodh/evaluator/__init__.py:224
2016-04-27 15:59:29.992 104633 DEBUG aodh.evaluator.threshold [-] query stats from 2016-04-27 07:57:29.992183 to 2016-04-27 07:59:29.992183 _bound_duration /opt/stack/aodh/aodh/evaluator/threshold.py:79
2016-04-27 15:59:29.992 104633 DEBUG aodh.evaluator.threshold [-] stats query [{'field': 'timestamp', 'value': '2016-04-27T07:59:29.992183', 'op': 'le'}, {'field': 'timestamp', 'value': '2016-04-27T07:57:29.992183', 'op': 'ge'}] _statistics /opt/stack/aodh/aodh/evaluator/threshold.py:114
2016-04-27 15:59:30.229 104632 DEBUG aodh.storage [-] looking for 'mysql+pymysql' driver in 'aodh.storage' get_connection_from_config /opt/stack/aodh/aodh/storage/__init__.py:64
2016-04-27 15:59:30.337 104632 DEBUG oslo_db.sqlalchemy.engines [-] MySQL server mode set to STRICT_TRANS_TABLES,STRICT_ALL_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,TRADITIONAL,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION _check_effective_sql_mode /usr/local/lib/python2.7/dist-packages/oslo_db/sqlalchemy/engines.py:256
2016-04-27 15:59:30.347 104632 DEBUG aodh.coordination [-] Members of group: ['d33ee413-059f-460e-b6fb-7c99a3bc7af0', '8b2b0a72-6a3a-4bda-ba9d-0fc67d501692'] extract_my_subset /opt/stack/aodh/aodh/coordination.py:200
2016-04-27 15:59:30.348 104632 DEBUG aodh.coordination [-] My subset: [<aodh.storage.models.Alarm object at 0x7f6da037e750>, <aodh.storage.models.Alarm object at 0x7f6da5feac90>, <aodh.storage.models.Alarm object at 0x7f6da031b3d0>] extract_my_subset /opt/stack/aodh/aodh/coordination.py:204
2016-04-27 15:59:30.351 104632 INFO aodh.evaluator [-] initiating evaluation cycle on 3 alarms
2016-04-27 15:59:30.351 104632 DEBUG aodh.evaluator [-] evaluating alarm b8e92d6f-abc8-4d7f-a39e-d31a79ae3810 _evaluate_alarm /opt/stack/aodh/aodh/evaluator/__init__.py:224
2016-04-27 15:59:30.351 104632 DEBUG aodh.evaluator.threshold [-] query stats from 2016-04-27 07:57:30.351661 to 2016-04-27 07:59:30.351661 _bound_duration /opt/stack/aodh/aodh/evaluator/threshold.py:79

the alarm b8e92d6f-abc8-4d7f-a39e-d31a79ae3810 is repeatedly evaluated on two aodh-evaluator service at the same time, the main reason of this issue is because we use the alarm objects list to extract the services own subset, the partition coordinator will use the alarm object to calculate the hash value and select the node, but one alarm will be difference alarm objects in different aodh-evaluator service.

Liusheng (liusheng) on 2016-04-27
Changed in aodh:
assignee: nobody → Liusheng (liusheng)
importance: Undecided → High
Changed in aodh:
status: New → In Progress
ZhiQiang Fan (aji-zqfan) wrote :

ceilometer polling agent has same issue

Changed in ceilometer:
assignee: nobody → ZhiQiang Fan (aji-zqfan)
status: New → In Progress
ZhiQiang Fan (aji-zqfan) on 2016-05-03
Changed in ceilometer:
importance: Undecided → High

Reviewed: https://review.openstack.org/310337
Committed: https://git.openstack.org/cgit/openstack/aodh/commit/?id=dd06bf9277774c56121be0b4878c8973f38e761d
Submitter: Jenkins
Branch: master

commit dd06bf9277774c56121be0b4878c8973f38e761d
Author: liusheng <email address hidden>
Date: Wed Apr 27 10:33:30 2016 +0800

    Fix and improve the partition coordinator

    * Fix the partition coordinator to distribute tasks properly.

    * Improve the partition coordination mechanism in retry logic, exception
      handling, and log messages, etc. Refer to the Ceilometer's changes:

    - Icf60381e30f3baf986cf9e008e133287765d9827
    - I6a48cf38b24a00a0db94d3dea0c6746b52526026
    - Ic0b6b62dace88e4e1ce7932024350bb211efb9ef
    - I8100160a3aa83a190c4110e6e8be9b26aef8fd1c
    - I2aed2241ded798464089b3eec5e1394422a45844

    Closes-Bug: #1575530
    Change-Id: I5729ae3080898e8a6d92889f8c520174dc371113

Changed in aodh:
status: In Progress → Fix Released

Reviewed: https://review.openstack.org/319303
Committed: https://git.openstack.org/cgit/openstack/aodh/commit/?id=966f7692abda7896e1389ee6c8e030f0e37dce0f
Submitter: Jenkins
Branch: stable/mitaka

commit 966f7692abda7896e1389ee6c8e030f0e37dce0f
Author: liusheng <email address hidden>
Date: Wed Apr 27 10:33:30 2016 +0800

    Fix and improve the partition coordinator

    * Fix the partition coordinator to distribute tasks properly.

    * Improve the partition coordination mechanism in retry logic, exception
      handling, and log messages, etc. Refer to the Ceilometer's changes:

    - Icf60381e30f3baf986cf9e008e133287765d9827
    - I6a48cf38b24a00a0db94d3dea0c6746b52526026
    - Ic0b6b62dace88e4e1ce7932024350bb211efb9ef
    - I8100160a3aa83a190c4110e6e8be9b26aef8fd1c
    - I2aed2241ded798464089b3eec5e1394422a45844

    Closes-Bug: #1575530
    Change-Id: I5729ae3080898e8a6d92889f8c520174dc371113
    (cherry picked from commit dd06bf9277774c56121be0b4878c8973f38e761d)

tags: added: in-stable-mitaka

This issue was fixed in the openstack/aodh 2.0.1 release.

Reviewed: https://review.openstack.org/326573
Committed: https://git.openstack.org/cgit/openstack/aodh/commit/?id=73c64955d560db2c97f81b381bc122a7697352c4
Submitter: Jenkins
Branch: stable/liberty

commit 73c64955d560db2c97f81b381bc122a7697352c4
Author: liusheng <email address hidden>
Date: Wed Apr 27 10:33:30 2016 +0800

    Fix and improve the partition coordinator

    * Fix the partition coordinator to distribute tasks properly.

    * Improve the partition coordination mechanism in retry logic, exception
      handling, and log messages, etc. Refer to the Ceilometer's changes:

    - Icf60381e30f3baf986cf9e008e133287765d9827
    - I6a48cf38b24a00a0db94d3dea0c6746b52526026
    - Ic0b6b62dace88e4e1ce7932024350bb211efb9ef
    - I8100160a3aa83a190c4110e6e8be9b26aef8fd1c
    - I2aed2241ded798464089b3eec5e1394422a45844

    Closes-Bug: #1575530
    Change-Id: I5729ae3080898e8a6d92889f8c520174dc371113
    (cherry picked from commit dd06bf9277774c56121be0b4878c8973f38e761d)
    (cherry picked from commit 966f7692abda7896e1389ee6c8e030f0e37dce0f)

tags: added: in-stable-liberty

This issue was fixed in the openstack/aodh 3.0.0.0b2 development milestone.

This issue was fixed in the openstack/aodh 1.1.3 release.

gordon chung (chungg) wrote :

alarms in ceilometer has been deprecated for a while.

Changed in ceilometer:
status: In Progress → Won't Fix
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers