evaluation periods effectively ignored for threshold alarm

Bug #1380216 reported by Mike Spreitzer
22
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Ceilometer
Fix Released
Undecided
ZhiQiang Fan
Juno
Fix Released
Undecided
Unassigned

Bug Description

In the file ceilometer/alarm/evaluator/threshold.py, in the class ThresholdEvaluator, consider this method:

    def _transition(self, alarm, statistics, compared):
        """Transition alarm state if necessary.

           The transition rules are currently hardcoded as:

           - transitioning from a known state requires an unequivocal
             set of datapoints

           - transitioning from unknown is on the basis of the most
             recent datapoint if equivocal

           Ultimately this will be policy-driven.
        """

and the _sufficient method:

    def _sufficient(self, alarm, statistics):
        """Check for the sufficiency of the data for evaluation.

        Ensure there is sufficient data for evaluation, transitioning to
        unknown otherwise.
        """
        sufficient = len(statistics) >= self.quorum
        ...

Note that self.quorum==1, regardless of evaluation_periods.

The current hard-wired policy effectively ignores the evaluation_periods parameter of the alarm.
Every alarm starts in the unknown state, so the first time there are any statistics at all available,
_sufficient() will return true and _transition will set the state based on how that first statistic
compares to the threshold.

Revision history for this message
Dina Belova (dbelova) wrote :

Actually don't understand how does evaluation_periods connect with quorum... Evaluation periods is about number of historical periods to evaluate the threshold (it'll be evaluation window), quorum is about minimum number of datapoints within sliding window to avoid unknown state... Hardcoded quorum is the only problem here I suppose...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to ceilometer (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/127909

Revision history for this message
Mike Spreitzer (mike-spreitzer) wrote :

Dina, the connection is this: the logic in _transition will set the alarm's state to "alarmed" as soon as there are "quorum" data points and the last is alarming. That logic I just outlined pays no attention to "evaluation periods".

Revision history for this message
Phil Neal (nealph) wrote :

Mike, I think if you consider the context of the _sufficient method, which evaluates the number of samples within the result set of _bound_duration, it follows that the quorum setting is applied only against the set that is within the evaluation period.

That is: given a sample set within the bounds of x evaluation periods, determine whether the number of samples meets the criteria of quorum = y, and if so proceed with evaluation.

Revision history for this message
Mike Spreitzer (mike-spreitzer) wrote :

Yes, Phil, that is the problem. Currently we have fixed quorum=1, so the _transition method will be called as soon as _statistics(..) returns any data at all. The first time this happens, the alarm is in the unknown state before _transition is called, so _transition decides to set the alarm state to something definite --- based on exactly 1 datum from _statistics(..). Note that the outline I just gave pays no attention to the evaluation_periods setting.

ZhiQiang Fan (aji-zqfan)
Changed in ceilometer:
assignee: nobody → ZhiQiang Fan (aji-zqfan)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ceilometer (master)

Fix proposed to branch: master
Review: https://review.openstack.org/132146

Changed in ceilometer:
status: New → In Progress
tags: added: juno-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ceilometer (master)

Reviewed: https://review.openstack.org/132146
Committed: https://git.openstack.org/cgit/openstack/ceilometer/commit/?id=553d8d96e60cf354406568ed7dd4c563e768e4d0
Submitter: Jenkins
Branch: master

commit 553d8d96e60cf354406568ed7dd4c563e768e4d0
Author: ZhiQiang Fan <email address hidden>
Date: Fri Oct 31 03:33:34 2014 +0800

    Use alarm's evaluation periods in sufficient test

    Currently, we use constant value quorum=1 to check if there are enough
    datapoints, however, this is not quite right for an alarm rule.

    Image evaluation periods is set to, for i.e., 3 for an instance on
    cpu_util greater or equal than 80%. Here are the cases which current
    may not work as expected:

    1. when system start or instance is just created, we may only get one
    or two samples for the instance
    2. when system is somewhere broken, or an instance is restarted (after
    being shutoff), sample may fail to be collected in some time, so we only
    get one or two sample in that time range

    We want to avoid a spurious data peak, for example, instance cpu_util can
    be 50%, 50%, 50%, 90%, in such case, alarm will not be triggered, but if
    instance cpu_util is None, None, None, 90%, current code will think alarm
    should be triggered, which is not consistent and may confuse end users.

    This patch will put alarm to insufficient data when datapoints are less
    than evaluation periods.

    Change-Id: Ie64a537434493a5965c8e9e165cf028d57689da2
    Closes-Bug: #1380216

Changed in ceilometer:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ceilometer (stable/juno)

Fix proposed to branch: stable/juno
Review: https://review.openstack.org/150446

Thierry Carrez (ttx)
Changed in ceilometer:
milestone: none → kilo-2
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in ceilometer:
milestone: kilo-2 → 2015.1.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ceilometer (stable/juno)

Reviewed: https://review.openstack.org/150446
Committed: https://git.openstack.org/cgit/openstack/ceilometer/commit/?id=0639d0d62999f3d8d77d027ce612ebe2498cb1e3
Submitter: Jenkins
Branch: stable/juno

commit 0639d0d62999f3d8d77d027ce612ebe2498cb1e3
Author: ZhiQiang Fan <email address hidden>
Date: Fri Oct 31 03:33:34 2014 +0800

    Use alarm's evaluation periods in sufficient test

    Currently, we use constant value quorum=1 to check if there are enough
    datapoints, however, this is not quite right for an alarm rule.

    Image evaluation periods is set to, for i.e., 3 for an instance on
    cpu_util greater or equal than 80%. Here are the cases which current
    may not work as expected:

    1. when system start or instance is just created, we may only get one
    or two samples for the instance
    2. when system is somewhere broken, or an instance is restarted (after
    being shutoff), sample may fail to be collected in some time, so we only
    get one or two sample in that time range

    We want to avoid a spurious data peak, for example, instance cpu_util can
    be 50%, 50%, 50%, 90%, in such case, alarm will not be triggered, but if
    instance cpu_util is None, None, None, 90%, current code will think alarm
    should be triggered, which is not consistent and may confuse end users.

    This patch will put alarm to insufficient data when datapoints are less
    than evaluation periods.

    Conflicts:
            ceilometer/alarm/evaluator/threshold.py

    NOTE(mriedem): The conflict is due to the oslo.i18n imports on master
    and oslo.i18n wasn't used in stable/juno so the _LW usage is removed.

    Change-Id: Ie64a537434493a5965c8e9e165cf028d57689da2
    Closes-Bug: #1380216
    (cherry picked from commit 553d8d96e60cf354406568ed7dd4c563e768e4d0)

tags: added: in-stable-juno
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.