Event alarms do not take effect immediately

Bug #1651273 reported by Zane Bitter
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Aodh
New
Undecided
Unassigned

Bug Description

When a new event alarm is created, it may take up to event_alarm_cache_ttl (by default, 60) seconds for the tenant's alarm cache in aodh-evaluator to expire and the new alarm to be loaded from the DB. In the meantime, any incoming oslo_messaging notifications that would have matched the alarm do not trigger it. Creating or updating an event alarm in the Aodh API should invalidate the cache for the alarm's tenant in aodh-evaluator.

Ideally there would be a hard guarantee that the cache has been invalidated by the time the alarm create call returns to the client, to prevent race conditions entirely. However, IIUC there is no RPC API between aodh-api and aodh-evaluator over which do do this, so a fairly intrusive architecture change would be required.

A partial solution might be for aodh-evaluator to listen for the aodh-api notifications about the alarm state changes, and respond by invalidating its cache before further processing. However, it's less clear that this is guaranteed to eliminate race conditions, even with only a single aodh-evaluator process. When scaling out aodh-evaluator, this solution would only work if the messages are sharded across the different evaluators by tenant, which is probably not the case.

Changed in aodh:
assignee: nobody → Vishakha Agarwal (vishakha.agarwal)
Revision history for this message
Vishakha Agarwal (vishakha.agarwal) wrote :

Hi Zane,

Are you facing the problem in master too? I am not able to reproduce this issue kindly help.

Revision history for this message
Vishakha Agarwal (vishakha.agarwal) wrote :

Also event alarm is handled by aodh-listener, not aodh-evaluator

Revision history for this message
Zane Bitter (zaneb) wrote :

I haven't tried recently, but the cache still exists (in aodh-evaluator):

http://git.openstack.org/cgit/openstack/aodh/tree/aodh/evaluator/event.py#n185

and it is only cleared after self.conf.event_alarm_cache_ttl seconds. Given that this is the case, there can't not be a case where the cache is stale.

Note that to reproduce you must populate the cache, so e.g. generate an event for a project, then add an alarm, then generate another event <60s later. The second event won't result in an alarm even if it matches.

Changed in aodh:
assignee: Vishakha Agarwal (vishakha.agarwal) → nobody
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.