periodic tasks should avoid synchronized execution

Bug #1326020 reported by Paul Murray
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
oslo-incubator
Fix Released
Medium
Tom Cammann

Bug Description

Periodic tasks run at defined intervals measured from the start time of the last execution. The run_periodic_task method loops through a list of registered tasks, determines if each has passed its spacing interval since its last execution, and marks the current time as the last time it was executed. This is shown in the following code:

def run_periodic_tasks(self, context, raise_on_error=False):
        """Tasks to be run at a periodic interval."""
        idle_for = DEFAULT_INTERVAL
        for task_name, task in self._periodic_tasks:
            full_task_name = '.'.join([self.__class__.__name__, task_name])

            now = timeutils.utcnow()
            spacing = self._periodic_spacing[task_name]
            last_run = self._periodic_last_run[task_name]

            # If a periodic task is _nearly_ due, then we'll run it early
            if spacing is not None and last_run is not None:
                due = last_run + datetime.timedelta(seconds=spacing)
                if not timeutils.is_soon(due, 0.2):
                    idle_for = min(idle_for, timeutils.delta_seconds(now, due))
                    continue

            if spacing is not None:
                idle_for = min(idle_for, spacing)

            self._periodic_last_run[task_name] = timeutils.utcnow()

            <---- code to execute task here ---->

        return idle_for

If a periodic task is blocked for a period of time and then becomes unblocked, the rest of the tasks in the loop will run at, or close to, the time that the it became unblocked.

If the cause of the blockage also affects other nodes, periodic tasks at all nodes will become synchronized.

An example of this behavior has been observed when the nova database has had blocked transactions for a period of time due to an error. Many periodic tasks access the database and so many became blocked. When the error in the database was cleared, the periodic tasks across all Nova compute managers executed at the same time and became synchronized from then on.

Two changes can avoid or lessen this behavior, these are both almost standard practice in distributed systems:

1. in stead of setting the value of _periodic_last_run[task_name] to be the current time it could be incremented by an appropriate multiple of the spacing interval. This would ensure the task ran at a regular interval instead of tending to a synchronized interval.

2. add some jitter to the value of _periodic_last_run[task_name] to cause synchronized tasks to spread out. This option would tend to be cancelled out by the code that checks spacing as it will run tasks that are "_nearly_ due". It might be useful to consider dropping that "_nearly_ due" code.

Revision history for this message
Ben Nemec (bnemec) wrote :

Yeah, I feel like that nearly due stuff was intended to address some very specific case, and I wouldn't be sorry to see it go. That said, we've caused plenty of angst in the past by removing odd behaviors like that so we'll have to be a little careful if we do decide we want to remove it. :-)

Changed in oslo:
status: New → Triaged
importance: Undecided → Medium
Paul Murray (pmurray)
Changed in oslo:
assignee: nobody → Paul Murray (pmurray)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo-incubator (master)

Fix proposed to branch: master
Review: https://review.openstack.org/99695

Changed in oslo:
assignee: Paul Murray (pmurray) → Tom Cammann (tom-cammann)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo-incubator (master)

Reviewed: https://review.openstack.org/99695
Committed: https://git.openstack.org/cgit/openstack/oslo-incubator/commit/?id=4dbd3aa4699d32d5b97647a5e4ed8829436c6237
Submitter: Jenkins
Branch: master

commit 4dbd3aa4699d32d5b97647a5e4ed8829436c6237
Author: Tom Cammann <email address hidden>
Date: Thu Jun 12 15:49:41 2014 +0100

    Make periodic tasks run on regular spacing interval

    Instead of setting the value of _periodic_last_run[task_name] to be the
    current time it is set as the nearest multiple of the spacing interval
    which is in the past with added jitter. This will ensure the task runs
    regularly but avoids synchronizing that interval with other nodes.

    This patch also removes the coalescing of tasks which are _nearby_ (0.2
    seconds away).

    Change-Id: I118ce960c2a7f53130fb4240de3e1574b7e06e63
    Closes-Bug: #1326020

Changed in oslo:
status: In Progress → Fix Committed
Revision history for this message
Matt Riedemann (mriedem) wrote :

@Tom, were you going to sync this over to nova and the other projects now (thinking at least Cinder and Neutron also).

Revision history for this message
Tom Cammann (tom-cammann) wrote :

@Matt Yes, I've just made one for nova and will create one for Cinder and Neutron now.

Revision history for this message
Tom Cammann (tom-cammann) wrote :
Changed in oslo:
milestone: none → juno-2
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.