Scheduler won't reschedule a task if it takes too long

Bug #677257 reported by Don Kirkby
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Odoo Server (MOVED TO GITHUB)
Fix Released
Low
OpenERP's Framework R&D
5.0
Fix Released
Low
Jay Vora (Serpent Consulting Services)

Bug Description

We ran into this problem because we were running the mrp scheduler every two minutes and it started to take longer than a minute to run. Suddenly, it would just stop being scheduled.

It looks like this is what happens in the ir_cron._poolJobs() method:

1. Get the current time and hold it in the "now" variable.
2. Find all active jobs whose next call time has passed.
3. Run each job.
4. Increment the next call time by the job's interval until it passes the "now" variable which may be a few minutes in the past if the jobs took a while to execute.
5. Update the job's next call time in the database.
6. Find all active jobs whose next call time is in the future and schedule the first one.

If a job ends up getting scheduled for a time after the "now" variable but before step 6 executes then it will no longer be executed.

Here's a scenario where that could happen. The mrp scheduler is scheduled to run every five minutes and it takes two minutes to run.

10:00 mrp starts
10:02 mrp finishes, scheduled for 10:05
10:03 admin shuts down the server for maintenance
10:09 admin starts up the server and connects to database. mrp starts
10:11 mrp finishes, scheduled for 10:10 (10:05 plus 5 minutes, it's after 10:09). mrp is no longer in the list of tasks in memory.

Once this happens, I think there is a two-in-five chance that the mrp task will run once at start up and not be scheduled after that.

Why does step six above need to check that the next time is in the future? Shouldn't it just schedule the minimum time for all active tasks?

Tags: maintenance

Related branches

Revision history for this message
Don Kirkby (donkirkby) wrote :

We're using version OpenERP 5.0.15 on Ubuntu 10.04.

Revision history for this message
Don Kirkby (donkirkby) wrote :

I created a merge proposal that stops checking that the next time for a job is in the future. I asked Christophe to review it, since he modified some of the ir_cron code back in September.

Changed in openobject-server:
assignee: nobody → OpenERP's Framework R&D (openerp-dev-framework)
importance: Undecided → Low
status: New → Confirmed
Revision history for this message
Olivier Dony (Odoo) (odo-openerp) wrote :

Hello Don,

Your patch has been merged in trunk in revision 3147 <email address hidden>.
I am re-assigning the backport in 5.0 branch to the maintenance team.

Thanks a lot!

Changed in openobject-server:
milestone: none → 6.0-rc2
status: Confirmed → Fix Released
tags: added: maintenance
Revision history for this message
Jay Vora (Serpent Consulting Services) (jayvora) wrote :

Thank you Don.
Your branch has been merged into stable by revision 2168 <email address hidden>.

Revision history for this message
Claire (cjin) wrote :

just wondering whether Openerp can reschedule if there is any change or updated sequence for manufacturing orders?

Revision history for this message
Don Kirkby (donkirkby) wrote :

Hi Claire,
The last time I looked at the scheduler code, tasks could only be triggered by the clock, not by events. Some entities like production orders, sales orders, and invoices can trigger code from events using the workflow, but products and bills of materials don't have workflow.

This isn't really a good place for asking questions like that. You'd be better to ask them at http://www.openerp.com/forum/ or http://stackoverflow.com/

Revision history for this message
Ghislain Nebra (INCB) (ghislain-nebra-3) wrote :

I would also add that in _poolJos function, sql requests are using "now()" of Postgre whereas in the if statement, the "DateTime.now()" of Python is used.
Only one "now" origin should be used : the Python one.

Here is my suggestion :
cr.execute("select * from ir_cron where numbercall<>0 and active and nextcall<=now() order by priority")
should be
cr.execute("select * from ir_cron where numbercall<>0 and active and nextcall<='" + now.strftime('%Y-%m-%d %H:%M:%S') + "' order by priority")

This is very important if your PostgreSQL database is not on the same computer than the OpenERP server because a small difference in clock could lead to non-working cron jobs.

Moreover in this function, the "while nextcall < now and numbercall:" should be "while nextcall <= now and numbercall:" because if the cron job is planned at 1:00, the timer will wake up at 1:00 ... and you want your cron job to be executed !

Revision history for this message
Olivier Dony (Odoo) (odo-openerp) wrote : Re: [Bug 677257] Re: Scheduler won't reschedule a task if it takes too long
Download full text (3.4 KiB)

On 12/19/2011 10:25 AM, Ghislain Nebra (INCB) wrote:
> I would also add that in _poolJos function, sql requests are using
> "now()" of Postgre whereas in the if statement, the "DateTime.now()"
> of Python is used. Only one "now" origin should be used : the Python
> one.
>
> Here is my suggestion :
> cr.execute("select * from ir_cron where numbercall<>0 and active
> and nextcall<=now() order by priority")
> should be
> cr.execute("select * from ir_cron where numbercall<>0 and active
> and nextcall<='" + now.strftime('%Y-%m-%d %H:%M:%S')
> + "' order by priority")
>
> This is very important if your PostgreSQL database is not on the same
> computer than the OpenERP server because a small difference in clock
> could lead to non-working cron jobs.

I agree that it would be more consistent to use the same reference time
everywhere in the code. Practically however the time offset should be
minimal because as of 6.1 we have switched the server to use UTC
everywhere, and we're storing only UTC timestamps in the database.
So the only difference here would be a clock sync difference between the
database machine and the OpenERP machine, as you say.

As of 6.1 the cron mechanism is multi-threaded and supports
load-balancing on multiple OpenERP servers talking to the same database.
In such environments it will be important to have all machines properly
time-sync'ed, e.g. with NTP, because there's a limit to what magic the
system can do to avoid sync issues.

In a load-balancing configuration there is always a way for things to go
wrong if all the servers are not properly sync'ed. Imagine a 1 hour
offset between the machines (while all of them think they're all at
UTC!): you'll get different results depending on which machine runs the
job vs. which was used to configure its execution time.

So if you can come up with a nice patch to improve this logic, feel free
to propose a merge for trunk (see the doc[1]), but keep in mind that
there's a limit to what we can do in case of bad time sync.
In time-critical deployment environments, setting up a proper clock sync
across all machines seems like a pre-requisite.

Note about the patch:
- you should pass the time value as a query parameter instead of
mangling the query string (bad practice!)
- you need to study the latest trunk code and make a patch against it,
such changes won't be accepted in a stable branch (this is a minor
improvement that can be replaced by proper config and clock sync)

> Moreover in this function, the "while nextcall < now and numbercall:"
> should be "while nextcall <= now and numbercall:" because if the cron
> job is planned at 1:00, the timer will wake up at 1:00 ... and you want
> your cron job to be executed !

Actually this will never be a problem because even if the job execution
was started exactly on the second where it was scheduled to run, the
value of datetime.now() includes precision up to the microsecond. It is
compared with the nextcall value from the database, which is rounded to
the second, so you'll get such a comparison:
 2011-12-19 10:33:07 < 2011-12-19 10:33:07.290877

Granted, changing the comparison to `<=` would not hurt (it's on...

Read more...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.