Batch jobs intermittently fail to leave "="queue when complete
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
at (Ubuntu) |
Won't Fix
|
Undecided
|
Unassigned |
Bug Description
This is a bit complicated to describe, but I will do my best. I estimate that the bug manifests itself more than 0.1% of the time, but less than 0.5% of the time (so I suspect some kind of race condition). This is almost certainly an upstream bug.
So here is what I am doing:
1. I have a series of several thousand batch jobs to be run.
2. Once a minute, with a cron job, I check the number of lines of output from the "atq" command
3. If the result of step #2 is 2 or greater, I do nothing.
4. If the result of step #2 is 1 or 0, I submit either 1 or 2 batch jobs, to bring the total up to 2. (This is so that I can insert additional jobs into the queue if I want to, without having several thousand other jobs in front of the new job).
atd is running with the parameter "-l 2.2", on an SMP dual-core system.
Approximately once every couple of days, I discover that the CPU is not busy. Upon investigation I discover that the "atq" output states that two jobs are running, both on the "=" queue. However, the two jobs in question have completed and are no longer requesting OS resources. (The output has been created and "ps" confirms that the processes no longer exist on the system.)
The system remains in this state until I manually use "atrm" to remove the jobs, at which point the next time that the cron job executes two new jobs are (correctly) started.
So the problem is that "atq" doesn't seem to realise that the jobs have completed.
This is on 64-bit dapper with all updates applied.
You are going to need to provide a usable test case to replicate this. Preferably a minimal test case, but otherwise the code that leads to this problem. Otherwise, there is little a develper can do.