Comment 2 for bug 126204

Revision history for this message
N7DR (doc-evans) wrote : Re: [Bug 126204] Re: Batch jobs intermittently fail to leave "="queue when complete

On 15/07/07, Scott Kitterman <email address hidden> wrote:
> You are going to need to provide a usable test case to replicate this.
> Preferably a minimal test case, but otherwise the code that leads to
> this problem. Otherwise, there is little a develper can do.

OK, here is the simplest that I am sure will work. Probably several
items can be changed/removed and the problem will still show up, but
since I am not completely certain of that, I would rather give
instructions that I know will cause the problem to appear.

On an SMP 64-bit system:

1. Create a program that takes several minutes to execute. Call this
program X. It doesn't matter at all what this program does, we just
want something that will occupy the CPU for about five minutes.

2. Create a file called jobs_to_run, of length at least several
hundred lines, where each line creates a batch job to run X. That is,
each line looks like this
  batch < <some-file>

3. Create a cron job that, once per minute, executes the following
python script:

#! /usr/bin/python

import popen2

# get the list of jobs in Q1
r, w, e = popen2.popen3('cat jobs_to_run')

jobs_in_Q1 = r.readlines() # the list of jobs

n_jobs_in_Q1 = len(jobs_in_Q1)

r.close()
w.close()
e.close()

if (n_jobs_in_Q1 > 0):

# find the number of jobs in Q2
 r, w, e = popen2.popen3('atq | wc -l')
 n_jobs_in_Q2 = r.readline()
 n_jobs_in_Q2 = n_jobs_in_Q2[0:len(n_jobs_in_Q2)-1]
 n_jobs_in_Q2 = int(n_jobs_in_Q2)

 r.close()
 w.close()
 e.close()

 if n_jobs_in_Q2 < 2:

# add a job to Q2
  job = jobs_in_Q1[0]
  job = job[0:len(job)-1]

  command = job

  r, w, e = popen2.popen3(command)
  r.close()
  w.close()
  e.close()

# remove this job from Q1
  jobs_in_Q1 = jobs_in_Q1[1:]

  jobs_to_run = open("jobs_to_run", "w")

  for line in range(len(jobs_in_Q1)):
   jobs_to_run.write(jobs_in_Q1[line])
  jobs_to_run.close()

4. Let it run.

Approximately every day or two, you will end up in the buggy state,
where "atq" says that you have two jobs running, but the jobs have in
fact completed. Once this happens, the system will stay in that state
until you notice and manually remove the jobs from the "=" job queue.

As far as I can see, the bug lies in whatever method is used to tell
the queue manager that jobs have completed -- occasionally, that
mechanism fails so that "atq" thinks that the jobs are still on the
"=" (i.e., the running) queue, even though the jobs have actually
completed processing.

It *may* be that one can cause the bug to appear more often by one of:
  a) making the jobs run for a shorter amount of time than the several
minutes that my jobs take
  b) queuing more than two jobs at a time
Since this is a work system currently running through approximately
10,000 jobs I am not in a position to try either of these methods to
attempt to make the problem appear more frequently than once every day
or two.

I am attempting to instrument my machine a bit more, to produce a bit
more information about the bug, or at least some output that shows it
occurring.