Comment 3 for bug 126204

Revision history for this message
N7DR (doc-evans) wrote : Re: [Bug 126204] Re: Batch jobs intermittently fail to leave "="queue when complete

OK, I can at least show an example of this bug happening now.

I instrumented the machine to print the following information once per minute:

the time
the output of "atq"
the start of the output of "top"

So here is an example of three minutes of normal operation (i.e., when
everything is working OK):

----

Wed Jul 18 07:33:01 MDT 2007

4232 Wed Jul 18 07:29:00 2007 = n7dr
4233 Wed Jul 18 07:30:00 2007 = n7dr

top - 07:33:01 up 14 days, 1:12, 2 users, load average: 2.37, 2.11, 2.04
Tasks: 243 total, 3 running, 238 sleeping, 0 stopped, 2 zombie
Cpu(s): 3.8% us, 1.2% sy, 80.0% ni, 14.1% id, 0.4% wa, 0.2% hi, 0.3% si
Mem: 4046572k total, 4014428k used, 32144k free, 564304k buffers
Swap: 1020116k total, 1005376k used, 14740k free, 548508k cached

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
27683 n7dr 39 14 28428 15m 1740 R 50 0.4 2:54.83 model2
27594 n7dr 39 14 32152 19m 1740 R 49 0.5 3:53.68 model2
27906 n7dr 15 0 10696 1316 880 R 1 0.0 0:00.01 top
    1 root 18 0 2640 504 468 S 0 0.0 0:04.61 init
    2 root RT 0 0 0 0 S 0 0.0 0:03.38 migration/0
    3 root 34 19 0 0 0 S 0 0.0 0:00.21 ksoftirqd/0
    4 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/0
    5 root RT 0 0 0 0 S 0 0.0 0:03.18 migration/1

Wed Jul 18 07:34:01 MDT 2007

4232 Wed Jul 18 07:29:00 2007 = n7dr
4234 Wed Jul 18 07:34:00 2007 = n7dr

top - 07:34:02 up 14 days, 1:14, 2 users, load average: 1.91, 2.04, 2.02
Tasks: 246 total, 3 running, 241 sleeping, 0 stopped, 2 zombie
Cpu(s): 3.8% us, 1.2% sy, 80.0% ni, 14.1% id, 0.4% wa, 0.2% hi, 0.3% si
Mem: 4046572k total, 4006700k used, 39872k free, 564616k buffers
Swap: 1020116k total, 1005372k used, 14744k free, 548916k cached

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
27594 n7dr 39 14 36128 22m 1740 R 48 0.6 4:53.11 model2
28014 n7dr 37 14 16476 3864 1740 R 45 0.1 0:00.47 model2
    1 root 16 0 2640 504 468 S 1 0.0 0:04.62 init
28018 n7dr 15 0 10692 1312 880 R 1 0.0 0:00.01 top
    2 root RT 0 0 0 0 S 0 0.0 0:03.38 migration/0
    3 root 34 19 0 0 0 S 0 0.0 0:00.21 ksoftirqd/0
    4 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/0
    5 root RT 0 0 0 0 S 0 0.0 0:03.18 migration/1

Wed Jul 18 07:35:01 MDT 2007

4235 Wed Jul 18 07:35:00 2007 = n7dr
4234 Wed Jul 18 07:34:00 2007 = n7dr

top - 07:35:02 up 14 days, 1:15, 2 users, load average: 1.91, 2.02, 2.01
Tasks: 246 total, 5 running, 239 sleeping, 0 stopped, 2 zombie
Cpu(s): 3.8% us, 1.2% sy, 80.0% ni, 14.1% id, 0.4% wa, 0.2% hi, 0.3% si
Mem: 4046572k total, 4002512k used, 44060k free, 564988k buffers
Swap: 1020116k total, 1005372k used, 14744k free, 549564k cached

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
28014 n7dr 39 14 28824 15m 1740 R 45 0.4 0:58.55 model2
28149 n7dr 39 14 16204 3604 1740 R 42 0.1 0:00.44 model2
28144 n7dr 15 0 10692 1312 880 R 1 0.0 0:00.02 top
28150 n7dr 21 4 44976 2992 2052 S 1 0.1 0:00.01 mail
    1 root 19 0 2640 504 468 S 0 0.0 0:04.62 init
    2 root RT 0 0 0 0 S 0 0.0 0:03.38 migration/0
    3 root 34 19 0 0 0 S 0 0.0 0:00.21 ksoftirqd/0
    4 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/0

----

You can see the CPU is being consumed by the "model2" jobs, which are
the jobs on the running batch queue.

Now look what happens a few minutes later:

----

Wed Jul 18 07:40:02 MDT 2007

4235 Wed Jul 18 07:35:00 2007 = n7dr
4234 Wed Jul 18 07:34:00 2007 = n7dr

top - 07:40:02 up 14 days, 1:20, 2 users, load average: 2.69, 2.73, 2.34
Tasks: 247 total, 4 running, 240 sleeping, 0 stopped, 3 zombie
Cpu(s): 3.8% us, 1.2% sy, 80.0% ni, 14.1% id, 0.4% wa, 0.2% hi, 0.3% si
Mem: 4046572k total, 4023040k used, 23532k free, 704060k buffers
Swap: 1020116k total, 1001744k used, 18372k free, 195968k cached

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
28014 n7dr 39 14 36444 22m 1740 R 50 0.6 5:40.84 model2
26130 n7dr 15 0 10692 1312 880 R 1 0.0 0:00.02 top
    1 root 15 0 2640 504 468 S 0 0.0 0:04.62 init
    2 root RT 0 0 0 0 S 0 0.0 0:03.38 migration/0
    3 root 34 19 0 0 0 S 0 0.0 0:00.21 ksoftirqd/0
    4 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/0
    5 root RT 0 0 0 0 S 0 0.0 0:03.18 migration/1
    6 root 34 19 0 0 0 S 0 0.0 0:00.01 ksoftirqd/1

----

There is now only one model2 job consuming CPU resources, but "atq"
still lists two jobs as on the running queue.

And then, a minute later...

----

Wed Jul 18 07:41:02 MDT 2007

4235 Wed Jul 18 07:35:00 2007 = n7dr
4234 Wed Jul 18 07:34:00 2007 = n7dr

top - 07:41:02 up 14 days, 1:21, 2 users, load average: 1.95, 2.53, 2.29
Tasks: 243 total, 2 running, 238 sleeping, 0 stopped, 3 zombie
Cpu(s): 3.8% us, 1.2% sy, 80.0% ni, 14.1% id, 0.4% wa, 0.2% hi, 0.3% si
Mem: 4046572k total, 4024852k used, 21720k free, 612560k buffers
Swap: 1020116k total, 1001744k used, 18372k free, 159620k cached

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
27744 n7dr 15 0 10696 1316 880 R 1 0.0 0:00.02 top
    1 root 16 0 2640 504 468 S 0 0.0 0:04.62 init
    2 root RT 0 0 0 0 S 0 0.0 0:03.38 migration/0
    3 root 34 19 0 0 0 S 0 0.0 0:00.21 ksoftirqd/0
    4 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/0
    5 root RT 0 0 0 0 S 0 0.0 0:03.18 migration/1
    6 root 34 19 0 0 0 S 0 0.0 0:00.01 ksoftirqd/1
    7 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/1

----

There are no model2 jobs, but "atq" thinks that they are still running.

an out-of-band check confirmed at this point that there were no
"model2" processes on the machine, and that the expected output from
them had been created.

So (when I noticed that we had triggered the bug, a few minutes later)
I had to manually remove jobs 4234 and 4235 from the queue (using
"atrm"), and then everything went back to normal operation again:

----

Wed Jul 18 08:27:01 MDT 2007

4236 Wed Jul 18 08:25:00 2007 = n7dr
4237 Wed Jul 18 08:26:00 2007 = n7dr

top - 08:27:01 up 14 days, 2:07, 2 users, load average: 2.38, 1.48, 1.16
Tasks: 218 total, 4 running, 213 sleeping, 0 stopped, 1 zombie
Cpu(s): 3.8% us, 1.2% sy, 79.9% ni, 14.3% id, 0.4% wa, 0.2% hi, 0.3% si
Mem: 4046572k total, 2660228k used, 1386344k free, 203016k buffers
Swap: 1020116k total, 99044k used, 921072k free, 815020k cached

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
29473 n7dr 39 14 25460 11m 1740 R 50 0.3 0:56.02 model2
29389 n7dr 39 14 34168 19m 1740 R 45 0.5 1:49.27 model2
28779 n7dr 15 0 323m 59m 33m S 3 1.5 0:17.45 amarokapp
 4814 root 15 0 0 0 0 S 2 0.0 1:54.15 nfsd
29553 n7dr 15 0 10692 1284 880 R 1 0.0 0:00.01 top
    1 root 18 0 2640 504 468 S 0 0.0 0:04.63 init
    2 root RT 0 0 0 0 S 0 0.0 0:03.40 migration/0
    3 root 34 19 0 0 0 S 0 0.0 0:00.21 ksoftirqd/0

----