OK, I can at least show an example of this bug happening now. I instrumented the machine to print the following information once per minute: the time the output of "atq" the start of the output of "top" So here is an example of three minutes of normal operation (i.e., when everything is working OK): ---- Wed Jul 18 07:33:01 MDT 2007 4232 Wed Jul 18 07:29:00 2007 = n7dr 4233 Wed Jul 18 07:30:00 2007 = n7dr top - 07:33:01 up 14 days, 1:12, 2 users, load average: 2.37, 2.11, 2.04 Tasks: 243 total, 3 running, 238 sleeping, 0 stopped, 2 zombie Cpu(s): 3.8% us, 1.2% sy, 80.0% ni, 14.1% id, 0.4% wa, 0.2% hi, 0.3% si Mem: 4046572k total, 4014428k used, 32144k free, 564304k buffers Swap: 1020116k total, 1005376k used, 14740k free, 548508k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 27683 n7dr 39 14 28428 15m 1740 R 50 0.4 2:54.83 model2 27594 n7dr 39 14 32152 19m 1740 R 49 0.5 3:53.68 model2 27906 n7dr 15 0 10696 1316 880 R 1 0.0 0:00.01 top 1 root 18 0 2640 504 468 S 0 0.0 0:04.61 init 2 root RT 0 0 0 0 S 0 0.0 0:03.38 migration/0 3 root 34 19 0 0 0 S 0 0.0 0:00.21 ksoftirqd/0 4 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/0 5 root RT 0 0 0 0 S 0 0.0 0:03.18 migration/1 Wed Jul 18 07:34:01 MDT 2007 4232 Wed Jul 18 07:29:00 2007 = n7dr 4234 Wed Jul 18 07:34:00 2007 = n7dr top - 07:34:02 up 14 days, 1:14, 2 users, load average: 1.91, 2.04, 2.02 Tasks: 246 total, 3 running, 241 sleeping, 0 stopped, 2 zombie Cpu(s): 3.8% us, 1.2% sy, 80.0% ni, 14.1% id, 0.4% wa, 0.2% hi, 0.3% si Mem: 4046572k total, 4006700k used, 39872k free, 564616k buffers Swap: 1020116k total, 1005372k used, 14744k free, 548916k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 27594 n7dr 39 14 36128 22m 1740 R 48 0.6 4:53.11 model2 28014 n7dr 37 14 16476 3864 1740 R 45 0.1 0:00.47 model2 1 root 16 0 2640 504 468 S 1 0.0 0:04.62 init 28018 n7dr 15 0 10692 1312 880 R 1 0.0 0:00.01 top 2 root RT 0 0 0 0 S 0 0.0 0:03.38 migration/0 3 root 34 19 0 0 0 S 0 0.0 0:00.21 ksoftirqd/0 4 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/0 5 root RT 0 0 0 0 S 0 0.0 0:03.18 migration/1 Wed Jul 18 07:35:01 MDT 2007 4235 Wed Jul 18 07:35:00 2007 = n7dr 4234 Wed Jul 18 07:34:00 2007 = n7dr top - 07:35:02 up 14 days, 1:15, 2 users, load average: 1.91, 2.02, 2.01 Tasks: 246 total, 5 running, 239 sleeping, 0 stopped, 2 zombie Cpu(s): 3.8% us, 1.2% sy, 80.0% ni, 14.1% id, 0.4% wa, 0.2% hi, 0.3% si Mem: 4046572k total, 4002512k used, 44060k free, 564988k buffers Swap: 1020116k total, 1005372k used, 14744k free, 549564k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 28014 n7dr 39 14 28824 15m 1740 R 45 0.4 0:58.55 model2 28149 n7dr 39 14 16204 3604 1740 R 42 0.1 0:00.44 model2 28144 n7dr 15 0 10692 1312 880 R 1 0.0 0:00.02 top 28150 n7dr 21 4 44976 2992 2052 S 1 0.1 0:00.01 mail 1 root 19 0 2640 504 468 S 0 0.0 0:04.62 init 2 root RT 0 0 0 0 S 0 0.0 0:03.38 migration/0 3 root 34 19 0 0 0 S 0 0.0 0:00.21 ksoftirqd/0 4 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/0 ---- You can see the CPU is being consumed by the "model2" jobs, which are the jobs on the running batch queue. Now look what happens a few minutes later: ---- Wed Jul 18 07:40:02 MDT 2007 4235 Wed Jul 18 07:35:00 2007 = n7dr 4234 Wed Jul 18 07:34:00 2007 = n7dr top - 07:40:02 up 14 days, 1:20, 2 users, load average: 2.69, 2.73, 2.34 Tasks: 247 total, 4 running, 240 sleeping, 0 stopped, 3 zombie Cpu(s): 3.8% us, 1.2% sy, 80.0% ni, 14.1% id, 0.4% wa, 0.2% hi, 0.3% si Mem: 4046572k total, 4023040k used, 23532k free, 704060k buffers Swap: 1020116k total, 1001744k used, 18372k free, 195968k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 28014 n7dr 39 14 36444 22m 1740 R 50 0.6 5:40.84 model2 26130 n7dr 15 0 10692 1312 880 R 1 0.0 0:00.02 top 1 root 15 0 2640 504 468 S 0 0.0 0:04.62 init 2 root RT 0 0 0 0 S 0 0.0 0:03.38 migration/0 3 root 34 19 0 0 0 S 0 0.0 0:00.21 ksoftirqd/0 4 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/0 5 root RT 0 0 0 0 S 0 0.0 0:03.18 migration/1 6 root 34 19 0 0 0 S 0 0.0 0:00.01 ksoftirqd/1 ---- There is now only one model2 job consuming CPU resources, but "atq" still lists two jobs as on the running queue. And then, a minute later... ---- Wed Jul 18 07:41:02 MDT 2007 4235 Wed Jul 18 07:35:00 2007 = n7dr 4234 Wed Jul 18 07:34:00 2007 = n7dr top - 07:41:02 up 14 days, 1:21, 2 users, load average: 1.95, 2.53, 2.29 Tasks: 243 total, 2 running, 238 sleeping, 0 stopped, 3 zombie Cpu(s): 3.8% us, 1.2% sy, 80.0% ni, 14.1% id, 0.4% wa, 0.2% hi, 0.3% si Mem: 4046572k total, 4024852k used, 21720k free, 612560k buffers Swap: 1020116k total, 1001744k used, 18372k free, 159620k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 27744 n7dr 15 0 10696 1316 880 R 1 0.0 0:00.02 top 1 root 16 0 2640 504 468 S 0 0.0 0:04.62 init 2 root RT 0 0 0 0 S 0 0.0 0:03.38 migration/0 3 root 34 19 0 0 0 S 0 0.0 0:00.21 ksoftirqd/0 4 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/0 5 root RT 0 0 0 0 S 0 0.0 0:03.18 migration/1 6 root 34 19 0 0 0 S 0 0.0 0:00.01 ksoftirqd/1 7 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/1 ---- There are no model2 jobs, but "atq" thinks that they are still running. an out-of-band check confirmed at this point that there were no "model2" processes on the machine, and that the expected output from them had been created. So (when I noticed that we had triggered the bug, a few minutes later) I had to manually remove jobs 4234 and 4235 from the queue (using "atrm"), and then everything went back to normal operation again: ---- Wed Jul 18 08:27:01 MDT 2007 4236 Wed Jul 18 08:25:00 2007 = n7dr 4237 Wed Jul 18 08:26:00 2007 = n7dr top - 08:27:01 up 14 days, 2:07, 2 users, load average: 2.38, 1.48, 1.16 Tasks: 218 total, 4 running, 213 sleeping, 0 stopped, 1 zombie Cpu(s): 3.8% us, 1.2% sy, 79.9% ni, 14.3% id, 0.4% wa, 0.2% hi, 0.3% si Mem: 4046572k total, 2660228k used, 1386344k free, 203016k buffers Swap: 1020116k total, 99044k used, 921072k free, 815020k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 29473 n7dr 39 14 25460 11m 1740 R 50 0.3 0:56.02 model2 29389 n7dr 39 14 34168 19m 1740 R 45 0.5 1:49.27 model2 28779 n7dr 15 0 323m 59m 33m S 3 1.5 0:17.45 amarokapp 4814 root 15 0 0 0 0 S 2 0.0 1:54.15 nfsd 29553 n7dr 15 0 10692 1284 880 R 1 0.0 0:00.01 top 1 root 18 0 2640 504 468 S 0 0.0 0:04.63 init 2 root RT 0 0 0 0 S 0 0.0 0:03.40 migration/0 3 root 34 19 0 0 0 S 0 0.0 0:00.21 ksoftirqd/0 ----