Comment 26 for bug 1011792

Revision history for this message
Stefan Bader (smb) wrote :

It is hard to say for sure. Indeed we have the same jump of the time. Though the problem on the lkml report seemed more like one task going into schedule and never really getting scheduled. While here it looks like a real locking issue. All CPUs (except CPU1) look to be waiting on some spinlock. Several on runqueue locks (which are trying to avoid potential a-b locks). CPU0 waits on some block request queue lock. From just the traces it is not possible to say but it feels like maybe one of the runqueue locks might have got not released on some not so often used special path.
I hope that the kernels with lock debug enabled may show something here. At least having new stack traces with those would be known to not have the task_group race. And it will help for the case where we need to think of a special way to catch / print info on the problem. For any further debug kernels I try to include something as mentioned on lkml to see whether this too avoids time jumps and issues. Though another suspect could be a higher rate of looking at /proc task info (which one of the CPUs runs on). That might be something that a database does more often than any other work load.