Comment 23 for bug 503138

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hi Arie,
I saw you also posting on some other old bugs with the same symptom (e.g. 1346917).
And that exactly is the problem with this issue - the message only represents the symptom but not the root cause. Fixing the symptom makes no sense, because if your IRQ delivery is stalled then it is stalled - there is only very little the system can do other than trottling it down.
If you follow the discussion around [1] you'll see that this was the thought back then and it didn't change since then AFAIK.

There were cases around KSM (bug 1346917), but also others around broken RAM, then others had just overloaded their CPUs, others had scheduler bug triggering the same while in other cases there was a thundering herd issue. The only thing all of these cases share is that eventually something (tm) happened which made IRQ/hrtimers stutter to then throttle them down.

Many of the underlying issues causing that have been fixed over the years - this got more rare nowadays. But for each case still left one would not need a "this happened again here" message. But instead a way to reproduce, to then debug the root cause of this exact case and check for a fix then.

Therefore three recommendations for people affected:
1. Try as good as you can to get it to reproduce reliably and then outline these steps, this will help people to hopefully get a grasp of the root cause. Do -not- just report the "hrtimer: interrupt took" being the symptom
2. Always try the same setup you have, but with the very latest virtualization stack provided - quite often things are fixed there and if that is confirmed it becomes "only" a binary search what would need to be backported.
3. open a new bug for these, because until a root cause of a given case is found and identified to be the same we don't know if it is the same issue.

[1]: http://<email address hidden>/msg23491.html