Comment 6 for bug 1534345

Revision history for this message
Stefan Bader (smb) wrote :

Looking at both traces this looks to be consistently happen inside run_timer_softirq() and from the offset I would guess we are in the inlined __run_timers. Another noteworthy part is the value of RAX. This is the value of LIST_POISON2 which is used to mark an invalid pointer of a (hlist_node *)->pprev.
So I would guess something modified the list of pending timers (those exist per-cpu) while softirq processing was working on them. The problem is to say what. Not sure a dump will help as often in those races the clues go away just after causing problems.
I would maybe suspect the area of xen-netfront, given that, as far as I can tell, this has not happened on bare-metal servers and from the description rather seems to affect high traffic instances.
Would it be possible to volunteer one affected instance and try mainline kernels (https://wiki.ubuntu.com/Kernel/MainlineBuilds) between 3.19 and 4.2 (4.0, 4.1) and/or after (4.3, maybe 4.4)? That would give a smaller delta to look at for what broke things (using the 4.0 and 4.1 kernels) or whether maybe it got fixed but not identified as a stable patch (when using 4.3 or 4.4).