Comment 12 for bug 1559194

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2016-09-02 09:22 EDT-------
From the dmesg it looks like this time ext4 page allocation stumbles upon the doubly freed page first, but it is immediately after the page got corrupted by the double free (indicated by the WARNING), so this just means that ext4 happened to be the first to get its fingers on the corrupted page during a page alloc. It could hit anyone, and we also see later another occurrence where copy_pte_range() stumbles over another corrupted page (no WARNING before that because it is a WARN_ONCE).

We still need to find the root cause for the double free and the resulting page corruption (count -1), and for that we only have the WARNING trace as reliable hint for a double free. So my analysis from comment #5 is still valid, even though this time genwqe itself is not the one who stumbled over the corrupted page, it was still involved in the double free (anyone can see the corrupted page afterwards, genwqe was just a more likely candidate because it was an active consumer at the time).

BTW, instead of "double free" of course a call of dma_free() on previously unmapped addresses would result in the same issue, but a double free is much more likely, e.g. caused by broken error handling with "off by one" or other issues. Speaking of error handling, the "genwqe 0001:00:00.0: [genwqe_map_pages] err: no dma addr daddr=ffffffffffffffff!" messages may be a good starting point to verify the genwqe error handling and the page freeing strategy. Those messages by itself are no problem and even expected given the nature of the test (online/offline and failing rpcit), but of course there is some error handling involved which may have issues that could lead to a double free.