Comment 15 for bug 1497428

Revision history for this message
Dan Streetman (ddstreet) wrote :

To clarify the bug, a bit of background is needed first (specific numbers apply only to this situation).

The kernel refers to all pages under a single PMD (midlevel page table) as a "pageblock". It's the same size as a hugepage, 2M. In the function triggering the BUG(), it's expecting that the start and end pages are inside the same zone, but that isn't the case so the BUG() is triggered. One function up, move_freepages_block(), is where the start and end PFNs are set; the function takes one page and calculates the start and end PFNs (which are aligned) that contain the provided page. It then verifies that both PFNs are inside the original page's zone, and passes the start/end pages to move_freepages().

The problem is that the zone's PFN range is wrong. In this particular case, the zone's memory ends in the middle of a pageblock, which is unusual. So when move_freepages_block() checks if the end PFN of the pageblock is inside the zone (i.e. < zone end PFN), it *should* fail, and cause the function to return. However, it doesn't fail, meaning the zone's end PFN is wrong, and when move_freepages() checks the page_zone() of the start and end pages, they don't match - because the end page isn't valid - and the BUG() is triggered.

In my testing, if I manually limit memory to end in the middle of a pageblock, the zone's end PFN is correctly set, so it seems that something is changing the zone PFN range (specifically the zone's spanned_pages value) at runtime - or, the particular environment for this bug is different that my test setup and getting the zone end PFN wrong somehow. I'm going to create a debug module that will jprobe these functions to check for this condition, and then print debug output and avoid the BUG().

As a workaround for this, if the amount of memory is set so that it ends at a multiple of the pageblock size (512 4k pages == 2M), this bug should not happen. On x86, the boot mem= param sets the maximum address, which should allow changing the zone's end pfn to be aligned with pageblock; e.g. if the dmesg e820 output lists the last line of the memory ranges as:

[ 0.000000] BIOS-e820: [mem 0x0000000100000000-0x00000003e08fffff] usable

then the last valid PFN is 0x3e08fffff, so the zone end pfn (1 more than last valid pfn) is 0x3e0900000, which isn't a multiple of the pageblock size (2M):

$ echo $[ 0x3e0900000 % (2 * 1024 * 1024) ]
1048576

In this example case, restricting the last 1M of memory by setting mem=0x3e0800000 should work around this bug - although since I can't reproduce it yet, I've no way to verify the workaround; and it may simply cause the bug to appear at a different location.