Comment 5 for bug 634487

Revision history for this message
Kent Forschmiedt (kentf) wrote : Re: t1.micro instance hangs when installing sun java

Linux takes a page fault, and in the process of updating the page tables to map the new page, it needs to allocate a new "PMD", or mid-level page directory entry. The PV kernel cannot write its own page tables, Xen has to do that, so a "multicall" that encapsulates two hypercalls is dispatched to Xen:
* update_va_mapping: Map the PMD page in the guest memory and flush the mapping cache (TLB)
* mmu_update: Update the contents of the PMD page
Xen returns a fatal error which causes the guest kernel to crash.

The "multicall" scheme does not return the failure status. Most of the failure paths have no logging, and the log levels are not usually tuned up enough to show anything.

The next steps are:
Turn up the Xen logging level and repro.
Add logging to all of the failure cases in the hypercalls and load the test host with the diagnostic Xen build.
Repro and see what we learn.

Some other notes:
* I wonder how easy it is to repro. I wrote some simple test programs and only got the expected out of memory errors. It may be that the memory has to run out just in time to need a new PMD. That is trickier than it sounds, because it depends on the number of virtual pages allocated vs. exactly how many have actually been touched.
* I studied the memory manager code in the Ubuntu kernel extensively and compared with other kernels that did not repro the problem. It may be that this bug exists in many, most or all versions, but needs exactly the right circumstances to trigger it.
* Good news - the micro is a single-cpu instance, so this is not a race in the guest and is almost certainly not a race in Xen.
* Some of the warning oopses in the Ubuntu 2.6.35 kernel are similar to this. I have not studied that yet.