Comment 13 for bug 1725350

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2017-11-10 12:18 EDT-------
So far I've only been able to reproduce this 2 ways:
a) booting up a Debian Jessie guest (kernel 3.16). generally the crash
happens some time after boot, but on some situations it needs some
"help", like running "useradd <newuser>".
b) bootup up an Ubuntu 16.04 guest, which doesn't seem to ever trigger
the issue itself, but then chrooting into that same Debian Jessie
image (attaching as a 2nd virtio disk), and then running that same
"useradd <newuser>".

Using these test cases, the crash appears to be during the first
instance of compound_head(page) within mm/gup.c:gup_pte_range()

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/gup.c?h=v4.13#n1312

The compound_head(page) call results in a pointer dereference of a
struct *page, via page->compound_page, and that generates a page
fault which leads to the crash. The address of *page in one instance
of the crash was 0xf00000000783af20:

[95667.639406] Unable to handle kernel paging request for data at address 0xf00000000783af20
[95667.639518] Faulting instruction address: 0xc000000000309714
90:mon> t
[c000001e3c4db900] c00000000030a3d0 get_user_pages_fast+0x110/0x160
[c000001e3c4db950] d0000000181be21c kvmppc_book3s_hv_page_fault+0x384/0xc60 [kvm_hv]
[c000001e3c4dba40] d0000000181ba94c kvmppc_vcpu_run_hv+0x314/0x790 [kvm_hv]
[c000001e3c4dbb10] d0000000181059ec kvmppc_vcpu_run+0x34/0x48 [kvm]
[c000001e3c4dbb30] d000000018101aa0 kvm_arch_vcpu_ioctl_run+0x108/0x320 [kvm]
[c000001e3c4dbbd0] d0000000180f5018 kvm_vcpu_ioctl+0x400/0x7c8 [kvm]
[c000001e3c4dbd40] c0000000003bd6e4 do_vfs_ioctl+0xd4/0xa00
[c000001e3c4dbde0] c0000000003be0d4 SyS_ioctl+0xc4/0x130
[c000001e3c4dbe30] c00000000000b184 system_call+0x58/0x6c
--- Exception: c01 (System Call) at 000079d53a595550
SP (79d5354ede40) is in userspace

The 0xf address corresponds to the vmemmap area, where page structs are
allocated sequentially for all PFNs in the system, so it isn't obviously
a bad address. Some of our kernel folks took a look at this and worked out
that that with a 64 byte sizeof(struct page), 0xf00000000783af20
corresponds to 0x783af20 / 64 = 1969852th PFN. For a 64K page size this
corresponds to 1969852*64K, an address somewhere at around 120GB, which
is in the range of physical memory on the system (0-128GB in this case)

Since the *page address appeared valid, it was suggested that the issue
was with the vmemmap area being "unbolted" by KVM, leading to a page
fault for an address that should always be pinned/bolted within the
host, and the following fix was suggested:

commit 67f8a8c1151c9ef3d1285905d1e66ebb769ecdf7
Author: Paul Mackerras <email address hidden>
Date: Tue Sep 12 13:47:23 2017 +1000
KVM: PPC: Book3S HV: Fix bug causing host SLB to be restored incorrectly

I've tested this patch against kernel 4.13.0-16-generic, and at least for
test cases a) and b) above, this does appear to resolve the issue.

So it looks like we need kernel commit 67f8a8c115 pulled into 17.10 to resolve
this bug.