KVM: Fix zero_page reference counter overflow when using KSM on KVM compute host
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Fix Released
|
Undecided
|
Unassigned | ||
Bionic |
Fix Released
|
Medium
|
Matthew Ruffell | ||
Focal |
Fix Released
|
Medium
|
Matthew Ruffell |
Bug Description
BugLink: https:/
[Impact]
We are seeing a problem on OpenStack compute nodes, and KVM hosts, where a kernel oops is generated, and all running KVM machines are placed into the pause state.
This is caused by the kernel's reserved zero_page reference counter overflowing from a positive number to a negative number, and hitting a (WARN_ON_
This only happens if the machine has Kernel Samepage Mapping (KSM) enabled, with "use_zero_pages" turned on. Each time a new VM starts and the kernel does a KSM merge run during a EPT violation, the reference counter for the zero_page is incremented in try_async_pf() and never decremented. Eventually, the reference counter will overflow, causing the KVM subsystem to fail.
Syslog:
error : qemuMonitorJSON
QEMU Logs:
error: kvm run failed Bad address
EAX=000afe00 EBX=0000000b ECX=00000080 EDX=00000cfe
ESI=0003fe00 EDI=000afe00 EBP=00000007 ESP=00006d74
EIP=000ee344 EFL=00010002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA]
CS =0008 00000000 ffffffff 00c09b00 DPL=0 CS32 [-RA]
SS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA]
DS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA]
FS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA]
GS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA]
LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT
TR =0000 00000000 0000ffff 00008b00 DPL=0 TSS32-busy
GDT= 000f7040 00000037
IDT= 000f707e 00000000
CR0=00000011 CR2=00000000 CR3=00000000 CR4=00000000
DR0=00000000000
DR6=00000000fff
EFER=0000000000
Code=c3 57 56 b8 00 fe 0a 00 be 00 fe 03 00 b9 80 00 00 00 89 c7 <f3> a5 a1 00 80 03 00 8b 15 04 80 03 00 a3 00 80 0a 00 89 15 04 80 0a 00 b8 ae e2 00 00 31
Kernel Oops:
[ 167.695986] WARNING: CPU: 1 PID: 3016 at /build/
[ 167.696023] CPU: 1 PID: 3016 Comm: CPU 0/KVM Tainted: G OE 4.15.0-106-generic #107~16.04.1-Ubuntu
[ 167.696023] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1 04/01/2014
[ 167.696025] RIP: 0010:follow_
[ 167.696026] RSP: 0018:ffffa81802
[ 167.696027] RAX: ffffed8786e33a80 RBX: ffffed878c6d21b0 RCX: 0000000080000000
[ 167.696027] RDX: 0000000000000000 RSI: 00003ffffffff000 RDI: 80000001b8cea225
[ 167.696028] RBP: ffffa81802023970 R08: 80000001b8cea225 R09: ffff90c4d55fa340
[ 167.696028] R10: 0000000000000000 R11: 0000000000000000 R12: ffffed8786e33a80
[ 167.696029] R13: 0000000000000326 R14: ffff90c4db94fc50 R15: ffff90c4d55fa340
[ 167.696030] FS: 00007f6a7798c70
[ 167.696030] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 167.696031] CR2: 0000000000000000 CR3: 0000000315580002 CR4: 0000000000162ee0
[ 167.696033] Call Trace:
[ 167.696047] follow_
[ 167.696049] follow_
[ 167.696051] __get_user_
[ 167.696052] get_user_
[ 167.696068] __gfn_to_
[ 167.696079] ? mmu_set_
[ 167.696090] try_async_
[ 167.696101] tdp_page_
[ 167.696104] ? vmexit_
[ 167.696114] kvm_mmu_
[ 167.696117] handle_
[ 167.696119] vmx_handle_
[ 167.696129] vcpu_enter_
[ 167.696138] ? kvm_arch_
[ 167.696148] kvm_arch_
[ 167.696157] ? kvm_arch_
[ 167.696165] kvm_vcpu_
[ 167.696166] ? do_futex+
[ 167.696171] ? __switch_
[ 167.696174] ? __switch_
[ 167.696176] do_vfs_
[ 167.696177] SyS_ioctl+0x79/0x90
[ 167.696180] ? exit_to_
[ 167.696181] do_syscall_
[ 167.696182] entry_SYSCALL_
[ 167.696184] RIP: 0033:0x7f6a80482007
[ 167.696184] RSP: 002b:00007f6a77
[ 167.696185] RAX: ffffffffffffffda RBX: 000000000000ae80 RCX: 00007f6a80482007
[ 167.696185] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000016
[ 167.696186] RBP: 000055fe135f3240 R08: 000055fe118be530 R09: 0000000000000001
[ 167.696186] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 167.696187] R13: 00007f6a85852000 R14: 0000000000000000 R15: 000055fe135f3240
[ 167.696188] Code: 4d 63 e6 e9 f2 fc ff ff 4c 89 45 d0 48 8b 47 10 e8 22 f0 9e 00 4c 8b 45 d0 e9 89 fc ff ff 4c 89 e7 e8 81 3f fd ff e9 aa fc ff ff <0f> 0b 49 c7 c4 f4 ff ff ff e9 c1 fc ff ff 0f 1f 40 00 66 2e 0f
[ 167.696200] ---[ end trace 7573f6868ea8f069 ]---
[Fix]
This was fixed in 5.6-rc1 with the following commit:
commit 7df003c85218b5f
Author: Zhuang Yanying <email address hidden>
Date: Sat Oct 12 11:37:31 2019 +0800
Subject: KVM: fix overflow of zero page refcount with ksm running
Link: https:/
The fix adds a check to see if the Page Frame Number (pfn) is linked to the zero page, and if it is, treats it as reserved. This has the effect that put_page() is no longer called on the zero_page, and reference counting is no longer needed.
This is a clean cherry pick to Bionic and Focal kernels.
[Testcase]
Create a new KVM host, and make sure it has plenty of ram. 16gb should be okay.
Install KVM packages:
$ sudo apt install -y qemu-kvm libvirt-bin qemu-utils genisoimage virtinst
Enable Kernel Samepage Mapping, and use_zero_pages:
$ echo 10000 | sudo tee /sys/kernel/
$ echo 1 | sudo tee /sys/kernel/
$ echo 1 | sudo tee /sys/kernel/
I wrote a script which creates and destroys xenial KVM VMs in a infinite loop:
https:/
Save the script to disk, and execute it:
$ chmod +x ksm_refcnt_
$ ./ksm_refcnt_
Each time a VM is created and destroyed the reference counter will increase.
I wrote a kernel module which exposes a /proc interface, which we can use to look at the value of the zero_page reference counter. It works by taking the memory allocated for the zero page: empty_zero_page, which is defined in arch/x86/
https:/
Save the module to disk, create its Makefile from the included documentation, and build it:
$ make
$ sudo insmod zero_page_
From there, we can examine the reference counter with:
$ cat /proc/zero_
Zero Page Refcount: 0x687 or 1671
$ cat /proc/zero_
Zero Page Refcount: 0x846 or 2118
$ cat /proc/zero_
Zero Page Refcount: 0x9f8 or 2552
$ cat /proc/zero_
Zero Page Refcount: 0xcb2 or 3250
We see it steadily increase. Instead of waiting months for it to overflow, I implemented a /proc entry to set it to near overflow. You can use it with:
$ cat /proc/zero_
Zero Page Refcount set to 0x1FFFFFFFFF000
After that, wait a few seconds and the reference counter will overflow:
$ cat /proc/zero_
Zero Page Refcount: 0x7fffff16 or 2147483414
$ cat /proc/zero_
Zero Page Refcount: 0x80000000 or -2147483648
All VMs will become paused:
$ virsh list
Id Name State
-------
1 instance-0 paused
2 instance-1 paused
QEMU will error out, and the kernel will oops with the messages in the impact section.
I built a test kernel, which is available here:
https:/
If you install the test kernel and try reproduce, you will notice the reference counter is never incremented past 1:
$ cat /proc/zero_
Zero Page Refcount: 0x1 or 1
$ cat /proc/zero_
Zero Page Refcount: 0x1 or 1
$ cat /proc/zero_
Zero Page Refcount: 0x1 or 1
This resolves the problem.
[Regression Potential]
While the change itself seems simple, it changes how the kernel treats the zero_page. The zero_page is important, since it is just a page full of 0's. Each time memory is allocated which is all 0s, the kernel sets it to use the zero_page to save memory. When an application writes to the buffer, a EPT violation happens, and the kernel does a COW to new pages to hold the data.
The change is limited to how the KVM subsystem handles the zero_page. This will not break the entire kernel if a regression occurs, only KVM.
If a regression were to occur, users could turn off KSM and disable KSM use_zero_pages until a fix is ready, as this particular use of zero_pages is limited to KSM.
The fix landed in upstream 5.6, and has not been backported to stable kernels.
I have read a bit of the paging code, especially around where the zero_page is used, and where its reference counters were being incorrectly incremented.
I think the fix is correct, and I believe it won't cause any regressions.
CVE References
Changed in linux (Ubuntu Bionic): | |
status: | New → In Progress |
Changed in linux (Ubuntu Focal): | |
status: | New → In Progress |
Changed in linux (Ubuntu): | |
status: | Confirmed → Fix Released |
Changed in linux (Ubuntu Bionic): | |
importance: | Undecided → Medium |
Changed in linux (Ubuntu Focal): | |
importance: | Undecided → Medium |
Changed in linux (Ubuntu Bionic): | |
assignee: | nobody → Matthew Ruffell (mruffell) |
Changed in linux (Ubuntu Focal): | |
assignee: | nobody → Matthew Ruffell (mruffell) |
summary: |
- qemu instance gets paused with error: kvm run failed Bad address + KVM: Fix zero_page reference counter overflow when using KSM on KVM + compute host |
description: | updated |
tags: |
added: bionic focal sts removed: xenial |
Changed in linux (Ubuntu Focal): | |
status: | In Progress → Fix Committed |
Changed in linux (Ubuntu Bionic): | |
status: | In Progress → Fix Committed |
Status changed to 'Confirmed' because the bug affects multiple users.