KVM CPU slowdown on 4.4 kernel
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
qemu-kvm (Ubuntu) |
Expired
|
Undecided
|
Unassigned |
Bug Description
HP DL380 G9 Openstack (Kilo 2015.1.4) hypervisor
755GB RAM/2x18 core E5-2699v3 2.30GHz
Ubuntu 14.04.3 with 4.4.0-28-generic kernel
Qemu version 2.0.0+dfsg-
Libvirt 1.2.2-0ubuntu13
Up to about 100 instances residing on the hypervisor, no issues are noticed. Once the server obtains about 100 instances, you see ksoftirqd process jump to the top of processes consuming CPU. Load shoots into the high 100's and low 200's and instances become inaccessible as the hypervisor begins dropping packets (due to CPU).
Moved to Libvirt 1.2.2-0ubuntu13
This appears to have resolved the issue and the server can once again handle 150+ instances without skipping a beat.
I did find this error message repeated thousands of times in libvirtd.log up until kernel and package upgrade which fixed the issue:
error : qemuMonitorJSON
Let me know if I can provide extra detail
Thanks
Thank you a lot for reporting this issue.
I haven't hit it in my environment, but then I only had about 80 guests so far.
Might I ask on the sizes (memory, cpu, #disks) of the guests and the sizing of the host you use?
For the ksoftirq the first look to take likely is to check for what softirq (and irq) is currently starting to rise. www-05. ibm.com/ de/events/ linux-on- system- z/downloads/ Tools-MK2- V7-Web. pdf page 79ff.
I'd recommend to start something like
http://
mpstat -I ALL 5
And then raise your load.
Gather timestamps when the case was ok, and when it started to become broken.
A sysstat along that data with the same timestamps can be useful as well.
Please attach that data so we can take a look.
In general for KVM exit issues a good thing to do is to get stats on KVM exits. /lwn.net/ Articles/ 513317/
Look at this for a basic info about it https:/
e.g. ./perf kvm stat report --event=vmexit
Although in your case for a system your size gathering that can have a serious impact.
The stat mentioned above is the lowest overhead of all perf kvm actions, but be careful if you do on a production system.