KVM CPU slowdown on 4.4 kernel

Bug #1646639 reported by Kevin Stevens
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
qemu-kvm (Ubuntu)
Expired
Undecided
Unassigned

Bug Description

HP DL380 G9 Openstack (Kilo 2015.1.4) hypervisor
755GB RAM/2x18 core E5-2699v3 2.30GHz
Ubuntu 14.04.3 with 4.4.0-28-generic kernel
Qemu version 2.0.0+dfsg-2ubuntu1.22
Libvirt 1.2.2-0ubuntu13.1.16

Up to about 100 instances residing on the hypervisor, no issues are noticed. Once the server obtains about 100 instances, you see ksoftirqd process jump to the top of processes consuming CPU. Load shoots into the high 100's and low 200's and instances become inaccessible as the hypervisor begins dropping packets (due to CPU).

Moved to Libvirt 1.2.2-0ubuntu13.1.17, qemu 2.0.0+dfsg-2ubuntu1.30, and kernel 3.13.0-101-generic.

This appears to have resolved the issue and the server can once again handle 150+ instances without skipping a beat.

I did find this error message repeated thousands of times in libvirtd.log up until kernel and package upgrade which fixed the issue:
error : qemuMonitorJSONCheckError:354 : internal error: unable to execute QEMU command 'qom-get': guest hasn't updated any stats yet

Let me know if I can provide extra detail

Thanks

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Thank you a lot for reporting this issue.
I haven't hit it in my environment, but then I only had about 80 guests so far.

Might I ask on the sizes (memory, cpu, #disks) of the guests and the sizing of the host you use?

For the ksoftirq the first look to take likely is to check for what softirq (and irq) is currently starting to rise.
I'd recommend to start something like
http://www-05.ibm.com/de/events/linux-on-system-z/downloads/Tools-MK2-V7-Web.pdf page 79ff.
  mpstat -I ALL 5
And then raise your load.
Gather timestamps when the case was ok, and when it started to become broken.
A sysstat along that data with the same timestamps can be useful as well.
Please attach that data so we can take a look.

In general for KVM exit issues a good thing to do is to get stats on KVM exits.
Look at this for a basic info about it https://lwn.net/Articles/513317/
e.g. ./perf kvm stat report --event=vmexit
Although in your case for a system your size gathering that can have a serious impact.
The stat mentioned above is the lowest overhead of all perf kvm actions, but be careful if you do on a production system.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Didn't get an update nor seen similar report or hit it in my environments, marking incomplete for now to reflect that.

Also when reproducing it might be interesting as well to run "perf top" to see what ksoftirq is really doing (that needs some prep like debug symbols and making perf available)

Changed in qemu-kvm (Ubuntu):
status: New → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for qemu-kvm (Ubuntu) because there has been no activity for 60 days.]

Changed in qemu-kvm (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.