Qemu using CPU even when the VM is paused

Bug #1851062 reported by Pedro Côrte-Real
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
virt-manager
Won't Fix
Undecided
qemu (Ubuntu)
Confirmed
Undecided
Unassigned
virt-manager (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

I run a Windows 10 VM for work and sometimes pause it when I'm not using it. However even when paused it keeps using ~15% of CPU. I'm running it with virt-manager and that's what I'm using to pause it.

ProblemType: Bug
DistroRelease: Ubuntu 19.04
Package: qemu-system-x86 1:3.1+dfsg-2ubuntu3.5
ProcVersionSignature: Ubuntu 5.0.0-31.33-generic 5.0.21
Uname: Linux 5.0.0-31-generic x86_64
ApportVersion: 2.20.10-0ubuntu27.2
Architecture: amd64
CurrentDesktop: Unity
Date: Sat Nov 2 18:12:11 2019
InstallationDate: Installed on 2019-05-09 (176 days ago)
InstallationMedia: Ubuntu 19.04 "Disco Dingo" - Release amd64 (20190416)
Lsusb:
 Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
 Bus 001 Device 004: ID 138a:0090 Validity Sensors, Inc. VFS7500 Touch Fingerprint Sensor
 Bus 001 Device 003: ID 5986:0706 Acer, Inc
 Bus 001 Device 002: ID 8087:0a2b Intel Corp.
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
MachineType: LENOVO 20FAS1DA00
ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-5.0.0-31-generic root=UUID=3819189f-6ddf-4730-b629-f67c85b114a4 ro quiet splash vt.handoff=1
SourcePackage: qemu
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 04/19/2019
dmi.bios.vendor: LENOVO
dmi.bios.version: N1CET75W (1.43 )
dmi.board.asset.tag: Not Available
dmi.board.name: 20FAS1DA00
dmi.board.vendor: LENOVO
dmi.board.version: 0B98417 WIN
dmi.chassis.asset.tag: No Asset Information
dmi.chassis.type: 10
dmi.chassis.vendor: LENOVO
dmi.chassis.version: None
dmi.modalias: dmi:bvnLENOVO:bvrN1CET75W(1.43):bd04/19/2019:svnLENOVO:pn20FAS1DA00:pvrThinkPadT460s:rvnLENOVO:rn20FAS1DA00:rvr0B98417WIN:cvnLENOVO:ct10:cvrNone:
dmi.product.family: ThinkPad T460s
dmi.product.name: 20FAS1DA00
dmi.product.sku: LENOVO_MT_20FA_BU_Think_FM_ThinkPad T460s
dmi.product.version: ThinkPad T460s
dmi.sys.vendor: LENOVO

Revision history for this message
Pedro Côrte-Real (pedrocr) wrote :
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

A paused guest still can have some action in I/O threads or UI, but 15% seems a lot.

None of the automatic logs helps for this.
You should gather some metrics which thread is using the CPU and where.
  pidstat -p $(pidof qemu-system-x86_64) -T ALL -rtuw 5 5

Report that back here on the bug to discuss what it might be.
In a check with a paused linux guest that stays <1% for me.

If you have more guests up identify the PID of the one in question and use it instead of `$(pidof qemu-system-x86_64)`

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better.

Since there isn't enough information in your report to differentiate between a local configuration problem and a bug in Ubuntu, I'm marking this bug as Incomplete.

If indeed this is a local configuration problem, you can find pointers to get help for this sort of problem here: http://www.ubuntu.com/support/community

Or if you believe that this is really a bug, then you may find it helpful to read "How to report bugs effectively" http://www.chiark.greenend.org.uk/~sgtatham/bugs.html. We'd be grateful if you would then provide a more complete description of the problem, explain why you believe this is a bug in Ubuntu rather than a problem specific to your system, and then change the bug status back to New.

Changed in qemu (Ubuntu):
status: New → Incomplete
Revision history for this message
Pedro Côrte-Real (pedrocr) wrote :

I can't replicate this on demand. Doing it now only gave me 3% CPU usage. I'll keep an eye on this and run this diagnostic when it's happening again.

Revision history for this message
Pedro Côrte-Real (pedrocr) wrote :

Here's an example where the VM was at 100% CPU when I paused it to not have it consume as much CPU as I wasn't using it. After pausing it is now at 30% CPU usage continuously even though it's paused.

Revision history for this message
Pedro Côrte-Real (pedrocr) wrote :

Unpausing the VM, letting it get to a lower level of CPU usage and then pausing again brings qemu to the more usual ~2% of continuous CPU usage. Which still seems high but isn't as bad.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Interesting, so the main qemu in the host consumes ~17% user and ~18% system while this is going on. And this goes on for at least the 15 seconds we traced things (and from your report I assume it keeps going on).

We also see context switches, but no faults - this really just seems to "do something".
Very odd and very unexpected @pedro.

Nothing is spend in the guest, matching what you'd expect after pausing.
I'd understand some buffer flushes or cleanups but those should stop after a while.
But if you say this "keep going" I agree that it feels odd.

With these logs I'd at least call it confirmed for now.
But it seems we need to go step by step analazing what is going on.

The next obvious step once you again see such a case would be to check "what" the qemu is doing on the host.

I'm not sure how experienced you are with profiling workload.
If you are please just dive deep and report what you find.
If you are not I can give suggestions every step but we will have to ping/pong this back and forth until we have something.

When in this paused-but-busy state you could next time run:
$ sudo strace -rT -ff -f -o paused-but-busy -p $(pidof qemu-system-x86_64)
Please at the same time also run the other one you already used so that we can match thread IDs:
$ pidstat -p $(pidof qemu-system-x86_64) -T ALL -rtuw 5 5

Also with the guest paused I'd expect the following not to move:
for i in $(seq 1 5); do virsh domstats "<yourguestname>"; sleep 5; done

Changed in qemu (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Pedro Côrte-Real (pedrocr) wrote :

As far as I can tell qemu will stay at 30% indefinitely in these situations. I'll produce more diagnostics when I can reproduce it again. One possible clue is that this may be happening when the VM is waiting for network responses. I'm not certain of that but it definitely doesn't happen just because the VM had high cpu usage. In other situations the VM was at 100% CPU usage (or 200% since it has two cores) and pausing it dropped it down to ~3%. The 3% are also continuous though and don't disappear after a while. That also seems odd to me.

Revision history for this message
Kevin Locke (kevinoid) wrote :

I just ran into what I think is the same issue. Windows 10 VM controlled from virt-manager using ~30% CPU while paused.

I'm attaching the output of
pidstat -p $(pidof qemu-system-x86_64) -T ALL -rtuw 5 5
and the first thousand lines of
strace -rT -ff -f -o paused-but-busy -p $(pidof qemu-system-x86_64)
that you requested from the other reporter. As a non-expert, the repeated calls to ioctl(10, KVM_IRQ_LINE_STATUS, ...)+ppoll() timeout seem suggestive to me.

Note: I'm omitting strace for all threads except the one consuming the CPU, since they each contain a single futex() call. If you'd like to see the futex addr/val, let me know.

Thanks for investigating,
Kevin

Full disclosure: I'm using Debian rather than Ubuntu. If you'd like me to repro using the Ubuntu package versions (or on a fully Ubuntu system) I can give it a shot. Currently using qemu-system-x86 1:5.0-5, libvirt 6.0.0-6, and virt-manager 1:2.2.1-4.

Revision history for this message
Kevin Locke (kevinoid) wrote :
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

That is great data to have @Kevin, thank you.
There is no problem at all that you've done that with some slightly other versions.

The loop around KVM_IRQ_LINE_STATUS really seems interesting and certainly will help to identify things. In fact with that I already found:
- https://bugzilla.redhat.com/show_bug.cgi?id=1638289
- https://www.spinics.net/lists/kvm/msg157148.html

All of them unfortunately ended with assumptions on windows programming bad timers and that being just a consequence of it.
I still can't repdroduce this on my side to take a look myself, but seeing that the upstream experts have already looked at it and didn't have more ideas indicates it will be a complex case for them anyway.
The bug is against Fedora and not upstream-kvm/qemu, mostly the same people but that is why it was closed EOL.
It seems to me we need the main developers back on it to really get a grip on it. - maybe you want to re-kindle this by replying to the old mail thready like "this stil is an issue" or open an upstream bug.

But OTOH that might just be an interaction between windows expecting to run on real HW and pausing a VM. https://www.spinics.net/lists/kvm/msg157371.html summarizes that nicely.
So I'm unsure how much more will happen bringing it up again.

Revision history for this message
Pedro Côrte-Real (pedrocr) wrote :

Annoyingly (for solving the bug but not for me) I haven't been able to reproduce this in quite a while, which is why I never posted any more information. I was hoping it was fixed but apparently it's just hard to reproduce.

It seems odd that the problem could be on the Windows side. I'm assuming there's no VM code actually running when the VM is paused so the hypervisor needs to be to blame, no?

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hi Pedro,
it is more like "while still running the guest programs the rtc timers" then later on when the guest is paused no guest code runs, but this bit seems to be something the host can't pause continue. I'm not deep enough into (virt)timer programming to explain deeper, but that is how it overall looks to me.

If you follow the threads and bug I linked the discussions there are about e.g. changing devices to cause less of that and/or to consider IRQ storming it when it wakes up (but that seemed to be against some specification).

If the guest goes to a proper pause state he is aware of likes S3/S4 it will most likely disable these timers. So it might also depend how exactly you pause things (guest aware or not for example).

Revision history for this message
In , kevin (kevin-redhat-bugs-1) wrote :

Description of problem:

virt-install currently enables hv-time,hv-relaxed,hv-vapic,hv-spinlocks=0x1fff for guests which support Hyper-V. There are several additional Hyper-V enlightenments supported by QEMU and libvirt (see https://github.com/qemu/qemu/blob/master/docs/hyperv.txt and https://libvirt.org/formatdomain.html#elementsFeatures) which could be useful to enable. In particular, hv_stimer significantly reduces CPU usage when supporting VMs are paused (see https://lore.kernel.org/kvm/20200625201046.GA179502@kevinolos/).

Version-Release number of selected component (if applicable): 2.2.1/master

How reproducible: 100%

Steps to Reproduce:
1. virt-install --os-variant win10 --print-xml

Actual results:
<hyperv>
  <relaxed state="on"/>
  <vapic state="on"/>
  <spinlocks state="on" retries="8191"/>
</hyperv>

Expected results:
Something more like:
<hyperv>
  <relaxed state='on'/>
  <vapic state='on'/>
  <spinlocks state="on" retries="8191"/>
  <vpindex state='on'/>
  <runtime state='on'/>
  <synic state='on'/>
  <stimer state='on'/>
  <frequencies state='on'/>
  <tlbflush state='on'/>
  <ipi state='on'/>
</hyperv>

Additional info:
I'm not advocating for any specific Hyper-V enlightenments to be enabled, but I suspect the current defaults are sub-optimal and I would be willing to help improve them. Is there a process for evaluating which to enable, or is it simply a matter of adding defaults for features which weren't available previously?

Thanks,
Kevin

Revision history for this message
Kevin Locke (kevinoid) wrote :

I re-opened the discussion that @Christian found on the kvm mailing list and Paolo Bonzini helped identify a fix: enabling the Hyper-V hv_stimer enlightenment. <https://lore.kernel.org/kvm/20200625201046.GA179502@kevinolos/> Although it may not fix the issue for guests which don't support that enlightenment, it works well for my Windows 10 guest, which now has negligible CPU usage when paused.

I've also opened https://bugzilla.redhat.com/1851244 to discuss improving the virt-manager/virt-install defaults for Windows guests.

Changed in virt-manager (Ubuntu):
status: New → Confirmed
Revision history for this message
Pedro Côrte-Real (pedrocr) wrote :

Thanks for pursuing this. I tried those settings and my paused VM now uses less than 1% CPU when before it used around 3%. So although I haven't been able to reproduce the ~30% CPU usage of before it was still an improvement. It's still odd to me that a paused VM consumes any CPU at all but at least this is a bit better.

Revision history for this message
Kevin Locke (kevinoid) wrote :

I agree that QEMU shouldn't use much CPU when paused, regardless of VM settings. For reference, I narrowed down a (more) minimal test case:

Using the Windows 10 May 2020 English 64-bit ISO from https://www.microsoft.com/en-us/software-download/windows10ISO

qemu-system-x86_64 \
 -no-user-config \
 -machine pc-q35-5.0,accel=kvm \
 -m 1024 \
 -blockdev driver=file,node-name=win10iso,filename=Win10_2004_English_x64.iso \
 -device ide-cd,drive=win10iso \
 -no-hpet \
 -rtc driftfix=slew

then pause the VM after the "Windows Setup" window appears, qemu-system-x86_64 uses ~40% CPU. Without -rtc driftfix=slew, ~10%. Without -no-hpet, ~1%.

I added it to the ML discussion this morning to see if there was interest in fixing the issue, but so far no takers: https://lore.kernel.org/kvm/20200626151432.GA231100@kevinolos/

Revision history for this message
In , crobinso (crobinso-redhat-bugs) wrote :

Thanks for the report. This will take a discussion with qemu developers to understand what it will take to safely enable more options, if there's tradeoffs, etc.

We are moving to using github issues for upstream virt-manager, so I moved this issue there: https://github.com/virt-manager/virt-manager/issues/154

Changed in virt-manager:
importance: Unknown → Undecided
status: Unknown → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.