qemu-1.4.0 and onwards, linux kernel 3.2.x, heavy I/O leads to kernel_hung_tasks_timout_secs message and unresponsive qemu-process
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
QEMU |
Invalid
|
Undecided
|
Unassigned |
Bug Description
Hi,
after some testing I tried to narrow down a problem, which was initially reported by some users.
Seen on different distros - debian 7.1, ubuntu 12.04 LTS, IPFire-2.3 as reported by now.
All using some flavour of linux-3.2.x kernel.
Tried e.g. under Ubuntu an upgrade to "Linux 3.8.0-27-generic x86_64" which solves the problem.
Problem could be triggert with some workload ala:
spew -v --raw -P -t -i 3 -b 4k -p random -B 4k 1G /tmp/doof.dat
and in parallel do some apt-get install/
That results in a somewhat stuck qemu-session with the bad "kernel_
A typical command-line is as follows:
/usr/local/
no "system_reset", "sendkey ctrl-alt-delete" or "q" in monitoring-session is accepted, need to hard-kill the process.
Please give any advice on what to do for tracing/debugging, because the number of tickets here are raising, and noone knows, what users are doing inside their VM.
Kind regards,
Oliver Francke.
On Fri, Aug 02, 2013 at 09:58:29AM -0000, Oliver Francke wrote:
> after some testing I tried to narrow down a problem, which was initially reported by some users.
> Seen on different distros - debian 7.1, ubuntu 12.04 LTS, IPFire-2.3 as reported by now.
>
> All using some flavour of linux-3.2.x kernel.
>
> Tried e.g. under Ubuntu an upgrade to "Linux 3.8.0-27-generic x86_64" which solves the problem.
Is that a guest kernel upgrade?
> Problem could be triggert with some workload ala: remove/ whatever. hung_task. .." messages. qemu-1. 6.0/bin/ qemu-system- x86_64 -usbdevice tablet -enable- qemu-server/ 760.pid -monitor run/qemu- server/ 760.mon, server, nowait -vnc unix:/var/run/qemu- 760.vnc, password -qmp unix:/var/run/qemu- 760.qmp, server, nowait -nodefaults -serial none -parallel none net-pci, mac=00: F1:70:00: 2F:80,netdev= vlan0d0 -netdev id=vlan0d0, ifname= tap760i0d0, script= /etc/fcms/ add_if. sh,downscript= /etc/fcms/ downscript. sh blk-pci, drive=virtio0 -drive raw,file= rbd:1155823384/ vm-760- disk-1. rbd:rbd_ cache=false, cache=writeback ,if=none, id=virtio0, media=disk, index=0, aio=native raw,file= rbd:1155823384/ vm-760- swap-1. rbd:rbd_ cache=false, cache=writeback ,if=virtio, media=disk, index=1, aio=native media=cdrom, id=ide1- cd0,readonly= on -drive media=cdrom, id=ide1- cd1,readonly= on -boot order=dc
>
> spew -v --raw -P -t -i 3 -b 4k -p random -B 4k 1G /tmp/doof.dat
> and in parallel do some apt-get install/
>
> That results in a somewhat stuck qemu-session with the bad
> "kernel_
>
> A typical command-line is as follows:
>
> /usr/local/
> kvm -daemonize -pidfile /var/run/
> unix:/var/
> server/
> server/
> -device virtio-
> type=tap,
> -name 1155823384-4 -m 512 -vga cirrus -k de -smp sockets=1,cores=1
> -device virtio-
> format=
> -drive
> format=
> -drive if=ide,
> if=ide,
>
> no "system_reset", "sendkey ctrl-alt-delete" or "q" in monitoring-
> session is accepted, need to hard-kill the process.
Yesterday I saw a possibly related report on IRC. It was a Windows
guest running under OpenStack with images on Ceph.
They reported that the QEMU process would lock up - ping would not work
and their management tools showed 0 CPU activity for the guest.
However, they were able to "kick" the guest by taking a VNC screenshot
(I think). Then it would come back to life.
If you have a Linux guest that is reporting kernel_hung_task, then it
could be a similar scenario.
Please confirm that the hung task message is from inside the guest.
If you are able to reproduce this and have an alternative non-Ceph
storage pool, please try that since Ceph is common to both these bug
reports.
Stefan