QEMU

Bug #1207686
Comment #1

Comment 1 for bug 1207686

Revision history for this message

Stefan Hajnoczi (stefanha) wrote on 2013-08-02: Re: [Qemu-devel] [Bug 1207686] [NEW] qemu-1.4.0 and onwards, linux kernel 3.2.x, heavy I/O leads to kernel_hung_tasks_timout_secs message and unresponsive qemu-process

On Fri, Aug 02, 2013 at 09:58:29AM -0000, Oliver Francke wrote:
> after some testing I tried to narrow down a problem, which was initially reported by some users.
> Seen on different distros - debian 7.1, ubuntu 12.04 LTS, IPFire-2.3 as reported by now.
>
> All using some flavour of linux-3.2.x kernel.
>
> Tried e.g. under Ubuntu an upgrade to "Linux 3.8.0-27-generic x86_64" which solves the problem.

Is that a guest kernel upgrade?

> Problem could be triggert with some workload ala:
>
> spew -v --raw -P -t -i 3 -b 4k -p random -B 4k 1G /tmp/doof.dat
> and in parallel do some apt-get install/remove/whatever.
>
> That results in a somewhat stuck qemu-session with the bad
> "kernel_hung_task..." messages.
>
> A typical command-line is as follows:
>
> /usr/local/qemu-1.6.0/bin/qemu-system-x86_64 -usbdevice tablet -enable-
> kvm -daemonize -pidfile /var/run/qemu-server/760.pid -monitor
> unix:/var/run/qemu-server/760.mon,server,nowait -vnc unix:/var/run/qemu-
> server/760.vnc,password -qmp unix:/var/run/qemu-
> server/760.qmp,server,nowait -nodefaults -serial none -parallel none
> -device virtio-net-pci,mac=00:F1:70:00:2F:80,netdev=vlan0d0 -netdev
> type=tap,id=vlan0d0,ifname=tap760i0d0,script=/etc/fcms/add_if.sh,downscript=/etc/fcms/downscript.sh
> -name 1155823384-4 -m 512 -vga cirrus -k de -smp sockets=1,cores=1
> -device virtio-blk-pci,drive=virtio0 -drive
> format=raw,file=rbd:1155823384/vm-760-disk-1.rbd:rbd_cache=false,cache=writeback,if=none,id=virtio0,media=disk,index=0,aio=native
> -drive
> format=raw,file=rbd:1155823384/vm-760-swap-1.rbd:rbd_cache=false,cache=writeback,if=virtio,media=disk,index=1,aio=native
> -drive if=ide,media=cdrom,id=ide1-cd0,readonly=on -drive
> if=ide,media=cdrom,id=ide1-cd1,readonly=on -boot order=dc
>
> no "system_reset", "sendkey ctrl-alt-delete" or "q" in monitoring-
> session is accepted, need to hard-kill the process.

Yesterday I saw a possibly related report on IRC. It was a Windows
guest running under OpenStack with images on Ceph.

They reported that the QEMU process would lock up - ping would not work
and their management tools showed 0 CPU activity for the guest.
However, they were able to "kick" the guest by taking a VNC screenshot
(I think). Then it would come back to life.

If you have a Linux guest that is reporting kernel_hung_task, then it
could be a similar scenario.

Please confirm that the hung task message is from inside the guest.

If you are able to reproduce this and have an alternative non-Ceph
storage pool, please try that since Ceph is common to both these bug
reports.

Stefan

On Fri, Aug 02, 2013 at 09:58:29AM -0000, Oliver Francke wrote:
> after some testing I tried to narrow down a problem, which was initially reported by some users.
> Seen on different distros - debian 7.1, ubuntu 12.04 LTS, IPFire-2.3 as reported by now.
> 
> All using some flavour of linux-3.2.x kernel.
> 
> Tried e.g. under Ubuntu an upgrade to "Linux 3.8.0-27-generic x86_64" which solves the problem.

Is that a guest kernel upgrade?

> Problem could be triggert with some workload ala:
> 
> spew -v --raw -P -t -i 3 -b 4k -p random -B 4k 1G /tmp/doof.dat
> and in parallel do some apt-get install/remove/whatever.
> 
> That results in a somewhat stuck qemu-session with the bad
> "kernel_hung_task..." messages.
> 
> A typical command-line is as follows:
> 
> /usr/local/qemu-1.6.0/bin/qemu-system-x86_64 -usbdevice tablet -enable-
> kvm -daemonize -pidfile /var/run/qemu-server/760.pid -monitor
> unix:/var/run/qemu-server/760.mon,server,nowait -vnc unix:/var/run/qemu-
> server/760.vnc,password -qmp unix:/var/run/qemu-
> server/760.qmp,server,nowait -nodefaults -serial none -parallel none
> -device virtio-net-pci,mac=00:F1:70:00:2F:80,netdev=vlan0d0 -netdev
> type=tap,id=vlan0d0,ifname=tap760i0d0,script=/etc/fcms/add_if.sh,downscript=/etc/fcms/downscript.sh
> -name 1155823384-4 -m 512 -vga cirrus -k de -smp sockets=1,cores=1
> -device virtio-blk-pci,drive=virtio0 -drive
> format=raw,file=rbd:1155823384/vm-760-disk-1.rbd:rbd_cache=false,cache=writeback,if=none,id=virtio0,media=disk,index=0,aio=native
> -drive
> format=raw,file=rbd:1155823384/vm-760-swap-1.rbd:rbd_cache=false,cache=writeback,if=virtio,media=disk,index=1,aio=native
> -drive if=ide,media=cdrom,id=ide1-cd0,readonly=on -drive
> if=ide,media=cdrom,id=ide1-cd1,readonly=on -boot order=dc
> 
> no "system_reset", "sendkey ctrl-alt-delete" or "q" in monitoring-
> session is accepted, need to hard-kill the process.

Yesterday I saw a possibly related report on IRC.  It was a Windows
guest running under OpenStack with images on Ceph.

If you have a Linux guest that is reporting kernel_hung_task, then it
could be a similar scenario.

Please confirm that the hung task message is from inside the guest.

If you are able to reproduce this and have an alternative non-Ceph
storage pool, please try that since Ceph is common to both these bug
reports.

Stefan