qemu-1.4.0 and onwards, linux kernel 3.2.x, heavy I/O leads to kernel_hung_tasks_timout_secs message and unresponsive qemu-process

Bug #1207686 reported by Oliver Francke
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
QEMU
Invalid
Undecided
Unassigned

Bug Description

Hi,

after some testing I tried to narrow down a problem, which was initially reported by some users.
Seen on different distros - debian 7.1, ubuntu 12.04 LTS, IPFire-2.3 as reported by now.

All using some flavour of linux-3.2.x kernel.

Tried e.g. under Ubuntu an upgrade to "Linux 3.8.0-27-generic x86_64" which solves the problem.
Problem could be triggert with some workload ala:

spew -v --raw -P -t -i 3 -b 4k -p random -B 4k 1G /tmp/doof.dat
and in parallel do some apt-get install/remove/whatever.

That results in a somewhat stuck qemu-session with the bad "kernel_hung_task..." messages.

A typical command-line is as follows:

/usr/local/qemu-1.6.0/bin/qemu-system-x86_64 -usbdevice tablet -enable-kvm -daemonize -pidfile /var/run/qemu-server/760.pid -monitor unix:/var/run/qemu-server/760.mon,server,nowait -vnc unix:/var/run/qemu-server/760.vnc,password -qmp unix:/var/run/qemu-server/760.qmp,server,nowait -nodefaults -serial none -parallel none -device virtio-net-pci,mac=00:F1:70:00:2F:80,netdev=vlan0d0 -netdev type=tap,id=vlan0d0,ifname=tap760i0d0,script=/etc/fcms/add_if.sh,downscript=/etc/fcms/downscript.sh -name 1155823384-4 -m 512 -vga cirrus -k de -smp sockets=1,cores=1 -device virtio-blk-pci,drive=virtio0 -drive format=raw,file=rbd:1155823384/vm-760-disk-1.rbd:rbd_cache=false,cache=writeback,if=none,id=virtio0,media=disk,index=0,aio=native -drive format=raw,file=rbd:1155823384/vm-760-swap-1.rbd:rbd_cache=false,cache=writeback,if=virtio,media=disk,index=1,aio=native -drive if=ide,media=cdrom,id=ide1-cd0,readonly=on -drive if=ide,media=cdrom,id=ide1-cd1,readonly=on -boot order=dc

no "system_reset", "sendkey ctrl-alt-delete" or "q" in monitoring-session is accepted, need to hard-kill the process.

Please give any advice on what to do for tracing/debugging, because the number of tickets here are raising, and noone knows, what users are doing inside their VM.

Kind regards,

Oliver Francke.

Revision history for this message
Stefan Hajnoczi (stefanha) wrote : Re: [Qemu-devel] [Bug 1207686] [NEW] qemu-1.4.0 and onwards, linux kernel 3.2.x, heavy I/O leads to kernel_hung_tasks_timout_secs message and unresponsive qemu-process

On Fri, Aug 02, 2013 at 09:58:29AM -0000, Oliver Francke wrote:
> after some testing I tried to narrow down a problem, which was initially reported by some users.
> Seen on different distros - debian 7.1, ubuntu 12.04 LTS, IPFire-2.3 as reported by now.
>
> All using some flavour of linux-3.2.x kernel.
>
> Tried e.g. under Ubuntu an upgrade to "Linux 3.8.0-27-generic x86_64" which solves the problem.

Is that a guest kernel upgrade?

> Problem could be triggert with some workload ala:
>
> spew -v --raw -P -t -i 3 -b 4k -p random -B 4k 1G /tmp/doof.dat
> and in parallel do some apt-get install/remove/whatever.
>
> That results in a somewhat stuck qemu-session with the bad
> "kernel_hung_task..." messages.
>
> A typical command-line is as follows:
>
> /usr/local/qemu-1.6.0/bin/qemu-system-x86_64 -usbdevice tablet -enable-
> kvm -daemonize -pidfile /var/run/qemu-server/760.pid -monitor
> unix:/var/run/qemu-server/760.mon,server,nowait -vnc unix:/var/run/qemu-
> server/760.vnc,password -qmp unix:/var/run/qemu-
> server/760.qmp,server,nowait -nodefaults -serial none -parallel none
> -device virtio-net-pci,mac=00:F1:70:00:2F:80,netdev=vlan0d0 -netdev
> type=tap,id=vlan0d0,ifname=tap760i0d0,script=/etc/fcms/add_if.sh,downscript=/etc/fcms/downscript.sh
> -name 1155823384-4 -m 512 -vga cirrus -k de -smp sockets=1,cores=1
> -device virtio-blk-pci,drive=virtio0 -drive
> format=raw,file=rbd:1155823384/vm-760-disk-1.rbd:rbd_cache=false,cache=writeback,if=none,id=virtio0,media=disk,index=0,aio=native
> -drive
> format=raw,file=rbd:1155823384/vm-760-swap-1.rbd:rbd_cache=false,cache=writeback,if=virtio,media=disk,index=1,aio=native
> -drive if=ide,media=cdrom,id=ide1-cd0,readonly=on -drive
> if=ide,media=cdrom,id=ide1-cd1,readonly=on -boot order=dc
>
> no "system_reset", "sendkey ctrl-alt-delete" or "q" in monitoring-
> session is accepted, need to hard-kill the process.

Yesterday I saw a possibly related report on IRC. It was a Windows
guest running under OpenStack with images on Ceph.

They reported that the QEMU process would lock up - ping would not work
and their management tools showed 0 CPU activity for the guest.
However, they were able to "kick" the guest by taking a VNC screenshot
(I think). Then it would come back to life.

If you have a Linux guest that is reporting kernel_hung_task, then it
could be a similar scenario.

Please confirm that the hung task message is from inside the guest.

If you are able to reproduce this and have an alternative non-Ceph
storage pool, please try that since Ceph is common to both these bug
reports.

Stefan

Revision history for this message
Oliver Francke (oliver-francke) wrote :
Download full text (5.8 KiB)

Hi Stefan,

Am 02.08.2013 um 17:24 schrieb Stefan Hajnoczi <email address hidden>:

> On Fri, Aug 02, 2013 at 09:58:29AM -0000, Oliver Francke wrote:
>> after some testing I tried to narrow down a problem, which was initially reported by some users.
>> Seen on different distros - debian 7.1, ubuntu 12.04 LTS, IPFire-2.3 as reported by now.
>>
>> All using some flavour of linux-3.2.x kernel.
>>
>> Tried e.g. under Ubuntu an upgrade to "Linux 3.8.0-27-generic x86_64" which solves the problem.
>
> Is that a guest kernel upgrade?

yeah, sorry if that was not clear enough.

>
>> Problem could be triggert with some workload ala:
>>
>> spew -v --raw -P -t -i 3 -b 4k -p random -B 4k 1G /tmp/doof.dat
>> and in parallel do some apt-get install/remove/whatever.
>>
>> That results in a somewhat stuck qemu-session with the bad
>> "kernel_hung_task..." messages.
>>
>> A typical command-line is as follows:
>>
>> /usr/local/qemu-1.6.0/bin/qemu-system-x86_64 -usbdevice tablet -enable-
>> kvm -daemonize -pidfile /var/run/qemu-server/760.pid -monitor
>> unix:/var/run/qemu-server/760.mon,server,nowait -vnc unix:/var/run/qemu-
>> server/760.vnc,password -qmp unix:/var/run/qemu-
>> server/760.qmp,server,nowait -nodefaults -serial none -parallel none
>> -device virtio-net-pci,mac=00:F1:70:00:2F:80,netdev=vlan0d0 -netdev
>> type=tap,id=vlan0d0,ifname=tap760i0d0,script=/etc/fcms/add_if.sh,downscript=/etc/fcms/downscript.sh
>> -name 1155823384-4 -m 512 -vga cirrus -k de -smp sockets=1,cores=1
>> -device virtio-blk-pci,drive=virtio0 -drive
>> format=raw,file=rbd:1155823384/vm-760-disk-1.rbd:rbd_cache=false,cache=writeback,if=none,id=virtio0,media=disk,index=0,aio=native
>> -drive
>> format=raw,file=rbd:1155823384/vm-760-swap-1.rbd:rbd_cache=false,cache=writeback,if=virtio,media=disk,index=1,aio=native
>> -drive if=ide,media=cdrom,id=ide1-cd0,readonly=on -drive
>> if=ide,media=cdrom,id=ide1-cd1,readonly=on -boot order=dc
>>
>> no "system_reset", "sendkey ctrl-alt-delete" or "q" in monitoring-
>> session is accepted, need to hard-kill the process.
>
> Yesterday I saw a possibly related report on IRC. It was a Windows
> guest running under OpenStack with images on Ceph.
>
> They reported that the QEMU process would lock up - ping would not work
> and their management tools showed 0 CPU activity for the guest.
> However, they were able to "kick" the guest by taking a VNC screenshot
> (I think). Then it would come back to life.
>
> If you have a Linux guest that is reporting kernel_hung_task, then it
> could be a similar scenario.
>
> Please confirm that the hung task message is from inside the guest.
>

confirmed.

> If you are able to reproduce this and have an alternative non-Ceph
> storage pool, please try that since Ceph is common to both these bug
> reports.
>

I can reproduce it with: kernel 3.2.something + qemu-1.[456] ( never spent much time on 1.3) and high I/O.
I took this VM later this day and converted it to local-storage-qcow2, no prob with any kernel. I already asked on ceph-users-list for assistance, especially from Josh ( if he's not on summer holiday ;) )

What is strange, I have a session via VNC-console opened and have a loop ala:
...

Read more...

Revision history for this message
Oliver Francke (oliver-francke) wrote :

Hi,

opened a ticket with the ceph-guys, and it turned out to be a bug in "librados aio flush".

With latest "wip-librados-aio-flush (bobtail)" I got no error even with _very_ high load.

Thnx for the attention ;)

Oliver.

Revision history for this message
Thomas Huth (th-huth) wrote :

Closing as "Invalid" since this was not a QEMU bug according to comment #3.

Changed in qemu:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.