QEMU

qemu-1.4.0 and onwards, linux kernel 3.2.x, heavy I/O leads to kernel_hung_tasks_timout_secs message and unresponsive qemu-process

Bug #1207686 reported by Oliver Francke on 2013-08-02

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	QEMU	Invalid	Undecided	Unassigned

Bug Description

Hi,

after some testing I tried to narrow down a problem, which was initially reported by some users.
Seen on different distros - debian 7.1, ubuntu 12.04 LTS, IPFire-2.3 as reported by now.

All using some flavour of linux-3.2.x kernel.

Tried e.g. under Ubuntu an upgrade to "Linux 3.8.0-27-generic x86_64" which solves the problem.
Problem could be triggert with some workload ala:

spew -v --raw -P -t -i 3 -b 4k -p random -B 4k 1G /tmp/doof.dat
and in parallel do some apt-get install/remove/whatever.

That results in a somewhat stuck qemu-session with the bad "kernel_hung_task..." messages.

A typical command-line is as follows:

/usr/local/qemu-1.6.0/bin/qemu-system-x86_64 -usbdevice tablet -enable-kvm -daemonize -pidfile /var/run/qemu-server/760.pid -monitor unix:/var/run/qemu-server/760.mon,server,nowait -vnc unix:/var/run/qemu-server/760.vnc,password -qmp unix:/var/run/qemu-server/760.qmp,server,nowait -nodefaults -serial none -parallel none -device virtio-net-pci,mac=00:F1:70:00:2F:80,netdev=vlan0d0 -netdev type=tap,id=vlan0d0,ifname=tap760i0d0,script=/etc/fcms/add_if.sh,downscript=/etc/fcms/downscript.sh -name 1155823384-4 -m 512 -vga cirrus -k de -smp sockets=1,cores=1 -device virtio-blk-pci,drive=virtio0 -drive format=raw,file=rbd:1155823384/vm-760-disk-1.rbd:rbd_cache=false,cache=writeback,if=none,id=virtio0,media=disk,index=0,aio=native -drive format=raw,file=rbd:1155823384/vm-760-swap-1.rbd:rbd_cache=false,cache=writeback,if=virtio,media=disk,index=1,aio=native -drive if=ide,media=cdrom,id=ide1-cd0,readonly=on -drive if=ide,media=cdrom,id=ide1-cd1,readonly=on -boot order=dc

no "system_reset", "sendkey ctrl-alt-delete" or "q" in monitoring-session is accepted, need to hard-kill the process.

Please give any advice on what to do for tracing/debugging, because the number of tickets here are raising, and noone knows, what users are doing inside their VM.

Kind regards,

Oliver Francke.

Revision history for this message

Stefan Hajnoczi (stefanha) wrote on 2013-08-02: Re: [Qemu-devel] [Bug 1207686] [NEW] qemu-1.4.0 and onwards, linux kernel 3.2.x, heavy I/O leads to kernel_hung_tasks_timout_secs message and unresponsive qemu-process

On Fri, Aug 02, 2013 at 09:58:29AM -0000, Oliver Francke wrote:
> after some testing I tried to narrow down a problem, which was initially reported by some users.
> Seen on different distros - debian 7.1, ubuntu 12.04 LTS, IPFire-2.3 as reported by now.
>
> All using some flavour of linux-3.2.x kernel.
>
> Tried e.g. under Ubuntu an upgrade to "Linux 3.8.0-27-generic x86_64" which solves the problem.

Is that a guest kernel upgrade?

> Problem could be triggert with some workload ala:
>
> spew -v --raw -P -t -i 3 -b 4k -p random -B 4k 1G /tmp/doof.dat
> and in parallel do some apt-get install/remove/whatever.
>
> That results in a somewhat stuck qemu-session with the bad
> "kernel_hung_task..." messages.
>
> A typical command-line is as follows:
>
> /usr/local/qemu-1.6.0/bin/qemu-system-x86_64 -usbdevice tablet -enable-
> kvm -daemonize -pidfile /var/run/qemu-server/760.pid -monitor
> unix:/var/run/qemu-server/760.mon,server,nowait -vnc unix:/var/run/qemu-
> server/760.vnc,password -qmp unix:/var/run/qemu-
> server/760.qmp,server,nowait -nodefaults -serial none -parallel none
> -device virtio-net-pci,mac=00:F1:70:00:2F:80,netdev=vlan0d0 -netdev
> type=tap,id=vlan0d0,ifname=tap760i0d0,script=/etc/fcms/add_if.sh,downscript=/etc/fcms/downscript.sh
> -name 1155823384-4 -m 512 -vga cirrus -k de -smp sockets=1,cores=1
> -device virtio-blk-pci,drive=virtio0 -drive
> format=raw,file=rbd:1155823384/vm-760-disk-1.rbd:rbd_cache=false,cache=writeback,if=none,id=virtio0,media=disk,index=0,aio=native
> -drive
> format=raw,file=rbd:1155823384/vm-760-swap-1.rbd:rbd_cache=false,cache=writeback,if=virtio,media=disk,index=1,aio=native
> -drive if=ide,media=cdrom,id=ide1-cd0,readonly=on -drive
> if=ide,media=cdrom,id=ide1-cd1,readonly=on -boot order=dc
>
> no "system_reset", "sendkey ctrl-alt-delete" or "q" in monitoring-
> session is accepted, need to hard-kill the process.

Yesterday I saw a possibly related report on IRC. It was a Windows
guest running under OpenStack with images on Ceph.

They reported that the QEMU process would lock up - ping would not work
and their management tools showed 0 CPU activity for the guest.
However, they were able to "kick" the guest by taking a VNC screenshot
(I think). Then it would come back to life.

If you have a Linux guest that is reporting kernel_hung_task, then it
could be a similar scenario.

Please confirm that the hung task message is from inside the guest.

If you are able to reproduce this and have an alternative non-Ceph
storage pool, please try that since Ceph is common to both these bug
reports.

Stefan

On Fri, Aug 02, 2013 at 09:58:29AM -0000, Oliver Francke wrote:
> after some testing I tried to narrow down a problem, which was initially reported by some users.
> Seen on different distros - debian 7.1, ubuntu 12.04 LTS, IPFire-2.3 as reported by now.
> 
> All using some flavour of linux-3.2.x kernel.
> 
> Tried e.g. under Ubuntu an upgrade to "Linux 3.8.0-27-generic x86_64" which solves the problem.

Is that a guest kernel upgrade?

> Problem could be triggert with some workload ala:
> 
> spew -v --raw -P -t -i 3 -b 4k -p random -B 4k 1G /tmp/doof.dat
> and in parallel do some apt-get install/remove/whatever.
> 
> That results in a somewhat stuck qemu-session with the bad
> "kernel_hung_task..." messages.
> 
> A typical command-line is as follows:
> 
> /usr/local/qemu-1.6.0/bin/qemu-system-x86_64 -usbdevice tablet -enable-
> kvm -daemonize -pidfile /var/run/qemu-server/760.pid -monitor
> unix:/var/run/qemu-server/760.mon,server,nowait -vnc unix:/var/run/qemu-
> server/760.vnc,password -qmp unix:/var/run/qemu-
> server/760.qmp,server,nowait -nodefaults -serial none -parallel none
> -device virtio-net-pci,mac=00:F1:70:00:2F:80,netdev=vlan0d0 -netdev
> type=tap,id=vlan0d0,ifname=tap760i0d0,script=/etc/fcms/add_if.sh,downscript=/etc/fcms/downscript.sh
> -name 1155823384-4 -m 512 -vga cirrus -k de -smp sockets=1,cores=1
> -device virtio-blk-pci,drive=virtio0 -drive
> format=raw,file=rbd:1155823384/vm-760-disk-1.rbd:rbd_cache=false,cache=writeback,if=none,id=virtio0,media=disk,index=0,aio=native
> -drive
> format=raw,file=rbd:1155823384/vm-760-swap-1.rbd:rbd_cache=false,cache=writeback,if=virtio,media=disk,index=1,aio=native
> -drive if=ide,media=cdrom,id=ide1-cd0,readonly=on -drive
> if=ide,media=cdrom,id=ide1-cd1,readonly=on -boot order=dc
> 
> no "system_reset", "sendkey ctrl-alt-delete" or "q" in monitoring-
> session is accepted, need to hard-kill the process.

Yesterday I saw a possibly related report on IRC.  It was a Windows
guest running under OpenStack with images on Ceph.

If you have a Linux guest that is reporting kernel_hung_task, then it
could be a similar scenario.

Please confirm that the hung task message is from inside the guest.

If you are able to reproduce this and have an alternative non-Ceph
storage pool, please try that since Ceph is common to both these bug
reports.

Stefan

Revision history for this message

Oliver Francke (oliver-francke) wrote on 2013-08-02:

Download full text (5.8 KiB)

Hi Stefan,

Am 02.08.2013 um 17:24 schrieb Stefan Hajnoczi <email address hidden>:

> On Fri, Aug 02, 2013 at 09:58:29AM -0000, Oliver Francke wrote:
>> after some testing I tried to narrow down a problem, which was initially reported by some users.
>> Seen on different distros - debian 7.1, ubuntu 12.04 LTS, IPFire-2.3 as reported by now.
>>
>> All using some flavour of linux-3.2.x kernel.
>>
>> Tried e.g. under Ubuntu an upgrade to "Linux 3.8.0-27-generic x86_64" which solves the problem.
>
> Is that a guest kernel upgrade?

yeah, sorry if that was not clear enough.

>
>> Problem could be triggert with some workload ala:
>>
>> spew -v --raw -P -t -i 3 -b 4k -p random -B 4k 1G /tmp/doof.dat
>> and in parallel do some apt-get install/remove/whatever.
>>
>> That results in a somewhat stuck qemu-session with the bad
>> "kernel_hung_task..." messages.
>>
>> A typical command-line is as follows:
>>
>> /usr/local/qemu-1.6.0/bin/qemu-system-x86_64 -usbdevice tablet -enable-
>> kvm -daemonize -pidfile /var/run/qemu-server/760.pid -monitor
>> unix:/var/run/qemu-server/760.mon,server,nowait -vnc unix:/var/run/qemu-
>> server/760.vnc,password -qmp unix:/var/run/qemu-
>> server/760.qmp,server,nowait -nodefaults -serial none -parallel none
>> -device virtio-net-pci,mac=00:F1:70:00:2F:80,netdev=vlan0d0 -netdev
>> type=tap,id=vlan0d0,ifname=tap760i0d0,script=/etc/fcms/add_if.sh,downscript=/etc/fcms/downscript.sh
>> -name 1155823384-4 -m 512 -vga cirrus -k de -smp sockets=1,cores=1
>> -device virtio-blk-pci,drive=virtio0 -drive
>> format=raw,file=rbd:1155823384/vm-760-disk-1.rbd:rbd_cache=false,cache=writeback,if=none,id=virtio0,media=disk,index=0,aio=native
>> -drive
>> format=raw,file=rbd:1155823384/vm-760-swap-1.rbd:rbd_cache=false,cache=writeback,if=virtio,media=disk,index=1,aio=native
>> -drive if=ide,media=cdrom,id=ide1-cd0,readonly=on -drive
>> if=ide,media=cdrom,id=ide1-cd1,readonly=on -boot order=dc
>>
>> no "system_reset", "sendkey ctrl-alt-delete" or "q" in monitoring-
>> session is accepted, need to hard-kill the process.
>
> Yesterday I saw a possibly related report on IRC. It was a Windows
> guest running under OpenStack with images on Ceph.
>
> They reported that the QEMU process would lock up - ping would not work
> and their management tools showed 0 CPU activity for the guest.
> However, they were able to "kick" the guest by taking a VNC screenshot
> (I think). Then it would come back to life.
>
> If you have a Linux guest that is reporting kernel_hung_task, then it
> could be a similar scenario.
>
> Please confirm that the hung task message is from inside the guest.
>

confirmed.

> If you are able to reproduce this and have an alternative non-Ceph
> storage pool, please try that since Ceph is common to both these bug
> reports.
>

I can reproduce it with: kernel 3.2.something + qemu-1.[456] ( never spent much time on 1.3) and high I/O.
I took this VM later this day and converted it to local-storage-qcow2, no prob with any kernel. I already asked on ceph-users-list for assistance, especially from Josh ( if he's not on summer holiday ;) )

What is strange, I have a session via VNC-console opened and have a loop ala:
...

Hi Stefan,

Am 02.08.2013 um 17:24 schrieb Stefan Hajnoczi <1207686@bugs.launchpad.net>:

> On Fri, Aug 02, 2013 at 09:58:29AM -0000, Oliver Francke wrote:
>> after some testing I tried to narrow down a problem, which was initially reported by some users.
>> Seen on different distros - debian 7.1, ubuntu 12.04 LTS, IPFire-2.3 as reported by now.
>> 
>> All using some flavour of linux-3.2.x kernel.
>> 
>> Tried e.g. under Ubuntu an upgrade to "Linux 3.8.0-27-generic x86_64" which solves the problem.
> 
> Is that a guest kernel upgrade?

yeah, sorry if that was not clear enough.

> 
>> Problem could be triggert with some workload ala:
>> 
>> spew -v --raw -P -t -i 3 -b 4k -p random -B 4k 1G /tmp/doof.dat
>> and in parallel do some apt-get install/remove/whatever.
>> 
>> That results in a somewhat stuck qemu-session with the bad
>> "kernel_hung_task..." messages.
>> 
>> A typical command-line is as follows:
>> 
>> /usr/local/qemu-1.6.0/bin/qemu-system-x86_64 -usbdevice tablet -enable-
>> kvm -daemonize -pidfile /var/run/qemu-server/760.pid -monitor
>> unix:/var/run/qemu-server/760.mon,server,nowait -vnc unix:/var/run/qemu-
>> server/760.vnc,password -qmp unix:/var/run/qemu-
>> server/760.qmp,server,nowait -nodefaults -serial none -parallel none
>> -device virtio-net-pci,mac=00:F1:70:00:2F:80,netdev=vlan0d0 -netdev
>> type=tap,id=vlan0d0,ifname=tap760i0d0,script=/etc/fcms/add_if.sh,downscript=/etc/fcms/downscript.sh
>> -name 1155823384-4 -m 512 -vga cirrus -k de -smp sockets=1,cores=1
>> -device virtio-blk-pci,drive=virtio0 -drive
>> format=raw,file=rbd:1155823384/vm-760-disk-1.rbd:rbd_cache=false,cache=writeback,if=none,id=virtio0,media=disk,index=0,aio=native
>> -drive
>> format=raw,file=rbd:1155823384/vm-760-swap-1.rbd:rbd_cache=false,cache=writeback,if=virtio,media=disk,index=1,aio=native
>> -drive if=ide,media=cdrom,id=ide1-cd0,readonly=on -drive
>> if=ide,media=cdrom,id=ide1-cd1,readonly=on -boot order=dc
>> 
>> no "system_reset", "sendkey ctrl-alt-delete" or "q" in monitoring-
>> session is accepted, need to hard-kill the process.
> 
> Yesterday I saw a possibly related report on IRC.  It was a Windows
> guest running under OpenStack with images on Ceph.
> 
> They reported that the QEMU process would lock up - ping would not work
> and their management tools showed 0 CPU activity for the guest.
> However, they were able to "kick" the guest by taking a VNC screenshot
> (I think).  Then it would come back to life.
> 
> If you have a Linux guest that is reporting kernel_hung_task, then it
> could be a similar scenario.
> 
> Please confirm that the hung task message is from inside the guest.
>

confirmed.

> If you are able to reproduce this and have an alternative non-Ceph
> storage pool, please try that since Ceph is common to both these bug
> reports.
>

What is strange, I have a session via VNC-console opened and have a loop ala:

while true; do apt-get install -y ntp libopts25; apt-get remove -y ntp-libopts25; done
and and parallel spew as described, the apt-"session" dies and one can see the hung_task-thingy, but I still can restart the spew-test.
Just for completeness.

Thnx for you attention,

Oliver.

> Stefan
> 
> -- 
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1207686
> 
> Title:
>  qemu-1.4.0 and onwards, linux kernel 3.2.x, heavy I/O leads to
>  kernel_hung_tasks_timout_secs message and unresponsive qemu-process
> 
> Status in QEMU:
>  New
> 
> Bug description:
>  Hi,
> 
>  after some testing I tried to narrow down a problem, which was initially reported by some users.
>  Seen on different distros - debian 7.1, ubuntu 12.04 LTS, IPFire-2.3 as reported by now.
> 
>  All using some flavour of linux-3.2.x kernel.
> 
>  Tried e.g. under Ubuntu an upgrade to "Linux 3.8.0-27-generic x86_64" which solves the problem.
>  Problem could be triggert with some workload ala:
> 
>  spew -v --raw -P -t -i 3 -b 4k -p random -B 4k 1G /tmp/doof.dat
>  and in parallel do some apt-get install/remove/whatever.
> 
>  That results in a somewhat stuck qemu-session with the bad
>  "kernel_hung_task..." messages.
> 
>  A typical command-line is as follows:
> 
>  /usr/local/qemu-1.6.0/bin/qemu-system-x86_64 -usbdevice tablet
>  -enable-kvm -daemonize -pidfile /var/run/qemu-server/760.pid -monitor
>  unix:/var/run/qemu-server/760.mon,server,nowait -vnc unix:/var/run
>  /qemu-server/760.vnc,password -qmp unix:/var/run/qemu-
>  server/760.qmp,server,nowait -nodefaults -serial none -parallel none
>  -device virtio-net-pci,mac=00:F1:70:00:2F:80,netdev=vlan0d0 -netdev
>  type=tap,id=vlan0d0,ifname=tap760i0d0,script=/etc/fcms/add_if.sh,downscript=/etc/fcms/downscript.sh
>  -name 1155823384-4 -m 512 -vga cirrus -k de -smp sockets=1,cores=1
>  -device virtio-blk-pci,drive=virtio0 -drive
>  format=raw,file=rbd:1155823384/vm-760-disk-1.rbd:rbd_cache=false,cache=writeback,if=none,id=virtio0,media=disk,index=0,aio=native
>  -drive
>  format=raw,file=rbd:1155823384/vm-760-swap-1.rbd:rbd_cache=false,cache=writeback,if=virtio,media=disk,index=1,aio=native
>  -drive if=ide,media=cdrom,id=ide1-cd0,readonly=on -drive
>  if=ide,media=cdrom,id=ide1-cd1,readonly=on -boot order=dc
> 
>  no "system_reset", "sendkey ctrl-alt-delete" or "q" in monitoring-
>  session is accepted, need to hard-kill the process.
> 
>  Please give any advice on what to do for tracing/debugging, because
>  the number of tickets here are raising, and noone knows, what users
>  are doing inside their VM.
> 
>  Kind regards,
> 
>  Oliver Francke.
> 
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/qemu/+bug/1207686/+subscriptions

Revision history for this message

Oliver Francke (oliver-francke) wrote on 2013-08-16:

Hi,

opened a ticket with the ceph-guys, and it turned out to be a bug in "librados aio flush".

With latest "wip-librados-aio-flush (bobtail)" I got no error even with _very_ high load.

Thnx for the attention ;)

Oliver.

Revision history for this message

Thomas Huth (th-huth) wrote on 2016-10-31:

Closing as "Invalid" since this was not a QEMU bug according to comment #3.

Changed in qemu:
status:	New → Invalid

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.