qemu-kvm crashes in qcow2 code

Bug #1223907 reported by Dr. Stefan Schimanski
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
qemu-kvm (Fedora)
Fix Released
Medium
qemu-kvm (Ubuntu)
Fix Released
High
Unassigned
Precise
Won't Fix
High
Unassigned

Bug Description

qemu-kvm machines die randomly when under heavy load.
(see also possibly similar RHEL bug: https://bugzilla.redhat.com/show_bug.cgi?id=812705)
Note that we have a qcow2 cluster size of 64k.

Here is the head of the apport crash dump:

ProblemType: Crash
Package: qemu-kvm 1.0+noroms-0ubuntu14.11
Architecture: amd64
Date: Thu Sep 5 14:11:25 2013
DistroRelease: Ubuntu 12.04
ExecutablePath: /usr/bin/qemu-system-x86_64
ExecutableTimestamp: 1376604277
ProcCmdline: /usr/bin/kvm -name instance-00003544 -S -M pc-1.0 -cpu core2duo,+lahf_lm,+rdtscp,+pdpe1gb,+aes,+popcnt,+sse4.2,+sse4.1,+dca,+pdcm,+xtpr,+cx16,+tm2,+est,+smx,+vmx,+ds_cpl,+dtes64,+pclmuldq,+pbe,+tm,+ht,+ss,+acpi,+ds -enable-kvm -m 2048 -smp 1,sockets=1,cores=1,threads=1 -uuid 34d732f4-4d7f-446b-912d-b4bdfc395942 -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/instance-00003544.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -no-kvm-pit-reinjection -no-shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/var/lib/nova/instances/instance-00003544/disk,if=none,id=drive-virtio-disk0,format=qcow2,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -drive file=/var/lib/nova/instances/instance-00003544/disk.local,if=none,id=drive-virtio-disk1,format=qcow2,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk1,id=virtio-disk1 -netdev tap,ifname=tap2ad62d47-fb,script=,id=hostnet0 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:a0:c6:f4,bus=pci.0,addr=0x3 -chardev file,id=charserial0,path=/var/lib/nova/instances/instance-00003544/console.log -device isa-serial,chardev=charserial0,id=serial0 -chardev pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1 -device usb-tablet,id=input0 -vnc 10.248.33.25:3 -k de -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6
ProcCwd: /
ProcEnviron: PATH=(custom, no user)

------------------------------------------------------------------------------------------------------------------------------------------------

And here is the backtrace:

warning: Can't read pathname for load map: Input/output error.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/bin/kvm -name instance-00003544 -S -M pc-1.0 -cpu core2duo,+lahf_lm,+rdtsc'.
Program terminated with signal 6, Aborted.
#0 0x00007f359b5e4425 in raise () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0 0x00007f359b5e4425 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f359b5e7b8b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007f359f81d3db in qcow2_cache_find_entry_to_replace (c=<optimized out>) at block/qcow2-cache.c:209
#3 qcow2_cache_do_get (bs=0x7f35a1183cd0, c=0x7f35a11859c0, offset=84475904, table=0x7f35a8b10a18, read_from_disk=<optimized out>)
    at block/qcow2-cache.c:229
#4 0x00007f359f81de6c in l2_load (l2_table=0x7f35a8b10a18, l2_offset=<optimized out>, bs=0x7f35a1183cd0) at block/qcow2-cluster.c:121
#5 qcow2_get_cluster_offset (bs=0x7f35a1183cd0, offset=9153032192, num=0x7f35a8b10aec, cluster_offset=0x7f35a8b10ae0)
    at block/qcow2-cluster.c:442
#6 0x00007f359f81e706 in qcow2_read (nb_sectors=104, buf=0x7f35931f7200 "", sector_num=17877016, bs=0x7f35a1183cd0) at block/qcow2-cluster.c:305
#7 copy_sectors (bs=0x7f35a1183cd0, start_sect=<optimized out>, cluster_offset=964231168, n_start=24, n_end=<optimized out>)
    at block/qcow2-cluster.c:360
#8 0x00007f359f81eae3 in qcow2_alloc_cluster_link_l2 (bs=0x7f35a1183cd0, m=0x7f35a8b10be0) at block/qcow2-cluster.c:631
#9 0x00007f359f8222ad in qcow2_co_writev (bs=0x7f35a1183cd0, sector_num=17876992, remaining_sectors=24, qiov=0x7f35a8e34e60) at block/qcow2.c:596
#10 0x00007f359f81495a in bdrv_co_do_writev (bs=<optimized out>, sector_num=17876992, nb_sectors=24, qiov=0x7f35a8e34e60) at block.c:1311
#11 0x00007f359f8149ee in bdrv_co_do_rw (opaque=0x7f35a1cf77d0) at block.c:2617
#12 0x00007f359f844eab in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at coroutine-ucontext.c:125
#13 0x00007f359b5f6650 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#14 0x00007fffeb9b3ae0 in ?? ()
#15 0x0000000000000000 in ?? ()

Revision history for this message
In , daiwei (daiwei-redhat-bugs) wrote :
Download full text (3.5 KiB)

Description of problem:

Install guest with virtio-scsi interface and cluster_size=4096, when formating disk qemu-kvm gets Aborted.

Version-Release number of selected component (if applicable):

# uname -r;rpm -q qemu-kvm
2.6.32-259.el6.x86_64
qemu-kvm-0.12.1.2-2.269.el6.scsifixes.x86_64

How reproducible:
3/3

Steps to Reproduce:
1. Create a qcow2 image with cluster_size=4096
e.g
# qemu-img create -f qcow2 sysdisk.qcow2 20G -o cluster_size=4096

2.Install guest

/usr/libexec/qemu-kvm -cpu cpu64-rhel6 -rtc base=localtime,clock=host,driftfix=slew -M rhel6.3.0 -enable-kvm -name rhel6.3-64 -smp 4,cores=2,threads=1,sockets=2 -m 4G -uuid c944829b-9aa0-46a2-b3d0-493c135da24d -boot menu=on -drive file=/home/sysdisk.qcow2,if=none,id=drive-virtio-disk0,format=qcow2,cache=none,aio=native,media=disk,werror=stop,rerror=stop -device virtio-scsi-pci,id=bus1 -device scsi-hd,bus=bus1.0,drive=drive-virtio-disk0,id=virtio-scsi-pci0 -netdev tap,id=hostnet0,vhost=on,script=/etc/qemu-ifup-switch -device virtio-net-pci,netdev=hostnet0,id=net0,mac=44:37:E6:5E:A3:F7 -spice port=9000,disable-ticketing -vga qxl -global qxl-vga.vram_size=67108864 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -monitor stdio -usb -device usb-tablet,id=input1 -drive file=/home/RHEL6.3-20120329.0-Server-x86_64-DVD1.iso,if=none,id=drive-virtio-disk1,format=raw,cache=none,aio=native,media=cdrom -device virtio-scsi-pci,id=bus2 -device scsi-cd,bus=bus2.0,drive=drive-virtio-disk1,id=virtio-scsi-pci1,bootindex=1

3.

Actual results:
When formating disk, qemu-kvm gets Aborted.

Program received signal SIGABRT, Aborted.
0x00007ffff57788a5 in raise () from /lib64/libc.so.6

(gdb) bt
#0 0x00007ffff57788a5 in raise () from /lib64/libc.so.6
#1 0x00007ffff577a085 in abort () from /lib64/libc.so.6
#2 0x00007ffff7e3c42e in qcow2_cache_find_entry_to_replace (bs=0x7ffff8854c30, c=0x7ffff87027d0, offset=3279413248,
    table=0x7fffd75764b8, read_from_disk=false) at block/qcow2-cache.c:209
#3 qcow2_cache_do_get (bs=0x7ffff8854c30, c=0x7ffff87027d0, offset=3279413248, table=0x7fffd75764b8, read_from_disk=false)
    at block/qcow2-cache.c:229
#4 0x00007ffff7e3a299 in l2_allocate (bs=0x7ffff8854c30, offset=6463520768, new_l2_table=0x7fffd7576548, new_l2_offset=0x7fffd7576550,
    new_l2_index=0x7fffd757655c) at block/qcow2-cluster.c:180
#5 get_cluster_table (bs=0x7ffff8854c30, offset=6463520768, new_l2_table=0x7fffd7576548, new_l2_offset=0x7fffd7576550,
    new_l2_index=0x7fffd757655c) at block/qcow2-cluster.c:512
#6 0x00007ffff7e3a6e6 in qcow2_alloc_cluster_offset (bs=0x7ffff8854c30, offset=6463520768, n_start=0, n_end=1008, num=0x7fffd757666c,
    m=0x7fffd7576600) at block/qcow2-cluster.c:714
#7 0x00007ffff7e363bf in qcow2_co_writev (bs=0x7ffff8854c30, sector_num=<value optimized out>, remaining_sectors=1008,
    qiov=0x7fffd855e838) at block/qcow2.c:555
#8 0x00007ffff7e215fa in bdrv_co_do_writev (bs=0x7ffff8854c30, sector_num=12624064, nb_sectors=1008, qiov=0x7fffd855e838,
    flags=<value optimized out>) at block.c:1734
#9 0x00007ffff7e216a1 in bdrv_co_do_rw (opaque=0x7fffd855e890) at block.c:3032
#10 0x00007ffff7e26b6b in coroutine_trampoline (i0=<value optimized out>, i1=<value...

Read more...

Revision history for this message
In , Xiaoqing (xiaoqing-redhat-bugs) wrote :

reproduced on:
2.6.32-262.el6.x86_64
qemu-kvm-0.12.1.2-2.275.el6.x86_64

using virtio-blk

Revision history for this message
In , daiwei (daiwei-redhat-bugs) wrote :

Sorry for report this bug on a private tree, i can reproduce this on the latest qemu-kvm and using virtio-scsi disk :

# uname -r;rpm -q qemu-kvm
2.6.32-262.el6.x86_64
qemu-kvm-0.12.1.2-2.275.el6.x86_64

Revision history for this message
In , Dor (dor-redhat-bugs) wrote :

Kevin, can you look whether its a symptom of a critical issue?

Revision history for this message
In , Kevin (kevin-redhat-bugs) wrote :

The problem could in theory occur even with the default cluster size, even though it's rather unlikely. It happens when allocating requests to more than 16 different L2 tables are queued because they depend on other requests.

It causes a qemu abort(), but image consistency is not harmed. Critical enough to be fixed in 6.3, I'd say.

I found a reproducer, added it to qemu-iotests and sent upstream patches. RHEL code is different, so I'm working on a different fix there.

Revision history for this message
In , langfang (langfang-redhat-bugs) wrote :
Download full text (3.2 KiB)

reporduce this issue with steps and environment as follows:
# uname -r
2.6.32-269.el6.x86_64
# rpm -q qemu-kvm
qemu-kvm-0.12.1.2-2.287.el6.x86_64

steps:
1)# qemu-img create -f qcow2 sysdisk.qcow2 20G -o cluster_size=4096
2)boot guest with virtio-scsi interface and cluster_size=4096,
 /usr/libexec/qemu-kvm -cpu cpu64-rhel6 -rtc base=localtime,clock=host,driftfix=slew -M rhel6.3.0 -enable-kvm -name rhel6.3 -smp 4,cores=2,threads=1,sockets=2 -m 4G -uuid a3d13230-f1c1-4dc9-95de-bb92b2017674 -boot menu=on -drive file=/home/sysdisk.qcow2,if=none,id=drive-virtio-disk0,format=qcow2,cache=none,aio=native,media=disk,werror=stop,rerror=stop -device virtio-scsi-pci,id=bus1 -device scsi-hd,bus=bus1.0,drive=drive-virtio-disk0,id=virtio-scsi-pci0 -netdev tap,id=hostnet0,vhost=on,script=/etc/qemu-ifup -device virtio-net-pci,netdev=hostnet0,id=net0,mac=44:37:E6:97:58:89 -spice port=9000,disable-ticketing -vga qxl -global qxl-vga.vram_size=67108864 -monitor stdio -usb -device usb-tablet,id=input1 -drive file=/home/RHEL6.3-20120426.2-Server-x86_64-DVD1.iso,if=none,id=drive-virtio-disk1,format=raw,cache=none,aio=native,media=cdrom -device virtio-scsi-pci,id=bus2 -device scsi-cd,bus=bus2.0,drive=drive-virtio-disk1,id=virtio-scsi-pci1,bootindex=1

results : when formating
disk qemu-kvm gets Aborted.
bt
#0 0x00007ffff57798a5 in raise () from /lib64/libc.so.6
#1 0x00007ffff577b085 in abort () from /lib64/libc.so.6
#2 0x00007ffff7e3d7ae in qcow2_cache_find_entry_to_replace (bs=0x7ffff86ef010, c=0x7ffff86d8df0, offset=70295552, table=0x7fffda5a4108,
    read_from_disk=false) at block/qcow2-cache.c:209
#3 qcow2_cache_do_get (bs=0x7ffff86ef010, c=0x7ffff86d8df0, offset=70295552, table=0x7fffda5a4108, read_from_disk=false)
    at block/qcow2-cache.c:229
#4 0x00007ffff7e3b619 in l2_allocate (bs=0x7ffff86ef010, offset=2693144576, new_l2_table=0x7fffda5a4198, new_l2_offset=0x7fffda5a41a0,
    new_l2_index=0x7fffda5a41ac) at block/qcow2-cluster.c:180
#5 get_cluster_table (bs=0x7ffff86ef010, offset=2693144576, new_l2_table=0x7fffda5a4198, new_l2_offset=0x7fffda5a41a0,
    new_l2_index=0x7fffda5a41ac) at block/qcow2-cluster.c:512
#6 0x00007ffff7e3ba66 in qcow2_alloc_cluster_offset (bs=0x7ffff86ef010, offset=2693144576, n_start=0, n_end=1008, num=0x7fffda5a42bc,
    m=0x7fffda5a4250) at block/qcow2-cluster.c:714
#7 0x00007ffff7e3773f in qcow2_co_writev (bs=0x7ffff86ef010, sector_num=<value optimized out>, remaining_sectors=1008,
    qiov=0x7fffda4a4088) at block/qcow2.c:555
#8 0x00007ffff7e2293a in bdrv_co_do_writev (bs=0x7ffff86ef010, sector_num=5260048, nb_sectors=1008, qiov=0x7fffda4a4088,
    flags=<value optimized out>) at block.c:1741
#9 0x00007ffff7e229e1 in bdrv_co_do_rw (opaque=0x7fffda4a4290) at block.c:3039
#10 0x00007ffff7e27eeb in coroutine_trampoline (i0=<value optimized out>, i1=<value optimized out>) at coroutine-ucontext.c:129
#11 0x00007ffff578a630 in ?? () from /lib64/libc.so.6
#12 0x00007fffed148530 in ?? ()
#13 0x0000000000000000 in ?? ()

verify this issue with steps and environment as follows:
version
# uname -r
2.6.32-262.el6.x86_64
rpm -q qemu-kvm
qemu-kvm-0.12.1.2-2.290.el6.x86_64

the steps as same as reproduce.

results:

qemu-k...

Read more...

Revision history for this message
In , Michal (michal-redhat-bugs) wrote :

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.

    New Contents:
No Documentation Needed

Revision history for this message
In , Paolo (paolo-redhat-bugs) wrote :

    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.

    Diffed Contents:
@@ -1 +1 @@
-No Documentation Needed+NEEDINFO

Revision history for this message
In , errata-xmlrpc (errata-xmlrpc-redhat-bugs) wrote :

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2012-0746.html

Changed in qemu-kvm (Ubuntu):
importance: Undecided → High
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Thank you for reporting this bug. It looks like commit 7242411460eb1cd6e850d51ef15ae734b59e2edf (qcow2: Don't hold cache references across yield) should be the fix for this. I will build a package with that for testing.

Changed in qemu-kvm (Ubuntu Precise):
importance: Undecided → High
Changed in qemu-kvm (Ubuntu):
status: New → Fix Released
Changed in qemu-kvm (Ubuntu Precise):
status: New → Confirmed
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

I've pushed the package to ppa:serge-hallyn/lucid-kvm-test. It should build after a few hours, after which you can

  sudo add-apt-repository ppa:serge-hallyn/lucid-kvm-test
  sudo apt-get update
  sudo apt-get dist-upgrade

If the problem is solved after that, we can SRU this change to get it into the archive.

Revision history for this message
Fabian Eichstädt (fabian-eichstaedt) wrote :

Thanks for the quick response! Unfortunately the build failed ...

Revision history for this message
Serge Hallyn (serge-hallyn) wrote : Re: [Bug 1223907] Re: qemu-kvm crashes in qcow2 code

Quoting Fabian Eichstädt (<email address hidden>):
> Thanks for the quick response! Unfortunately the build failed ...

Yes sorry about that, I thought the patch had been applied locally
when I pushed the package but apparently not. I've backported the
patch and re-pushed a ppa2 version.

Revision history for this message
Fabian Eichstädt (fabian-eichstaedt) wrote :

The patch did not help, the machines still die. Unfortunately I cannot create a backtrace since the debugging symbols
for the patched qemu-kvm package are not available (or are they and if yes, where?).

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Quoting Fabian Eichstädt (<email address hidden>):
> The patch did not help, the machines still die. Unfortunately I cannot create a backtrace since the debugging symbols
> for the patched qemu-kvm package are not available (or are they and if yes, where?).

The wiki page

https://wiki.ubuntu.com/DebuggingProgramCrash?action=show&redirect=DebuggingProgramCrashes

(section Debug Symbol Packages) shows how to install the qemu-kvm-dbgsym package.

Revision history for this message
Dr. Stefan Schimanski (sttts) wrote :

True. However I do need the debug symbols for the PATCHED qemu-kvm package, specifically for the ppa2 version. These symbols are of course not in the official ddebs repos and also not in the PPA.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Quoting Stefan Schimanski (sts@1stein.org):
> True. However I do need the debug symbols for the PATCHED qemu-kvm
> package, specifically for the ppa2 version. These symbols are of course
> not in the official ddebs repos and also not in the PPA.

Oh - since the patched version did not fix it, I think it best that
you downgrade to the precise-updates version. My backport of the
patch listed in the redhat bug will only confuse matters.

Revision history for this message
Fabian Eichstädt (fabian-eichstaedt) wrote :

Still having the same problem. (Please change the status) Any more suggestions? Can we provide any more info?

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

The status for precise is 'confirmed' - did you mean you experienced it on saucy?

Revision history for this message
Fabian Eichstädt (fabian-eichstaedt) wrote :

Sorry, I was confused: I thought "Fix released" applied to precise also, which is not true obviously.
However I was successful with the quantal packages (qemu-*-1.2.0+noroms-0ubuntu2.12.10.5, seabios_1.7.0-1, vgabios_0.7a-3ubuntu2). After installing those and a night full of test runs, things appear to be fine again.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Thanks for that info. So the fix is between 1.0..1.2.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

I'm trying to reproduce this (using a 64k cluster size 2G qcow2 rootfs). Does simply having heavy filesystem activity in parallel trigger this, or should I be doing something else as well? Does it help to have qcow2 snapshots?

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Could not reproduce this with parallel tar or kernel builds.

Changed in qemu-kvm (Fedora):
importance: Unknown → Medium
status: Unknown → Fix Released
Revision history for this message
Steve Langasek (vorlon) wrote :

The Precise Pangolin has reached end of life, so this bug will not be fixed for that release

Changed in qemu-kvm (Ubuntu Precise):
status: Confirmed → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.