kvm Ubuntu12.04 64bit guest got frozen

Bug #965525 reported by Sasa Paporovic
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Undecided
Unassigned
qemu-kvm (Ubuntu)
Fix Released
High
Unassigned

Bug Description

I am on Ubuntu11.10 64bit and have had running a Ubuntu12.04 64bit guest(updated to 26.3.2012) with kvm in it.

After several hours of work the guest enters a frozen state. Only a forced shutdown was possible to me.

Neither the kernel.log of the host nor the libvirt logs showing something special to me.

ProblemType: Bug
DistroRelease: Ubuntu 11.10
Package: qemu-kvm 0.14.1+noroms-0ubuntu6.3
ProcVersionSignature: Ubuntu 3.0.0-17.30-generic 3.0.22
Uname: Linux 3.0.0-17-generic x86_64
ApportVersion: 1.23-0ubuntu4
Architecture: amd64
Date: Mon Mar 26 19:30:12 2012
InstallationMedia: Ubuntu 11.10 "Oneiric Ocelot" - Release amd64+mac (20111011)
KvmCmdLine:
 UID PID PPID C SZ RSS PSR STIME TTY TIME CMD
 115 27627 1 35 588109 2128236 0 17:42 ? 00:38:05 /usr/bin/kvm -S -M pc-0.14 -enable-kvm -m 2048 -smp 3,sockets=3,cores=1,threads=1 -name preciseDesktop-clone -uuid a35f2704-32c6-8e25-132b-0590ad56180b -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/preciseDesktop-clone.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=readline -rtc base=utc -boot order=c,menu=off -drive file=/home/user/VirtualDisks-clone,if=none,id=drive-virtio-disk0,boot=on,format=raw -device virtio-blk-pci,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0 -drive if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -netdev tap,fd=19,id=hostnet0 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:44:3d:11,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -usb -vnc 127.0.0.1:0 -vga cirrus -device intel-hda,id=sound0,bus=pci.0,addr=0x4 -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6
MachineType: MICRO-STAR INTERNATIONAL CO.,LTD MS-7596
ProcEnviron:
 PATH=(custom, no User Name)
 LANG=de_DE.UTF-8
 SHELL=/bin/bash
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.0.0-17-generic root=UUID=ab49260a-c922-44e8-9686-9e7848aad0ab ro quiet splash vt.handoff=7
SourcePackage: qemu-kvm
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 09/03/2010
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: V1.8
dmi.board.asset.tag: To Be Filled By O.E.M.
dmi.board.name: 760GM -E51 (MS-7596)
dmi.board.vendor: MICRO-STAR INTERNATIONAL CO.,LTD
dmi.board.version: 1.0
dmi.chassis.asset.tag: To Be Filled By O.E.M.
dmi.chassis.type: 3
dmi.chassis.vendor: MICRO-STAR INTERNATIONAL CO.,LTD
dmi.chassis.version: 1.0
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvrV1.8:bd09/03/2010:svnMICRO-STARINTERNATIONALCO.,LTD:pnMS-7596:pvr1.0:rvnMICRO-STARINTERNATIONALCO.,LTD:rn760GM-E51(MS-7596):rvr1.0:cvnMICRO-STARINTERNATIONALCO.,LTD:ct3:cvr1.0:
dmi.product.name: MS-7596
dmi.product.version: 1.0
dmi.sys.vendor: MICRO-STAR INTERNATIONAL CO.,LTD

Revision history for this message
Sasa Paporovic (melchiaros) wrote :
Revision history for this message
Sasa Paporovic (melchiaros) wrote :

Additional:

The guest was under heavy harddrive load, with performing a aggregation of parts of the whole fielsystem with rmastersys.

Revision history for this message
Sasa Paporovic (melchiaros) wrote :
Revision history for this message
Sasa Paporovic (melchiaros) wrote :
Changed in qemu-kvm (Ubuntu):
importance: Undecided → High
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Thanks for reporting this bug.

It looks like you did have some disk errors. Lots of

Mar 26 13:46:40 ubuntu kernel: [ 4094.560082] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Mar 26 13:46:40 ubuntu kernel: [ 4094.569174] ata1.00: configured for UDMA/33
Mar 26 13:46:40 ubuntu kernel: [ 4094.584104] ata1: EH complete
Mar 26 13:46:40 ubuntu kernel: [ 4094.598626] ata1.00: exception Emask 0x50 SAct 0x1 SErr 0x280901 action 0x6 frozen
Mar 26 13:46:40 ubuntu kernel: [ 4094.598637] ata1.00: irq_stat 0x08000000, interface fatal error
Mar 26 13:46:40 ubuntu kernel: [ 4094.598646] ata1: SError: { RecovData UnrecovData HostInt 10B8B BadCRC }
Mar 26 13:46:40 ubuntu kernel: [ 4094.598654] ata1.00: failed command: READ FPDMA QUEUED
Mar 26 13:46:40 ubuntu kernel: [ 4094.598668] ata1.00: cmd 60/08:00:74:55:f3/00:00:02:00:00/40 tag 0 ncq 4096 in
Mar 26 13:46:40 ubuntu kernel: [ 4094.598671] res 40/00:04:74:55:f3/00:00:02:00:00/40 Emask 0x50 (ATA bus error)
Mar 26 13:46:40 ubuntu kernel: [ 4094.598679] ata1.00: status: { DRDY }
Mar 26 13:46:40 ubuntu kernel: [ 4094.598688] ata1: hard resetting link
Mar 26 13:46:40 ubuntu kernel: [ 4095.088098] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Mar 26 13:46:40 ubuntu kernel: [ 4095.095878] ata1.00: configured for UDMA/33
Mar 26 13:46:40 ubuntu kernel: [ 4095.108060] ata1: EH complete
Mar 26 13:46:40 ubuntu kernel: [ 4095.114596] ata1.00: exception Emask 0x50 SAct 0x1 SErr 0x280900 action 0x6 frozen
Mar 26 13:46:40 ubuntu kernel: [ 4095.114606] ata1.00: irq_stat 0x08000000, interface fatal error
Mar 26 13:46:40 ubuntu kernel: [ 4095.114616] ata1: SError: { UnrecovData HostInt 10B8B BadCRC }
Mar 26 13:46:40 ubuntu kernel: [ 4095.114624] ata1.00: failed command: READ FPDMA QUEUED

in syslog and kernel log.

Revision history for this message
Sasa Paporovic (melchiaros) wrote :

Yes, but that was not the run I mention here.

The disk errors occure in a more widly sense than only with qemu-kvm.

See

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/965213

The run I ment was between 18:00 and 20:00.

As far as I have seen there are no special entries in the kernl log for this time.

Revision history for this message
Sasa Paporovic (melchiaros) wrote :

But it seems that you are still correct.

I have tested it the whole day.

Everytime when I indroduce high harddrive traffic in a qemu-kvm guest(uck or remastersys running) the ata1 entries occures in the host kernel.log and when I ignore them, sooner or later not only the guest get frozen, also the host just behind.

That causes fielesytem corruption to my host, which until now I get managed to repair succesfully every time.

This is dangerous!

On the opposite when I introduce high harddrive traffic direct on my host(uck for .iso remastering) with no qemu-kvm guest started there are no ata1 errors in the kernel.log of my hostsytems.

Logically follows that the ata1 entries are caused by qemu-kvm only.

Changed in qemu-kvm (Ubuntu):
status: New → Confirmed
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Thanks for the info.

Unless I'm missing (entirely possible) where you've accounted for this, the fact that disk errors only happen when kvm is running could still be due to bad sectors only getting hit by the kvm disk.

Could you try cloning your VM's disk image and running from the cloned image?

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Marking this bug as affecting linux kernel - qemu-kvm userspace should not be able to affect the disk like this. THe kvm kernel module shipped with the kernel package would be more likely.

Brad Figg (brad-figg)
Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Sasa Paporovic (melchiaros) wrote :

Q:Could you try cloning your VM's disk image and running from the cloned image?

A:Yes, I am just performing, but the story is a bit longer.

Yesterday, in the late evening the harddrive performance with linux-generic3.0.17 Ubuntu serie become worser in a sense, that the ata1 errors also have their rising on only host activity(without any qemu-kvm starting).

With this my thesis from #7 lost it`s consistence.

Today, the errors occurs just on startup the host. I have tried it two times and within minutes my host was unusable.
The recovery on following system start has worked.

----------------------------------------------------------------------------------------
I decided to boot up with the still installed old linux-generic 3.0.16 ubuntu serie.

Result:

No, ata1 errors until now on just hostsystem runnig.

I observe this further through the day, but post it just now, because I am not so sure how long my system keeps living.

Revision history for this message
Sasa Paporovic (melchiaros) wrote :
Download full text (3.7 KiB)

To:

Q:Could you try cloning your VM's disk image and running from the cloned image?

A:

The process of cloning(with virt-manager0.9.0-1ubuntu3) has just finished.
During the cloning process with hardrive cloning/copying, there are some stacktraces in the kern.log of linux-generic 3.0.16 ubuntu serie.

One example:

Mar 29 10:28:15 ubuntu kernel: [ 481.256085] INFO: task NetworkManager:989 blocked for more than 120 seconds.
Mar 29 10:28:15 ubuntu kernel: [ 481.256095] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 29 10:28:15 ubuntu kernel: [ 481.256102] NetworkManager D ffffffff81805120 0 989 1 0x00000000
Mar 29 10:28:15 ubuntu kernel: [ 481.256113] ffff8802f9fa1a88 0000000000000086 ffff880308633270 ffff8801e4908f70
Mar 29 10:28:15 ubuntu kernel: [ 481.256124] ffff8802f9fa1fd8 ffff8802f9fa1fd8 ffff8802f9fa1fd8 0000000000012a40
Mar 29 10:28:15 ubuntu kernel: [ 481.256133] ffff88030d320000 ffff880308f5c560 ffff8802f9fa1a98 ffff8802f9fa1af8
Mar 29 10:28:15 ubuntu kernel: [ 481.256142] Call Trace:
Mar 29 10:28:15 ubuntu kernel: [ 481.256157] [<ffffffff815f0ecf>] schedule+0x3f/0x60
Mar 29 10:28:15 ubuntu kernel: [ 481.256168] [<ffffffff8124765d>] do_get_write_access+0x27d/0x500
Mar 29 10:28:15 ubuntu kernel: [ 481.256178] [<ffffffff810819c7>] ? bit_waitqueue+0x17/0xd0
Mar 29 10:28:15 ubuntu kernel: [ 481.256187] [<ffffffff81081e40>] ? autoremove_wake_function+0x40/0x40
Mar 29 10:28:15 ubuntu kernel: [ 481.256197] [<ffffffff81197767>] ? __find_get_block+0x87/0xe0
Mar 29 10:28:15 ubuntu kernel: [ 481.256205] [<ffffffff81247a20>] jbd2_journal_get_write_access+0x30/0x50
Mar 29 10:28:15 ubuntu kernel: [ 481.256215] [<ffffffff8122d9ae>] __ext4_journal_get_write_access+0x3e/0x80
Mar 29 10:28:15 ubuntu kernel: [ 481.256226] [<ffffffff812031ce>] ext4_new_inode+0x29e/0xbc0
Mar 29 10:28:15 ubuntu kernel: [ 481.256234] [<ffffffff81222c79>] ? ext4_journal_start_sb+0x69/0x170
Mar 29 10:28:15 ubuntu kernel: [ 481.256243] [<ffffffff81211947>] ext4_create+0xb7/0x140
Mar 29 10:28:15 ubuntu kernel: [ 481.256252] [<ffffffff811736d3>] ? generic_permission+0x23/0xd0
Mar 29 10:28:15 ubuntu kernel: [ 481.256261] [<ffffffff81174ae4>] vfs_create+0xb4/0x120
Mar 29 10:28:15 ubuntu kernel: [ 481.256269] [<ffffffff81175e69>] do_last+0x649/0x730
Mar 29 10:28:15 ubuntu kernel: [ 481.256276] [<ffffffff81177aca>] path_openat+0xca/0x3f0
Mar 29 10:28:15 ubuntu kernel: [ 481.256284] [<ffffffff81177ef2>] do_filp_open+0x42/0xa0
Mar 29 10:28:15 ubuntu kernel: [ 481.256293] [<ffffffff812f61a1>] ? strncpy_from_user+0x31/0x40
Mar 29 10:28:15 ubuntu kernel: [ 481.256301] [<ffffffff8117327a>] ? do_getname+0x10a/0x180
Mar 29 10:28:15 ubuntu kernel: [ 481.256309] [<ffffffff815f2dae>] ? _raw_spin_lock+0xe/0x20
Mar 29 10:28:15 ubuntu kernel: [ 481.256316] [<ffffffff81185377>] ? alloc_fd+0xf7/0x150
Mar 29 10:28:15 ubuntu kernel: [ 481.256325] [<ffffffff81167a7d>] do_sys_open+0xed/0x220
Mar 29 10:28:15 ubuntu kernel: [ 481.256331] [<ffffffff81169bb5>] ? fput+0x25/0x30
Mar 29 10:28:15 ubuntu kernel: [ 481.256340] [<ffffffff81167bd0>] sys_open+0x20/0x30
Mar 29 10:28:15 ubuntu kernel: [ 481.256348] [<...

Read more...

Revision history for this message
Sasa Paporovic (melchiaros) wrote :
Revision history for this message
Sasa Paporovic (melchiaros) wrote :

I have seen that my additional text from #11 is hidden by a "ReadMore", so I copy it here again:

I see that(blocked processes) as not so nice, but more as a result of not diplomatic resource allocation(copy of 80GB for cloning) than as a result of a heavy I/O error.

The systems keeps alive and was working well after finishing the cloning process.

I attach the latest log.

There are some more processes that where blocked during cloning, but no ata1 entries.

Revision history for this message
Sasa Paporovic (melchiaros) wrote :

I have rerunned remastersys process in the original vm like I have written in first posts:

[
"I am on Ubuntu11.10 64bit and have had running a Ubuntu12.04 64bit guest(updated to 26.3.2012) with kvm in it."
"The guest was under heavy harddrive load, with performing a aggregation of parts of the whole fielsystem with remastersys."
]

The only change is that my Ubuntu11.10 64 host is now running on linux-generic3.0.0.16 ubuntu serie, instead of linux-generic 3.0.017 ubuntu serie.

Result:

No ata1 entries!

The relevant part of the kern.log:

Mar 29 11:10:54 ubuntu kernel: [ 3039.728085] audit_printk_skb: 48 callbacks suppressed
Mar 29 11:10:54 ubuntu kernel: [ 3039.728088] type=1400 audit(1333012254.318:28): apparmor="STATUS" operation="profile_load" name="libvirt-a35f2704-32c6-8e25-132b-0590ad56180b" pid=6723 comm="apparmor_parser"
Mar 29 11:10:54 ubuntu kernel: [ 3040.144719] device vnet0 entered promiscuous mode
Mar 29 11:10:54 ubuntu kernel: [ 3040.149856] virbr0: topology change detected, propagating
Mar 29 11:10:54 ubuntu kernel: [ 3040.149860] virbr0: port 1(vnet0) entering forwarding state
Mar 29 11:10:54 ubuntu kernel: [ 3040.149864] virbr0: port 1(vnet0) entering forwarding state
Mar 29 11:10:54 ubuntu kernel: [ 3040.151021] ADDRCONF(NETDEV_CHANGE): virbr0: link becomes ready
Mar 29 11:10:57 ubuntu kernel: [ 3042.738391] kvm: 6730: cpu0 unhandled rdmsr: 0xc0010001
Mar 29 11:11:05 ubuntu kernel: [ 3050.696087] vnet0: no IPv6 routers present
Mar 29 11:31:37 ubuntu kernel: [ 4282.711738] type=1400 audit(1333013497.298:29): apparmor="STATUS" operation="profile_load" name="libvirt-b7685d07-b779-a45f-2b87-4801bba1152a" pid=8363 comm="apparmor_parser"

Revision history for this message
Sasa Paporovic (melchiaros) wrote :

The run of the clone as requested in comment #8, but on on linux-generic3.0.0.16 ubuntu serie, gave the ata1 error with first frozen guest than crashing host again:

Mar 29 11:31:37 ubuntu kernel: [ 4282.711738] type=1400 audit(1333013497.298:29): apparmor="STATUS" operation="profile_load" name="libvirt-b7685d07-b779-a45f-2b87-4801bba1152a" pid=8363 comm="apparmor_parser"
Mar 29 11:31:39 ubuntu kernel: [ 4284.744460] device vnet1 entered promiscuous mode
Mar 29 11:31:39 ubuntu kernel: [ 4284.749717] virbr0: topology change detected, propagating
Mar 29 11:31:39 ubuntu kernel: [ 4284.749729] virbr0: port 2(vnet1) entering forwarding state
Mar 29 11:31:39 ubuntu kernel: [ 4284.749736] virbr0: port 2(vnet1) entering forwarding state
Mar 29 11:31:43 ubuntu kernel: [ 4288.904465] kvm: 8372: cpu0 unhandled rdmsr: 0xc0010001
Mar 29 11:31:46 ubuntu kernel: [ 4291.631723] virbr0: port 1(vnet0) entering forwarding state
Mar 29 11:31:46 ubuntu kernel: [ 4291.632067] device vnet0 left promiscuous mode
Mar 29 11:31:46 ubuntu kernel: [ 4291.632073] virbr0: port 1(vnet0) entering disabled state
Mar 29 11:31:47 ubuntu kernel: [ 4292.868939] type=1400 audit(1333013507.458:30): apparmor="STATUS" operation="profile_remove" name="libvirt-a35f2704-32c6-8e25-132b-0590ad56180b" pid=8446 comm="apparmor_parser"
Mar 29 11:31:49 ubuntu kernel: [ 4294.928080] vnet1: no IPv6 routers present
Mar 29 11:45:07 ubuntu kernel: [ 5092.451441] ata1.00: exception Emask 0x50 SAct 0x1 SErr 0x280900 action 0x6 frozen
Mar 29 11:45:07 ubuntu kernel: [ 5092.451453] ata1.00: irq_stat 0x08000000, interface fatal error
Mar 29 11:45:07 ubuntu kernel: [ 5092.451462] ata1: SError: { UnrecovData HostInt 10B8B BadCRC }
Mar 29 11:45:07 ubuntu kernel: [ 5092.451470] ataMar 29 11:55:27 ubuntu kernel: imklog 5.8.1, log source = /proc/kmsg started.

Revision history for this message
Sasa Paporovic (melchiaros) wrote :

So it is not related to local limited bad sectors on the host harddrive that are allocated to the kvm guest.

Revision history for this message
Sasa Paporovic (melchiaros) wrote :

Also it could not only be linux-generic3.0.0.17 ubuntu serie, because it also happends with linux-generic3.0.0.16 ubuntu serie.

Revision history for this message
Sasa Paporovic (melchiaros) wrote :

I have found what it is.

In respect to:

http://lime-technology.com/wiki/index.php?title=The_Analysis_of_Drive_Issues

with the text passage(citing):

"Drive Interface Issues

These are problems with the cables and connections to the drive, both power and data, or the quality of the power supplied. If your errors match one of these, then almost certainly, your drive is completely fine. There have been many drives returned or thrown out, after numerous errors similar to the following issues, that were entirely the fault of the cables or power or connectors used, NOT the drive itself.
"

-> You may read on on the linked website for which kernel entries are indicating this.

----------------------------------------------------

I went in the next computer store and came back with a new SATA data and a new SATA energy cable.

I changed both against my old ones.

I have had runned the system since many days to be sure.
-> No ata1 errors anymore with the same harddrive!

A deeper look at the cables shows me that the plastic connector of the SATA energy supply to the harddrive was deformed a bit.
It was nearly invisible but seems to be enough to interrupt the harddrive and causes this errors.

-----------------------------------------------------

So this here was not a problem of qemu-kvm. It was a hard(wired) one.

-----------------------------------------------------

I would like to bring this issue up as a blueprint for Ubuntu/apport, because it was a time stealing one which is not easy to identify.

The system should inform the user/superuser when there are entries of this type in the kernel log that this affected with the possibility of defect cable or cable connectors to the harddrive.

So, for now I close this here as fixed released.

Changed in linux (Ubuntu):
status: Confirmed → Fix Released
Changed in qemu-kvm (Ubuntu):
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.