Ubuntu 12.04 64bit Guest (qemu/kvm) dies randomly

Bug #1022901 reported by Todor Andreev
22
This bug affects 4 people
Affects Status Importance Assigned to Milestone
libvirt (Ubuntu)
Incomplete
High
Unassigned
qemu-kvm (Ubuntu)
Incomplete
High
Unassigned

Bug Description

Using qemu/kvm
Host and Guest: Ubuntu server 12.04 64bit

The thing is that the guest randomly dies and it is soooo frustrating. There is PDC on the guest and people cannot work. It happens almost every day.
 There are no logs in the syslogs of host and guest in the time of crash.
 The only mentioning of the dying is in this log /var/log/libvirt/qemu/ubuntu-pdc.log:

 2012-07-05 14:38:30.798+0000: shutting down

 When I manually shut down the guest at least there is a line before the "shutting down" like this:

 qemu: terminating on signal 15 from pid 1160
 2012-07-05 14:39:01.150+0000: shutting down

 I saw that there are also these errors in /var/log/libvirt/libvirtd.log:

 1160: error : qemuMonitorIO:603 : internal error End of file from monitor

 but they are appearing when I shutdown the guest manually, too.

Update:
I had leaved a console in the guest turned on with htop. So, when the machine died, there was a message - "Power button pressed.", which leads me to acpid. There was no power button pressed and no script is turning the machine off. No crons either.

Update:
Well, now it did it again, with acpid stopped. It just didn't print the message. It just dies...
---
ApportVersion: 2.0.1-0ubuntu8
Architecture: amd64
DistroRelease: Ubuntu 12.04
InstallationMedia: Ubuntu-Server 12.04 LTS "Precise Pangolin" - Release amd64 (20120424.1)
Lsusb:
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
 Bus 002 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
 Bus 001 Device 002: ID 8087:0024 Intel Corp. Integrated Rate Matching Hub
 Bus 002 Device 002: ID 8087:0024 Intel Corp. Integrated Rate Matching Hub
MachineType: To be filled by O.E.M. To be filled by O.E.M.
Package: qemu-kvm 1.0+noroms-0ubuntu13
PackageArchitecture: amd64
ProcCmdline: BOOT_IMAGE=/boot/vmlinuz-3.2.0-26-generic root=UUID=506ef9ea-f5aa-4e99-acce-c9fda1afb9e2 ro
ProcEnviron:
 LANGUAGE=en_US:en
 TERM=xterm
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.2.0-26-generic root=UUID=506ef9ea-f5aa-4e99-acce-c9fda1afb9e2 ro
ProcVersionSignature: Ubuntu 3.2.0-26.41-generic 3.2.19
Tags: precise precise
Uname: Linux 3.2.0-26-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:

dmi.bios.date: 06/24/2011
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: F5
dmi.board.asset.tag: To be filled by O.E.M.
dmi.board.name: To be filled by O.E.M.
dmi.board.vendor: To be filled by O.E.M.
dmi.board.version: x.x
dmi.chassis.asset.tag: To Be Filled By O.E.M.
dmi.chassis.type: 3
dmi.chassis.vendor: To Be Filled By O.E.M.
dmi.chassis.version: To Be Filled By O.E.M.
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvrF5:bd06/24/2011:svnTobefilledbyO.E.M.:pnTobefilledbyO.E.M.:pvrTobefilledbyO.E.M.:rvnTobefilledbyO.E.M.:rnTobefilledbyO.E.M.:rvrx.x:cvnToBeFilledByO.E.M.:ct3:cvrToBeFilledByO.E.M.:
dmi.product.name: To be filled by O.E.M.
dmi.product.version: To be filled by O.E.M.
dmi.sys.vendor: To be filled by O.E.M.

Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. It seems that your bug report is not filed about a specific source package though, rather it is just filed against Ubuntu in general. It is important that bug reports be filed about source packages so that people interested in the package can find the bugs about it. You can find some hints about determining what package your bug might be about at https://wiki.ubuntu.com/Bugs/FindRightPackage. You might also ask for help in the #ubuntu-bugs irc channel on Freenode.

To change the source package that this bug is filed about visit https://bugs.launchpad.net/ubuntu/+bug/1022901/+editstatus and add the package name in the text box next to the word Package.

[This is an automated message. I apologize if it reached you inappropriately; please just reply to this message indicating so.]

tags: added: bot-comment
Revision history for this message
Todor Andreev (toshko3) wrote :

I think it is the qemu-kvm package responsible...

affects: ubuntu → qemu-kvm (Ubuntu)
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

@Todor,

do you know if any other virsh activity is going on every time that happens? For instance, create, destroying, or even just listing guests?

I'm going to mark this as also affect libvirt (which i think is more likely to be the problem). Coudl you run 'apport-collect 1022901' to have apport upload some debugging information?

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

@Todor,

actually, before you run apport-collect, could you

cd /usr/share/apport/package-hooks
sudo mv source_libvirt-bin.py source_libvirt.py

to make sure that the libvirt debugging information is collected.

thanks!

Changed in libvirt (Ubuntu):
importance: Undecided → High
Changed in qemu-kvm (Ubuntu):
importance: Undecided → High
status: New → Incomplete
Changed in libvirt (Ubuntu):
status: New → Incomplete
Revision history for this message
Todor Andreev (toshko3) wrote :

Maybe listing, because virt-manager is going on constantly. I'll stop it and see what happens. I'm not sure what are these apport things about, but will do, immediately!
Thanks!

Revision history for this message
Todor Andreev (toshko3) wrote : BootDmesg.txt

apport information

tags: added: apport-collected precise
description: updated
Revision history for this message
Todor Andreev (toshko3) wrote : CurrentDmesg.txt

apport information

Revision history for this message
Todor Andreev (toshko3) wrote : Dependencies.txt

apport information

Revision history for this message
Todor Andreev (toshko3) wrote : KernLog.txt

apport information

Revision history for this message
Todor Andreev (toshko3) wrote : KvmCmdLine.txt

apport information

Revision history for this message
Todor Andreev (toshko3) wrote : Lspci.txt

apport information

Revision history for this message
Todor Andreev (toshko3) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Todor Andreev (toshko3) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Todor Andreev (toshko3) wrote : ProcModules.txt

apport information

Revision history for this message
Todor Andreev (toshko3) wrote : RelatedPackageVersions.txt

apport information

Revision history for this message
Todor Andreev (toshko3) wrote : UdevDb.txt

apport information

Revision history for this message
Todor Andreev (toshko3) wrote : UdevLog.txt

apport information

Changed in libvirt (Ubuntu):
status: Incomplete → New
Changed in qemu-kvm (Ubuntu):
status: Incomplete → New
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Have you been able to verify whether this does not happen when virt-manager is not running?

Please try the following:

sudo mv /var/log/libvirt/libvirtd.log /var/log/libvirt/libvirtd.log.orig
echo "log_level = 1" | sudo tee -a /etc/libvirt/libvirtd.conf
sudo stop libvirt-bin; sudo start libvirt-bin

Then after one of the VMs has shut itself down,

sudo cp /var/log/libvirt/libvirtd.log /var/log/libvirt/log.submit
sudo stop libvirt-bin
sudo sed -i '/^log_level/d' /etc/libvirt/libvirtd.conf
sudo start libvirt-bin

Then attach /var/log/libvirt/log.submit to this report. This should provide more debugging information from libvirt about what has happened.

Changed in libvirt (Ubuntu):
status: New → Incomplete
Changed in qemu-kvm (Ubuntu):
status: New → Incomplete
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Sorry, one more thing - could you also attach /var/log/kernel.log from a guest which has recently shut down? This may include useful acpi and kernel information.

Revision history for this message
Todor Andreev (toshko3) wrote :

Of course. It has shut off at these hours (directly from host's /var/log/libvirt/qemu/ubuntu-pdc.log):
2012-07-10 09:56:23.767+0000: shutting down
and
2012-07-10 11:50:09.191+0000: shutting down

Revision history for this message
Todor Andreev (toshko3) wrote :

I have set the log_level and have virt_manager disconnected. Thank you for your cooperation! Now I'm waiting, when it happens will write immideately.

Revision history for this message
Todor Andreev (toshko3) wrote :

Now it happened again. By the way, I have made a script to autostart it 10 seconds after ping lose. This time the virt-manager was not connected at the time it died. Also, the script couldn't start it again, because of some error. When I connected with virt-manager to see what is the status - it was Paused. I had to do a right-mouse-button > Force Off.
Now it did it again, at the time I'm writing this. Again had to Force Off, because it was Paused.

The log is about 100MB in size. The machine was lost at 7:47 and 8:05 (log times) (the logs are in UTC time, please write me if I can make them in my time-zone, which is +3 hours now).

Revision history for this message
Todor Andreev (toshko3) wrote :

There are no errors in the guest's syslog. Well I now saw the log and it seems it has IO errors on "device": "drive-virtio-disk1". How could I determine which of the two is this virtual disk - is there drive-virtio-disk0 or this is the first?
Also, now it was obviously in Paused state, but the majority of cases it just dies, it is not in Paused state, so I'm not sure this is the actual case we are waiting. Only god knows :-)
Thanks for any help!

Revision history for this message
Todor Andreev (toshko3) wrote :

Now it did it again - at 10:47. Same errors and symptoms (Paused).

Revision history for this message
Todor Andreev (toshko3) wrote :

Now it did it again - at 11:07. Same errors and symptoms (Paused).
Well, I see a pattern here. Every max. 15 min. before a round hour (for example 11.00) and every max. 10min. after this hour there are IO_ERRORs in the log and therefore pausing the machine.
I saw that other dyings are also 15 minutes around the hour. Not exactly at some minute, but definetly within 15 minutes before or after the round hour.

Revision history for this message
Alexander Hawk (saclista) wrote :

Hi Serge & Todor,

Same problem here :(

Ubuntu 12.04 server 64 bit host
Guests:
1. Debian 6 32 bit graphic desktop
2. Debian 6 64 bit graphic desktop

I`m new to kvm.
I created and maintain 2 debian machines over LAN from Virtual Machine Manager(http://virt-manager.org/).

Once or twice a day error message is:
qemuMonitorIO:603 : internal error End of file from monitor

and one or both of Debian shut down :(

Any help appreciated.
Alex

Revision history for this message
Alexander Hawk (saclista) wrote :
Download full text (3.5 KiB)

A part of /var/log/libvirtd/libvirtd.log file:

2012-07-11 15:15:11.083+0000: 1230: warning : qemuCapsInit:856 : Failed to get host power management capabilities
2012-07-11 15:15:32.479+0000: 1225: error : virNetSocketReadWire:996 : End of file while reading data: Input/output error
2012-07-11 18:06:25.243+0000: 1225: error : qemuMonitorIO:603 : internal error End of file from monitor
2012-07-12 09:25:37.092+0000: 1231: error : virExecWithHook:328 : Cannot find 'pm-is-supported' in path: No such file or directory
2012-07-12 09:25:37.092+0000: 1231: warning : qemuCapsInit:856 : Failed to get host power management capabilities
2012-07-12 09:25:37.857+0000: 1236: error : virConnectNumOfInterfaces:9803 : this function is not supported by the connection driver: virConnectNumOfInterfaces
2012-07-12 09:25:47.537+0000: 1232: error : virExecWithHook:328 : Cannot find 'pm-is-supported' in path: No such file or directory
2012-07-12 09:25:47.537+0000: 1232: warning : qemuCapsInit:856 : Failed to get host power management capabilities
2012-07-12 09:25:48.291+0000: 1233: error : virExecWithHook:328 : Cannot find 'pm-is-supported' in path: No such file or directory
2012-07-12 09:25:48.291+0000: 1233: warning : qemuCapsInit:856 : Failed to get host power management capabilities
2012-07-12 09:26:08.761+0000: 1229: error : virExecWithHook:328 : Cannot find 'pm-is-supported' in path: No such file or directory
2012-07-12 09:26:08.761+0000: 1229: warning : qemuCapsInit:856 : Failed to get host power management capabilities
2012-07-12 09:26:09.636+0000: 1230: error : virExecWithHook:328 : Cannot find 'pm-is-supported' in path: No such file or directory
2012-07-12 09:26:09.636+0000: 1230: warning : qemuCapsInit:856 : Failed to get host power management capabilities
2012-07-12 14:03:55.930+0000: 1225: error : virNetSocketReadWire:996 : End of file while reading data: Input/output error
2012-07-12 14:04:16.928+0000: 1231: error : virExecWithHook:328 : Cannot find 'pm-is-supported' in path: No such file or directory
2012-07-12 14:04:16.928+0000: 1231: warning : qemuCapsInit:856 : Failed to get host power management capabilities
2012-07-12 14:04:17.536+0000: 1237: error : virConnectNumOfInterfaces:9803 : this function is not supported by the connection driver: virConnectNumOfInterfaces
2012-07-12 14:04:28.160+0000: 1232: error : virExecWithHook:328 : Cannot find 'pm-is-supported' in path: No such file or directory
2012-07-12 14:04:28.160+0000: 1232: warning : qemuCapsInit:856 : Failed to get host power management capabilities
2012-07-12 14:04:29.021+0000: 1230: error : virExecWithHook:328 : Cannot find 'pm-is-supported' in path: No such file or directory
2012-07-12 14:04:29.021+0000: 1230: warning : qemuCapsInit:856 : Failed to get host power management capabilities
2012-07-12 14:04:53.009+0000: 1230: error : virExecWithHook:328 : Cannot find 'pm-is-supported' in path: No such file or directory
2012-07-12 14:04:53.009+0000: 1230: warning : qemuCapsInit:856 : Failed to get host power management capabilities
2012-07-12 14:04:53.645+0000: 1233: error : virExecWithHook:328 : Cannot find 'pm-is-supported' in path: No such file or directory
2012-07-12 14:04:53.645+0000: 1233...

Read more...

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

@Alexander,

do you also see the same 15 minute pattern as Todor?

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

@Todor,

could you run
   sudo grep -i cron /var/log/* > /tmp/crongrep
   ps -ef > /tmp/psout
   ls -l /etc/cron* > /tmp/cronls

in the ubuntu-pdc guest, and attach the resulting /tmp/crongrep, /tmp/cronls, and /tmp/psout files?

How many VMs do you have, how much memory is each allocated, and how much RAM do you have in the main machine (I think the boot messages said 4G?) Is it possible that VMs are shutting down because of memory pressure on the host, or do the VMs shut down even when you leave at least 500M free on the host?

Revision history for this message
Todor Andreev (toshko3) wrote :

I can confirm that the "Paused" case with IO_ERROR is happening when a user (samba client) starts to copy files (70GB of pictures with 10MB/s). It crashes almost immediately!
@Serge, the Ubuntu machine is with 1GB allocated and the Windows machine is with 2GB allocated. The host is 4GB. I had a look at this possibility already, but it is not the case.
At the time of writing this, I saw that the space on host's partition is finished and now it cannot start at all, becoming Paused right after the guest tries to write something. Well, OK, but the image size by definition is this:

root@nobel-host1:/home/images# qemu-img info ubuntu-pdc-vdb.img
image: ubuntu-pdc-vdb.img
file format: qcow2
virtual size: 1.7T (1825361100800 bytes)
disk size: 1.7T
cluster_size: 65536
Snapshot list:
ID TAG VM SIZE DATE VM CLOCK
11 snapshot.sh 0 2012-07-06 00:02:02 00:00:00.000
12 snapshot.sh 0 2012-07-07 00:02:02 00:00:00.000
13 snapshot.sh 0 2012-07-10 00:02:02 00:00:00.000
14 snapshot.sh 0 2012-07-12 00:06:01 00:00:00.000
15 snapshot.sh 0 2012-07-13 00:06:03 00:00:00.000

But the real size is (as I am observing from a while) increasing. I don't know if it is from snapshot taking into the image file itself? Also now it didn't do any snapshot, the only thing is copying files into the guest. Isn't the size of the image supposed to be fixed at that, mentioned above? Why is it now this:

root@nobel-host1:/home/images# du -sm ubuntu-pdc-vdb.img
1785438 ubuntu-pdc-vdb.img

and at the beginning it was this:
1741809 ubuntu-pdc-vdb.img

It has grown with over 40GB for about half a month. I'm doing a snapshot every evening at working days, but am deleting the old ones too, leaving only 5 snapshots at a time.

Revision history for this message
Todor Andreev (toshko3) wrote :

I have copied all of the content of the guest machine, deleted the image ubuntu-pdc-vdb.img and created a new one without "fallocate", but only with preallocation=metadata. The problem is that the image is growing progressively and has obviously no limit, although I gave it one. The root filesystem's image is the same case:

qemu-img info ubuntu-pdc-vda.img
image: ubuntu-pdc-vda.img
file format: qcow2
virtual size: 10G (10737418240 bytes)
disk size: 14G
cluster_size: 65536

and for confirmation:
du -sh ubuntu-pdc-vda.img
15G ubuntu-pdc-vda.img

I made a test and saw that when I delete something from the guest, the real size of the image is not decreasing (I read it is normal). OK, but when I write something again, it doesn't use the freed space, but instead grows the image. So for example:
1. The initial physical size of the image is 1GB.
2. I copy 1GB of data in the guest. It's physical size becomes 2GB.
3. I delete this data (1GB). The physical size of the image remains 2GB.
4. I copy another 1GB of data to the guest.
5. The physical size of the image becomes 3GB.
6. And so on with no limit. It doesn't care if the virtual size is less.

Is this normal - the real/physical size of the image to be larger than the virtual limit???

Revision history for this message
Todor Andreev (toshko3) wrote :

OK, I have created another bug report for the increasing size of image file:
https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1025244

I'm monitoring now, after creating a new image (with no fallocate) and having enough space on the host, if the machine will die again. I suppose it will, because when there were no space, libvirtd.log said IO_ERROR and the machine was ALWAYS paused, not dead. My original problem, however, was the machine dies randomly and appeared as Shutoff. So now I'm monitoring with libvirtd.log in debugging mode and waiting.
Will write as soon as I have info.

Revision history for this message
Alexander Hawk (saclista) wrote :

@Serge There are no "15 minutes patterns" and no any IO_ERRORS (in guests` /var/log/syslog files)

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

@Alexander,

could you please file a new bug? It sounds like yours has a different cause from Todor's. Please first rename /usr/share/apport/package-hooks/source_libvirt-bin.py to /usr/share/apport/package-hooks/source_libvirt.py. In other words:

1. sudo mv /usr/share/apport/package-hooks/source_libvirt-bin.py /usr/share/apport/package-hooks/source_libvirt.py
2. apport-bug libvirt-bin

@Todor,

I will mark this a duplicate of the new bug you created. thanks.

Revision history for this message
Todor Andreev (toshko3) wrote :

I can confirm that the machine doesn't die or pause when there is no "space" problem. All of the problems were caused by the space issue!
Although now I have another problem - qcow2 image is broken after 6 snapshots :-) Will file another bug.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

@Todor,

since you filed this bug, should this bug be marked invalid?

Revision history for this message
Todor Andreev (toshko3) wrote :

Well, I don't think that this is a desired behavior. I don't think it should be put in the wishlist neither.
Since the machine dies, pauses and whatever, I think this is not a desired/predicted way of managing such a situation.
I think it should manage to be at least alive! Don't you think so?
Then it should give some warnings in any way to the administrator.
I imagine how many admins will go through this and only after some nightmare days will realize what is happening. Because, strangely enough, the space was not at the end when I was checking it, it has been going to end only when the client was moving some 100GB of files to the virtual machine, through samba. And after getting an error, he was canceling the transfer and the image was back at its original size, again. Black magic or whatever, but a fact!
I'm sorry I can't help with it... I'm not a programmer and can help only with filling bugs :-(

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

@Todor,

sorry, please disregard - I was misremembering the cause.

Revision history for this message
Johan Boger (jboger) wrote :

I have the same issue, without filling virtual space. All the other things are exactly the same, though.

Is there a patch available yet?

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

@Johan

Could you please file a new bug using 'ubuntu-bug qemu-kvm' and describe exactly what you see and how you reproduce it?

Revision history for this message
Davide Guerri (davide-guerri) wrote :

I also think this is not a duplicate of bug #1025244 although they might be somewhat related to each other.

Same issue here: random "internal error End of file from monitor" and "graceful" shutdown. No space issue on the VM

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

@Davide,

what exactly do you mean by not a duplicate? Do you mean your bug is not caused by out of space? If so, then again please file a new bug.

Technically I'd probably agree that this is not a duplicate of bug 1025244 - rather it is correct behavior (cleanly shutting down when unable to get enough space) in the face of the other bug. But this bug, as filed by the submitter, is about shutdowns caused by lack of disk space.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.