Live migration failed because of file permission changed

Bug #1252519 reported by Barrow Kwan
32
This bug affects 6 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Invalid
Medium
Unassigned

Bug Description

Openstack : Havana
OS : CentOS 6.4
Shared storage with GlusterFS : /var/lib/nova/instances mounted on glusterfs shared

Instance start up fine on node01. When live migration happen, it moved to node02 but failed with the following error

2013-11-18 16:27:37.813 9837 ERROR nova.openstack.common.periodic_task [-] Error during ComputeManager.update_available_resource: Unexpected error while running command.
Command: env LC_ALL=C LANG=C qemu-img info /var/lib/nova/instances/aa1deb40-ae1d-45e4-a37e-7b0607df372f/disk
Exit code: 1
Stdout: ''
Stderr: "qemu-img: Could not open '/var/lib/nova/instances/aa1deb40-ae1d-45e4-a37e-7b0607df372f/disk'\n"
2013-11-18 16:27:37.813 9837 TRACE nova.openstack.common.periodic_task Traceback (most recent call last):
2013-11-18 16:27:37.813 9837 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.6/site-packages/nova/openstack/common/periodic_task.py", line 180, in run_periodic_tasks
2013-11-18 16:27:37.813 9837 TRACE nova.openstack.common.periodic_task task(self, context)

The problem is with the file ownership of "console.log" and "disk". Those file should be owned by user "qemu" and group "qemu" but after the migration, both files are owned by root

drwxr-xr-x 2 nova nova 53 Nov 18 13:40 .
drwxr-xr-x 6 nova nova 110 Nov 18 13:43 ..
-rw-rw---- 1 root root 1546 Nov 18 13:43 console.log
-rw-r--r-- 1 root root 12058624 Nov 18 13:42 disk
-rw-r--r-- 1 nova nova 1569 Nov 18 13:42 libvirt.xml

tags: added: compute live-migrate
tags: added: security
Changed in nova:
importance: Undecided → Medium
status: New → Triaged
tags: added: libvirt
Changed in nova:
assignee: nobody → Facundo Maldonado (facundo-n-maldonado)
Revision history for this message
Facundo Maldonado (facundo-n-maldonado) wrote :

Cannot reproduce.
Openstack: fresh devstack installation, 1 controller / 3 compute nodes
Shared storage with GlusterFS: 2 nodes/ 2 bricks
OS: Ubuntu server 12.04

Instances were migrated between compute nodes without problem.
The files "console.log" and "disk" are owned by root after the first migration, but that do not
prevent to keep migrating.

Any other clue to try to reproduce this bug?

Revision history for this message
haruka tanizawa (h-tanizawa) wrote :

Hi Barrow Kwan,
How do you think about Maldonado's comment?

Revision history for this message
Maurice Leeflang (malicure) wrote :

I have the same problem.
I am currently trying to isolate the cause to it.

The first live migration of an instance works, but the second one (back to the first node) fails.
The ownership changed, but even manually as root I am not allowed to read the disk file.
Only when kvm closes the fd's on these files (when I suspend the instance, for example), the files can be read again. The files can still be read when the instance is resumed, even a live migration is possible again, then. The ownership after a resume is qemu again (not root), so I can understand why Barrow points that way.

It smells like some locking situation on the gluster side, but I am not able to pinpoint it to a configuration option or bug yet
I will do some more tests to see why (and how) gluster is spoiling the fun.
Please do not put the blame on some gluster bug or behaviour yet.
The fact that the ownership of instance files on shared storage changes indicates that there is still a resize or something being done (post migration), which, IMHO, is not needed in the case of a live migration with de nova instances directory on shared storage for all compute nodes participating.

Changed in nova:
status: Triaged → Confirmed
Revision history for this message
Dylan (dylan-hager) wrote :

I think this is related to a bug I have found at Bugzilla. The work around solved my issue.

excerpt from bug to workaround this issue.

* stop your guests
* stop libvirt-bin
* edit /etc/libvirt/qemu.conf - this contains a commented out entry 'dynamic_ownership=1', which is the default. Change this to 0, and remove the comment.
* Do a chown to libvirt-qemu:kvm for all your stopped images.
* Start the service libvirt-bin again
* Bring up the guests
* Repeat on the other half of your cluster
* Test a live migration - for me, they work again.

https://bugzilla.redhat.com/show_bug.cgi?id=1057645

Revision history for this message
Hendrik Frenzel (hfrenzel) wrote :

Great, this works for me.
Thank you, Dylan.

Revision history for this message
Maurice Leeflang (malicure) wrote :

Works for me too!
Thanks a lot Dylan!

Changed in nova:
assignee: Facundo Maldonado (facundo-n-maldonado) → nobody
Revision history for this message
Hendrik Frenzel (hfrenzel) wrote :

I would suggest to set user="nova" and group="nova" in qemu.conf too.
Without them, a permission denied error occured on instance creation:

2014-09-08 20:08:33.267 4740 TRACE nova.compute.manager [instance: 6fbc9080-772d-4c18-a630-2196763742bf] qemu-kvm: -drive file=/var/lib/nova/instances/6fbc9080-772d-4c18-a630-2196763742bf/disk,if=none,id=drive-virtio-disk0,format=qcow2,cache=none: could not open disk image /var/lib/nova/instances/6fbc9080-772d-4c18-a630-2196763742bf/disk: Permission denied

Revision history for this message
Maurice Leeflang (malicure) wrote :

Agreed, and I managed to avoid bringing all running instances down by chowning the disk and monitor files to group nova and then chmod g+w on those files.
Then, assuming all nova nodes are using the working qemu configuration, it is possible to live migrate the instances, and after that migration, the files can also be chowned to user nova.
That way there is no need for downtime

Revision history for this message
chrone (chrone81) wrote :

I still got this following error when trying to launch an instance under GlusterFS shared storage on latest updated Ubuntu 14.04.1 LTS by changing the user and group and dynamic_ownership as suggested from the above:

"libvirtError: internal error: process exited while connecting to monitor: qemu-system-x86_64: -chardev file,id=charserial0,path=/var/lib/nova/instances/3ce9acb6-fcfd-4042-93ee-372fe2221f4d/console.log: Could not open '/var/lib/nova/instances/3ce9acb6-fcfd-4042-93ee-372fe2221f4d/console.log': Permission denied"

"libvirtError: internal error: process exited while connecting to monitor: Could not access KVM kernel module: Permission denied
2014-10-31 04:03:36.246 5008 TRACE nova.compute.manager [instance: 76c0f9bf-d438-4f19-841a-0cc1a8cc3013] failed to initialize KVM: Permission denied"

Eli Qiao (taget-9)
Changed in nova:
assignee: nobody → Eli Qiao (taget-9)
Revision history for this message
Pawel Koniszewski (pawel-koniszewski) wrote :

Eli,

do you have idea how to fix this issue? I believe that it should be fixed in libvirt (don't chown if gluster detected):

https://bugzilla.redhat.com/show_bug.cgi?id=1057645
https://bugzilla.redhat.com/show_bug.cgi?id=1004673
https://bugzilla.redhat.com/show_bug.cgi?id=714997

Revision history for this message
Eli Qiao (taget-9) wrote :

thanks Pawel, that seems need to be fixed in libvirt.

Changed in nova:
assignee: Eli Qiao (taget-9) → nobody
Revision history for this message
Pawel Koniszewski (pawel-koniszewski) wrote :

This is a regression introduced in Gluster 3.4.1 as pointed here - http://www.gluster.org/pipermail/gluster-users/2014-January/015885.html
There is a workaround for this issue: https://bugzilla.redhat.com/show_bug.cgi?id=1057645#c7

Marking as invalid as this is not nova bug.

Changed in nova:
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.