permission denied on console.log during some migrations when using Quobyte for cinder volumes

Bug #1549828 reported by Silvan Kaiser on 2016-02-25
30
This bug affects 6 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Low
Unassigned

Bug Description

The core issue are console.log files with ownership root:root and permissions 600 during migration where a target hosts nova process is unable to access these console.log files.

The setup is kilo version OpenStack with Quobyte backend running with nas_secure_file_permissions and nas_secure_file_operations set to true and libvirts qemu.conf set to dynamic_ownership=0 and qemu running as nova:cinder user:group.

After testing the different types migration (cold migration and live migration) i see two scenarios so far that produce a console.log file with the afore mentioned access issue:
1) cold migration on a stopped instance who's image resides in a (Quobyte) Cinder volume
2) live migrations of instances who's images reside either in ephemeral storage or in a (Quobyte) Cinder volume

On all occasions all files related to the instance are found to have ownership nova:nova with permissions 644 with the sole exception of the console.log file, e.g.:

[root@server06 log]# ls -lah /prod/openstack-nova/instances/0dfdfd1e-7f5e-47ea-a63c-06c4f84ae2d8*
total 22M
drwxr-xr-x. 1 nova nova 0 Feb 25 14:05 .
drwxrwxrwx. 1 root root 0 Feb 25 14:05 ..
-rw-------. 1 root root 0 Feb 25 14:05 console.log
-rw-r--r--. 1 nova nova 22M Feb 25 14:03 disk

Libvirt does not have a detection for Quobyte as a shared filesystem at this point and thus believes the files to reside on a local filesystem.

Now i'm trying to find out
1) Who creates console.log at which point (so i can check where the ownership & permissions for that are taken from)
2) Who manipulates console.log during migration (so i can check for the same as above)

General observation: libvirts qemu settings for dynamic ownership and user:group do work as the other files are managed according to these settings but console.log seems to be an exception (maybe not touched by libvirt at all? Is Nova setting this?)

Any help appreciated.

Silvan Kaiser (2-silvan) on 2016-02-25
tags: added: cinder volume
Sean Dague (sdague) wrote :

Can you actually build a reproduce for this? Because get_console_output does a chown, so I'm not sure what the concern here is. https://github.com/openstack/nova/blob/058e21ec984fb602e8ed293e2aea32267ca1143e/nova/virt/libvirt/driver.py#L2855

Changed in nova:
status: New → Incomplete
stgleb (gstepanov) wrote :

All activities for copying console.log file are made by libvirt. So console log, after copying has
qemu-libvirt owner.
So , as Sean mentioned above you can do nova console-log <instance-name> and nova wil chown this file.

FYI https://bugs.launchpad.net/nova/+bug/1203193 console.log related bug in block live migration.
and review https://review.openstack.org/284674/

Silvan Kaiser (2-silvan) on 2016-03-01
description: updated
Silvan Kaiser (2-silvan) wrote :

Thanks for the feedback!

console-log is a good hint on how to mitigate this issue, prevents the user from having to manually find the file and correct this. Nice.

Nevertheless live-migrations or starting an instance after a stopped instance migration can fail because of this.

I can reproduce this with a CentOS 7, RDO / Kilo release OpenStack installation running with Quobyte storage. Cinder is running with nas_secure_file_permissions and nas_secure_file_operations, libvirt is running qemu with dynamic_ownership = 0 (off) and nova:cinder ownership. If you want me to try any reconfigurations, etc. please feel free to contact me (kaisers@irc) or comment in here.

I can mitigate this by running e.g.: 'nova console-log <instanceid> && nova live-migration <instanceid> <targethost>' which works more often than not. But this is not reliable. I'd prefer to have the (live-)migration commands to work reliably on itself.

I'd think this issue should come up with GlusterFS setups running nas_secure_file_* = true, too. Has anybody experiences with that kind of setup?

Silvan Kaiser (2-silvan) wrote :

BTW, thanks for the console.log related change links!

Changed in nova:
status: Incomplete → New
Silvan Kaiser (2-silvan) wrote :

I did some more detailed testing by migrating a VM up and down and all around my setup and monitoring it's files, some insights:

1) console.log belonging to root:root with 600 is created when the live-migration command is issued. This command silently fails, meaning Nova reports no error and the VM does not migrate. Only digging on the source hosts logs reveals:

2016-03-02 11:48:14.250 5338 ERROR nova.virt.libvirt.driver [req-b84dd9be-6f61-4c9c-80d7-edead59fdd37 d544aad717f14bd1a20ccdc9d78281e1 5a74d35b546e46039ff2daf48565f186 - - -] [instance: 6c49d8ea-9a62-4db7-a46e-c
462c1c4a08a] Live Migration failure: internal error: process exited while connecting to monitor: 2016-03-02T10:48:14.073332Z qemu-kvm: -chardev file,id=charserial0,path=/prod/openstack-nova/instances/6c49d8ea-9a
62-4db7-a46e-c462c1c4a08a/console.log: Could not open '/prod/openstack-nova/instances/6c49d8ea-9a62-4db7-a46e-c462c1c4a08a/console.log': Permission denied

2) Once the console-log command is run the ownership for console.log changes to nova:root 600, as pointed out previously. This is sufficient to complete a second live-migration attempt that works without issues now.

So this issue should be solvable by doing a chown during the live-migration action, preferably during/after creation of console.log .

Silvan Kaiser (2-silvan) wrote :

Might this be the point where console.log is created with root:root 600 ?
https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L6676

Then a solution would be to add a chown right afterwards or change the call to use the correct ownership.

I'm not sure why there's two or three chown calls for the console.log file, througout the driver.py file. Shouldn't this be set right at the creation of this file?

I'll probably add a change for further discussion on this.

tags: added: live-migration
Silvan Kaiser (2-silvan) wrote :

After some unsuccessful fix testing.

How does the Nova process create a console.log file with root:root ownership in the first place???
If this file is created by the aforementioned 'touch' command it should have the owner of the Nova process, shouldn't it? This should be the user nova. So how is the file created with root:root?

Silvan Kaiser (2-silvan) wrote :

More testing done.

- Adding a chown prior to migrating helps but only in case console.log is already present
- console.log is created with root:root if it does not exist during the migration process, which fails because of this
- The weird fact is that despite qemu-kvm running as nova:cinder ownership the file console.log is still created with root:root ???

Seems to me this is related to an earlier issue i ran into: #1473116

Silvan Kaiser (2-silvan) wrote :

Seems to be related to this issue (although in our case libvirts qemu is configured as specified in the described workaround in this ticket):
https://bugzilla.redhat.com/show_bug.cgi?id=1004673

Matt Riedemann (mriedem) on 2016-03-03
tags: added: kilo-backport-potential liberty-backport-potential
stgleb (gstepanov) wrote :

https://review.openstack.org/#/c/284674/ this patch may solve this. It actually happens during block
live migration, right after that console.log owned by qemu-libvirt or root and should be owned by nova.

Silvan Kaiser (2-silvan) wrote :

Thanks for the hint! However I've not been using block live migration, only live migration on a shared file system. Therefore i fear this will not solve this.

stgleb (gstepanov) on 2016-03-09
Changed in nova:
assignee: nobody → stgleb (gstepanov)
status: New → In Progress
stgleb (gstepanov) wrote :

Hello, i've read your bug report more carefully. If you are interested who
operates with console.log - it is libvirt. libvirt-qemu or root are owners of this file and they move it during migration.
but you can simply call
nova console-log <instance>

and observe that in instance directory console log is owned by stack or nova or
something else user that runs nova.
So there is no problem that console.log is denied, use root permissions or open it using
nova console log.

Changed in nova:
status: In Progress → Invalid
Silvan Kaiser (2-silvan) wrote :

Hi Gleb, thanks for looking into this!

The remaining Problem in this issue is that the live migration fails (without throwing an error, it just does not migrate). Please see comment #5.
Furthermore VMs affected by this cannot be started by Nova, e.g. if beeing restarted.

Beeing familiar with this ticket means one can easily mitigate by running console-log on the instance but if a user does not know that he is unable to run these migrations.

Changed in nova:
status: Invalid → New
stgleb (gstepanov) on 2016-03-14
Changed in nova:
status: New → Invalid
status: Invalid → In Progress
Silvan Kaiser (2-silvan) wrote :

Btw, if more questions arise, please feel free to contact me in IRC (kaisers), i'd be happy to help.

stgleb (gstepanov) wrote :

I think, cloud operator should use nova tools to observe logs, instead of manually open these files.

I'll try to reproduce it in Kilo. But in my current installation made from devstack/master this bug is not reproducable.

Silvan Kaiser (2-silvan) wrote :

For reproduction you need a filesystem based volume driver. Iirc GlusterFS won't work, sadly, as libvirt has builtin GlusterFS support which detects the shared filesystem and thus causes libvirt to refrain from touching the files. So it probably never changes ownership for console.log with GlusterFS. I'm not sure wich filesystems are available to you.
Regarding options you need to set the following:
- nas_secure_file_owernship/-permissions = True should be used with Cinder (otherwise all files will run with root:root / 666 in insecure mode)
- libvirt has to run with the following options set in /etc/libvirt/qemu.conf:
    - dynamic_ownership = 0
    - user = "nova"
    - group = "cinder"

If this is not reproducable for you perhaps i can try to write a fix myself but i need some nova insight & help. So far i've not been able to find the right point where to fix this.

Btw this has nothing to do with manually opening those files. When running the live migration the console.log is changed to root:root which a) causes the live-migration to 'not happen' (fail but no error thrown) and b) causes subsequent vm starts to fail (this time including a thrown error). The fact that the nova console-log command includes a chown on the console.log file which repairs the ownership is a nice mitigation to these issues but no final solution imho.

Augustina Ragwitz (auggy) wrote :

I've set this back to "New" because it looks like it hasn't been reproduced. If there is a patch for this bug, please make sure it has the "closes-bug" tag in the commit message.

Changed in nova:
status: In Progress → Confirmed
status: Confirmed → New
assignee: stgleb (gstepanov) → nobody
Silvan Kaiser (2-silvan) wrote :

I can reproduce this continuously in the setup described in the earlier comments.

Augustina Ragwitz (auggy) wrote :

We need someone other than the person reporting the bug to be able to reproduce it. If we can't follow the steps to reproduce then we can't debug it.

Silvan Kaiser (2-silvan) wrote :

For anyone trying to reproduce, please see comment #16 where i outlined the env.

Sean Dague (sdague) wrote :

I think we can mark this as confirmed, but as has been said, this only happens when specific filesystem based volume drivers are used as backends. Being specific about that means this is definitely a low priority bug as it impacts a small slice of our users, and is not as dire as failing all the time.

summary: - permission denied on console.log during some migrations
+ permission denied on console.log during some migrations when using
+ Quobyte for cinder volumes
Changed in nova:
status: New → Confirmed
importance: Undecided → Low
yatin (yatinkarel) wrote :

I too faced this ERROR when executed: nova evacuate:
Some Environment Info:- Two compute nodes
Openstack Newton
Distro: Centos7
openstack-nova-api-14.0.2-1.el7.noarch
python2-novaclient-6.0.0-1.el7.noarch
openstack-nova-compute-14.0.3-1.el7.noarch
libvirt-daemon-2.0.0-10.el7_3.4.x86_64

Workaround used:-
chown -R nova:nova /var/lib/nova/instances on target host and
then "nova evacuate" didn't face the error.

Charles Pike (luowei) wrote :

I also see this ERROR when doing an evacuate:
Two compute nodes
Openstack Ocata
Distro: Ubuntu 16.04
Standard tutorial manual build plus shared /var/lib/nov/instances

Workaround used:-
chmod 666 /var/lib/nova/instances/console.log on target host and
then "nova evacuate" didn't face the error.

information type: Public → Public Security
information type: Public Security → Public
Jan Walzer (jwalzer) wrote :

We are also facing this error here on Ubuntu/Xenial and for our customers it is a bugger when 20% of the vms are not migrating, when they rather should. It simply means this will break our autoscaling and autoscheduling and update-process on the plattform, because we can't drain hosts of all the VMs.

We're using Quobyte-Latest(currently 1.3.17) here as shared storage.

nova-api: 13.1.3-0ubuntu1
nova-compute: 13.1.3-0ubuntu1
python-novaclient: 3.3.1-2ubuntu1
libvirt0: 1.3.1-1ubuntu10.8

As Silvan pointed out the extra step of getting the console and then doing a live-migration is nothing more than a desperate kludge and far from feasable for us as operators as well as for our customers.
We could as well place an inotify in place that monitors any file and changes permissions when needed.
Thats the same level of uggly mess and this should be fixed at the root instead: Whoever creates the file with the wrong permissions shouldn't do that in the first place.

Sean Dague (sdague) wrote :

Automatically discovered version kilo in description. If this is incorrect, please update the description to include 'nova version: ...'

tags: added: openstack-version.kilo
Silvan Kaiser (2-silvan) on 2017-06-28
tags: added: openstack-version. openstack-version.liberty openstack-version.mitaka openstack-version.newton
tags: removed: openstack-version.
Maciej Kucia (maciejkucia) wrote :

This seems to be alive and kicking on Ocata

[...]
2017-07-18 11:28:48.949 12914 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 2964, in _ensure_console_log_for_instance
2017-07-18 11:28:48.949 12914 ERROR oslo_messaging.rpc.server libvirt_utils.file_open(console_file, 'a').close()
2017-07-18 11:28:48.949 12914 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/utils.py", line 350, in file_open
2017-07-18 11:28:48.949 12914 ERROR oslo_messaging.rpc.server return open(*args, **kwargs)
2017-07-18 11:28:48.949 12914 ERROR oslo_messaging.rpc.server IOError: [Errno 13] Permission denied: '/var/lib/nova/instances/e23edae7-f259-4fbf-b6ff-0d016044acbd/console.log'
2017-07-18 11:28:48.949 12914 ERROR oslo_messaging.rpc.server

Maciej Kucia (maciejkucia) wrote :

The problem was resolved for me in Ocata when following fix was introduced:
 Correct _ensure_console_log_for_instance implementation
 https://review.openstack.org/#/c/469012/

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.