Failed to start instances with encrypted volumes

Bug #1821696 reported by Magnus Lööf
20
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Lee Yarwood
Queens
Fix Committed
Medium
Lee Yarwood
Rocky
Fix Released
Medium
Lee Yarwood
Stein
Fix Released
Medium
Lee Yarwood
kolla-ansible
Invalid
Undecided
Unassigned

Bug Description

Description
===========
We hit this bug after doing a complete cluster shutdown due to server room maintenance. The bug is however more easily reproducible.

When cold starting an instance with an encrypted volume attached, it fails so start with a VolumeEncryptionNotSupported error.

https://github.com/openstack/os-brick/blob/stable/rocky/os_brick/encryptors/cryptsetup.py#L52

Steps to reproduce
==================

* Deploy Openstack with Barbican support using Kolla.
* Create an encrypted volume type
* Create an encrypted volume
* Create an instance and attach the encrypted folder
* Enjoy your new instance and volume, install software and store data
* In our case, we shut down the entire cluster and restarted it again. First all instances were stopped in Horizon using Shut down instance command. We use Ceph so we then stopped that using these procedures https://ceph.com/planet/how-to-do-a-ceph-cluster-maintenance-shutdown/ and then shut down the compute / storage nodes and then the controller nodes one by one. Then we started the cluster in the reverse order, verified all services were up and running, examined logs and then started the instances.
* Instances without encrypted volumes started fine.
* Instances with encrypted volumes fail to start with VolumeEncryptionNotSupported.

Note: It is possible to recreate the problem by using a Hard Reboot (possibly related https://bugs.launchpad.net/nova/+bug/1597234) or by shutting down instances and then restarting all Openstack services on that compute node.

Expected results
================
Instances with encrypted volumes should start fine, even after a Hard Reboot or a complete cluster shutdown.

Actual results
==============
Instances with encrypted volumes failed to start with VolumeEncryptionNotSupported

https://pastebin.com/mvMbJQRb

Environment
===========

1. Openstack version
Environment is established by Kolla (Rocky release).

2. Hypervisor
KVM on RHEL

3. Storage type
Ceph using Kolla (Rocky release)

Analysis
========
There seems to be a problem related to this code not behaving as expected:

https://github.com/openstack/nova/blob/stable/rocky/nova/virt/libvirt/driver.py#L1049

It seems that it is expected that the exception should be ignored and logged, but for some reason, the `ctxt.reraise = False` does not work as expected:

self.force_reraise() is called in https://github.com/openstack/oslo.utils/blob/stable/rocky/oslo_utils/excutils.py#L220 which it should not have hit since `reraise` is expected to be `False`.

We did some hacking and just swallowed the exception by commenting out the `excutils.save_and_reraise_exception()` section and replacing it with a simple `pass`.

Then the instance booted - but it could not boot from the image. But, it was then possible to remove the encrypted volume attachment, reboot the server and then reattach the encrypted volume.

description: updated
Revision history for this message
Magnus Lööf (magnus-loof) wrote :

Note that it was not possible to remove the encrypted volume attachment from the affected hosts - that would also yield a VolumeEncryptionNotSupported error.

tags: added: volumes
tags: added: encryption
pandatt (pandatt)
Changed in nova:
status: New → Confirmed
assignee: nobody → pandatt (pandatt)
Revision history for this message
Lee Yarwood (lyarwood) wrote :

The value of reraise should definitely be True when we hit VolumeEncryptionNotSupported.

Looking at the provided trace it's pretty clear that that the LuksEncryptor encryptor class provided by os-brick is calling the following __init__ code within CryptsetupEncryptor causing this mess:

https://github.com/openstack/os-brick/blob/00a4d96d2506bed5c5507282a774bc75df9f790f/os_brick/encryptors/cryptsetup.py#L47-L54

Revision history for this message
Lee Yarwood (lyarwood) wrote :

> The value of reraise should definitely be True when we hit VolumeEncryptionNotSupported.

Apologies, to be clear I mean that it should be True when we hit VolumeEncryptionNotSupported in the context of starting an instance, reboot setc where destroy_disks is also False.

Revision history for this message
Lee Yarwood (lyarwood) wrote :

So the fact that Nova is even creating an encryptor object in the first place is incorrect. Reviewing our _detach_encryptor code suggests that the volume secrets have gone missing in order for this to happen:

https://github.com/openstack/nova/blob/9bb78d5765dab01e38327f57312583c189a352d5/nova/virt/libvirt/driver.py#L1381-L1382

Did you manually remove the associated libvirt secrets for this volume?

These should persist reboots as they are created with ephemeral=False by default.

Revision history for this message
Magnus Lööf (magnus-loof) wrote :

> Did you manually remove the associated libvirt secrets for this volume?

No, we did no such thing. We followed the procedure as described in the issue description. Shut down the instances, and then Ceph and then Openstack services.

I could reproduce this in our lab, and also reproduce the issue by performing a `Hard Reboot` in Horizon.

> These should persist reboots as they are created with ephemeral=False by default.

Reboots from "within" the instance as well as Live Migration and also Soft Reboot works well.

I believe (cannot verify from where I am right now...) that even `Shutdown Instance` and a `Start Instance` works well.

Revision history for this message
Lee Yarwood (lyarwood) wrote :

What process did you follow to restart the services being deployed as Kolla containers?

Can you run the following command as root within the nova-libvirt container before and after a restart when instances are running with encrypted volumes attached?

$ virsh secret-list

Revision history for this message
Magnus Lööf (magnus-loof) wrote :

> What process did you follow to restart the services being deployed as Kolla containers?

Restart the physical services, then verify Service status in Horizon, `docker restart` if there are any errors.

Revision history for this message
Magnus Lööf (magnus-loof) wrote :

> What process did you follow to restart the services being deployed as Kolla containers?

Also note that it is not necessary to restart any containers to reproduce the error. It is sufficient with a Hard Reboot from Horizon GUI.

Revision history for this message
Lee Yarwood (lyarwood) wrote :

> Also note that it is not necessary to restart any containers to reproduce the error.
> It is sufficient with a Hard Reboot from Horizon GUI.

I can't reproduce this in a F29 devstack env using ceph.

I'm still convinced that you're hitting this due to the associated Libvirt volume secrets going AWOL in your environment before the instance hard reboots.

Again it would be super useful if you could run the following commands within the nova-libvirt container to confirm if the secrets are present before you hard reboot:

$ sudo virsh secret-list
$ sudo ls /etc/libvirt/secrets/

If they aren't then IMHO this is a kolla/kolla-ansible issue.

Revision history for this message
Lee Yarwood (lyarwood) wrote :
Download full text (5.6 KiB)

FWIW I can reproduce this artificially in devstack by manually removing the associated volume secret:

$ sudo virsh secret-list
 UUID Usage
--------------------------------------------------------------------------------
 6713c0d1-7c30-4546-bbcf-c60ee9fcb9f4 volume ba7486b3-4ea5-4715-89f3-1ec86b0d9812
 e4897c8d-b271-44e8-b366-367ecddb8a3d ceph client.cinder secret

$ nova stop test
Request to stop server test has been accepted.

$ virsh secret-undefine 6713c0d1-7c30-4546-bbcf-c60ee9fcb9f4

$ nova start test

$ journalctl -u devstack@n-*
[..]
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/compute/manager.py", line 2895, in start_instance
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server self._power_on(context, instance)
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/compute/manager.py", line 2865, in _power_on
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server block_device_info)
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 2992, in power_on
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server self._hard_reboot(context, instance, network_info, block_device_info)
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 2839, in _hard_reboot
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server block_device_info=block_device_info)
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 1047, in destroy
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server destroy_disks)
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 1132, in cleanup
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server instance=instance)
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server self.force_reraise()
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server six.reraise(self.type_, self.value, self.tb)
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 1119, in cleanup
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]:...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/649951

Changed in nova:
assignee: pandatt (pandatt) → Lee Yarwood (lyarwood)
status: Confirmed → In Progress
Revision history for this message
Lee Yarwood (lyarwood) wrote :

^ This change works around the manual out-of-band removal of Libvirt secrets in my devstack environment and allows instances to restart.

Revision history for this message
Magnus Lööf (magnus-loof) wrote :

Some more information:

When doing a `Hard Reboot` on an instance with encrypted volumes attached (image volume not encrypted):

System fails to boot with "booting from hard disk" message on console.

However, this scenario is recoverable by following this procedure:

1. Shut down instance from Horizon
2. Remove attached encrypted volume
3. Start instance (boots fine)
4. Re attach encrypted volume
5. Reboot from Horizon or from within instance works fine

`$ virsh secret-list` shows the same information before and after Hard Reboot:

UUID Usage
xxxxxxxxx ceph client.cinder secret
yyyyyyyyy volume <volume guid>
zzzzzzzzz ceph client.nova secret

Revision history for this message
Lee Yarwood (lyarwood) wrote :

^ This is a completely different issue to the VolumeEncryptionNotSupported exception from the initial description that leaves the instance in an ERROR state.

IMHO this should live in a separate bug as it suggests that your instance is attempting to boot from a non-bootable volume over the bootable image for some reason.

It would be useful to have output from the following commands to review to determine what's going on here:

$ virsh domblklist ${instance-uuid}
$ virsh dumpxml ${instance-uuid}

Revision history for this message
Lee Yarwood (lyarwood) wrote :

FWIW I don't see any changes to the disk ordering that could cause this in my devstack env:

$ sudo virsh domblklist 8ddce596-52f7-4e60-b714-11f60a4ab8d8
Target Source
------------------------------------------------
vda vms/8ddce596-52f7-4e60-b714-11f60a4ab8d8_disk
vdb volumes/volume-ba7486b3-4ea5-4715-89f3-1ec86b0d9812

$ nova reboot --hard 8ddce596-52f7-4e60-b714-11f60a4ab8d8
Request to reboot server test (8ddce596-52f7-4e60-b714-11f60a4ab8d8) has been accepted.

$ sudo virsh domblklist 8ddce596-52f7-4e60-b714-11f60a4ab8d8
Target Source
------------------------------------------------
vda vms/8ddce596-52f7-4e60-b714-11f60a4ab8d8_disk
vdb volumes/volume-ba7486b3-4ea5-4715-89f3-1ec86b0d9812

Revision history for this message
Magnus Lööf (magnus-loof) wrote :

When doing a shutdown of instance from Horizon, and then `docker stop` and then `docker start` on:
- nova_ssh
- nova_compute
- nova_libvirt

`$ virsh secret-list` shows that the volume secret is gone:

UUID Usage
xxxxxxxxx ceph client.cinder secret
zzzzzzzzz ceph client.nova secret

Revision history for this message
Lee Yarwood (lyarwood) wrote :

Okay thanks so that is definitely a Kolla bug, it needs to persist /etc/libvirt/secrets/ somewhere and ensure it's injected back into the container on restart.

FWIW TripleO maps /etc/libvirt directly from the host into the container to work around this:

https://github.com/openstack/tripleo-heat-templates/blob/756b689fc354cbf617cec587fa669443a60d7ab5/deployment/nova/nova-libvirt-container-puppet.yaml#L646

affects: nova → kolla
affects: kolla → nova
Revision history for this message
Magnus Lööf (magnus-loof) wrote :

verified that the patch works :clap: by

1. bricking a machine as described in issue
1. patching the code (Rocky release lines 1384)
1. rebooting image

still getting a problem with boot order (virsh domblklist shows that the order has changed, but I believe that that is related to something in my environment. so that can be ignored).

Revision history for this message
Lee Yarwood (lyarwood) wrote :

^ I'd be interested in seeing the ordering of the devices in the domain XML and any n-cpu log snippets showing the instance booting if you are able to share.

Revision history for this message
Magnus Lööf (magnus-loof) wrote :

So here is:

1. information about volume and image. I have experimented in our lab with setting SCSI-properties on the Image. I belive this is the problem with the Hard Reboot. Horizon reports boot image attached to /dev/vda but in virtual instance:

[cloud-user@malo-test ~]$ sudo fdisk -l

Disk /dev/sda: 21.5 GB, 21474836480 bytes, 41943040 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk label type: dos
Disk identifier: 0x000b288e

   Device Boot Start End Blocks Id System
/dev/sda1 * 2048 41943006 20970479+ 83 Linux

Disk /dev/sdb: 107.4 GB, 107374182400 bytes, 209715200 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk label type: dos
Disk identifier: 0xc182ca0f

   Device Boot Start End Blocks Id System
/dev/sdb1 63 209715199 104857568+ 83 Linux

1. virsh domblklist before hard reboot

1. nova compute logs from when instance was booted

1. virsh dombklist after hard reboot

https://pastebin.com/ndXJH0Ed

Revision history for this message
Magnus Lööf (magnus-loof) wrote :

and virsh dumpxml before and after Hard Reboot

https://pastebin.com/16FDKqh7

Revision history for this message
Magnus Lööf (magnus-loof) wrote :

Removing the properties from the SCSI image resolved the "booting from hard disk" problems.

Revision history for this message
Lee Yarwood (lyarwood) wrote :

^ Yeah that's an interesting use case, I think sda always wins the boot race there unless you set bootindex=0 on the initial image based volume you're booting from. Actually even then I'm not sure that we update the boot device in the XML to point to anything other than hd. Might be worth creating a separate bug for this.

Mark Goddard (mgoddard)
affects: kolla → kolla-ansible
Revision history for this message
Mark Goddard (mgoddard) wrote :

Looks like there's been some good investigation on this already. We already map /etc/libvirt/qemu to a Docker volume, it looks like we need to do the same for /etc/libvirt/secrets. Are there any other subdirectories of /etc/libvirt we should persist?

Magnus, could you try this patch to kolla-ansible?

diff --git a/ansible/roles/nova/defaults/main.yml b/ansible/roles/nova/defaults/main.yml
index 8081b7a..d5eddaf 100644
--- a/ansible/roles/nova/defaults/main.yml
+++ b/ansible/roles/nova/defaults/main.yml
@@ -21,6 +21,7 @@ nova_services:
       - "{{ nova_instance_datadir_volume }}:/var/lib/nova/"
       - "{% if enable_shared_var_lib_nova_mnt | bool %}/var/lib/nova/mnt:/var/lib/nova/mnt:shared{% endif %}"
       - "nova_libvirt_qemu:/etc/libvirt/qemu"
+ - "nova_libvirt_secrets:/etc/libvirt/secrets"
       - "{{ kolla_dev_repos_directory ~ '/nova/nova:/var/lib/kolla/venv/lib/python2.7/site-packages/nova' if nova_dev_mode | bool else '' }}"
     dimensions: "{{ nova_libvirt_dimensions }}"
   nova-ssh:

Revision history for this message
Magnus Lööf (magnus-loof) wrote :

Sure can! Not for a few days, though. I am away on a business trip.

When I was analysing Kolla Ansible, I came across this config:

https://github.com/openstack/kolla-ansible/blob/stable/rocky/ansible/roles/nova/templates/nova-libvirt.json.j2

As I understand that configuration and the code @ https://github.com/openstack/kolla/blob/stable/rocky/docker/base/set_configs.py :

- The contents of /etc/libvirt/secrets will be cleared on each restart, since `merge` is not specified, and the host content of `"{{ container_config_directory }}/secrets"` will be copied in place. This directory contains the secrets for accessing ceph but not the instance-specific secrets.

could a more elegant solution be to just modify the template with `merge: true`? I might be misunderstanding things in relation to how the configuration is copied, but just an idea.

I believe that it is correct for Nova to *not* assume that the ceph secrets are in place, so I like the patch in Nova above.

Revision history for this message
Mark Goddard (mgoddard) wrote :

I proposed the above solution here: https://review.openstack.org/#/c/650853.

Your suggestion re merge is a good one - this clears up for me why the container restarting causes the problem, which I was not sure about initially. I'll add that to the above patch.

Revision history for this message
Mark Goddard (mgoddard) wrote :

Seems that the above method of making /etc/libvirt/secrets a docker volume alone means the container fails:

http://logs.openstack.org/53/650853/1/check/kolla-ansible-ubuntu-source-ceph/4051ac0/primary/logs/docker_logs/nova_libvirt.txt.gz

I think adding merge will avoid that issue though, since the delete is skipped in that case.

Revision history for this message
Mark Goddard (mgoddard) wrote :

Just to get my understanding of the situation here, is there some process that is creating additional secrets under /etc/libvirt/secrets after the container has started, that are getting wiped out by the container restart?

Revision history for this message
Magnus Lööf (magnus-loof) wrote :

> is there some process that is creating additional secrets under /etc/libvirt/secrets after the container has started, that are getting wiped out by the container restart?

To my understaning, when an instance with an encrypted folder is scheduled for a compute host, Nova collects the encryption key for the volume from Barbican.

Revision history for this message
Lee Yarwood (lyarwood) wrote :

Yeah correct, since Queens with recent Libvirt and QEMU versions Nova will now store the passphrase (it's actually an asymmetric key but that's another story) as Libvirt secrets to decrypt a LUKS encrypted volume on the host. These Libvirt secrets need to persist reboots.

Mark Goddard (mgoddard)
Changed in kolla-ansible:
status: New → In Progress
importance: Undecided → High
assignee: nobody → Mark Goddard (mgoddard)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/649951
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=56ca4d32ddf944b541b8a6c46f07275e7d8472bc
Submitter: Zuul
Branch: master

commit 56ca4d32ddf944b541b8a6c46f07275e7d8472bc
Author: Lee Yarwood <email address hidden>
Date: Thu Apr 4 09:09:04 2019 +0100

    libvirt: Avoid using os-brick encryptors when device_path isn't provided

    When disconnecting an encrypted volume the Libvirt driver uses the
    presence of a Libvirt secret associated with the volume to determine if
    the new style native QEMU LUKS decryption or original decryption method
    using os-brick encrytors is used.

    While this works well in most deployments some issues have been observed
    in Kolla based environments where the Libvirt secrets are not fully
    persisted between host reboots or container upgrades. This can lead to
    _detach_encryptor attempting to build an encryptor which will fail if
    the associated connection_info for the volume does not contain a
    device_path, such as in the case for encrypted rbd volumes.

    This change adds a simple conditional to _detach_encryptor to ensure we
    return when device_path is not present in connection_info and native
    QEMU LUKS decryption is available. This handles the specific use
    case where we are certain that the encrypted volume was never decrypted
    using the os-brick encryptors, as these require a local block device on
    the compute host and have thus never supported rbd.

    It is still safe to build an encryptor and call detach_volume when a
    device_path is present however as change I9f52f89b8466d036 made such
    calls idempotent within os-brick.

    Change-Id: Id670f13a7f197e71c77dc91276fc2fba2fc5f314
    Closes-bug: #1821696

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/656464

Matt Riedemann (mriedem)
Changed in nova:
importance: Undecided → Medium
tags: added: libvirt
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/stein)

Reviewed: https://review.opendev.org/656462
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=c6432ac0212d15b6d8f1620b42937b2abcb66d46
Submitter: Zuul
Branch: stable/stein

commit c6432ac0212d15b6d8f1620b42937b2abcb66d46
Author: Lee Yarwood <email address hidden>
Date: Thu Apr 4 09:09:04 2019 +0100

    libvirt: Avoid using os-brick encryptors when device_path isn't provided

    When disconnecting an encrypted volume the Libvirt driver uses the
    presence of a Libvirt secret associated with the volume to determine if
    the new style native QEMU LUKS decryption or original decryption method
    using os-brick encrytors is used.

    While this works well in most deployments some issues have been observed
    in Kolla based environments where the Libvirt secrets are not fully
    persisted between host reboots or container upgrades. This can lead to
    _detach_encryptor attempting to build an encryptor which will fail if
    the associated connection_info for the volume does not contain a
    device_path, such as in the case for encrypted rbd volumes.

    This change adds a simple conditional to _detach_encryptor to ensure we
    return when device_path is not present in connection_info and native
    QEMU LUKS decryption is available. This handles the specific use
    case where we are certain that the encrypted volume was never decrypted
    using the os-brick encryptors, as these require a local block device on
    the compute host and have thus never supported rbd.

    It is still safe to build an encryptor and call detach_volume when a
    device_path is present however as change I9f52f89b8466d036 made such
    calls idempotent within os-brick.

    Change-Id: Id670f13a7f197e71c77dc91276fc2fba2fc5f314
    Closes-bug: #1821696
    (cherry picked from commit 56ca4d32ddf944b541b8a6c46f07275e7d8472bc)

Revision history for this message
Mark Goddard (mgoddard) wrote :

Magnus, have you had a chance to test the kolla-ansible patch yet? (https://review.opendev.org/#/c/650853)

Revision history for this message
Mark Goddard (mgoddard) wrote :

Magnus?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 19.0.1

This issue was fixed in the openstack/nova 19.0.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/rocky)

Reviewed: https://review.opendev.org/656463
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=2c6e59e835b123d6040e2a059aaa98bf9cced392
Submitter: Zuul
Branch: stable/rocky

commit 2c6e59e835b123d6040e2a059aaa98bf9cced392
Author: Lee Yarwood <email address hidden>
Date: Thu Apr 4 09:09:04 2019 +0100

    libvirt: Avoid using os-brick encryptors when device_path isn't provided

    When disconnecting an encrypted volume the Libvirt driver uses the
    presence of a Libvirt secret associated with the volume to determine if
    the new style native QEMU LUKS decryption or original decryption method
    using os-brick encrytors is used.

    While this works well in most deployments some issues have been observed
    in Kolla based environments where the Libvirt secrets are not fully
    persisted between host reboots or container upgrades. This can lead to
    _detach_encryptor attempting to build an encryptor which will fail if
    the associated connection_info for the volume does not contain a
    device_path, such as in the case for encrypted rbd volumes.

    This change adds a simple conditional to _detach_encryptor to ensure we
    return when device_path is not present in connection_info and native
    QEMU LUKS decryption is available. This handles the specific use
    case where we are certain that the encrypted volume was never decrypted
    using the os-brick encryptors, as these require a local block device on
    the compute host and have thus never supported rbd.

    It is still safe to build an encryptor and call detach_volume when a
    device_path is present however as change I9f52f89b8466d036 made such
    calls idempotent within os-brick.

    Change-Id: Id670f13a7f197e71c77dc91276fc2fba2fc5f314
    Closes-bug: #1821696
    (cherry picked from commit 56ca4d32ddf944b541b8a6c46f07275e7d8472bc)
    (cherry picked from commit c6432ac0212d15b6d8f1620b42937b2abcb66d46)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 18.2.1

This issue was fixed in the openstack/nova 18.2.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/queens)

Reviewed: https://review.opendev.org/656464
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=ec43a1348d871b52473b46bd895b609dc16fe8fe
Submitter: Zuul
Branch: stable/queens

commit ec43a1348d871b52473b46bd895b609dc16fe8fe
Author: Lee Yarwood <email address hidden>
Date: Thu Apr 4 09:09:04 2019 +0100

    libvirt: Avoid using os-brick encryptors when device_path isn't provided

    When disconnecting an encrypted volume the Libvirt driver uses the
    presence of a Libvirt secret associated with the volume to determine if
    the new style native QEMU LUKS decryption or original decryption method
    using os-brick encrytors is used.

    While this works well in most deployments some issues have been observed
    in Kolla based environments where the Libvirt secrets are not fully
    persisted between host reboots or container upgrades. This can lead to
    _detach_encryptor attempting to build an encryptor which will fail if
    the associated connection_info for the volume does not contain a
    device_path, such as in the case for encrypted rbd volumes.

    This change adds a simple conditional to _detach_encryptor to ensure we
    return when device_path is not present in connection_info and native
    QEMU LUKS decryption is available. This handles the specific use
    case where we are certain that the encrypted volume was never decrypted
    using the os-brick encryptors, as these require a local block device on
    the compute host and have thus never supported rbd.

    It is still safe to build an encryptor and call detach_volume when a
    device_path is present however as change I9f52f89b8466d036 made such
    calls idempotent within os-brick.

    Change-Id: Id670f13a7f197e71c77dc91276fc2fba2fc5f314
    Closes-bug: #1821696
    (cherry picked from commit 56ca4d32ddf944b541b8a6c46f07275e7d8472bc)
    (cherry picked from commit c6432ac0212d15b6d8f1620b42937b2abcb66d46)
    (cherry picked from commit 2c6e59e835b123d6040e2a059aaa98bf9cced392)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 17.0.11

This issue was fixed in the openstack/nova 17.0.11 release.

Mark Goddard (mgoddard)
Changed in kolla-ansible:
status: Fix Committed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 20.0.0.0rc1

This issue was fixed in the openstack/nova 20.0.0.0rc1 release candidate.

Mark Goddard (mgoddard)
Changed in kolla-ansible:
status: In Progress → Triaged
milestone: 8.0.0 → none
assignee: Mark Goddard (mgoddard) → nobody
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to kolla-ansible (master)

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/650853
Committed: https://opendev.org/openstack/kolla-ansible/commit/1c63eb20d92febabbcb0dacbc35b0c89771d7202
Submitter: "Zuul (22348)"
Branch: master

commit 1c63eb20d92febabbcb0dacbc35b0c89771d7202
Author: Mark Goddard <email address hidden>
Date: Mon Apr 8 12:18:52 2019 +0100

    Persist nova libvirt secrets in a Docker volume

    Libvirt may reasonably expect that its secrets directory
    (/etc/libvirt/secrets) is persistent. However, the nova_libvirt
    container does not map the secrets directory to a volume, so it will not
    survive a recreation of the container. Furthermore, if Cinder or Nova
    Ceph RBD integration is enabled, nova_libvirt's config.json includes an
    entry for /etc/libvirt/secrets which will wipe out the directory on a
    restart of the container.

    Previously, this appeared to cause an issue with encrypted volumes,
    which could fail to attach in certain situations as described in bug
    1821696. Nova has since made a related change, and the issue can no
    longer be reproduced. However, making the secret store persistent seems
    like a sensible thing to do, and may prevent hitting other corner cases.

    This change maps /etc/libvirt/secrets to a Docker volume in the
    nova_libvirt container. We also modify config.json for the nova_libvirt
    container to merge the /etc/libvirt/secrets directory, to ensure that
    secrets added in the container during runtime are not overwritten when
    the container restarts.

    Change-Id: Ia7e923dddb77ff6db3c9160af931354a2b305e8d
    Related-Bug: #1821696

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to kolla-ansible (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/797151

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to kolla-ansible (master)

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/797151
Committed: https://opendev.org/openstack/kolla-ansible/commit/1fc58e74d0a026192b64934b7b23b5ee752dedcc
Submitter: "Zuul (22348)"
Branch: master

commit 1fc58e74d0a026192b64934b7b23b5ee752dedcc
Author: Mark Goddard <email address hidden>
Date: Fri Jun 18 19:57:40 2021 +0100

    Fix up 'Persist nova libvirt secrets in a Docker volume'

    Follow up fix for Ia7e923dddb77ff6db3c9160af931354a2b305e8d, which
    broke the cephadm jobs.

    Change-Id: Ieb39b41a6f493bd00c687610ba043a1b4e5945e7
    Related-Bug: #1821696

Changed in kolla-ansible:
status: Triaged → Invalid
no longer affects: kolla-ansible/rocky
no longer affects: kolla-ansible/stein
Changed in kolla-ansible:
importance: High → Undecided
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.