Bug #1821696 “Failed to start instances with encrypted volumes” : Bugs : OpenStack Compute (nova)

Magnus Lööf (magnus-loof) on 2019-03-26

description:

updated

Revision history for this message

Magnus Lööf (magnus-loof) wrote on 2019-03-26:

#1

Note that it was not possible to remove the encrypted volume attachment from the affected hosts - that would also yield a VolumeEncryptionNotSupported error.

Magnus Lööf (magnus-loof) on 2019-03-26

tags:	added: volumes
tags:	added: encryption

pandatt (pandatt) on 2019-03-29

Changed in nova:
status:	New → Confirmed
assignee:	nobody → pandatt (pandatt)

Revision history for this message

Lee Yarwood (lyarwood) wrote on 2019-04-02:

#2

The value of reraise should definitely be True when we hit VolumeEncryptionNotSupported.

Looking at the provided trace it's pretty clear that that the LuksEncryptor encryptor class provided by os-brick is calling the following __init__ code within CryptsetupEncryptor causing this mess:

https://github.com/openstack/os-brick/blob/00a4d96d2506bed5c5507282a774bc75df9f790f/os_brick/encryptors/cryptsetup.py#L47-L54

Revision history for this message

Lee Yarwood (lyarwood) wrote on 2019-04-02:

#3

> The value of reraise should definitely be True when we hit VolumeEncryptionNotSupported.

Apologies, to be clear I mean that it should be True when we hit VolumeEncryptionNotSupported in the context of starting an instance, reboot setc where destroy_disks is also False.

Revision history for this message

Lee Yarwood (lyarwood) wrote on 2019-04-02:

#4

So the fact that Nova is even creating an encryptor object in the first place is incorrect. Reviewing our _detach_encryptor code suggests that the volume secrets have gone missing in order for this to happen:

https://github.com/openstack/nova/blob/9bb78d5765dab01e38327f57312583c189a352d5/nova/virt/libvirt/driver.py#L1381-L1382

Did you manually remove the associated libvirt secrets for this volume?

These should persist reboots as they are created with ephemeral=False by default.

Revision history for this message

Magnus Lööf (magnus-loof) wrote on 2019-04-02:

#5

> Did you manually remove the associated libvirt secrets for this volume?

No, we did no such thing. We followed the procedure as described in the issue description. Shut down the instances, and then Ceph and then Openstack services.

I could reproduce this in our lab, and also reproduce the issue by performing a `Hard Reboot` in Horizon.

> These should persist reboots as they are created with ephemeral=False by default.

Reboots from "within" the instance as well as Live Migration and also Soft Reboot works well.

I believe (cannot verify from where I am right now...) that even `Shutdown Instance` and a `Start Instance` works well.

Revision history for this message

Lee Yarwood (lyarwood) wrote on 2019-04-03:

#6

What process did you follow to restart the services being deployed as Kolla containers?

Can you run the following command as root within the nova-libvirt container before and after a restart when instances are running with encrypted volumes attached?

$ virsh secret-list

Revision history for this message

Magnus Lööf (magnus-loof) wrote on 2019-04-03:

#7

> What process did you follow to restart the services being deployed as Kolla containers?

Restart the physical services, then verify Service status in Horizon, `docker restart` if there are any errors.

Revision history for this message

Magnus Lööf (magnus-loof) wrote on 2019-04-03:

#8

> What process did you follow to restart the services being deployed as Kolla containers?

Also note that it is not necessary to restart any containers to reproduce the error. It is sufficient with a Hard Reboot from Horizon GUI.

Revision history for this message

Lee Yarwood (lyarwood) wrote on 2019-04-03:

#9

> Also note that it is not necessary to restart any containers to reproduce the error.
> It is sufficient with a Hard Reboot from Horizon GUI.

I can't reproduce this in a F29 devstack env using ceph.

I'm still convinced that you're hitting this due to the associated Libvirt volume secrets going AWOL in your environment before the instance hard reboots.

Again it would be super useful if you could run the following commands within the nova-libvirt container to confirm if the secrets are present before you hard reboot:

$ sudo virsh secret-list
$ sudo ls /etc/libvirt/secrets/

If they aren't then IMHO this is a kolla/kolla-ansible issue.

Revision history for this message

Lee Yarwood (lyarwood) wrote on 2019-04-03:

#10

Download full text (5.6 KiB)

FWIW I can reproduce this artificially in devstack by manually removing the associated volume secret:

$ sudo virsh secret-list
UUID Usage
--------------------------------------------------------------------------------
6713c0d1-7c30-4546-bbcf-c60ee9fcb9f4 volume ba7486b3-4ea5-4715-89f3-1ec86b0d9812
e4897c8d-b271-44e8-b366-367ecddb8a3d ceph client.cinder secret

$ nova stop test
Request to stop server test has been accepted.

$ virsh secret-undefine 6713c0d1-7c30-4546-bbcf-c60ee9fcb9f4

$ nova start test

$ journalctl -u devstack@n-*
[..]
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/compute/manager.py", line 2895, in start_instance
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server self._power_on(context, instance)
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/compute/manager.py", line 2865, in _power_on
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server block_device_info)
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 2992, in power_on
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server self._hard_reboot(context, instance, network_info, block_device_info)
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 2839, in _hard_reboot
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server block_device_info=block_device_info)
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 1047, in destroy
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server destroy_disks)
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 1132, in cleanup
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server instance=instance)
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server self.force_reraise()
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server six.reraise(self.type_, self.value, self.tb)
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 1119, in cleanup
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]:...

FWIW I can reproduce this artificially in devstack by manually removing the associated volume secret:

$ sudo virsh secret-list
 UUID                                  Usage
--------------------------------------------------------------------------------
 6713c0d1-7c30-4546-bbcf-c60ee9fcb9f4  volume ba7486b3-4ea5-4715-89f3-1ec86b0d9812
 e4897c8d-b271-44e8-b366-367ecddb8a3d  ceph client.cinder secret

$ nova stop test
Request to stop server test has been accepted.

$ virsh secret-undefine 6713c0d1-7c30-4546-bbcf-c60ee9fcb9f4

$ nova start test

$ journalctl -u devstack@n-*
[..]
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/manager.py", line 2895, in start_instance
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server     self._power_on(context, instance)
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/compute/manager.py", line 2865, in _power_on
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server     block_device_info)
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 2992, in power_on
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server     self._hard_reboot(context, instance, network_info, block_device_info)
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 2839, in _hard_reboot
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server     block_device_info=block_device_info)
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 1047, in destroy
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server     destroy_disks)
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 1132, in cleanup
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server     instance=instance)
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server     self.force_reraise()
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server     six.reraise(self.type_, self.value, self.tb)
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 1119, in cleanup
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server     self._disconnect_volume(context, connection_info, instance)
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 1345, in _disconnect_volume
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server     self._detach_encryptor(context, connection_info, encryption=encryption)
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 1460, in _detach_encryptor
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server     encryption)
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 1381, in _get_volume_encryptor
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server     **encryption)
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/os_brick/encryptors/__init__.py", line 91, in get_volume_encryptor
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server     **kwargs)
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_utils/importutils.py", line 44, in import_object
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server     return import_class(import_str)(*args, **kwargs)
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/os_brick/encryptors/luks.py", line 61, in __init__
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server     *args, **kwargs)
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/os_brick/encryptors/cryptsetup.py", line 54, in __init__
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server     volume_type=connection_info['driver_volume_type'])
Apr 03 11:25:13 localhost.localdomain nova-compute[94660]: ERROR oslo_messaging.rpc.server VolumeEncryptionNotSupported: Volume encryption is not supported for rbd volume ba7486b3-4ea5-4715-89f3-1ec86b0d9812.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-04: Fix proposed to nova (master)

#11

Fix proposed to branch: master
Review: https://review.openstack.org/649951

Changed in nova:
assignee:	pandatt (pandatt) → Lee Yarwood (lyarwood)
status:	Confirmed → In Progress

Revision history for this message

Lee Yarwood (lyarwood) wrote on 2019-04-04:

#12

^ This change works around the manual out-of-band removal of Libvirt secrets in my devstack environment and allows instances to restart.

Revision history for this message

Magnus Lööf (magnus-loof) wrote on 2019-04-04:

#13

Some more information:

When doing a `Hard Reboot` on an instance with encrypted volumes attached (image volume not encrypted):

System fails to boot with "booting from hard disk" message on console.

However, this scenario is recoverable by following this procedure:

1. Shut down instance from Horizon
2. Remove attached encrypted volume
3. Start instance (boots fine)
4. Re attach encrypted volume
5. Reboot from Horizon or from within instance works fine

`$ virsh secret-list` shows the same information before and after Hard Reboot:

UUID Usage
xxxxxxxxx ceph client.cinder secret
yyyyyyyyy volume <volume guid>
zzzzzzzzz ceph client.nova secret

Revision history for this message

Lee Yarwood (lyarwood) wrote on 2019-04-04:

#14

^ This is a completely different issue to the VolumeEncryptionNotSupported exception from the initial description that leaves the instance in an ERROR state.

IMHO this should live in a separate bug as it suggests that your instance is attempting to boot from a non-bootable volume over the bootable image for some reason.

It would be useful to have output from the following commands to review to determine what's going on here:

$ virsh domblklist ${instance-uuid}
$ virsh dumpxml ${instance-uuid}

Revision history for this message

Lee Yarwood (lyarwood) wrote on 2019-04-04:

#15

FWIW I don't see any changes to the disk ordering that could cause this in my devstack env:

$ sudo virsh domblklist 8ddce596-52f7-4e60-b714-11f60a4ab8d8
Target Source
------------------------------------------------
vda vms/8ddce596-52f7-4e60-b714-11f60a4ab8d8_disk
vdb volumes/volume-ba7486b3-4ea5-4715-89f3-1ec86b0d9812

$ nova reboot --hard 8ddce596-52f7-4e60-b714-11f60a4ab8d8
Request to reboot server test (8ddce596-52f7-4e60-b714-11f60a4ab8d8) has been accepted.

$ sudo virsh domblklist 8ddce596-52f7-4e60-b714-11f60a4ab8d8
Target Source
------------------------------------------------
vda vms/8ddce596-52f7-4e60-b714-11f60a4ab8d8_disk
vdb volumes/volume-ba7486b3-4ea5-4715-89f3-1ec86b0d9812

Revision history for this message

Magnus Lööf (magnus-loof) wrote on 2019-04-04:

#16

When doing a shutdown of instance from Horizon, and then `docker stop` and then `docker start` on:
- nova_ssh
- nova_compute
- nova_libvirt

`$ virsh secret-list` shows that the volume secret is gone:

UUID Usage
xxxxxxxxx ceph client.cinder secret
zzzzzzzzz ceph client.nova secret

Revision history for this message

Lee Yarwood (lyarwood) wrote on 2019-04-04:

#17

Okay thanks so that is definitely a Kolla bug, it needs to persist /etc/libvirt/secrets/ somewhere and ensure it's injected back into the container on restart.

FWIW TripleO maps /etc/libvirt directly from the host into the container to work around this:

https://github.com/openstack/tripleo-heat-templates/blob/756b689fc354cbf617cec587fa669443a60d7ab5/deployment/nova/nova-libvirt-container-puppet.yaml#L646

affects:	nova → kolla
affects:	kolla → nova

Revision history for this message

Magnus Lööf (magnus-loof) wrote on 2019-04-04:

#18

verified that the patch works :clap: by

1. bricking a machine as described in issue
1. patching the code (Rocky release lines 1384)
1. rebooting image

still getting a problem with boot order (virsh domblklist shows that the order has changed, but I believe that that is related to something in my environment. so that can be ignored).

Revision history for this message

Lee Yarwood (lyarwood) wrote on 2019-04-04:

#19

^ I'd be interested in seeing the ordering of the devices in the domain XML and any n-cpu log snippets showing the instance booting if you are able to share.

Revision history for this message

Magnus Lööf (magnus-loof) wrote on 2019-04-04:

#20

So here is:

1. information about volume and image. I have experimented in our lab with setting SCSI-properties on the Image. I belive this is the problem with the Hard Reboot. Horizon reports boot image attached to /dev/vda but in virtual instance:

[cloud-user@malo-test ~]$ sudo fdisk -l

Disk /dev/sda: 21.5 GB, 21474836480 bytes, 41943040 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk label type: dos
Disk identifier: 0x000b288e

Device Boot Start End Blocks Id System
/dev/sda1 * 2048 41943006 20970479+ 83 Linux

Disk /dev/sdb: 107.4 GB, 107374182400 bytes, 209715200 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk label type: dos
Disk identifier: 0xc182ca0f

Device Boot Start End Blocks Id System
/dev/sdb1 63 209715199 104857568+ 83 Linux

1. virsh domblklist before hard reboot

1. nova compute logs from when instance was booted

1. virsh dombklist after hard reboot

https://pastebin.com/ndXJH0Ed

Revision history for this message

Magnus Lööf (magnus-loof) wrote on 2019-04-04:

#21

and virsh dumpxml before and after Hard Reboot

https://pastebin.com/16FDKqh7

Revision history for this message

Magnus Lööf (magnus-loof) wrote on 2019-04-04:

#22

Removing the properties from the SCSI image resolved the "booting from hard disk" problems.

Revision history for this message

Lee Yarwood (lyarwood) wrote on 2019-04-04:

#23

^ Yeah that's an interesting use case, I think sda always wins the boot race there unless you set bootindex=0 on the initial image based volume you're booting from. Actually even then I'm not sure that we update the boot device in the XML to point to anything other than hd. Might be worth creating a separate bug for this.

Mark Goddard (mgoddard) on 2019-04-08

affects:

kolla → kolla-ansible

Revision history for this message

Mark Goddard (mgoddard) wrote on 2019-04-08:

#24

Looks like there's been some good investigation on this already. We already map /etc/libvirt/qemu to a Docker volume, it looks like we need to do the same for /etc/libvirt/secrets. Are there any other subdirectories of /etc/libvirt we should persist?

Magnus, could you try this patch to kolla-ansible?

diff --git a/ansible/roles/nova/defaults/main.yml b/ansible/roles/nova/defaults/main.yml
index 8081b7a..d5eddaf 100644
--- a/ansible/roles/nova/defaults/main.yml
+++ b/ansible/roles/nova/defaults/main.yml
@@ -21,6 +21,7 @@ nova_services:
       - "{{ nova_instance_datadir_volume }}:/var/lib/nova/"
       - "{% if enable_shared_var_lib_nova_mnt | bool %}/var/lib/nova/mnt:/var/lib/nova/mnt:shared{% endif %}"
       - "nova_libvirt_qemu:/etc/libvirt/qemu"
+ - "nova_libvirt_secrets:/etc/libvirt/secrets"
       - "{{ kolla_dev_repos_directory ~ '/nova/nova:/var/lib/kolla/venv/lib/python2.7/site-packages/nova' if nova_dev_mode | bool else '' }}"
     dimensions: "{{ nova_libvirt_dimensions }}"
   nova-ssh:

Revision history for this message

Magnus Lööf (magnus-loof) wrote on 2019-04-08:

#25

Sure can! Not for a few days, though. I am away on a business trip.

When I was analysing Kolla Ansible, I came across this config:

https://github.com/openstack/kolla-ansible/blob/stable/rocky/ansible/roles/nova/templates/nova-libvirt.json.j2

As I understand that configuration and the code @ https://github.com/openstack/kolla/blob/stable/rocky/docker/base/set_configs.py :

- The contents of /etc/libvirt/secrets will be cleared on each restart, since `merge` is not specified, and the host content of `"{{ container_config_directory }}/secrets"` will be copied in place. This directory contains the secrets for accessing ceph but not the instance-specific secrets.

could a more elegant solution be to just modify the template with `merge: true`? I might be misunderstanding things in relation to how the configuration is copied, but just an idea.

I believe that it is correct for Nova to *not* assume that the ceph secrets are in place, so I like the patch in Nova above.

Revision history for this message

Mark Goddard (mgoddard) wrote on 2019-04-08:

#26

I proposed the above solution here: https://review.openstack.org/#/c/650853.

Your suggestion re merge is a good one - this clears up for me why the container restarting causes the problem, which I was not sure about initially. I'll add that to the above patch.

Revision history for this message

Mark Goddard (mgoddard) wrote on 2019-04-08:

#27

Seems that the above method of making /etc/libvirt/secrets a docker volume alone means the container fails:

http://logs.openstack.org/53/650853/1/check/kolla-ansible-ubuntu-source-ceph/4051ac0/primary/logs/docker_logs/nova_libvirt.txt.gz

I think adding merge will avoid that issue though, since the delete is skipped in that case.

Revision history for this message

Mark Goddard (mgoddard) wrote on 2019-04-08:

#28

Just to get my understanding of the situation here, is there some process that is creating additional secrets under /etc/libvirt/secrets after the container has started, that are getting wiped out by the container restart?

Revision history for this message

Magnus Lööf (magnus-loof) wrote on 2019-04-08:

#29

> is there some process that is creating additional secrets under /etc/libvirt/secrets after the container has started, that are getting wiped out by the container restart?

To my understaning, when an instance with an encrypted folder is scheduled for a compute host, Nova collects the encryption key for the volume from Barbican.

Revision history for this message

Lee Yarwood (lyarwood) wrote on 2019-04-08:

#30

Yeah correct, since Queens with recent Libvirt and QEMU versions Nova will now store the passphrase (it's actually an asymmetric key but that's another story) as Libvirt secrets to decrypt a LUKS encrypted volume on the host. These Libvirt secrets need to persist reboots.

Mark Goddard (mgoddard) on 2019-04-10

Changed in kolla-ansible:
status:	New → In Progress
importance:	Undecided → High
assignee:	nobody → Mark Goddard (mgoddard)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-29: Fix merged to nova (master)

#31

Reviewed: https://review.opendev.org/649951
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=56ca4d32ddf944b541b8a6c46f07275e7d8472bc
Submitter: Zuul
Branch: master

commit 56ca4d32ddf944b541b8a6c46f07275e7d8472bc
Author: Lee Yarwood <email address hidden>
Date: Thu Apr 4 09:09:04 2019 +0100

libvirt: Avoid using os-brick encryptors when device_path isn't provided

    When disconnecting an encrypted volume the Libvirt driver uses the
    presence of a Libvirt secret associated with the volume to determine if
    the new style native QEMU LUKS decryption or original decryption method
    using os-brick encrytors is used.

    While this works well in most deployments some issues have been observed
    in Kolla based environments where the Libvirt secrets are not fully
    persisted between host reboots or container upgrades. This can lead to
    _detach_encryptor attempting to build an encryptor which will fail if
    the associated connection_info for the volume does not contain a
    device_path, such as in the case for encrypted rbd volumes.

    This change adds a simple conditional to _detach_encryptor to ensure we
    return when device_path is not present in connection_info and native
    QEMU LUKS decryption is available. This handles the specific use
    case where we are certain that the encrypted volume was never decrypted
    using the os-brick encryptors, as these require a local block device on
    the compute host and have thus never supported rbd.

    It is still safe to build an encryptor and call detach_volume when a
    device_path is present however as change I9f52f89b8466d036 made such
    calls idempotent within os-brick.

Change-Id: Id670f13a7f197e71c77dc91276fc2fba2fc5f314
Closes-bug: #1821696

Changed in nova:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-30: Fix proposed to nova (stable/queens)

#32

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/656464

Matt Riedemann (mriedem) on 2019-05-04

Changed in nova:
importance:	Undecided → Medium
tags:	added: libvirt

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-05-04: Fix merged to nova (stable/stein)

#33

Reviewed: https://review.opendev.org/656462
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=c6432ac0212d15b6d8f1620b42937b2abcb66d46
Submitter: Zuul
Branch: stable/stein

commit c6432ac0212d15b6d8f1620b42937b2abcb66d46
Author: Lee Yarwood <email address hidden>
Date: Thu Apr 4 09:09:04 2019 +0100

libvirt: Avoid using os-brick encryptors when device_path isn't provided

    When disconnecting an encrypted volume the Libvirt driver uses the
    presence of a Libvirt secret associated with the volume to determine if
    the new style native QEMU LUKS decryption or original decryption method
    using os-brick encrytors is used.

    While this works well in most deployments some issues have been observed
    in Kolla based environments where the Libvirt secrets are not fully
    persisted between host reboots or container upgrades. This can lead to
    _detach_encryptor attempting to build an encryptor which will fail if
    the associated connection_info for the volume does not contain a
    device_path, such as in the case for encrypted rbd volumes.

    This change adds a simple conditional to _detach_encryptor to ensure we
    return when device_path is not present in connection_info and native
    QEMU LUKS decryption is available. This handles the specific use
    case where we are certain that the encrypted volume was never decrypted
    using the os-brick encryptors, as these require a local block device on
    the compute host and have thus never supported rbd.

    It is still safe to build an encryptor and call detach_volume when a
    device_path is present however as change I9f52f89b8466d036 made such
    calls idempotent within os-brick.

    Change-Id: Id670f13a7f197e71c77dc91276fc2fba2fc5f314
    Closes-bug: #1821696
    (cherry picked from commit 56ca4d32ddf944b541b8a6c46f07275e7d8472bc)

Revision history for this message

Mark Goddard (mgoddard) wrote on 2019-05-04:

#34

Magnus, have you had a chance to test the kolla-ansible patch yet? (https://review.opendev.org/#/c/650853)

Revision history for this message

Mark Goddard (mgoddard) wrote on 2019-05-22:

#35

Magnus?

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-06-06: Fix included in openstack/nova 19.0.1

#36

This issue was fixed in the openstack/nova 19.0.1 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-06-11: Fix merged to nova (stable/rocky)

#37

Reviewed: https://review.opendev.org/656463
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=2c6e59e835b123d6040e2a059aaa98bf9cced392
Submitter: Zuul
Branch: stable/rocky

commit 2c6e59e835b123d6040e2a059aaa98bf9cced392
Author: Lee Yarwood <email address hidden>
Date: Thu Apr 4 09:09:04 2019 +0100

libvirt: Avoid using os-brick encryptors when device_path isn't provided

    When disconnecting an encrypted volume the Libvirt driver uses the
    presence of a Libvirt secret associated with the volume to determine if
    the new style native QEMU LUKS decryption or original decryption method
    using os-brick encrytors is used.

    While this works well in most deployments some issues have been observed
    in Kolla based environments where the Libvirt secrets are not fully
    persisted between host reboots or container upgrades. This can lead to
    _detach_encryptor attempting to build an encryptor which will fail if
    the associated connection_info for the volume does not contain a
    device_path, such as in the case for encrypted rbd volumes.

    This change adds a simple conditional to _detach_encryptor to ensure we
    return when device_path is not present in connection_info and native
    QEMU LUKS decryption is available. This handles the specific use
    case where we are certain that the encrypted volume was never decrypted
    using the os-brick encryptors, as these require a local block device on
    the compute host and have thus never supported rbd.

    It is still safe to build an encryptor and call detach_volume when a
    device_path is present however as change I9f52f89b8466d036 made such
    calls idempotent within os-brick.

    Change-Id: Id670f13a7f197e71c77dc91276fc2fba2fc5f314
    Closes-bug: #1821696
    (cherry picked from commit 56ca4d32ddf944b541b8a6c46f07275e7d8472bc)
    (cherry picked from commit c6432ac0212d15b6d8f1620b42937b2abcb66d46)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-06-18: Fix included in openstack/nova 18.2.1

#38

This issue was fixed in the openstack/nova 18.2.1 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-07-03: Fix merged to nova (stable/queens)

#39

Reviewed: https://review.opendev.org/656464
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=ec43a1348d871b52473b46bd895b609dc16fe8fe
Submitter: Zuul
Branch: stable/queens

commit ec43a1348d871b52473b46bd895b609dc16fe8fe
Author: Lee Yarwood <email address hidden>
Date: Thu Apr 4 09:09:04 2019 +0100

libvirt: Avoid using os-brick encryptors when device_path isn't provided

    When disconnecting an encrypted volume the Libvirt driver uses the
    presence of a Libvirt secret associated with the volume to determine if
    the new style native QEMU LUKS decryption or original decryption method
    using os-brick encrytors is used.

    While this works well in most deployments some issues have been observed
    in Kolla based environments where the Libvirt secrets are not fully
    persisted between host reboots or container upgrades. This can lead to
    _detach_encryptor attempting to build an encryptor which will fail if
    the associated connection_info for the volume does not contain a
    device_path, such as in the case for encrypted rbd volumes.

    This change adds a simple conditional to _detach_encryptor to ensure we
    return when device_path is not present in connection_info and native
    QEMU LUKS decryption is available. This handles the specific use
    case where we are certain that the encrypted volume was never decrypted
    using the os-brick encryptors, as these require a local block device on
    the compute host and have thus never supported rbd.

    It is still safe to build an encryptor and call detach_volume when a
    device_path is present however as change I9f52f89b8466d036 made such
    calls idempotent within os-brick.

    Change-Id: Id670f13a7f197e71c77dc91276fc2fba2fc5f314
    Closes-bug: #1821696
    (cherry picked from commit 56ca4d32ddf944b541b8a6c46f07275e7d8472bc)
    (cherry picked from commit c6432ac0212d15b6d8f1620b42937b2abcb66d46)
    (cherry picked from commit 2c6e59e835b123d6040e2a059aaa98bf9cced392)

Reviewed:  https://review.opendev.org/656464
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=ec43a1348d871b52473b46bd895b609dc16fe8fe
Submitter: Zuul
Branch:    stable/queens

commit ec43a1348d871b52473b46bd895b609dc16fe8fe
Author: Lee Yarwood <lyarwood@redhat.com>
Date:   Thu Apr 4 09:09:04 2019 +0100

libvirt: Avoid using os-brick encryptors when device_path isn't provided
    
    When disconnecting an encrypted volume the Libvirt driver uses the
    presence of a Libvirt secret associated with the volume to determine if
    the new style native QEMU LUKS decryption or original decryption method
    using os-brick encrytors is used.
    
    While this works well in most deployments some issues have been observed
    in Kolla based environments where the Libvirt secrets are not fully
    persisted between host reboots or container upgrades. This can lead to
    _detach_encryptor attempting to build an encryptor which will fail if
    the associated connection_info for the volume does not contain a
    device_path, such as in the case for encrypted rbd volumes.
    
    This change adds a simple conditional to _detach_encryptor to ensure we
    return when device_path is not present in connection_info and native
    QEMU LUKS decryption is available. This handles the specific use
    case where we are certain that the encrypted volume was never decrypted
    using the os-brick encryptors, as these require a local block device on
    the compute host and have thus never supported rbd.
    
    It is still safe to build an encryptor and call detach_volume when a
    device_path is present however as change I9f52f89b8466d036 made such
    calls idempotent within os-brick.
    
    Change-Id: Id670f13a7f197e71c77dc91276fc2fba2fc5f314
    Closes-bug: #1821696
    (cherry picked from commit 56ca4d32ddf944b541b8a6c46f07275e7d8472bc)
    (cherry picked from commit c6432ac0212d15b6d8f1620b42937b2abcb66d46)
    (cherry picked from commit 2c6e59e835b123d6040e2a059aaa98bf9cced392)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-07-10: Fix included in openstack/nova 17.0.11

#40

This issue was fixed in the openstack/nova 17.0.11 release.

Mark Goddard (mgoddard) on 2019-08-07

Changed in kolla-ansible:
status:	Fix Committed → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-09-27: Fix included in openstack/nova 20.0.0.0rc1

#41

This issue was fixed in the openstack/nova 20.0.0.0rc1 release candidate.

Mark Goddard (mgoddard) on 2020-03-11

Changed in kolla-ansible:
status:	In Progress → Triaged
milestone:	8.0.0 → none
assignee:	Mark Goddard (mgoddard) → nobody

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-06-17: Related fix merged to kolla-ansible (master)

#42

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/650853
Committed: https://opendev.org/openstack/kolla-ansible/commit/1c63eb20d92febabbcb0dacbc35b0c89771d7202
Submitter: "Zuul (22348)"
Branch: master

commit 1c63eb20d92febabbcb0dacbc35b0c89771d7202
Author: Mark Goddard <email address hidden>
Date: Mon Apr 8 12:18:52 2019 +0100

Persist nova libvirt secrets in a Docker volume

    Libvirt may reasonably expect that its secrets directory
    (/etc/libvirt/secrets) is persistent. However, the nova_libvirt
    container does not map the secrets directory to a volume, so it will not
    survive a recreation of the container. Furthermore, if Cinder or Nova
    Ceph RBD integration is enabled, nova_libvirt's config.json includes an
    entry for /etc/libvirt/secrets which will wipe out the directory on a
    restart of the container.

    Previously, this appeared to cause an issue with encrypted volumes,
    which could fail to attach in certain situations as described in bug
    1821696. Nova has since made a related change, and the issue can no
    longer be reproduced. However, making the secret store persistent seems
    like a sensible thing to do, and may prevent hitting other corner cases.

    This change maps /etc/libvirt/secrets to a Docker volume in the
    nova_libvirt container. We also modify config.json for the nova_libvirt
    container to merge the /etc/libvirt/secrets directory, to ensure that
    secrets added in the container during runtime are not overwritten when
    the container restarts.

Change-Id: Ia7e923dddb77ff6db3c9160af931354a2b305e8d
Related-Bug: #1821696

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-06-18: Related fix proposed to kolla-ansible (master)

#43

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/797151

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-06-19: Related fix merged to kolla-ansible (master)

#44

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/797151
Committed: https://opendev.org/openstack/kolla-ansible/commit/1fc58e74d0a026192b64934b7b23b5ee752dedcc
Submitter: "Zuul (22348)"
Branch: master

commit 1fc58e74d0a026192b64934b7b23b5ee752dedcc
Author: Mark Goddard <email address hidden>
Date: Fri Jun 18 19:57:40 2021 +0100

Fix up 'Persist nova libvirt secrets in a Docker volume'

Follow up fix for Ia7e923dddb77ff6db3c9160af931354a2b305e8d, which
broke the cephadm jobs.

Change-Id: Ieb39b41a6f493bd00c687610ba043a1b4e5945e7
Related-Bug: #1821696

Radosław Piliszek (yoctozepto) on 2021-06-20

Changed in kolla-ansible:
status:	Triaged → Invalid
no longer affects:	kolla-ansible/rocky
no longer affects:	kolla-ansible/stein
Changed in kolla-ansible:
importance:	High → Undecided

OpenStack Compute (nova)

Failed to start instances with encrypted volumes

Bug Description

Duplicates of this bug

Other bug subscribers

Remote bug watches

	Status	Importance	Assigned to
OpenStack Compute (nova)	Fix Released	Medium	Lee Yarwood
Queens	Fix Committed	Medium	Lee Yarwood
Rocky	Fix Released	Medium	Lee Yarwood
Stein	Fix Released	Medium	Lee Yarwood
kolla-ansible	Invalid	Undecided	Unassigned