block migration of config_drive_format=iso9660 doesn't take into account a dedicated live-migration network

Bug #1939869 reported by Nobuto Murata
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
NULL Project
Invalid
Undecided
Unassigned
OpenStack Compute (nova)
In Progress
Low
Doug Szumski

Bug Description

Downstream issue: https://bugs.launchpad.net/charm-nova-compute/+bug/1939719

How to reproduce:
1. Prepare two underlying networks/subnets for Libvirt+KVM based OpenStack Nova deployment (one network as main, the other as dedicated live-migration network)
2. Distribute SSH public keys under authorized_keys, pre-populate known_hosts (with the live-migration network IP addresses), and make sure StrictHostKeyChecking is NOT "no"
3. Set live_migration_scheme=ssh and live_migration_inbound_addr to the ones in the live-migration network in nova.conf
4. Launch a VM with config-drive (iso9660 as the default)
5. Live-migrate the VM with `openstack server migrate --live-migration --block-migration`

Expected result:
Live migration works

Actual result:
Live migration fails with an error on the *destination* host at the point of:
https://opendev.org/openstack/nova/src/commit/370830e9445c9825d1e34e60cca01fdfe88d5d82/nova/virt/libvirt/driver.py#L10170-L10189
with:
Command: scp -r <source_host_fqdn>:/var/lib/nova/instances/a1cc19a2-2c34-49c7-b85a-bc4a96265fea/disk.config /var/lib/nova/instances/a1cc19a2-2c34-49c7-b85a-bc4a96265fea
Exit code: 1
Stdout: ''
Stderr: 'Host key verification failed.\r\n'

The source host FQDN is used probably the code relies on instance.host, and it's resolved as an IP address on the main network instead of the live-migration network. And the IP addresses on the main network are not on known_hosts so the key verification failed.

Current workaround is either using config_drive_format=vfat or including IP addresses of the main network in to known_hosts. But wondered if copying the config drive of iso9660 can be invoked on the source side as pushing the data using live_migration_inbound_addr instead of pulling the data from the destination host.

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

I agree that using live_migration_inbound_addr[1], if configured, over instance.host would be the right move here. (and falling back to instance.host if the config is missing).

Setting this triaged as I think we have a way forward with it. Feel free to propose a patch to fix this in review.opendev.org.

[1] https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.live_migration_inbound_addr

Changed in nova:
status: New → Triaged
importance: Undecided → Low
tags: added: compute libvirt live-migration
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/899458

Changed in nova:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by "Jan Horstmann <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/nova/+/899458
Reason: Since the copy gets called on the target, this makes no sense

Nobuto Murata (nobuto)
Changed in nova:
status: In Progress → Confirmed
Doug Szumski (dszumski)
Changed in nova:
assignee: nobody → Doug Szumski (dszumski)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/906053

Revision history for this message
sean mooney (sean-k-mooney) wrote :

note that the current config option is explicitly only for the libvirt traffic not nova.
so to me this is a new feature not a bug.

Revision history for this message
sean mooney (sean-k-mooney) wrote :

i agree the usecase makes sense but it not in the scope of the current config option as documented in my view.

Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :

Considering the original description of this bug coming from a charmed deployment of nova, the bug [1] below I just marked this one as a duplicate of, the fix for that is very likely to address this one as well

[1] https://bugs.launchpad.net/charm-nova-cloud-controller/+bug/1969971

Changed in charm-nova-cloud-controller:
status: New → Fix Committed
Nobuto Murata (nobuto)
affects: charm-nova-cloud-controller → null-and-void
Changed in null-and-void:
status: Fix Committed → Invalid
Revision history for this message
Doug Szumski (dszumski) wrote :

Hopefully to clarify, the issue reported here isn't just limited to Charms, eg. we see it in Kolla Ansible environments. Specifically this part:

```
The source host FQDN is used probably the code relies on instance.host, and it's resolved as an IP address on the main network instead of the live-migration network. And the IP addresses on the main network are not on known_hosts so the key verification failed.
```

Rather than allowing `scp` over the 'main network' for some limited operations (fetching missing Glance image / copying config drive), we want /all/ migration traffic to stay on the migration network for security / QoS reasons. Ideally the existing config option `live_migration_inbound_addr` would be used to control this.

Sean helpfully pointed out that we can remove the config drive copy since the issue that addressed should now be fixed: https://review.opendev.org/c/openstack/nova/+/909122

That leaves just the corner case of copying Glance images, which are no longer available in Glance. That's at least easier to work around (by hiding instead of deleting images).

If there was interest, I could re-propose https://review.opendev.org/c/openstack/nova/+/906053 as a feature?

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.