Live migration of guests created with config-drive using non-standard "migration" binding fails

Bug #1939719 reported by Vladimir Grevtsev
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Nova Cloud Controller Charm
Invalid
Undecided
Unassigned
OpenStack Nova Compute Charm
Fix Committed
Undecided
Unassigned

Bug Description

[Environment]
Focal/Ussuri, latest stable charms
juju show-application nova-compute-dpdk: https://paste.ubuntu.com/p/Ysdwy4YqZG/

[Description]
There are multiple Nova applications in this deployment (3 specifically - "generic", "sriov" and "dpdk" landscapes). There wasn't any problems with the live migration at the generic hosts, however, when trying to live migrate the DPDK machine, we've got the following error in Nova log:

in pre_live_migration
 migrate_data = self.driver.pre_live_migration(context,
File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 9628, in pre_live_migration
 self._remotefs.copy_file(src, instance_dir)
File "/usr/lib/python3/dist-packages/nova/virt/libvirt/volume/remotefs.py", line 106, in copy_file
 self.driver.copy_file(src, dst, on_execute=on_execute,
File "/usr/lib/python3/dist-packages/nova/virt/libvirt/volume/remotefs.py", line 195, in copy_file
 processutils.execute('scp', '-r', src, dst,
File "/usr/lib/python3/dist-packages/oslo_concurrency/processutils.py", line 421, in execute
 raise ProcessExecutionError(exit_code=_returncode,
"oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command.
Command: scp -r u0400s2entcomp02:/var/lib/nova/instances/a1cc19a2-2c34-49c7-b85a-bc4a96265fea/disk.config /var/lib/nova/instances/a1cc19a2-2c34-49c7-b85a-bc4a96265fea
Exit code: 1
Stdout: ''
Stderr: 'Host key verification failed.\r\n'

[Analysis]
This issue fires, because Nova is trying to do an outgoing SSH connection to the remote hypervisor, using its FQDN (which resolves to the different IP address, which is not equal to the ingress-address of "migration" binding).

https://github.com/openstack/nova/blob/stable/ussuri/nova/virt/libvirt/driver.py#L9616-L9628

if (configdrive.required_by(instance) and
                        CONF.config_drive_format == 'iso9660'):
                    # comment was here
                    src = "%s:%s/disk.config" % (instance.host, instance_dir)
                    self._remotefs.copy_file(src, instance_dir)

This scenario applies for any config-drive-enabled machines, including all of the DPDK guests (since they have to use config-drive, because of the missing overlay connectivity).

 j run --unit nova-compute-dpdk/leader 'network-get migration'
bind-addresses:
- mac-address: f4:a4:d6:f3:68:a1
  interface-name: bond1.811
  addresses:
  - hostname: ""
    address: 10.35.174.1 ##### looks correct
    cidr: 10.35.174.0/25
  macaddress: f4:a4:d6:f3:68:a1
  interfacename: bond1.811
egress-subnets:
- 10.35.174.1/32
ingress-addresses:
- 10.35.174.1

$ j run --unit nova-compute-dpdk/0 'relation-ids cloud-compute'
cloud-compute:204

$ j run --unit nova-compute-dpdk/leader -- relation-get -r cloud-compute:204 - nova-compute-dpdk/0
availability_zone: default
egress-subnets: 10.35.174.1/32
hostname: u0400s2entcomp02
ingress-address: 10.35.174.1 ### looks also good (as expected)

But:

ubuntu@u0400s2entcomp05:~$ sudo su nova
nova@u0400s2entcomp05:/home/ubuntu$ cd
nova@u0400s2entcomp05:~$ pwd
/var/lib/nova
nova@u0400s2entcomp05:~$ tail -n1 .ssh/known_hosts # checking if known_hosts is not empty
|1|uwlHwgyd2RN3m81oEWryINZoLAs=|ixOg6l1iHepfHd5uCKuMCUcdSCM= ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQDCgyz/DBjrywQNuhx1x2/ueMF8kjag7p9AMHL027...

nova@u0400s2entcomp05:~$ ssh u0400s2entcomp02 # Hostname resolves in oam-space address
The authenticity of host 'u0400s2entcomp02 (10.35.81.249)' can't be established.

# same host, but using internal IP
nova@u0400s2entcomp05:~$ ssh 10.35.174.1 'whoami; hostname'
nova
u0400s2entcomp02

[Expected result]
config-drive guests should migrate properly, considering the binding (or using the IP instead of hypervisor FQDN).

[Available workarounds]
1. Suppress the warning by adding /var/lib/nova/.ssh/config with the following content:

Host *
    StrictHostKeyChecking no
    UserKnownHostsFile=/dev/null

2. Use oam-space for migration binding (so the known_hosts would generate properly)

Revision history for this message
Vladimir Grevtsev (vlgrevtsev) wrote :

+ field-high, since that impacts ongoing delivery (and this functionality is expected to work).

tags: added: field-high
Revision history for this message
Nobuto Murata (nobuto) wrote :

This part of the code is a bit tricky:
https://github.com/openstack/nova/blob/stable/ussuri/nova/virt/libvirt/driver.py#L9616-L9628

The file copy is from the live-migration source host to the destination host and the command is run on the *destination* host instead of the source host. My gut feeling is that live_migration_inbound_addr is only applicable when running live-migration related operations on the *source* host after the destination host passes live_migration_inbound_addr of the destination host itself as the *inbound* address using the pre_live_migration phase.

Using instance.host instead of using an explicit address probably doesn't reflect the operator's intention of setting live_migration_inbound_addr so it's an upstream issue ultimately IMHO. However, using config_drive_format=iso9660 is known tricky in the upstream as the code comment suggests:
https://bugs.launchpad.net/nova/+bug/1246201

So the quickest workaround would be using config_drive_format=vfat to avoid running that part of the code.

$ juju config nova-compute-* config-flags='config_drive_format=vfat'

Revision history for this message
Nobuto Murata (nobuto) wrote :

Upstream one has been filed just for future reference.
https://bugs.launchpad.net/nova/+bug/1939869

tags: removed: field-high
Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :

This is very likely a duplicate of [1], which the fix has just merged and probably addresses the issue of this bug as well.

[1] https://bugs.launchpad.net/charm-nova-cloud-controller/+bug/1969971

Changed in charm-nova-compute:
status: New → Fix Committed
Nobuto Murata (nobuto)
Changed in charm-nova-cloud-controller:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.