Live migrations failing due to remote host identification change

Bug #1969971 reported by Paul Goins
44
This bug affects 8 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Invalid
Undecided
Unassigned
OpenStack Nova Cloud Controller Charm
Fix Committed
Undecided
Edward Hope-Morley

Bug Description

I've encountered a cloud where, for some reason (maybe a redeploy of a compute; I'm not sure), I'm hitting this error in nova-compute.log on the source node for an instance migration:

2022-04-22 10:21:17.419 3776 ERROR nova.virt.libvirt.driver [-] [instance: <REDACTED INSTANCE UUID>] Live Migration failure: operation failed: Failed to connect to remote libvirt URI qemu+ssh://<REDACTED IP>/system: Cannot recv data: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the RSA key sent by the remote host is
SHA256:<REDACTED FINGERPRINT>.
Please contact your system administrator.
Add correct host key in /root/.ssh/known_hosts to get rid of this message.
Offending RSA key in /root/.ssh/known_hosts:97
  remove with:
  ssh-keygen -f "/root/.ssh/known_hosts" -R "<REDACTED IP>"
RSA host key for <REDACTED IP> has changed and you have requested strict checking.
Host key verification failed.: Connection reset by peer: libvirt.libvirtError: operation failed: Failed to connect to remote libvirt URI qemu+ssh://<REDACTED IP>/system: Cannot recv data: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

This interferes with instance migration.

There is a workaround:
* Manually ssh to the destination node, both as the root and nova users on the source node.
* Manually clear the offending known_hosts entries reported by the SSH command.
* Verify that once cleared, the root and nova users are able to successfully connect via SSH.

Obviously, this is cumbersome in the case of clouds with high numbers of compute nodes. It'd be better if the charm was able to avoid this issue.

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

Nova-cc has an action to redo all of the host keys when redeploying etc. Check out the "clear-unit-knownhost-cache" action. Also check whether hostname caching is on (config "cache-known-hosts=true") If this is set to true (the default) then changes in hosts or DNS resolution will result in stale information on the nova-compute units.

If it's neither of those things, then we have a bug.

Revision history for this message
Paul Goins (vultaire) wrote :

Thanks Alex - I feel like I've seen that action before but forgot about it.

Confirmed that cache-known-hosts=true. I'll see if the action fixes things; probably it will.

Revision history for this message
Paul Goins (vultaire) wrote :

Hello Alex - I found this bug in the wake of an issue we had on another cloud, and while it manifested in a slightly different way this time, the end result is the same: migrations failing because of issues with the SSH known_hosts file not being fully prepared to allow prompt-less SSH access.

First, let me say: I think the problem is *partially* addressed by the config change and action you mention. However, it wasn't enough for this particular cloud; I have evidence that improvements may be needed.

On this cloud, after the migration problem was reported to us, we set cache-known-hosts=false to turn off hostname caching, and followed that by the clear-unit-knownhost-cache action. And it looks like that works as expected. Here is a sanitized version of the output from the clear-unit-knownhost-cache action:

$ juju show-action-output 12345
UnitId: nova-cloud-controller/1
id: "97791"
results:
  Stderr: |
    # 10.1.2.15:22 SSH-2.0-OpenSSH_8.2p1 Ubuntu-4ubuntu0.4
    # site2-rack3-node15:22 SSH-2.0-OpenSSH_8.2p1 Ubuntu-4ubuntu0.4
    # 10.1.2.15:22 SSH-2.0-OpenSSH_8.2p1 Ubuntu-4ubuntu0.4
    # site2-rack3-node15:22 SSH-2.0-OpenSSH_8.2p1 Ubuntu-4ubuntu0.4
    [...]
  units-updated: '[{''nova-compute-kvm/1'': ''<REDACTED>''}, [...]
status: completed
timing:
  completed: 2023-08-29 17:20:20 +0000 UTC
  enqueued: 2023-08-29 17:19:39 +0000 UTC
  started: 2023-08-29 17:19:39 +0000 UTC

We can see clearly that the script pulled the private-address IP and also the hostname and created entries against both - which is exactly what we want.

However, here's the nuance: the hostname doesn't match what's in "openstack hypervisor list" nor "openstack host list".

# Again, sanitized
$ openstack hypervisor list
+----+-------------------------+-----------------+---------------+-------+
| ID | Hypervisor Hostname | Hypervisor Type | Host IP | State |
+----+-------------------------+-----------------+---------------+-------+
| 1 | site2-rack3-node15.maas | QEMU | 10.1.2.15 | up |
[...]
+----+-------------------------+-----------------+---------------+-------+

$ openstack compute service list --service nova-compute
+----+--------------+-------------------------+---------------------+----------+-------+----------------------------+
| ID | Binary | Host | Zone | Status | State | Updated At |
+----+--------------+-------------------------+---------------------+----------+-------+----------------------------+
| 28 | nova-compute | site2-rack3-node15.maas | availability-zone-3 | enabled | up | 2023-08-30T20:07:11.000000 |
[...]
+----+--------------+-------------------------+---------------------+----------+-------+----------------------------+

As you can see above, there's a .maas domain suffix. That wouldn't have been pre-seeded - and indeed, instance migrations fail without those entries since the hostname field in the relations don't match the hostnames used in OpenStack.

So - I think we have a bug here with regards to how hostnames are handled in the known_hosts file generation process.

Revision history for this message
Nishant Dash (dash3) wrote :

I can confirm I see this across multiple deployments
From what I understand, n-c-c is pulling hostname from relation data of the `--endpoint cloud-compute` which has plain hostnames whereas, nova is using the fqdn when performing commands during a resize for example

1. n-c-c endpoint
nova-compute/x:
        in-scope: true
        data:
          availability_zone: zone2
          egress-subnets: ip/32
          hostname: hostname

2. performing resize fails with Hostkey verification failure as such
oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command.\nCommand: scp -C -r hostname.maas:/var/lib/nova/instances/_b
ase/xyzw /var/lib/nova/instances/_base/abcdefgh\nExit code: 1\nStdout: \'\'\nStderr: \'Host key verification failed.

Since this scp command above is using the fqdn, this is exactly what Paul has outlined as both `hypervisor list` and `host list` use the fqdn.

Additionally, I see both focal ussuri and jammy yoga affected

Revision history for this message
Giuseppe Petralia (peppepetra) wrote (last edit ):

The issue described in comment #3 is affecting both:

- focal-ussuri (charm nova-cloud-controller ussuri/stable rev. 680)

- jammy-yoga (charm nova-cloud-controller yoga/stable rev. 634)

Nova-cloud-controller configures prompt-less SSH access only for hostname and private ip of each compute.

But then nova uses FQDN to do "scp" needed by resize and live-migrations. Resulting in both to fail.

Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :
Revision history for this message
Giuseppe Petralia (peppepetra) wrote (last edit ):

we have verified that host in nova.conf is using already the FQDN.

Also the issue occurs only when resizing or migrating VMs after the original image was deleted.

The error only occurs when the resize or migration includes the copy of the base file from the original host at

/var/lib/nova/instances/_base

When the original image is deleted from glance, nova fall back to copy it from host:

https://github.com/openstack/nova/blob/stable/ussuri/nova/virt/libvirt/driver.py#L9452

and it uses instance.host as source, which is the FQDN of the compute node:

https://github.com/openstack/nova/blob/stable/ussuri/nova/virt/libvirt/driver.py#L9620

And charm is not configuring prompt-less ssh for the FQDN

Revision history for this message
Giuseppe Petralia (peppepetra) wrote :

Update on this issue.

n-c-c is configuring the hosts keys for each node for the following entries:

* private-address of the compute node on the cloud-compute relation, which in our env is the internal space

* hostname on cloud-compute relation data (which is the hostname w/o domain)

* reverse lookup entry of the private-address that in maas environments return the fqdn with the interface name at the beginning:

  ```
  >>> import charmhelpers.contrib.openstack.utils as ch_utils
  >>> print(ch_utils.get_hostname("192.168.52.50"))
  bond0.123.my-host.maas
  ```

  The correct entry is only returned for reverse lookup on the oam space which is the boot interface

Revision history for this message
Edward Hope-Morley (hopem) wrote :

The nova-cloud-controller charm will create hostname, fqdn and ip address entries for each compute host. It does using settings 'private-address' and 'hostname' on the cloud-compute relation. private-address will be the address resolvable from libvirt-migration-network (if configured) otherwise the unit private-address.

Here comes the problem; the hostname added to known_hosts will be from relation 'hostname' BUT the hostname fqdn will be resolved from private-address. This means that if Nova compute is configured to use network X for the its management network and libvirt-migration-network is set to a different network, the fqdn in known_hosts will be from the latter. This is all good until nova-compute needs to do a vm resize and the image used to build the vm no longer exists in Glance. At which point Nova will use the instance.hostname from the database to perform an scp from source to destination and this fails because this hostname (fqdn from management network) is not in known_hosts.

This is something that Nova should ultimately have support for but in the interim the suggestion is that nova-cloud-controller always adds the management network fqdn to known_hosts.

Changed in charm-nova-cloud-controller:
assignee: nobody → Edward Hope-Morley (hopem)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-nova-cloud-controller (master)
Changed in charm-nova-cloud-controller:
status: New → In Progress
Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

Ssh known hosts file handling is not in scope for nova. I glad to see that this is progressing in charms. Closing this for nova.

Changed in nova:
status: New → Invalid
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-nova-cloud-controller (master)

Reviewed: https://review.opendev.org/c/openstack/charm-nova-cloud-controller/+/898581
Committed: https://opendev.org/openstack/charm-nova-cloud-controller/commit/05b081bf5ffa24045a1fef3b18d2e28fec52d604
Submitter: "Zuul (22348)"
Branch: master

commit 05b081bf5ffa24045a1fef3b18d2e28fec52d604
Author: Edward Hope-Morley <email address hidden>
Date: Mon Oct 9 16:14:10 2023 +0100

    Ensure mgmt network hostname and fqdn in known_hosts

    The cloud-compute relation uses the private-address setting to
    reflect the hostname/address to be used for vm migrations. This
    can be the default management network or an alternate one. When
    this charm populates ssh known_hosts entries for compute hosts
    it needs to ensure hostname, address and fqdn for the mgmt network
    is included so that Nova resize operations can work if they use
    the hostname from the db (which will always be from the mgmt
    network).

    Change-Id: Ic9e4657453d8f53d1ecbee23475c7b11549ebc14
    Closes-Bug: #1969971

Changed in charm-nova-cloud-controller:
status: In Progress → Fix Committed
Revision history for this message
Nobuto Murata (nobuto) wrote :

> This is all good until nova-compute needs to do a vm resize and the
> image used to build the vm no longer exists in Glance. At which point
> Nova will use the instance.hostname from the database to perform an scp
> from source to destination and this fails because this hostname (fqdn
> from management network) is not in known_hosts.
>
> This is something that Nova should ultimately have support for but in
> the interim the suggestion is that nova-cloud-controller always adds
> the management network fqdn to known_hosts.

Ah, I reported the following one in the upstream before where Nova doesn't respect the live migration network specified in nova.conf. It was about copying the content of the config drive using scp executed from the target node.
https://bugs.launchpad.net/nova/+bug/1939869

At that point, we had a clear workaround that using vfat instead of iso9660 can avoid the scp code path in Nova. But it's good to know that it is the case mentioned above (image no longer available in Glance).

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-nova-cloud-controller (stable/2023.2)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-nova-cloud-controller (stable/2023.1)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-nova-cloud-controller (stable/yoga)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-nova-cloud-controller (stable/zed)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-nova-cloud-controller (stable/xena)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-nova-cloud-controller (stable/wallaby)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-nova-cloud-controller (stable/victoria)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-nova-cloud-controller (stable/ussuri)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-nova-cloud-controller (stable/2023.2)

Reviewed: https://review.opendev.org/c/openstack/charm-nova-cloud-controller/+/908774
Committed: https://opendev.org/openstack/charm-nova-cloud-controller/commit/b6809e75e181a49c2209aa2c1be9e4d11442cbb3
Submitter: "Zuul (22348)"
Branch: stable/2023.2

commit b6809e75e181a49c2209aa2c1be9e4d11442cbb3
Author: Edward Hope-Morley <email address hidden>
Date: Mon Oct 9 16:14:10 2023 +0100

    Ensure mgmt network hostname and fqdn in known_hosts

    The cloud-compute relation uses the private-address setting to
    reflect the hostname/address to be used for vm migrations. This
    can be the default management network or an alternate one. When
    this charm populates ssh known_hosts entries for compute hosts
    it needs to ensure hostname, address and fqdn for the mgmt network
    is included so that Nova resize operations can work if they use
    the hostname from the db (which will always be from the mgmt
    network).

    Change-Id: Ic9e4657453d8f53d1ecbee23475c7b11549ebc14
    Closes-Bug: #1969971
    (cherry picked from commit 05b081bf5ffa24045a1fef3b18d2e28fec52d604)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-nova-cloud-controller (stable/2023.1)

Reviewed: https://review.opendev.org/c/openstack/charm-nova-cloud-controller/+/908775
Committed: https://opendev.org/openstack/charm-nova-cloud-controller/commit/61a7dd0bbd62af3cf73635199cc1090154a5ebb9
Submitter: "Zuul (22348)"
Branch: stable/2023.1

commit 61a7dd0bbd62af3cf73635199cc1090154a5ebb9
Author: Edward Hope-Morley <email address hidden>
Date: Mon Oct 9 16:14:10 2023 +0100

    Ensure mgmt network hostname and fqdn in known_hosts

    The cloud-compute relation uses the private-address setting to
    reflect the hostname/address to be used for vm migrations. This
    can be the default management network or an alternate one. When
    this charm populates ssh known_hosts entries for compute hosts
    it needs to ensure hostname, address and fqdn for the mgmt network
    is included so that Nova resize operations can work if they use
    the hostname from the db (which will always be from the mgmt
    network).

    Also adds requirements.txt to default tox [testenv] to get
    netaddr constraints applied to others that were removed as
    part of https://review.opendev.org/q/topic:%22batch-update%22.
    This is needed for -epy38 which is only run by gate.

    Change-Id: Ic9e4657453d8f53d1ecbee23475c7b11549ebc14
    Closes-Bug: #1969971
    (cherry picked from commit 05b081bf5ffa24045a1fef3b18d2e28fec52d604)
    (cherry picked from commit b6809e75e181a49c2209aa2c1be9e4d11442cbb3)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-nova-cloud-controller (stable/zed)

Reviewed: https://review.opendev.org/c/openstack/charm-nova-cloud-controller/+/908777
Committed: https://opendev.org/openstack/charm-nova-cloud-controller/commit/e71b6209fcbe22667f416ff05cd14f6622631831
Submitter: "Zuul (22348)"
Branch: stable/zed

commit e71b6209fcbe22667f416ff05cd14f6622631831
Author: Edward Hope-Morley <email address hidden>
Date: Mon Oct 9 16:14:10 2023 +0100

    Ensure mgmt network hostname and fqdn in known_hosts

    The cloud-compute relation uses the private-address setting to
    reflect the hostname/address to be used for vm migrations. This
    can be the default management network or an alternate one. When
    this charm populates ssh known_hosts entries for compute hosts
    it needs to ensure hostname, address and fqdn for the mgmt network
    is included so that Nova resize operations can work if they use
    the hostname from the db (which will always be from the mgmt
    network).

    Also adds requirements.txt to default tox [testenv] to get
    netaddr constraints applied to others that were removed as
    part of https://review.opendev.org/q/topic:%22batch-update%22.
    This is needed for -epy38 which is only run by gate.

    Change-Id: Ic9e4657453d8f53d1ecbee23475c7b11549ebc14
    Closes-Bug: #1969971
    (cherry picked from commit 05b081bf5ffa24045a1fef3b18d2e28fec52d604)
    (cherry picked from commit b6809e75e181a49c2209aa2c1be9e4d11442cbb3)
    (cherry picked from commit 61a7dd0bbd62af3cf73635199cc1090154a5ebb9)

tags: added: in-stable-zed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-nova-cloud-controller (stable/yoga)

Reviewed: https://review.opendev.org/c/openstack/charm-nova-cloud-controller/+/908776
Committed: https://opendev.org/openstack/charm-nova-cloud-controller/commit/fbaef937dc33bcbbc8a4ac76948f2f67166e8e5d
Submitter: "Zuul (22348)"
Branch: stable/yoga

commit fbaef937dc33bcbbc8a4ac76948f2f67166e8e5d
Author: Edward Hope-Morley <email address hidden>
Date: Mon Oct 9 16:14:10 2023 +0100

    Ensure mgmt network hostname and fqdn in known_hosts

    The cloud-compute relation uses the private-address setting to
    reflect the hostname/address to be used for vm migrations. This
    can be the default management network or an alternate one. When
    this charm populates ssh known_hosts entries for compute hosts
    it needs to ensure hostname, address and fqdn for the mgmt network
    is included so that Nova resize operations can work if they use
    the hostname from the db (which will always be from the mgmt
    network).

    Change-Id: Ic9e4657453d8f53d1ecbee23475c7b11549ebc14
    Closes-Bug: #1969971

tags: added: in-stable-yoga
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-nova-cloud-controller (stable/xena)

Reviewed: https://review.opendev.org/c/openstack/charm-nova-cloud-controller/+/909023
Committed: https://opendev.org/openstack/charm-nova-cloud-controller/commit/928cb7d75788b208beeb501e97723dfd9ba8a2f8
Submitter: "Zuul (22348)"
Branch: stable/xena

commit 928cb7d75788b208beeb501e97723dfd9ba8a2f8
Author: Edward Hope-Morley <email address hidden>
Date: Mon Oct 9 16:14:10 2023 +0100

    Ensure mgmt network hostname and fqdn in known_hosts

    The cloud-compute relation uses the private-address setting to
    reflect the hostname/address to be used for vm migrations. This
    can be the default management network or an alternate one. When
    this charm populates ssh known_hosts entries for compute hosts
    it needs to ensure hostname, address and fqdn for the mgmt network
    is included so that Nova resize operations can work if they use
    the hostname from the db (which will always be from the mgmt
    network).

    Change-Id: Ic9e4657453d8f53d1ecbee23475c7b11549ebc14
    Closes-Bug: #1969971

tags: added: in-stable-xena
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-nova-cloud-controller (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/charm-nova-cloud-controller/+/909024
Committed: https://opendev.org/openstack/charm-nova-cloud-controller/commit/58b7a5b934e8abdb30b45e32ae0888982ce3c60a
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 58b7a5b934e8abdb30b45e32ae0888982ce3c60a
Author: Edward Hope-Morley <email address hidden>
Date: Mon Oct 9 16:14:10 2023 +0100

    Ensure mgmt network hostname and fqdn in known_hosts

    The cloud-compute relation uses the private-address setting to
    reflect the hostname/address to be used for vm migrations. This
    can be the default management network or an alternate one. When
    this charm populates ssh known_hosts entries for compute hosts
    it needs to ensure hostname, address and fqdn for the mgmt network
    is included so that Nova resize operations can work if they use
    the hostname from the db (which will always be from the mgmt
    network).

    Change-Id: Ic9e4657453d8f53d1ecbee23475c7b11549ebc14
    Closes-Bug: #1969971

tags: added: in-stable-wallaby
tags: added: in-stable-victoria
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-nova-cloud-controller (stable/victoria)

Reviewed: https://review.opendev.org/c/openstack/charm-nova-cloud-controller/+/909026
Committed: https://opendev.org/openstack/charm-nova-cloud-controller/commit/8f2e9f94296538a30dac3ee36e28031f3996a028
Submitter: "Zuul (22348)"
Branch: stable/victoria

commit 8f2e9f94296538a30dac3ee36e28031f3996a028
Author: Edward Hope-Morley <email address hidden>
Date: Mon Oct 9 16:14:10 2023 +0100

    Ensure mgmt network hostname and fqdn in known_hosts

    The cloud-compute relation uses the private-address setting to
    reflect the hostname/address to be used for vm migrations. This
    can be the default management network or an alternate one. When
    this charm populates ssh known_hosts entries for compute hosts
    it needs to ensure hostname, address and fqdn for the mgmt network
    is included so that Nova resize operations can work if they use
    the hostname from the db (which will always be from the mgmt
    network).

    Change-Id: Ic9e4657453d8f53d1ecbee23475c7b11549ebc14
    Closes-Bug: #1969971

Revision history for this message
Nobuto Murata (nobuto) wrote :

Filed a charm follow-up bug as:
"Support migration_inbound_addr in addition to live_migration_inbound_addr"
https://bugs.launchpad.net/charm-nova-compute/+bug/2055350

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-nova-cloud-controller (stable/ussuri)

Reviewed: https://review.opendev.org/c/openstack/charm-nova-cloud-controller/+/909027
Committed: https://opendev.org/openstack/charm-nova-cloud-controller/commit/f2e57d0681c9ea34da2518834a6d5a198996ce6b
Submitter: "Zuul (22348)"
Branch: stable/ussuri

commit f2e57d0681c9ea34da2518834a6d5a198996ce6b
Author: Edward Hope-Morley <email address hidden>
Date: Mon Oct 9 16:14:10 2023 +0100

    Ensure mgmt network hostname and fqdn in known_hosts

    The cloud-compute relation uses the private-address setting to
    reflect the hostname/address to be used for vm migrations. This
    can be the default management network or an alternate one. When
    this charm populates ssh known_hosts entries for compute hosts
    it needs to ensure hostname, address and fqdn for the mgmt network
    is included so that Nova resize operations can work if they use
    the hostname from the db (which will always be from the mgmt
    network).

    Change-Id: Ic9e4657453d8f53d1ecbee23475c7b11549ebc14
    Closes-Bug: #1969971

tags: added: in-stable-ussuri
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.