Migrations fail (SSH host key verification failure) after setting libvirt-migration-network

Bug #1860743 reported by Edin S
28
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Nova Cloud Controller Charm
Confirmed
High
Alex Kavanagh
OpenStack Nova Compute Charm
Confirmed
High
Alex Kavanagh

Bug Description

Issue encountered on:
- OS: Xenial
- OpenStack version: Pike
- Charm: nova-compute revision 311

I can confirm that this issue arises after setting the "libvirt-migration-network" directive via juju.

Prior to setting the directive, migrations are successful.

After setting the directive, the very same migrations fail (I've tested with the same source/destination nodes, and same compute instances).

How I was able to replicate the issue:
1. Set `libvirt-migration-network':
   juju config nova-compute-kvm libvirt-migration-network=10.35.102.0/24
2. I wait until the nova-compute-kvm units finish "executing" and are again in an "active/idle" state. While in the "executing" state, I can confirm that there does appear to be an attempt to exchange SSH keys: "(config-changed) SSH key exchange"
3. Attempt the migration:
   openstack server migrate 3c70bf83-694d-4bff-a2c9-c6d50cf15c62 --block-migration --live openstack-11

But this results in the error (taken from source node's /var/log/nova/nova-compute.log)
2020-01-24 01:05:42.317 1214121 ERROR nova.virt.libvirt.driver [req-78337918-e580-4b64-ac73-227fb24c19b2 7a5e20f2d1fc4af18f959a4666c2265c b07f32d8f1f84ba7bbe821ee7fa4f09a - f750199c451f432f9d615a147744f4f5 f750199c451f432f9d615a147744f4f5] [instance: 3c70bf83-694d-4bff-a2c9-c6d50cf15c62] Live Migration failure: operation failed: Failed to connect to remote libvirt URI qemu+ssh://10.35.102.62/system: Cannot recv data: Host key verification failed.: Connection reset by peer: libvirtError: operation failed: Failed to connect to remote libvirt URI qemu+ssh://10.35.102.62/system: Cannot recv data: Host key verification failed.: Connection reset by peer
2020-01-24 01:05:42.399 1214121 ERROR nova.virt.libvirt.driver [req-78337918-e580-4b64-ac73-227fb24c19b2 7a5e20f2d1fc4af18f959a4666c2265c b07f32d8f1f84ba7bbe821ee7fa4f09a - f750199c451f432f9d615a147744f4f5 f750199c451f432f9d615a147744f4f5] [instance: 3c70bf83-694d-4bff-a2c9-c6d50cf15c62] Migration operation has aborted

I can confirm that the source node can reach the destination node on port 22 (confirmed via netcat/telnet).

I can confirm that once the libvirt-migration-network directive is unset, migrations can be performed successfully again.

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

This *might* be due to host key caching that the nova-cloud-controller is now doing. Please could you try to clear the host key cache on the nova-cloud-controller(s) unit using the juju action:

    juju run-action nova-cloud-controller/0 clear-unit-knownhost-cache

This needs to be run on all the nova-cloud-controllers.

This will re-find all the the SSH keys on all the nova-compute units and re-share them. If this fixes the problem, then the issue is around not clearing the cache and re-seeding the keys when the migration network is changed.

Revision history for this message
Xav Paice (xavpaice) wrote :

Running those actions resulted in:

2020-02-06 23:33:41 DEBUG clear-unit-knownhost-cache # 10.35.101.62:22 SSH-2.0-OpenSSH_7.6p1 Ubuntu-4ubuntu0.3
  2020-02-06 23:33:41 DEBUG worker.uniter.jujuc server.go:182 running hook tool "juju-log"
  2020-02-06 23:33:41 INFO juju-log Adding SSH host key to known hosts for compute node at 10.35.101.62.
 (etc etc)

i.e. the incorrect (oam network) address, not the address listed in libvirt-migration-network

The action itself also failed with "DEBUG clear-unit-knownhost-cache ERROR key "Units updated" must start and end with lowercase alphanumeric, and contain only lowercase alphanumeric, hyphens and periods".

I tried manually accepting the host keys, by ssh'ing as nova and root from each compute host, to every other compute host using the 10.35.102.0/24 address. That did allow live-migration to complete, but as a workaround it's a very painful process.

I also note https://bugs.launchpad.net/charm-nova-compute/+bug/1823309 - in our case, the hosts all resolve to the address on 10.35.101.0/24 which is the one we don't want to use for livemigration.

Revision history for this message
Xav Paice (xavpaice) wrote :

Subscribed field-medium - this affects a production cloud where we have a workaround in place, but the fact that known_hosts is managed by juju means we do not know if that is permanent or not.

Changed in charm-nova-cloud-controller:
assignee: nobody → Alex Kavanagh (ajkavanagh)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-nova-cloud-controller (master)

Fix proposed to branch: master
Review: https://review.opendev.org/706536

Changed in charm-nova-cloud-controller:
status: New → In Progress
Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

Note that the fix proposed is only a fix for action fail. I'm looking into the rest of the issue.

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

Hi Xav

Please could you supply the bundle that this is occurring on, and a crashdump?

Thanks!

Changed in charm-nova-compute:
status: New → Incomplete
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-nova-cloud-controller (master)

Reviewed: https://review.opendev.org/706536
Committed: https://git.openstack.org/cgit/openstack/charm-nova-cloud-controller/commit/?id=67dfc8d882995b11707b42f4758f4d9c303eb7b6
Submitter: Zuul
Branch: master

commit 67dfc8d882995b11707b42f4758f4d9c303eb7b6
Author: Alex Kavanagh <email address hidden>
Date: Fri Feb 7 14:57:02 2020 +0000

    Fix action replay for clear-knownhost-cache

    The return key was illegal in Juju, so this patchset makes
    it legal.

    Change-Id: I2ee633ba0b445025a789a77e62950cd572636c6c
    Partial-Bug: #1860743

Revision history for this message
Nikolay Vinogradov (nikolay.vinogradov) wrote :

Just my 2 cents.

I"m having the same experience as Xav has. so on nova-compute-311 Nova tries to connect to qemu+tcp and gets connection refused as expected, and with nova-comptue-312 I'm seeing this problem with SSH keys.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-nova-cloud-controller (stable/20.02)

Fix proposed to branch: stable/20.02
Review: https://review.opendev.org/713728

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-nova-cloud-controller (stable/20.02)

Reviewed: https://review.opendev.org/713728
Committed: https://git.openstack.org/cgit/openstack/charm-nova-cloud-controller/commit/?id=aa8670b427bb6f7dd6391919f47b083fdbb78325
Submitter: Zuul
Branch: stable/20.02

commit aa8670b427bb6f7dd6391919f47b083fdbb78325
Author: Alex Kavanagh <email address hidden>
Date: Fri Feb 7 14:57:02 2020 +0000

    Fix action replay for clear-knownhost-cache

    The return key was illegal in Juju, so this patchset makes
    it legal.

    Change-Id: I2ee633ba0b445025a789a77e62950cd572636c6c
    Partial-Bug: #1860743
    (cherry picked from commit 67dfc8d882995b11707b42f4758f4d9c303eb7b6)

Revision history for this message
Nikolay Vinogradov (nikolay.vinogradov) wrote :

I just hit it again. Migration just doesn't work because of the host key of the destination, but it was working with libvirt-migration-network configured to match internal space just fine 2-3 re-deployments in a row before this one.

The workaround is to do SSH manually and accept the host key, but as Xav mentioned that's not very convenient for the customer deployment.

Revision history for this message
Nikolay Vinogradov (nikolay.vinogradov) wrote :

Please disregard the previous comment. I just realized that I tried to migrate between units of different nova-compute application instances.

Changed in charm-nova-cloud-controller:
assignee: Alex Kavanagh (ajkavanagh) → nobody
status: In Progress → Confirmed
Revision history for this message
Xav Paice (xavpaice) wrote :

Have hit this again. Collecting a crashdump for the cloud, however it needs to remain confidential - please advise which logs etc you need to see.

Changed in charm-nova-compute:
status: Incomplete → New
Changed in charm-nova-compute:
status: New → Confirmed
Changed in charm-nova-cloud-controller:
importance: Undecided → High
Changed in charm-nova-compute:
importance: Undecided → Medium
importance: Medium → High
Changed in charm-nova-cloud-controller:
assignee: nobody → Alex Kavanagh (ajkavanagh)
Changed in charm-nova-compute:
assignee: nobody → Alex Kavanagh (ajkavanagh)
Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

I think I've worked out what is going on:

1. The nova-cloud-controller charm caches hostnames by default.
2. When libvirt-migration-network is changed (in nova-compute) it causes the relation changed, for cloud-compute, to update the private address to an address in the migration network specified. This is propagated to nova-cloud-controller.
3. With caching enabled, the hostname for ssh keys doesn't get updated and so isn't pushed out to the other nova-compute units.
4. When a migration is attempted, the hosts are still on the old network.

In order to resolve this, the hostname caching needs to be overriden when the relation changed from cloud-compute happens - but caching was done precisely to speed up this in large clouds.

In order to verify that this is what is happening, the action clear-unit-knownhost-cache should be run on nova-cloud-controller, which should clear all the hosts and then re-populate them with the new migration network private-address as set from nova-compute.

If that doesn't clear it, then something else is stopping the correct addresses from being propagated.

Revision history for this message
Marco Marino (marino-mrc) wrote :

Hi Alex,
thanks for your work.

I hit this again on Focal/Ussuri but I have one doubt:

I have a dedicated network for live migration but checking the docs, I don't see any mention of the fact that nova-cloud-controller units must have an IP on the same live migration network.

in nova-compute You can configure this in 2 ways:
1. Using a bind named "migration"
2. Using the configuration param "libvirt-migration-network"

After doing this, I see that the "clear-unit-knownhost-cache" action is trying to connect to compute nodes through the live migration network but n-c-cs don't have an IP on that network. Also, n-c-c doesn't offer any bind for this.

So, the question is:
Should we add a Bind to n-c-c? Or at least specify in the doc that when a dedicated live migration network is used on nova-compute, a constraint (space=livemigrationnet) is needed on n-c-c?

Thank you.
Regards,
Marco

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

Just circling back to this one; I know it has been a while, but it's still a pertinent issue.

> After doing this, I see that the "clear-unit-knownhost-cache" action is trying to connect to compute nodes through the live migration network but n-c-cs don't have an IP on that network. Also, n-c-c doesn't offer any bind for this.

Yes, the action ultimately uses `ssh-keyscan` to fetch the SSH host keys of the nova-compute hosts so that they can be populated across the cluster. In order to do this, the nova-cc unit running the ssh-keyscan needs to be able to reach the nova-compute hosts on some network.

Whether it needs to be on the libvirt-migration-network is more questionable. nova-cloud-controller has bindings for internal, public and admin, and nova-compute has bindings for internal and migration. As internal is used for console access (on nova-compute), maybe that same binding could be used for nova-cloud-controller?

Obviously, there could be a limitation that I've not considered here, so please feel free to criticise it! It may be that we should add the migration binding to nova-cc as the cleanest solution?

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.