live-migration fails with ssh due to offending key for ip in known_hosts

Bug #1628216 reported by Eric Desrochers
30
This bug affects 6 people
Affects Status Importance Assigned to Milestone
OpenStack Nova Cloud Controller Charm
Fix Released
High
James Page
nova-cloud-controller (Juju Charms Collection)
Invalid
Medium
Unassigned

Bug Description

It has been brought to my attention that when doing live-migration between compute-nodes it fails with "Host key verification failed", even if the following has been setted :

$juju set nova-compute-kvm enable-live-migration=True
$juju set nova-compute-kvm migration-auth-type=ssh

In this case, this is a Autopilot/Landscape deployment (where the charm version are pinned), the compute-node-kvm has two nics eth0 (x.x.x.x/x) subnet and juju-br0 (eth4) (y.y.y.y/y) subnet.

The same problem also occurred when adding new machines (compute-node) unit.

Live migration doesn't work :

"... ERROR nova.virt.libvirt.driver .... Live Migration failure: operation failed: Failed to connect to remote libvirt URI qemu+ssh://<HOST>/system: Cannot recv data: Host key verification failed.: Connection reset by peer"

$ juju run --unit nova-cloud-controller/0 "unit-get private-address"
<IP_OF_JUJU_BR0_SUBNET>

$ juju run --unit nova-cloud-controller/0 "unit-get public-address"
<IP_OF_JUJU_BR0_SUBNET>

$ juju run --unit nova-compute-kvm/0 "unit-get private-address"
<IP_OF_JUJU_BR0_SUBNET>

$ juju run --unit nova-compute-kvm/0 "unit-get private-address"
<IP_OF_JUJU_BR0_SUBNET>

If ssh by hand as user 'root'

$ ssh root@<IP_OF_ETH0_SUBNET>, is working without complaining about offending key for ip.

$ ssh root@<IP_OF_JUJU_BR0_SUBNET>, is not working complaining about the offending key for ip.

Some workarounds :
* Manually remove the offending entries in /root/.ssh/known_hosts
* Set "StrictHostKeyChecking no" and restart ssh.
* ...

Related src code:
https://github.com/openstack/charm-nova-cloud-controller/blob/fbd0d368c3700b3ef7beaa63d0afd48126e53206/hooks/charmhelpers/contrib/network/ip.py#L433
https://github.com/openstack/charm-nova-cloud-controller/blob/master/hooks/nova_cc_utils.py#L749

Changed in nova-cloud-controller (Juju Charms Collection):
status: New → Confirmed
Revision history for this message
Eric Desrochers (slashd) wrote :

[Services]
NAME STATUS EXPOSED CHARM
base-machine unknown false cs:trusty/ubuntu-0
ceilometer active false cs:trusty/ceilometer-237
ceilometer-agent false cs:trusty/ceilometer-agent-233
ceph-mon active false cs:trusty/ceph-mon-3
ceph-osd active false cs:trusty/ceph-osd-236
ceph-radosgw active false cs:trusty/ceph-radosgw-242
cinder active false cs:trusty/cinder-255
glance active false cs:trusty/glance-251
glance-simplestreams-sync active false cs:~landscape-charmers/trusty/glance-simplestreams-sync-12
hacluster-ceilometer false cs:trusty/hacluster-29
hacluster-ceph-radosgw false cs:trusty/hacluster-29
hacluster-cinder false cs:trusty/hacluster-29
hacluster-glance false cs:trusty/hacluster-29
hacluster-keystone false cs:trusty/hacluster-29
hacluster-mysql false cs:trusty/hacluster-29
hacluster-neutron-api false cs:trusty/hacluster-29
hacluster-nova-cloud-controller false cs:trusty/hacluster-29
hacluster-openstack-dashboard false cs:trusty/hacluster-29
keystone active false cs:trusty/keystone-256
landscape-client false cs:trusty/landscape-client-14
mongodb unknown false cs:trusty/mongodb-37
mysql active false cs:trusty/percona-cluster-244
nagios unknown false cs:trusty/nagios-10
neutron-api active false cs:trusty/neutron-api-244
neutron-gateway active false cs:trusty/neutron-gateway-229
neutron-openvswitch false cs:trusty/neutron-openvswitch-236
nova-cloud-controller blocked false cs:trusty/nova-cloud-controller-287

description: updated
Eric Desrochers (slashd)
description: updated
Revision history for this message
Eric Desrochers (slashd) wrote :

Additional information :

The charm revision numbers in Landscape Autopilot are pinned.
As of today, upgrading charm will require to redeploy from scratch.

tags: added: sts
Revision history for this message
Edward Hope-Morley (hopem) wrote :

Hi Eric, do you think you could provide the output of:

  juju run --unit nova-compute/0 "relation-get -r ${rid} - nova-compute/0"

For:

  rid=`juju run --unit nova-compute/0 "relation-ids cloud-compute"`

And

  rid=`juju run --unit nova-compute/0 "relation-ids compute-peer"`

Revision history for this message
Eric Desrochers (slashd) wrote :

@hopem,

Here's the request output :

$ rid1=`juju run --unit nova-compute-kvm/0 "relation-ids compute-peer"`
$ rid2=`juju run --unit nova-compute-kvm/0 "relation-ids cloud-compute"`

$ echo $rid1
compute-peer:1

$ echo $rid2
cloud-compute:94

$ juju run --unit nova-compute-kvm/0 "relation-get -r ${rid1} - nova-compute-kvm/0"
private-address: <HOSTNAME>.maas

$ juju run --unit nova-compute-kvm/0 "relation-get -r ${rid2} - nova-compute-kvm/0"
hostname: <HOSTNAME>
migration_auth_type: ssh
private-address: <HOSTNAME>.maas
ssh_public_key: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDBPrmRwb80u349754XiO6tqaUtIZpAALlRn2WC55ivsChbBs/8go2WFsUN7+DqPqAymq6cYvrU9uRhUeqTag8AHXoOq0WaPFdJFVCrLItzrF/ty2Vi+SCn1eijC36q4x+r7nycFNojHa4d2i43RmN5rENlqSeuc4Dwn4M5m4keTVHKzRjvLwnz3mJpwA2HCbJxqM9FEgfrNQmIkzHw85eJRoRyP9H9Vtfuh6p+TGi7m/qzYYUt4U6lOUox/RsQ9kJUcLAFqylEST3gE4HLktvFpcXFHfQfGN4TA2S9TTFjzSKMnvGcJ9xI+VjTPNQyr7tAF0511XCj9cb+RmZBZ0k9

private-address uses the same hostname FQDN for both relation-get queries.

Revision history for this message
James Page (james-page) wrote :

This is a little tricky as migration ties into DNS hostname resolution (as nova/libvirt use an unqualified hostname as the target for a migration).

Can you confirm which of the IP addresses on the host the <HOST> name resolves to? That might give us a clue as to what's happening here.

FTR "unit-get private-address" is not consistent, so we're moving the charms over to using network-get when deployed with Juju 2.0, which should ensure we at least get some sane behaviour in this respect.

Revision history for this message
James Page (james-page) wrote :

Summary (I think)

<HOST> resolves to <IP_OF_ETH0_SUBNET>
unit-get private-address -> <IP_OF_JUJU_BR0_SUBNET>

or I might have some of that the wrong way around - anyway having multiple configured nics is def the root cause, we just need to figure out a way forward on this with Juju 2.0 network spaces.

tags: added: networking
Changed in nova-cloud-controller (Juju Charms Collection):
status: Confirmed → Triaged
importance: Undecided → Medium
James Page (james-page)
Changed in charm-nova-cloud-controller:
importance: Undecided → Medium
status: New → Triaged
Changed in nova-cloud-controller (Juju Charms Collection):
status: Triaged → Invalid
Revision history for this message
Kevin Metz (pertinent) wrote :

Same issue is still occuring. juju version 2.0.2

Mar 17 01:00:46 FLEX-1 nova-compute[14773]: libvirtError: operation failed: Failed to connect to remote libvirt URI qemu+ssh://FLEX-4/system: Cannot recv data: Host key verification failed.: Connection reset by peer
Mar 17 01:00:46 FLEX-1 nova-compute[14773]: 2017-03-17 01:00:46.618 14773 ERROR nova.virt.libvirt.driver [req-c1edf902-a66f-40f8-8e63-643af301dc40 3eb635280d9841d8bf57b3df015e476a 19e02021261a4233ade31a03203d4dea - - -] [instance: d826a1ed-7378-4ee9-821f-7f63e4cf7b40] Migration operation has aborted

tags: added: canonical-bootstack
Revision history for this message
Jill Rouleau (jillrouleau) wrote :

Hitting this in the same environment Kevin was referring to, redeployed on 2.1.x and just upgraded to 2.2.1.1. In the below 172.31/16 is the private/OS mgmt network and 172.30.60/22 is the ipmi/pxe network.

jujumanage@maas:~$ juju run --unit=compute/4 'unit-get private-address'
172.31.1.195
FLEX-6.example.com. 30 IN A 172.31.1.195

jujumanage@maas:~$ juju run --unit nova-compute/24 "relation-ids cloud-compute"
cloud-compute:159
jujumanage@maas:~$ juju run --unit nova-compute/24 "relation-get -r cloud-compute:159 - nova-compute/24"
hostname: FLEX-6
migration_auth_type: ssh
nova_ssh_public_key: ssh-rsa xxxxkeyxxxx
  root@FLEX-6
private-address: 172.30.63.59
ssh_public_key: ssh-rsa aaaakeyaaaa root@FLEX-6

jujumanage@maas:~$ juju run --unit=nova-compute/24 'unit-get private-address'
172.31.1.195

Trying to migrate from FLEX-6 to:
FLEX-5 (nova-compute/21)
inet 172.31.1.189/16 brd 172.31.255.255 scope global bond0
inet 172.30.63.58/22 brd 172.30.63.255 scope global bond1

root@FLEX-6:~# ssh -vvvv -o BatchMode=yes 172.31.1.189 mkdir -p /var/lib/nova/instances/16981efa-f453-46ef-ab74-2ac8f77b503e
---snip---
debug3: send packet: type 30
debug1: expecting SSH2_MSG_KEX_ECDH_REPLY
debug3: receive packet: type 31
debug1: Server host key: ecdsa-sha2-nistp256 SHA256:XXXXXXXXXXXXXXXXXXXAcJE
debug3: hostkeys_foreach: reading file "/root/.ssh/known_hosts"
Host key verification failed.

root@FLEX-6:~# ssh -vvvv -o BatchMode=yes 172.30.63.58 mkdir -p /var/lib/nova/instances/16981efa-f453-46ef-ab74-2ac8f77b503e
Succeeds

Worked around for now by adding to all hypervisors /var/lib/nova/.ssh/config:
Host $relevant_host_pattern
    StrictHostKeyChecking no

Revision history for this message
James Page (james-page) wrote :

Groking the code:

@hooks.hook('cloud-compute-relation-joined')
def compute_joined(rid=None):
    # NOTE(james-page) in MAAS environments the actual hostname is a CNAME
    # record so won't get scanned based on private-address which is an IP
    # add the hostname configured locally to the relation.
    settings = {
        'hostname': gethostname(),
        'private-address': get_relation_ip(
            'cloud-compute', cidr_network=config('os-internal-network')),
    }

the get_relation_id command will

a) use os-internal-network if set - legacy approach to multi-networks
b) use the network space binding of the 'cloud-compute' relation
c) fallback to 'unit-get private-address'

Revision history for this message
James Page (james-page) wrote :

c) should not be hit in a juju 2.0 or later env where the network-get tool is provided.

Revision history for this message
James Page (james-page) wrote :

Might be fixed under bug 1686882

James Page (james-page)
Changed in charm-nova-cloud-controller:
status: Triaged → In Progress
importance: Medium → High
assignee: nobody → James Page (james-page)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-nova-cloud-controller (master)

Fix proposed to branch: master
Review: https://review.openstack.org/582611

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-nova-cloud-controller (master)

Reviewed: https://review.openstack.org/582611
Committed: https://git.openstack.org/cgit/openstack/charm-nova-cloud-controller/commit/?id=36ccf4ee97d6848b79aca67f64c32eb3115fd269
Submitter: Zuul
Branch: master

commit 36ccf4ee97d6848b79aca67f64c32eb3115fd269
Author: James Page <email address hidden>
Date: Fri Jul 13 15:52:48 2018 +0100

    Use lowercase hostnames for SSH know hosts

    OpenSSH will lowercase any hostname; ensure that hostnames
    for known_host entries are also lower case avoiding any
    authenticity of host type issues during live migration and
    resize operators.

    Change-Id: Ie5ab0774b49fc0d753ff1c26eb041f1ceb35e8fb
    Closes-Bug: 1628216

Changed in charm-nova-cloud-controller:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-nova-cloud-controller (stable/18.05)

Fix proposed to branch: stable/18.05
Review: https://review.openstack.org/582933

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on charm-nova-cloud-controller (stable/18.05)

Change abandoned by James Page (<email address hidden>) on branch: stable/18.05
Review: https://review.openstack.org/582933
Reason: Will be superceeded by 18.08 release.

David Ames (thedac)
Changed in charm-nova-cloud-controller:
milestone: none → 18.08
James Page (james-page)
Changed in charm-nova-cloud-controller:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.