live-migration fails if nova-compute is added to a machine after ceph-osd has been added

Bug #1677707 reported by Drew Freiberger
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Nova Compute Charm
Expired
Undecided
Unassigned

Bug Description

As I'm digging into an issue with live-migration which results in all of my previously-live nova-compute units not able to connect (because of hostkey verification failures) to qemu+ssh://newsystem/. Error in log:

Failed to connect to remote libvirt URI qemu+ssh://compute-storage-002/system: Cannot recv data: Host key verification failed.

I was able to work-around by connecting to source nova hypervisor, becoming root, ssh dest, and accepting the host key, and then live-migration succeeded to the new hypervisor.

Four of the nova-compute instances of the 64 installed, had ceph-osd charm installed and performing before nova-compute was installed. Some nodes that had ceph-osd charm installed before, but not functioning (due to install hook failed on ceph-osd due to missing OSD path mounts), do not exhibit this live-migration failure.

I think there is a problem with interaction with ssh host-key sharing that happens with both ceph-osd and nova-compute charms that breaks nova-compute cross-host pollenation of ssh host-keys if ceph-osd pollenates hostkeys before nova-compute does.

This is noticed on xenial units 16.04.2 LTS under juju 2.1.2.1 with nova-compute charm revision 135 and ceph-osd charm revision 17.

Any recommendations to troubleshoot or resolve this would be helpful.

Revision history for this message
Drew Freiberger (afreiberger) wrote :

I'm still digging into the issue to determine what's different on these nodes from the others, but think this may be a safeguard in nova-compute charm.

Revision history for this message
Drew Freiberger (afreiberger) wrote :
Revision history for this message
Jorge Niedbalski (niedbalski) wrote :
Download full text (3.2 KiB)

Hello Xavi,

I deployed 3 new units of nova-compute on a xenial/mitaka cloud on top of my
current osd units (--to ceph/{0,2}) that already had 2 nova-compute units
 and I was unable to reproduce the issue that you presented on this case,
the live migration between an already existing host and the new one finished correctly.

Unit Workload Agent Machine Public address Ports Message
ceph/0* active idle 1 10.5.2.49 Unit is ready and clustered
ceph/1 active idle 2 10.5.2.51 Unit is ready and clustered
ceph/2 active idle 3 10.5.2.52 Unit is ready and clustered
nova-compute/3 active executing 1 10.5.2.49 (update-status) Unit is ready
  ceilometer-agent/4 active idle 10.5.2.49 Unit is ready
  neutron-openvswitch/4 active idle 10.5.2.49 Unit is ready
nova-compute/4 active executing 2 10.5.2.51 (update-status) Unit is ready
  ceilometer-agent/5 active idle 10.5.2.51 Unit is ready
  neutron-openvswitch/5 active idle 10.5.2.51 Unit is ready
nova-compute/5 active executing 3 10.5.2.52 (update-status) Unit is ready
  ceilometer-agent/3 active idle 10.5.2.52 Unit is ready
  neutron-openvswitch/3 active idle 10.5.2.52 Unit is ready

At the NCC side I can see all the authorized hosts/keys, and the authorized keys and
known hosts at the nova-compute side seems right.

ubuntu@niedbalski-xenial-bastion:~/openstack-charm-testing$ juju run --application nova-cloud-controller "sudo wc -l /etc/nova/compute_ssh/nova-compute/authorized_keys /etc/nova/compute_ssh/nova-compute/known_hosts"
- Stdout: |2
        6 /etc/nova/compute_ssh/nova-compute/authorized_keys
       18 /etc/nova/compute_ssh/nova-compute/known_hosts
       24 total
  UnitId: nova-cloud-controller/0
- Stdout: |2
        6 /etc/nova/compute_ssh/nova-compute/authorized_keys
       18 /etc/nova/compute_ssh/nova-compute/known_hosts
       24 total
  UnitId: nova-cloud-controller/1
- Stdout: |2
        6 /etc/nova/compute_ssh/nova-compute/authorized_keys
       18 /etc/nova/compute_ssh/nova-compute/known_hosts
       24 total
  UnitId: nova-cloud-controller/2

ubuntu@niedbalski-xenial-bastion:~$ juju run --application nova-compute 'relation-get -r `relation-ids cloud-compute` - nova-cloud-controller/0'|grep nova | grep index
    nova_authorized_keys_max_index: "6"
    nova_known_hosts_max_index: "18"
    nova_authorized_keys_max_index: "6"
    nova_known_hosts_max_index: "18"
    nova_authorized_keys_max_index: "6"
    nova_known_hosts_max_index: "18"
    nova_authorized_keys_max_index: "6"
    nova_known_hosts_max_index: "18"
    nova_authorized_keys_max_index: "6"
    nova_known_hosts_max_index: "18"
    nova_authorized_keys_max_index: "6"
    nova_known_hosts_max_index: "18"

Is there any detail on this reproducer that I might have been...

Read more...

Changed in charm-nova-compute:
status: New → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for OpenStack nova-compute charm because there has been no activity for 60 days.]

Changed in charm-nova-compute:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.