Live migration fails with ceph rbd attached block device between different nova-compute instances

Bug #1606344 reported by Peter Sabaini
20
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Nova Compute Charm
Triaged
Medium
Unassigned
nova-compute (Juju Charms Collection)
Invalid
Medium
Unassigned

Bug Description

We have two nova-compute services with distinct names to support different types of hardware.

When live migrating instances with attached volumes across those services we run into authentication issues against ceph.

ERROR nova.virt.libvirt.driver [req-...] [instance: ...] Live Migration failure: internal error: process exited while connecting to monitor: 2016-07-20T08:16:15.614465Z qemu-system-x86_64: -drive file=rbd:cinder/volume-7ecfac34-afdb-4944-8670-154e5631671c:id=nova-compute:key=Abarkey==:auth_supported=cephx\;none:mon_host=1.1.1.1\:6789\;1.1.1.2\:6789\;1.1.1.3\:6789,format=raw,if=none,id=drive-virtio-disk1,serial=7ecfac34-afdb-4944-8670-154e5631671c,cache=none: error connecting

This is, afaict, due to the fact that the ceph charm sets up credentials per service, and considers those two nova-compute separate services:

$ sudo ceph auth list
...
client.compute-only
key: Afookey==
caps: [mon] allow rw
caps: [osd] allow rwx
...
client.nova-compute
key: Abarkey==
caps: [mon] allow rw
caps: [osd] allow rwx

Also cf. ceph charm, in hooks/ceph_hooks.py:L597, (client-relation-joined hook function). This function sets up client authentication per service, and determines service name by splitting up the peer unit name

James Page (james-page)
affects: ceph (Juju Charms Collection) → nova-compute (Juju Charms Collection)
Changed in nova-compute (Juju Charms Collection):
importance: Undecided → Medium
summary: - Live migration fails for named services
+ Live migration fails with ceph rbd attached block device between
+ different nova-compute instances
tags: added: live-migration
Revision history for this message
James Page (james-page) wrote :

As the key is encoded in the data, this look odd to me:

"key=Abarkey==:auth_supported=cephx\;none:mon_host=1.1.1.1\:6789\;1.1.1.2\:6789\;1.1.1.3\:6789"

The key is normally encoded using a libvirt secret - I'm not 100% sure of the designed behaviour in libvirt for this particular use case (migrating between compute hosts with different keys stored in the ceph-key secret).

Revision history for this message
James Page (james-page) wrote :

I think having different keys is correct - we'll need to dig into the underlying migration code to figure out why this actually happens.

Revision history for this message
James Page (james-page) wrote :

@Peter

Please can you confirm which OpenStack release you are using as this will determine libvirt/qemu version as well.

Marking 'Incomplete' pending that information - please set back to 'New' once provided.

Changed in nova-compute (Juju Charms Collection):
status: New → Incomplete
Revision history for this message
Peter Sabaini (peter-sabaini) wrote :

Thanks James -- this is Mitaka running on Trusty

ii libvirt-bin 1.3.1-1ubuntu10.1~cloud0
ii nova-compute 2:13.0.0-0ubuntu5~cloud0
ii qemu-system 1:2.5+dfsg-5ubuntu10.2~cloud0

Changed in nova-compute (Juju Charms Collection):
status: Incomplete → New
James Page (james-page)
Changed in charm-nova-compute:
importance: Undecided → Medium
Changed in nova-compute (Juju Charms Collection):
status: New → Invalid
Revision history for this message
Billy Olsen (billy-olsen) wrote :

I've looked into this previously. As Peter mentioned, when mounting a ceph rbd device, the ceph user is included in the volume attachment properties in the libvirt domain xml generated by nova. When this is transferred across to a target hypervisor, the ceph user information must exist on the target side as well.

The ceph-mon creates users based on the name of the remote application that connects. Thus, when two nova-compute applications are deployed with different names they will have unique ceph credentials to access the cluster. The live migration fails because of this.

To fix this, we should probably add a field that allows the remote service to specify a ceph user name. I've looked before and it was somewhat non-trivial to change this.

Changed in charm-nova-compute:
status: New → Triaged
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.