Comment 5 for bug 2019190

Revision history for this message
melanie witt (melwitt) wrote : Re: [RBD] Retyping of in-use boot volumes renders instances unusable (possible data corruption)

Generally, nova gets the volume locations from cinder as a field called 'connection_info' which belongs to a volume attachment.

The way retype usually works is cinder creates a new empty volume with the destination volume type and then calls the nova swap_volume API [1] to swap the volume from the original source volume to the new destination volume. Nova will call the cinder API to create a new attachment for the destination volume. Then, nova gathers the nova-compute host connector and calls the cinder API to update the attachment with the host connector. Cinder API returns the new connection_info from this call. Nova calls down into the libvirt driver to connect the new volume and copy the volume data from the old volume to the new volume, using the new connection_info for the destination libvirt XML. Finally, Nova disconnects the old volume.

However from what I can tell reading the code, in the case of the RBD driver on the cinder side, I don't see that nova is called at all as part of the retyping process, so it doesn't know about the new volume location when it goes to generate the guest XML.

I found mention about this issue on the ceph-users mailing list recently as well:

https://<email address hidden>/thread/TJO6YBJFHCY743UPQDY4D4PENZDQFAHH

which pointed to these posts on the openstack-discuss mailing list:

https://lists.openstack.org/pipermail/openstack-discuss/2023-June/034160.html

https://lists.openstack.org/pipermail/openstack-discuss/2023-June/034165.html

According to the second post, the retype of attached RBD volumes was working in Victoria as long as the [nova] section of the cinder.conf was configured and then it stopped working in Wallaby. The second post noted https://bugs.launchpad.net/cinder/+bug/1886543 as the only change around retype for Wallaby, so is it possible that is related?

I think this bug is Critical given it's a regression and has potential for data loss. Please let me know if I’ve got anything wrong here and/or if anything is needed on the nova side.

[1] https://github.com/openstack/cinder/blob/5728d3899f13140203d44259ca8dfb7ae132e192/cinder/volume/manager.py#L2429