Static Ceph mon IP addresses in connection_info can prevent VM startup

Bug #1452641 reported by Arne Wiebalck
124
This bug affects 20 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
Unassigned
nova (Ubuntu)
Medium
Unassigned

Bug Description

The Cinder rbd driver extracts the IP addresses of the Ceph mon servers from the Ceph mon map when the instance/volume connection is established. This info is then stored in nova's block-device-mapping table and is never re-validated down the line.
Changing the Ceph mon servers' IP adresses will prevent the instance from booting as the stale connection info will enter the instance's XML. One idea to fix this would be to use the information from ceph.conf, which should be an alias or a loadblancer, directly.

Revision history for this message
Josh Durgin (jdurgin) wrote :

Nova stores the volume connection info in its db, so updating that
would be a workaround to allow restart/migration of vms to work.
Otherwise running vms shouldn't be affected, since they'll notice any
new or deleted monitors through their existing connection to the
monitor cluster.

Perhaps the most general way to fix this would be for cinder to return
any monitor hosts listed in ceph.conf (as they are listed, so they may
be hostnames or ips) in addition to the ips from the current monmap
(the current behavior).

That way an out of date ceph.conf is less likely to cause problems,
and multiple clusters could still be used with the same nova node.

Changed in cinder:
importance: Undecided → Medium
status: New → Confirmed
Eric Harney (eharney)
tags: added: ceph
Revision history for this message
Dan van der Ster (dan-vanderster) wrote :

The problem with adding hosts to the list in Cinder is that those previous mon hosts might be re-used in another Ceph clusters, thereby causing an authentication error when a VM tries an incorrect mon host at boot time. (This is due to the Ceph client behaviour not to try another monitor after authentication error... which I think is rather sane).

Bin Zhou (binzhou)
Changed in cinder:
assignee: nobody → Bin Zhou (binzhou)
Revision history for this message
Sean McGinnis (sean-mcginnis) wrote : Owner Expired

Unassigning due to no activity.

Changed in cinder:
assignee: Bin Zhou (binzhou) → nobody
Eric Harney (eharney)
tags: added: drivers
Changed in cinder:
assignee: nobody → Jon Bernard (jbernard)
Revision history for this message
Kevin Fox (kevpn) wrote :

How are you supposed to deal with needing to re'ip mons?

Revision history for this message
Sean McGinnis (sean-mcginnis) wrote : Bug Assignee Expired

Unassigning due to no activity for > 6 months.

Changed in cinder:
assignee: Jon Bernard (jbernard) → nobody
Revision history for this message
Matt Riedemann (mriedem) wrote :

Talked about this at the queens ptg, notes are in here:

https://etherpad.openstack.org/p/cinder-ptg-queens

Changed in nova:
status: New → Confirmed
importance: Undecided → Medium
no longer affects: cinder
tags: added: volumes
removed: drivers
Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Walt Boring (walter-boring) wrote :

I have a customer that is seeing something similar to this. I thought about filing a new bug, but this might be sufficient to just piggy back this one.

They have running VMs that are boot from ceph volume and also has attached ceph volumes.
He adds a new monitor to his ceph cluster and updates ceph.conf on all of the openstack nodes to reflect the new monitor IP.

He does a live migration to try and get nova to update the libvirt.xml and it seems that only the volumes section is updated, not the vms section.

He added a patch to migration.py to fix this, but wasn't sure it was the right thing to do. I have added his patch as an attachment here.
Let me know if this might be ok, and I can submit the patch to gerrit.

This is a copy of xml after the live migrate.

    <disk type='network' device='disk'>
      <driver name='qemu' type='raw' cache='none'/>
      <auth username='nova'>
        <secret type='ceph' uuid='820ccd0b-b180-4528-93ed-76ae82edf832'/>
      </auth>
      <source protocol='rbd' name='vms/3b97914e-3f9b-410a-b3d9-6c1a83244136_disk'> <-- this one is NOT changed, old ips
        <host name='192.168.200.12' port='6789'/>
        <host name='192.168.200.14' port='6789'/>
        <host name='192.168.200.24' port='6789'/>
        <host name='192.168.240.17' port='6789'/>
        <host name='192.168.240.23' port='6789'/>
      </source>
      <backingStore/>
      <target dev='vda' bus='virtio'/>
      <alias name='virtio-disk0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </disk>
    <disk type='network' device='disk'>
      <driver name='qemu' type='raw' cache='none'/>
      <auth username='nova'>
        <secret type='ceph' uuid='820ccd0b-b180-4528-93ed-76ae82edf832'/>
      </auth>
      <source protocol='rbd' name='volumes/volume-6d04520d-0029-499c-af81-516a7ba37a54'> <-- this one is changed, new ips
        <host name='192.168.200.12' port='6789'/>
        <host name='192.168.200.14' port='6789'/>
        <host name='192.168.200.24' port='6789'/>
        <host name='192.168.210.15' port='6789'/>
        <host name='192.168.240.17' port='6789'/>
        <host name='192.168.240.23' port='6789'/>
      </source>
      <backingStore/>
      <target dev='vdb' bus='virtio'/>
      <serial>6d04520d-0029-499c-af81-516a7ba37a54</serial>
      <alias name='virtio-disk1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
    </disk>

Revision history for this message
Matt Riedemann (mriedem) wrote :

That patch is way too rbd specific I think. Here is a more detailed conversation we had in IRC and also goes over some of what was discussed at the Queens PTG:

http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2018-01-04.log.html#t2018-01-04T22:26:24

Revision history for this message
Lee Yarwood (lyarwood) wrote :

~~~
<source protocol='rbd' name='vms/3b97914e-3f9b-410a-b3d9-6c1a83244136_disk'> <-- this one is NOT changed, old ips
        <host name='192.168.200.12' port='6789'/>
        <host name='192.168.200.14' port='6789'/>
        <host name='192.168.200.24' port='6789'/>
        <host name='192.168.240.17' port='6789'/>
        <host name='192.168.240.23' port='6789'/>
</source>
~~~

For ephemeral rbd images we fetch the mon ips during the initial instance creation but don't refresh this during LM [1]. IMHO this is a separate issue to the volume connection_info refresh problem being discussed in this bug.

[1] https://github.com/openstack/nova/blob/master/nova/virt/libvirt/storage/rbd_utils.py#L163

Revision history for this message
Walt Boring (walter-boring) wrote :

Thanks Lee,
  I filed a separate bug for updating the rbd images here:
https://bugs.launchpad.net/nova/+bug/1741364

Xav Paice (xavpaice)
tags: added: canonical-bootstack
Revision history for this message
Xav Paice (xavpaice) wrote :

This manifested itself again on a Mitaka cloud, we had moved the Ceph mons and existing, running, instances were fine, fresh new instances were fine, but when we stopped instances via nova, then started them again, they failed to start. Editing the xml didn't fix anything of course because Nova overwrite the xml on machine start.

I ended up fixing the nova db:

update block_device_mapping set connection_info = replace(connection_info, '"a.b.c.d", "a.b.c.e", "a.b.c.f"', '"a.b.c.foo", "a.b.c.bar", "a.b.c.baz"') where connection_info like '%a.b.c.d%'
 and deleted_at is NULL;

The select query could have been better (don't copy me!) but you get the point.

Subscribing field-high because this is something that will continue to bite folks every time ceph-mon hosts are moved around.

Revision history for this message
James Page (james-page) wrote :

I guess the alternative is to update the mapping for the block device on a stop/start nova operation.

Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

The attachment "virt/libvirt/migration.py patch" seems to be a patch. If it isn't, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are a member of the ~ubuntu-reviewers, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issues please contact him.]

tags: added: patch
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Just to summarize my understanding, and perhaps clarify for others, this bug is focused on stale connection_info for rbd volumes (not rbd images). rbd images have a related issue during live migration that is being handled in a separate bug (see comment 12 above).

Focusing on connection_info for rbd volumes now (and thanks to Matt Riedemann's comments for the tips here). connection_info appears to be properly refreshed for live migration in pre_live_migration() where _get_instance_block_device_info() is called with refresh_conn_info=True (see comment 9 above and https://github.com/openstack/nova/blob/stable/queens/nova/compute/manager.py#L5977).

Is the fix as simple as flipping refresh_conn_info=False to True for some of the other calls to _get_instance_block_device_info()? Below is an audit of the _get_instance_block_device_info() calls.

Calls to _get_instance_block_device_info() with refresh_conn_info=False:
  _destroy_evacuated_instances()
  _init_instance()
  _resume_guests_state()
  _shutdown_instance()
  _power_on()
  _do_rebuild_instance()
  reboot_instance()
  revert_resize()
  _resize_instance()
  resume_instance()
  shelve_offload_instance()
  check_can_live_migrate_source()
  _do_live_migration()
  _post_live_migration()
  post_live_migration_at_destination()
  rollback_live_migration_at_destination()

Calls to _get_instance_block_device_info() with refresh_conn_info=True:
  finish_revert_resize()
  _finish_resize()
  pre_live_migration()

Based on xavpaice's comments in (see comment 13 above -- "... existing, running, instances were fine, fresh new instances were fine, but when we stopped instances via nova, then started them again, they failed to start ..."), it would seem that the following should also have refresh_conn_info=True:
  _power_on() # solves xavpaice's scenario?
  _do_rebuild_instance()
  reboot_instance()

Revision history for this message
Xav Paice (xavpaice) wrote :

FWIW, in the cloud we saw this, migrating the (stopped) instance also updated the connection info - it was just that migrating hundreds of instances wasn't practical.

Changed in nova:
assignee: nobody → Corey Bryant (corey.bryant)
Changed in nova (Ubuntu):
assignee: nobody → Corey Bryant (corey.bryant)
Revision history for this message
Corey Bryant (corey.bryant) wrote :

I did some initial testing with the default parameter value for refresh_conn_info set to True in _get_instance_block_device_info() and unfortunately an instance with rbd volume attached does not successfully stop/start after ceph-mon's are moved to new IP addresses.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/579004

Changed in nova:
status: Confirmed → In Progress
Changed in nova (Ubuntu):
status: New → In Progress
importance: Undecided → Medium
Revision history for this message
Xav Paice (xavpaice) wrote :

Just a clarification on the process to 'move' ceph-mon units. I added ceph mons to the cluster, and removed the old ones - in this case it was a 'juju add-unit' and 'juju remove-unit' but any process to achieve the same thing would have the same result - the mons are now all on different addresses.

Changed in nova:
assignee: Corey Bryant (corey.bryant) → Seyeong Kim (xtrusia)
Changed in nova:
assignee: Seyeong Kim (xtrusia) → Lee Yarwood (lyarwood)
Changed in nova:
assignee: Lee Yarwood (lyarwood) → Seyeong Kim (xtrusia)
Changed in nova:
assignee: Seyeong Kim (xtrusia) → Lee Yarwood (lyarwood)
Changed in nova:
assignee: Lee Yarwood (lyarwood) → Seyeong Kim (xtrusia)
Changed in nova (Ubuntu):
assignee: Corey Bryant (corey.bryant) → nobody
Changed in nova (Ubuntu):
assignee: nobody → Seyeong Kim (xtrusia)
Seyeong Kim (seyeongkim)
Changed in nova:
assignee: Seyeong Kim (xtrusia) → nobody
Changed in nova (Ubuntu):
assignee: Seyeong Kim (xtrusia) → nobody
James Page (james-page)
Changed in nova (Ubuntu):
status: In Progress → Triaged
Revision history for this message
Paul Peereboom (peereb) wrote :

We're changed our ceph-monitor ip's and we're running into this issue unfortunately. It is fixed on the cinder side, but still broken on the nova side. We could fix it with patch in of Walt (comment #9) but there must be more users running into this issue unknowingly.

Revision history for this message
Tyler Stachecki (tstachecki) wrote :

We have also been bitten by this. Apologies if this does not help solve the bug, but this issue has been floating for quite awhile and the following may help future cloud operators...

In our case, we trying to re-IP ALL of our Ceph Mons. As Corey mentioned, this bug report is for *Cinder volumes*... but note that all of our instances were observed to make use of RBD-backed configuration drives which suffered the same problem as the images... so you may suffer from both problems even if you exclusively boot all instances from volume!

* RBD config drives AND Glance/image-based RBD volumes DID NOT have their Ceph Mon addresses updated as part of a live-migration, even with the patch in #9. The Ceph Mon addresses for these types in volumes IN PARTICULAR are NOT stored anywhere in a database and rather seem to be derived as needed when certain actions occur and otherwise carted around from hyp to hyp by way of the libvirt domain XML. Again, see the other LP bug for this.

* Trying to 'fix up' the Ceph Mon addresses via 'virsh edit' or comparable and then trying to live-migrate an instance to have those changes reflected is futile, because the Ceph Mon address changes are not reflected until a hard bounce of the VMM for that instance AND nova-compute uses the running copy of libvirt domain XML when shipping a copy to a destination hypervisor, NOT the copy on disk.

What we may end up doing (that worked in a lab environment) is to respin a patch off #9 that is applied to all worknode. It searches for all instances of './devices/disk/source' in the XML document which have an 'rbd' protocol. For each entry, we replace the current host subelements with our new Ceph Mon addresses. Then live-migrate every VM exactly once.

This works for all kinds of RBD volumes and, unlike 'virsh edit', works because the in-memory libvirt domain XML is rewritten prior to the VMM starting up on the destination host. Note that while you are doing the LMs and updating the domain XMLs, you must keep at least one of the old and new Ceph Mons accessible at all times.

Revision history for this message
Peter (fazy) wrote :

We have the same issue with Rocky.
One of my SQL wizard colleague helped me with some query, which can change the block_device_mapping table, and the RBD host/username/ports (if you change the number of ceph monitors, you'll need it)

Since we have multiple Zones, and our change will only affect Zone1, and since we have iSCSI storage too, we needed a bit more precise query.

Also my colleague pointed out, that the connection_info is JSON, and since the MariaDB 10.2.3 have support for json, he used them, just to be sure not to mess up the syntax.

So the three query (use with caution, and - of course - your own risk!):

update block_device_mapping as b set connection_info = json_replace(connection_info, '$.data.auth_username', 'dev-r1z1-c4e') where instance_uuid in (select i.uuid from instances as i where i.deleted_at is null and i.availability_zone = 'Zone1') AND JSON_EXISTS(b.connection_info, '$.data.hosts') = 1 and b.deleted_at is NULL;

update block_device_mapping as b set connection_info = json_replace(connection_info, '$.data.hosts', JSON_ARRAY("10.1.58.156", "10.1.58.157", "10.1.58.158")) where instance_uuid in (select i.uuid from instances as i where i.deleted_at is null and i.availability_zone = 'Zone1') AND JSON_EXISTS(b.connection_info, '$.data.hosts') = 1 and b.deleted_at is NULL;

update block_device_mapping as b set connection_info = json_replace(connection_info, '$.data.ports', JSON_ARRAY("6789", "6789", "6789")) where instance_uuid in (select i.uuid from instances as i where i.deleted_at is null and i.availability_zone = 'Zone1') AND JSON_EXISTS(b.connection_info, '$.data.hosts') = 1 and b.deleted_at is NULL;

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers