Shared storage providers are not supported and will break things if used

Bug #1784020 reported by Matt Riedemann on 2018-07-27
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
Unassigned

Bug Description

https://review.openstack.org/#/c/560459/ in Rocky changed the libvirt driver such that if the compute node provider is in a shared storage provider aggregate relationship (in the same aggregate with a resource provider that has DISK_GB inventory and the MISC_SHARES_VIA_AGGREGATE trait), the compute node provider won't report DISK_GB inventory.

There are at least two major issues with this:

1. On upgrade from Queens, any existing allocations against the compute node provider's DISK_GB inventory will not allow removal of the DISK_GB inventory from the compute node provider during the update_available_resource periodic task. In other words, we have no data migration routine in place to move DISK_GB allocations from the compute node provider to the shared storage provider in Rocky.

2. During a move operation, we move the instance's allocations from the source compute node provider to the migration record, then go through the scheduler to pick a dest host for the instance and allocate resources against the dest host (and optionally shared storage provider). So:

a) The DISK_GB allocation from the instance to the shared storage provider is deleted for a short window of time during scheduling until we pick a dest host.

https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/conductor/tasks/migrate.py#L57

b) If cold migrate fails or is reverted, we delete the allocations (created by the scheduler) and move the allocations from the migration record (against the source node provider) back to the instance, but because we failed to move the DISK_GB allocation against the sharing provider for the instance to the migration record, we've lost that DISK_GB allocation when copying it back to the instance on revert/failure:

https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/compute/manager.py#L4155

--

We could also have issues with how forced live migrate:

https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/conductor/tasks/live_migrate.py#L109

And evacuate:

https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/conductor/manager.py#L868

bypass the scheduler altogether so we're potentially not handling shared provider allocations there either.

Also, we don't have *any* shared storage provider CI jobs setup. A start to that is here:

https://review.openstack.org/#/c/586363/

But that's just a single-node job at the moment and we'd need a multi-node shared storage CI job to really say we support shared storage providers as a feature in nova.

Reviewed: https://review.openstack.org/586614
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=a451686263c655c79158a0c2c96c10a4d323ab18
Submitter: Zuul
Branch: master

commit a451686263c655c79158a0c2c96c10a4d323ab18
Author: Eric Fried <email address hidden>
Date: Fri Jul 27 11:20:35 2018 -0500

    libvirt: Revert non-reporting DISK_GB if sharing

    Change Iea283322124cb35fc0bc6d25f35548621e8c8c2f went part way toward
    implementing proper accounting for sharing providers, but the gaps it
    left are untenable - see the associated bug report.

    This removes the main functional piece of that change with TODOs to
    reinstate it once the broader issues have been resolved.

    Change-Id: I245a16315f97d0c2ca69c6ca9727a55a8ceb75ab
    Related-Bug: #1784020

melanie witt (melwitt) wrote :

I think we can consider this bug "things will break" resolved now that the patch landed to disable the bit that makes shared storage providers affect allocations. The work to finish proper support for shared storage providers will be tracked on its blueprint.

Changed in nova:
status: Triaged → Fix Released
Matt Riedemann (mriedem) wrote :

With the is_bfv fixes and disk usage reporting fixed in Rocky this is likely less important.

Changed in nova:
importance: High → Medium
Matt Riedemann (mriedem) wrote :

Another use case for shared storage providers is noted in this spec:

https://review.openstack.org/#/c/551927/

The aim of that spec is to avoid the need for ssh access across compute hosts during a resize if those hosts are using the same shared storage, which we could model with shared storage resource providers in placement.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers