Nova does not track shared ceph pools across multiple nodes

Bug #1908133 reported by Rodrigo Barbieri
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
New
Undecided
Unassigned

Bug Description

Environment:
- tested in focal-victoria and bionic-stein

======================

Steps to reproduce:
1) Deploy OpenStack having 2 nova-compute nodes
2) Configure both compute nodes to have a RBD backend pointing to the same pool in ceph as below:

[libvirt]
images_type = rbd
images_rbd_pool = nova

3) run "openstack hypervisor show" on each node. Both will show the full pool capacity:

local_gb | 29
local_gb_used | 0
free_disk_gb | 29
disk_available_least | 15

4) create a 20gb instance and run "openstack hypervisor show" again on the node it landed:

local_gb | 29
local_gb_used | 20
free_disk_gb | 9
disk_available_least | 15

5) create another 20GB one. It will land on the other hypervisor
6) try to create a third 20GB one, it will fail because placement will not return an allocation candidate. This is correct.
7) Now ssh to both the instances and fill their disk (actually based on disk_available_least that is read from ceph df, only one may need to be filled)
8) I/O for all instances will be frozen as the ceph pool runs out of space, and the nova-compute service freezes on "create_image" whenever a new instance is attempted to be created there, causing it to be reported as "down".
9) disk_available_least will be updated to 0, but that doesn't prevent new instances from being scheduled.

This is the first problem as both compute nodes have their tracking disconnected from the ceph pool on "free_disk_gb" and "local_gb_used", while "disk_available_least" is not used by the scheduler to prevent the problem while disk_allocation_ratio is 1.0 (it is used by live-migration appropriately though).

Alternatively (as a possible solution/fix/workaround), following the steps in [0] and [1] to have placement as a centralized place for the shared ceph pool. I ran the following steps:

10) openstack resource provider create ceph_nova_pool

11) openstack resource provider inventory set --os-placement-api-version 1.19 --resource DISK_GB=30 <ceph_nova_pool_uuid>

12) openstack resource provider trait set --os-placement-api-version 1.19 <ceph_nova_pool_uuid> --trait MISC_SHARES_VIA_AGGREGATE

13) openstack resource provider aggregate set <ceph_nova_pool_uuid> --aggregate <resource_provider1_uuid> --aggregate <resource_provider2_uuid> --generation 2 --os-placement-api-version 1.19

14) Deleted all instances and repeated steps 4, 5 and 6 but same result

15) openstack resource provider set --name <resource_provider1_name> --parent-provider <ceph_nova_pool_uuid> <resource_provider1_uuid> --os-placement-api-version 1.19

16) openstack resource provider set --name <resource_provider2_name> --parent-provider <ceph_nova_pool_uuid> <resource_provider2_uuid> --os-placement-api-version 1.19

17) Deleted all instances and repeated steps 4, 5 and 6. Now I was able to create 3 instances, where 1 of them had allocations from the ceph_nova_pool resource provider. The created resource_provider is being treated as an "extra" resource provider.

18) Deleted 2 instances that had allocations from the compute nodes

19) openstack resource provider inventory delete <resource_provider1_uuid> --resource-class DISK_GB

20) openstack resource provider inventory delete <resource_provider1_uuid> --resource-class DISK_GB

21) watch openstack allocation candidate list --resource DISK_GB=20 --os-placement-api-version 1.19

Now, the list would be empty, until nova-compute periodically updates the inventory with its local_gb value and we go back to the state at step 17.

======================

Expected result:
- For the first approach, it is expected that scheduling would be affected by the disk_available_least value (accordingly to disk_allocation_ratio as well) to avoid allowing the creation of instances when there is no space.
- For the second approach, it is expected that there is a way to prevent nova-compute when periodically updating a specific inventory, or guarantee that its inventory is shared with another resource_provider instead of an "extra" one.

[0] https://github.com/openstack/placement/blob/c02a073c523d363d7136677ab12884dc4ec03e6f/placement/objects/research_context.py#L1107
[1] https://docs.openstack.org/placement/latest/user/provider-tree.html

Tags: sts
Revision history for this message
Stephen Finucane (stephenfinucane) wrote :

This is a well-known issue. Closing as a duplicate.

Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :

Update on this also with feedback from the mailing list [0]:

From the instructions and link provided by Sean Mooney [1], I realized my previous instructions in the description are partially incorrect.

I tested in Queens, Rocky and Stein. I was able to get it work on Rocky+. There is no need to have parent providers or tree hierarchy (available only in Stein+). The most important thing to set up correctly is the aggregate. For this specific scenario, it is necessary to create one aggregate per compute resource provider, where it contains the compute resource provider itself, and the ceph nova pool shared resource provider. At the end, the shared resource provider will be in every aggregate along with one compute resource provider.

The workaround that worked best for me was to use a fake allocation of the same size as the pool against the compute resource providers (command #4 below) to cancel out the reported disk inventory from the computes, and then only the shared pool will be used. The "reserved_host_disk_mb" approach is required to be (total_size - 1), and requires a compute service restart, so less ideal.

The caveats are that this approach will not account for the ceph overhead (which means it is not really possible to use 30GB for disks if your ceph total size is 30GB), and most values (inventory, fake allocations) will likely also need to be readjusted whenever the ceph backend size changes. Alternatively adjusting the shared resource provider allocation ratio can be used to make up for the ceph overhead and/or size changes. This could be tied to a cron job that monitors ceph status, especially for the purpose of OSDs going down, causing the total size to be reduced.

Here is the updated list of commands:

1) openstack resource provider create ceph_nova_pool

2) openstack resource provider inventory set --resource DISK_GB=30 <ceph_nova_pool_uuid> --os-placement-api-version 1.28

3) openstack resource provider list

4) openstack resource provider allocation set --allocation rp=<resource_provider_uuid>,<resource_class_name>=<amount> [--allocation ...] <ceph_nova_pool_uuid>

5) openstack resource provider aggregate set <resource_provider1_uuid> --aggregate <resource_provider1_uuid> --os-placement-api-version 1.28 --generation <rp1_generation_#>

6) openstack resource provider aggregate set <ceph_nova_pool_uuid> --aggregate <resource_provider1_uuid> [--aggregate ...] --os-placement-api-version 1.28 --generation <ceph_nova_pool_uuid_generation_#>

Still not the ideal solution, bug makes the problem much more manageable, at least on Rocky+.

[0] http://lists.openstack.org/pipermail/openstack-discuss/2021-January/019646.html

[1] https://github.com/openstack/placement/blob/master/placement/tests/functional/gabbits/shared-resources.yaml

tags: added: sts
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.