Bug #1927186 “the value of allocated_capacity_gb is Incorrect wh...” : Bugs : Cinder

jiaohaolin (jiaohaolin) on 2021-05-05

summary:

the value of allocated_capacity_gb is Incorrect when the number of
- replaca of cinder-volume is more than one and all of them configurated
+ replica of cinder-volume is more than one and all of them configurated
same backend

Revision history for this message

Sofia Enriquez (lsofia-enriquez) wrote on 2021-05-05:

#1

Hello jiaohaolin, hope this message finds you well.
Do you mind to clarify:
- cinder version / release you're using?
- the backend you are using?
- are you using multipath?
Cheers
Sofia

tags:	added: volume
Changed in cinder:
importance:	Undecided → Low
status:	New → Incomplete

Revision history for this message

jiaohaolin (jiaohaolin) wrote on 2021-05-06:

#2

hi Sofia:

The version we ussd is 'rocky',and the backend we used is 'inspur-instorage-iscsi' with multipath

And the cause of this problem has nothing to do with the backend what we use or multipath,
when we create a bunch of volume ,the requests would be scheduled to different cinder-volume service,and the allocated_capacity_gb would be incorrect because each cinder-volume service maintains its own allocated_capacity_gb.

For example:
We set up three cinder-volume service name A/B/C.And all three service have the same config. Then we create 100 10GB volume. May the 50 of 100 creating requests schedule to A,30 to B ,20 to C ,and each allocated_capacity_gb value of them would be 500GB ,300GB 200GB.And the valume of command 'get pool --detail' returned would be floating value depend on the value the latest cinder-volume service report to cinder-scheduler.

Revision history for this message

Gorka Eguileor (gorka) wrote on 2021-05-10:

#3

I believe this is probably a duplicate of the old local scheduler data issue and is not related to an active/active deployment.
This was discussed in the PTG.

This issue will happen even if you have a single cinder-volume as long as you have multiple cinder-scheduler services running. Moreover if you create a volume and then make N get-pools requests while there are no additional cinder requests (where N is the number of schedulers) you will most likely see that you don't get the same values.

Revision history for this message

Gorka Eguileor (gorka) wrote on 2021-05-10:

#4

Some time ago I added a description of the issue on our docs: https://docs.openstack.org/cinder/victoria/contributor/high_availability.html#cinder-scheduler

Revision history for this message

Ilya Popov (ilya-p) wrote on 2023-06-01 (last edit on 2023-06-02):

#5

Looks like there are 2 possible options to change in code:

1. Put common variables to common key-value like redis. And cinder volume on start will check value there (if there is one - it will read it from redis, if there is no value - it will recalculate and put it there)

2. Listen RPC messages (sm like notification) and change internal values accordingly. It is less good option as we get smth like diffs of values (e.g we've seen that the volume created and we have to increse local value). This calculation will be less accurate.

Or we have to store it in DB

Sofia Enriquez (lsofia-enriquez) on 2023-06-14

Changed in cinder:
status:	Incomplete → New

Maksim Malchuk (mmalchuk) on 2023-06-14

Changed in cinder:
status:	New → Confirmed

Revision history for this message

Ilya Popov (ilya-p) wrote on 2023-06-14:

#6

Download full text (5.7 KiB)

Well, Rajat asked me on Cinder team meeting where in cinder-volume source code we calculate allocated_capacity_gb:
https://meetings.opendev.org/meetings/cinder/2023/cinder.2023-06-14-14.00.log.html

So there are three places:

1. On cinder-volume startup:
https://github.com/openstack/cinder/blob/master/cinder/volume/manager.py#L403

2. When we destroy volume
https://github.com/openstack/cinder/blob/master/cinder/volume/manager.py#L1074

3. On volume creation process:
https://github.com/openstack/cinder/blob/master/cinder/volume/manager.py#L759 when calling _update_allocated_capacity
https://github.com/openstack/cinder/blob/master/cinder/volume/manager.py#L3717

So each cinder volume instance has its own local value of allocated_capacity_gb for each pool it serves.
When cinder volume instance starts - it recalculate allocated_capacity_gb for each pool it serves based on volumes in that pool.
Each time when instance of cinder volume got task to create volume - it increase its local value.
When cinder volume instance fetch task to delete volume - it decrease this value.

It works more or less good for independent cinder volume deployment case - because in this case we have one pool for each instance of cinder volume.

When we have Active-Active cinder volume setup - we have only ONE pool with allocated_capacity_gb. And each instance of cinder volume reports its own local (and different for each instance) value
to scheduler. If first instance of cinder volume report 1 you will see 1 in allocated_capacity_gb (cinder get-pools --detail) till the next report of second cinder volume which report 2.
Just after scheduler receives 2 - you will see 2 in allocated_capacity_gb (cinder get-pools --detail). When scheduler will get next report from first instance of cinder volume - it will show 1
(till the next report from second instance of cindre volume, which report 2) and so on

There is the case from my lab:

3 instances of cinder volume in one cluster with one (and the same) ceph backend. So these cinder volumes are in one same cluster.
I created 200 volumes of 50Gb each and than deleted one volume. Total allocated capacity should be 9950Gb. Tasks for volume creation were spreaded on each instance of cinder volume as about 200/3

2023-06-14 18:45:59.240 7 DEBUG cinder.scheduler.host_manager [req-bcbf17da-59b6-44a8-9985-c4337aef53f5 - - - - -] Received volume service update from Cluster: os_lab@ceph_hdd - Host: os_lab-vct02@ceph_hdd: {'vendor_name': 'Open Source', 'driver_version': '1.2.0', 'storage_protocol': 'ceph', 'total_capacity_gb': 125821.54, 'free_capacity_gb': 125804.88, 'reserved_percentage': 0, 'multiattach': True, 'thin_provisioning_support': True, 'max_over_subscription_ratio': '20.0', 'location_info': 'ceph:/etc/ceph/ceph.conf:0252f788-fb05-11ec-bf1d-0117d320bc05:cl1ceph1_os_lab_cinder:cl1ceph1_os_lab_volumes', 'backend_state': 'up', 'volume_backend_name': 'cinder_ceph_hdd', 'replication_enabled': False, 'allocated_capacity_gb': 3350, 'filter_function': None, 'goodness_function': None}Cluster: os_lab@ceph_hdd - Host: update_service_capabilities /var/lib/kolla/venv/lib/python3.8/site-packages/cinder/scheduler/host_manager.py:575
20...

Well, Rajat asked me on Cinder team meeting where in cinder-volume source code we calculate allocated_capacity_gb:
https://meetings.opendev.org/meetings/cinder/2023/cinder.2023-06-14-14.00.log.html

So there are three places:

1. On cinder-volume startup:
https://github.com/openstack/cinder/blob/master/cinder/volume/manager.py#L403

2. When we destroy volume
https://github.com/openstack/cinder/blob/master/cinder/volume/manager.py#L1074

3. On volume creation process:
https://github.com/openstack/cinder/blob/master/cinder/volume/manager.py#L759 when calling _update_allocated_capacity 
https://github.com/openstack/cinder/blob/master/cinder/volume/manager.py#L3717

So each cinder volume instance has its own local value of allocated_capacity_gb for each pool it serves.
When cinder volume instance starts - it recalculate allocated_capacity_gb for each pool it serves based on volumes in that pool. 
Each time when instance of cinder volume got task to create volume - it increase its local value. 
When cinder volume instance fetch task to delete volume - it decrease this value.

It works more or less good for independent cinder volume deployment case - because in this case we have one pool for each instance of cinder volume.

When we have Active-Active cinder volume setup - we have only ONE pool with allocated_capacity_gb. And each instance of cinder volume reports its own local (and different for each instance) value 
to scheduler. If first instance of cinder volume report 1 you will see 1 in allocated_capacity_gb (cinder get-pools --detail) till the next report of second cinder volume which report 2. 
Just after scheduler receives 2 - you will see 2 in allocated_capacity_gb (cinder get-pools --detail). When scheduler will get next report from first instance of cinder volume - it will show 1 
(till the next report from second instance of cindre volume, which report 2) and so on

There is the case from my lab:

3 instances of cinder volume in one cluster with one (and the same) ceph backend. So these cinder volumes are in one same cluster.
I created 200 volumes of 50Gb each and than deleted one volume. Total allocated capacity should be 9950Gb. Tasks for volume creation were spreaded on each instance of cinder volume as about 200/3

2023-06-14 18:45:59.240 7 DEBUG cinder.scheduler.host_manager [req-bcbf17da-59b6-44a8-9985-c4337aef53f5 - - - - -] Received volume service update from Cluster: os_lab@ceph_hdd - Host:  os_lab-vct02@ceph_hdd: {'vendor_name': 'Open Source', 'driver_version': '1.2.0', 'storage_protocol': 'ceph', 'total_capacity_gb': 125821.54, 'free_capacity_gb': 125804.88, 'reserved_percentage': 0, 'multiattach': True, 'thin_provisioning_support': True, 'max_over_subscription_ratio': '20.0', 'location_info': 'ceph:/etc/ceph/ceph.conf:0252f788-fb05-11ec-bf1d-0117d320bc05:cl1ceph1_os_lab_cinder:cl1ceph1_os_lab_volumes', 'backend_state': 'up', 'volume_backend_name': 'cinder_ceph_hdd', 'replication_enabled': False, 'allocated_capacity_gb': 3350, 'filter_function': None, 'goodness_function': None}Cluster: os_lab@ceph_hdd - Host:  update_service_capabilities /var/lib/kolla/venv/lib/python3.8/site-packages/cinder/scheduler/host_manager.py:575
2023-06-14 18:46:21.213 7 DEBUG cinder.scheduler.host_manager [req-71d4339d-f0cd-4ad4-8259-5c7e793fd56e - - - - -] Received volume service update from Cluster: os_lab@ceph_hdd - Host:  os_lab-vct01@ceph_hdd: {'vendor_name': 'Open Source', 'driver_version': '1.2.0', 'storage_protocol': 'ceph', 'total_capacity_gb': 125821.54, 'free_capacity_gb': 125804.88, 'reserved_percentage': 0, 'multiattach': True, 'thin_provisioning_support': True, 'max_over_subscription_ratio': '20.0', 'location_info': 'ceph:/etc/ceph/ceph.conf:0252f788-fb05-11ec-bf1d-0117d320bc05:cl1ceph1_os_lab_cinder:cl1ceph1_os_lab_volumes', 'backend_state': 'up', 'volume_backend_name': 'cinder_ceph_hdd', 'replication_enabled': False, 'allocated_capacity_gb': 3250, 'filter_function': None, 'goodness_function': None}Cluster: os_lab@ceph_hdd - Host:  update_service_capabilities /var/lib/kolla/venv/lib/python3.8/site-packages/cinder/scheduler/host_manager.py:575
2023-06-14 18:46:24.627 7 DEBUG cinder.scheduler.host_manager [req-f50cf689-b375-4109-87bf-b59033009858 - - - - -] Received volume service update from Cluster: os_lab@ceph_hdd - Host:  os_lab-vct03@ceph_hdd: {'vendor_name': 'Open Source', 'driver_version': '1.2.0', 'storage_protocol': 'ceph', 'total_capacity_gb': 125821.54, 'free_capacity_gb': 125804.88, 'reserved_percentage': 0, 'multiattach': True, 'thin_provisioning_support': True, 'max_over_subscription_ratio': '20.0', 'location_info': 'ceph:/etc/ceph/ceph.conf:0252f788-fb05-11ec-bf1d-0117d320bc05:cl1ceph1_os_lab_cinder:cl1ceph1_os_lab_volumes', 'backend_state': 'up', 'volume_backend_name': 'cinder_ceph_hdd', 'replication_enabled': False, 'allocated_capacity_gb': 3350, 'filter_function': None, 'goodness_function': None}Cluster: os_lab@ceph_hdd - Host:  update_service_capabilities /var/lib/kolla/venv/lib/python3.8/site-packages/cinder/scheduler/host_manager.py:575

As we can see from log example - each cinder volume reports its own local allocated capacity based on only volumes created by itself:

os_lab-vct02@ceph_hdd: allocated_capacity_gb': 3350
os_lab-vct01@ceph_hdd: allocated_capacity_gb': 3250
os_lab-vct03@ceph_hdd: allocated_capacity_gb': 3350

So if we won't create or delete volumes and will get information about pools in cicle, we will be happy to see
3250 or 3350 in allocated_capacity_gb for pool as the scheduler will update value for pool after it gets report from each instance of cinder volume

So what is incorrect:

1. allocated_capacity_gb should be 9950, not 3350 or 3250 for pool if we configured Active-Active cluster
2. allocated_capacity_gb is only one value per pool, it should be shared between all cinder volume instances serving this pool.

Revision history for this message

Bartosz Bezak (bbezak) wrote on 2023-07-13:

#7

I'm wondering about real impact of this miscalculation of allocated_capacity_gb. By the look of it, the scheduler will incorrectly schedule volumes to less busy cinder-volume service. However the free_capacity_gb will still be reported correctly. So there is no risk of overcommit the backend. Therefore for setup with one backend managed by multiple HA cinder-volume service that is not a huge issue?

Furthermore quota management is also probably not impacted here as well.

Revision history for this message

Ilya Popov (ilya-p) wrote on 2023-09-15:

#8

Not exactly for all cases.

for example, cinder rbd driver doesn't report provisioned_capacity_gb:

    2022-11-26 17:23:15.746 8 DEBUG cinder.scheduler.host_manager [req-f39ed266-c6c4-415b-b5a8-2ec2170c5fc4 - - - - -] Received volume service update from compute0.ipo-region@rbd-1:
    {'vendor_name': 'Open Source', 'driver_version': '1.2.0', 'storage_protocol': 'ceph', 'total_capacity_gb': 27.24, 'free_capacity_gb': 27.23, 'reserved_percentage': 0, 'multiattach': True, 'thin_provisioning_support': True,
    'max_over_subscription_ratio': '20.0', 'location_info': 'ceph:/etc/ceph/ceph.conf:587501de-69b3-11ed-bdd6-dd57b05661dd:cinder:volumes', 'backend_state': 'up', 'volume_backend_name': 'rbd-1', 'replication_enabled': False,
    'allocated_capacity_gb': 0, 'filter_function': None, 'goodness_function': None} update_service_capabilities /var/lib/kolla/venv/lib/python3.8/site-packages/cinder/scheduler/host_manager.py:575
    2022-11-26 17:23:15.752 8 DEBUG cinder.scheduler.host_manager [req-f39ed266-c6c4-415b-b5a8-2ec2170c5fc4 - - - - -]

In this case host manger will set provisioned_capacity_gb based on allocated_capacity_gb:

https://github.com/openstack/cinder/blob/master/cinder/scheduler/host_manager.py#L434

And, finally, provisioned_capacity_gb is used in capacity filter:

https://github.com/openstack/cinder/blob/master/cinder/scheduler/filters/capacity_filter.py#L148.

So, if you will have many thin volumes, that don't have much data on it free_capacity_gb will be sufficient to deploy additional volumes, but actual oversubscription will be 3 times higher, than calculated in filter. As a result - we will have 3 times more thin volumes than planned. So we lost oversubscription control. Over time thin volume on ceph became thick and we will get ceph exhausted much easy than with correct calculation of provisioned_capacity_gb (based on allocated_capacity_gb)

Cinder

the value of allocated_capacity_gb is Incorrect when the number of replica of cinder-volume is more than one and all of them configurated same backend

Bug Description

Other bug subscribers

Remote bug watches