the value of allocated_capacity_gb is Incorrect when the number of replica of cinder-volume is more than one and all of them configurated same backend

Bug #1927186 reported by jiaohaolin
38
This bug affects 7 people
Affects Status Importance Assigned to Milestone
Cinder
Confirmed
Low
Unassigned

Bug Description

the value of allocated_capacity_gb is incorrect when the number of replaca of cinder-volume is more than one and all of them configurated same backend,

We set up more than one cinder-volume service ,and all of them have the same config file with the same storage backend.
When all cinder-volume services complete initialization, the value of 'allocated_capacity_gb' that command of 'cinder get pools -detail' returned is correct.But once any creating or deleting volumes has been done, the value of 'allocated_capacity_gb' that command of 'cinder get pools -detail' returned will be incorrect.And the value of 'allocated_capacity_gb' would be changed few times quickly.

The reason for this phenomenon is that each cinder-volume service maintains its own allocated_capacity_gb value.The scheduler will distribute the request to different cinder-volume services, and the service that receives the request will update its own allocated_capacity_gb , but not synchronizes the value to other cinder-voluem services. So the value of 'allocated_capacity_gb' from different Cinder-volume would be floating , and the value of 'allocated_capacity_gb' that command of 'cinder get pools -detail' returned would be incorrect. The reason for 'cinder get pools -detail' returning a negative number is because one or several cinder-volume services always receive delete requests,and the 'allocated_capacity_gb' of the special cinder-volume is always subtracted from the deleted volume size.

Tags: volume
jiaohaolin (jiaohaolin)
summary: the value of allocated_capacity_gb is Incorrect when the number of
- replaca of cinder-volume is more than one and all of them configurated
+ replica of cinder-volume is more than one and all of them configurated
same backend
Revision history for this message
Sofia Enriquez (lsofia-enriquez) wrote :

Hello jiaohaolin, hope this message finds you well.
Do you mind to clarify:
- cinder version / release you're using?
- the backend you are using?
- are you using multipath?
Cheers
Sofia

tags: added: volume
Changed in cinder:
importance: Undecided → Low
status: New → Incomplete
Revision history for this message
jiaohaolin (jiaohaolin) wrote :

hi Sofia:

The version we ussd is 'rocky',and the backend we used is 'inspur-instorage-iscsi' with multipath

And the cause of this problem has nothing to do with the backend what we use or multipath,
when we create a bunch of volume ,the requests would be scheduled to different cinder-volume service,and the allocated_capacity_gb would be incorrect because each cinder-volume service maintains its own allocated_capacity_gb.

For example:
We set up three cinder-volume service name A/B/C.And all three service have the same config. Then we create 100 10GB volume. May the 50 of 100 creating requests schedule to A,30 to B ,20 to C ,and each allocated_capacity_gb value of them would be 500GB ,300GB 200GB.And the valume of command 'get pool --detail' returned would be floating value depend on the value the latest cinder-volume service report to cinder-scheduler.

Revision history for this message
Gorka Eguileor (gorka) wrote :

I believe this is probably a duplicate of the old local scheduler data issue and is not related to an active/active deployment.
This was discussed in the PTG.

This issue will happen even if you have a single cinder-volume as long as you have multiple cinder-scheduler services running. Moreover if you create a volume and then make N get-pools requests while there are no additional cinder requests (where N is the number of schedulers) you will most likely see that you don't get the same values.

Revision history for this message
Gorka Eguileor (gorka) wrote :
Revision history for this message
Ilya Popov (ilya-p) wrote (last edit ):

Looks like there are 2 possible options to change in code:

1. Put common variables to common key-value like redis. And cinder volume on start will check value there (if there is one - it will read it from redis, if there is no value - it will recalculate and put it there)

2. Listen RPC messages (sm like notification) and change internal values accordingly. It is less good option as we get smth like diffs of values (e.g we've seen that the volume created and we have to increse local value). This calculation will be less accurate.

Or we have to store it in DB

Changed in cinder:
status: Incomplete → New
Changed in cinder:
status: New → Confirmed
Revision history for this message
Ilya Popov (ilya-p) wrote :
Download full text (5.7 KiB)

Well, Rajat asked me on Cinder team meeting where in cinder-volume source code we calculate allocated_capacity_gb:
https://meetings.opendev.org/meetings/cinder/2023/cinder.2023-06-14-14.00.log.html

So there are three places:

1. On cinder-volume startup:
https://github.com/openstack/cinder/blob/master/cinder/volume/manager.py#L403

2. When we destroy volume
https://github.com/openstack/cinder/blob/master/cinder/volume/manager.py#L1074

3. On volume creation process:
https://github.com/openstack/cinder/blob/master/cinder/volume/manager.py#L759 when calling _update_allocated_capacity
https://github.com/openstack/cinder/blob/master/cinder/volume/manager.py#L3717

So each cinder volume instance has its own local value of allocated_capacity_gb for each pool it serves.
When cinder volume instance starts - it recalculate allocated_capacity_gb for each pool it serves based on volumes in that pool.
Each time when instance of cinder volume got task to create volume - it increase its local value.
When cinder volume instance fetch task to delete volume - it decrease this value.

It works more or less good for independent cinder volume deployment case - because in this case we have one pool for each instance of cinder volume.

When we have Active-Active cinder volume setup - we have only ONE pool with allocated_capacity_gb. And each instance of cinder volume reports its own local (and different for each instance) value
to scheduler. If first instance of cinder volume report 1 you will see 1 in allocated_capacity_gb (cinder get-pools --detail) till the next report of second cinder volume which report 2.
Just after scheduler receives 2 - you will see 2 in allocated_capacity_gb (cinder get-pools --detail). When scheduler will get next report from first instance of cinder volume - it will show 1
(till the next report from second instance of cindre volume, which report 2) and so on

There is the case from my lab:

3 instances of cinder volume in one cluster with one (and the same) ceph backend. So these cinder volumes are in one same cluster.
I created 200 volumes of 50Gb each and than deleted one volume. Total allocated capacity should be 9950Gb. Tasks for volume creation were spreaded on each instance of cinder volume as about 200/3

2023-06-14 18:45:59.240 7 DEBUG cinder.scheduler.host_manager [req-bcbf17da-59b6-44a8-9985-c4337aef53f5 - - - - -] Received volume service update from Cluster: os_lab@ceph_hdd - Host: os_lab-vct02@ceph_hdd: {'vendor_name': 'Open Source', 'driver_version': '1.2.0', 'storage_protocol': 'ceph', 'total_capacity_gb': 125821.54, 'free_capacity_gb': 125804.88, 'reserved_percentage': 0, 'multiattach': True, 'thin_provisioning_support': True, 'max_over_subscription_ratio': '20.0', 'location_info': 'ceph:/etc/ceph/ceph.conf:0252f788-fb05-11ec-bf1d-0117d320bc05:cl1ceph1_os_lab_cinder:cl1ceph1_os_lab_volumes', 'backend_state': 'up', 'volume_backend_name': 'cinder_ceph_hdd', 'replication_enabled': False, 'allocated_capacity_gb': 3350, 'filter_function': None, 'goodness_function': None}Cluster: os_lab@ceph_hdd - Host: update_service_capabilities /var/lib/kolla/venv/lib/python3.8/site-packages/cinder/scheduler/host_manager.py:575
20...

Read more...

Revision history for this message
Bartosz Bezak (bbezak) wrote :

I'm wondering about real impact of this miscalculation of allocated_capacity_gb. By the look of it, the scheduler will incorrectly schedule volumes to less busy cinder-volume service. However the free_capacity_gb will still be reported correctly. So there is no risk of overcommit the backend. Therefore for setup with one backend managed by multiple HA cinder-volume service that is not a huge issue?

Furthermore quota management is also probably not impacted here as well.

Revision history for this message
Ilya Popov (ilya-p) wrote :

Not exactly for all cases.

for example, cinder rbd driver doesn't report provisioned_capacity_gb:

    2022-11-26 17:23:15.746 8 DEBUG cinder.scheduler.host_manager [req-f39ed266-c6c4-415b-b5a8-2ec2170c5fc4 - - - - -] Received volume service update from compute0.ipo-region@rbd-1:
    {'vendor_name': 'Open Source', 'driver_version': '1.2.0', 'storage_protocol': 'ceph', 'total_capacity_gb': 27.24, 'free_capacity_gb': 27.23, 'reserved_percentage': 0, 'multiattach': True, 'thin_provisioning_support': True,
    'max_over_subscription_ratio': '20.0', 'location_info': 'ceph:/etc/ceph/ceph.conf:587501de-69b3-11ed-bdd6-dd57b05661dd:cinder:volumes', 'backend_state': 'up', 'volume_backend_name': 'rbd-1', 'replication_enabled': False,
    'allocated_capacity_gb': 0, 'filter_function': None, 'goodness_function': None} update_service_capabilities /var/lib/kolla/venv/lib/python3.8/site-packages/cinder/scheduler/host_manager.py:575
    2022-11-26 17:23:15.752 8 DEBUG cinder.scheduler.host_manager [req-f39ed266-c6c4-415b-b5a8-2ec2170c5fc4 - - - - -]

In this case host manger will set provisioned_capacity_gb based on allocated_capacity_gb:

https://github.com/openstack/cinder/blob/master/cinder/scheduler/host_manager.py#L434

And, finally, provisioned_capacity_gb is used in capacity filter:

https://github.com/openstack/cinder/blob/master/cinder/scheduler/filters/capacity_filter.py#L148.

So, if you will have many thin volumes, that don't have much data on it free_capacity_gb will be sufficient to deploy additional volumes, but actual oversubscription will be 3 times higher, than calculated in filter. As a result - we will have 3 times more thin volumes than planned. So we lost oversubscription control. Over time thin volume on ceph became thick and we will get ceph exhausted much easy than with correct calculation of provisioned_capacity_gb (based on allocated_capacity_gb)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.