rbd backend reports wrong 'local_gb_used' for compute node

OpenStack Compute (nova)

When instance's disk in rbd backend, compute node reports the whole ceph cluster status, that makes sense. We get the local_gb usage in https://github.com/openstack/nova/blob/master/nova/virt/libvirt/storage/rbd_utils.py#L313

     def get_pool_info(self):
        with RADOSClient(self) as client:
            stats = client.cluster.get_cluster_stats()
            return {'total': stats['kb'] * units.Ki,
                    'free': stats['kb_avail'] * units.Ki,
                    'used': stats['kb_used'] * units.Ki}

This reports same disk usages with command 'ceph -s', for example:
[root@node-1 ~]# ceph -s
    cluster e598930a-0807-491b-b191-d57244d3c8e2
     health HEALTH_OK
     monmap e1: 1 mons at {node-1=}, election epoch 1, quorum 0 node-1
     osdmap e28: 2 osds: 2 up, 2 in
      pgmap v3985: 576 pgs, 5 pools, 295 MB data, 57 objects
            21149 MB used, 76479 MB / 97628 MB avail
                 576 active+clean

[root@node-1 ~]# rbd -p compute ls
[root@node-1 ~]# rbd -p compute info 45892200-97cb-4fa4-9c29-35a0ff9e16f6_disk
rbd image '45892200-97cb-4fa4-9c29-35a0ff9e16f6_disk':
 size 20480 MB in 2560 objects
 order 23 (8192 kB objects)
 block_name_prefix: rbd_data.39ab250fe98b
 format: 2
 features: layering
 parent: compute/944d9028-ac59-45fd-9be3-69066c8bc4e5@snap
 overlap: 40162 kB

In above example. we have two compute node , and can create 4 instances with 20G disk in each compute. The interesting thing is the total local_gb is 95G, and allocate 160G for instances.

The root cause is client.cluster.get_cluster_stats() returns actual used size, means 20G instance disk maybe only occupy 200M bytes. This is dangerous when instance use all of their disk.

An alternative solution fo calcuate all instance's disk size by some way as local_gb_used.

Mark Doffma wrote:

If you look at https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L4628
you can see the three different functions used for getting available disk space.

'get_pool_info' and

All of these methods are going to return the ACTUAL disk space used, rather than the theoretical maximum of all the instance sizes. This is because disks stored locally will be stored as sparse qcow images. LVM disks are sparse volumes.

I believe that the intention of 'local_gb_used' is to report the actual disk space.

