rbd backend reports wrong 'local_gb_used' for compute node

Bug #1493760 reported by ChangBo Guo(gcb) on 2015-09-09
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Undecided
Unassigned

Bug Description

When instance's disk in rbd backend, compute node reports the whole ceph cluster status, that makes sense. We get the local_gb usage in https://github.com/openstack/nova/blob/master/nova/virt/libvirt/storage/rbd_utils.py#L313

     def get_pool_info(self):
        with RADOSClient(self) as client:
            stats = client.cluster.get_cluster_stats()
            return {'total': stats['kb'] * units.Ki,
                    'free': stats['kb_avail'] * units.Ki,
                    'used': stats['kb_used'] * units.Ki}

This reports same disk usages with command 'ceph -s', for example:
[root@node-1 ~]# ceph -s
    cluster e598930a-0807-491b-b191-d57244d3c8e2
     health HEALTH_OK
     monmap e1: 1 mons at {node-1=192.168.0.1:6789/0}, election epoch 1, quorum 0 node-1
     osdmap e28: 2 osds: 2 up, 2 in
      pgmap v3985: 576 pgs, 5 pools, 295 MB data, 57 objects
            21149 MB used, 76479 MB / 97628 MB avail
                 576 active+clean

[root@node-1 ~]# rbd -p compute ls
45892200-97cb-4fa4-9c29-35a0ff9e16f6_disk
8c6a5555-394f-4c24-b7ff-e05fdf322155_disk
944d9028-ac59-45fd-9be3-69066c8bc4e5
9ea375dc-f0b8-472e-ba53-4d83e5721771_disk
9fce4606-6871-40ca-bf8f-6146c05068e6_disk
cedce585-8747-4798-885f-0c47337f0f6f_disk
e17c9391-2032-4144-8fa1-85b092239e66_disk
e19143c7-228c-4f89-9735-c27c333adce4_disk
f9caf4a7-2b62-46c2-b2e1-f99cb4ce3f57_disk
[root@node-1 ~]# rbd -p compute info 45892200-97cb-4fa4-9c29-35a0ff9e16f6_disk
rbd image '45892200-97cb-4fa4-9c29-35a0ff9e16f6_disk':
 size 20480 MB in 2560 objects
 order 23 (8192 kB objects)
 block_name_prefix: rbd_data.39ab250fe98b
 format: 2
 features: layering
 parent: compute/944d9028-ac59-45fd-9be3-69066c8bc4e5@snap
 overlap: 40162 kB

In above example. we have two compute node , and can create 4 instances with 20G disk in each compute. The interesting thing is the total local_gb is 95G, and allocate 160G for instances.

The root cause is client.cluster.get_cluster_stats() returns actual used size, means 20G instance disk maybe only occupy 200M bytes. This is dangerous when instance use all of their disk.

An alternative solution fo calcuate all instance's disk size by some way as local_gb_used.

tags: added: ceph libvirt
Mark Doffman (mjdoffma) wrote :

If you look at https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L4628
you can see the three different functions used for getting available disk space.

'get_volume_group_info'
'get_pool_info' and
'get_fs_info'

All of these methods are going to return the ACTUAL disk space used, rather than the theoretical maximum of all the instance sizes. This is because disks stored locally will be stored as sparse qcow images. LVM disks are sparse volumes.

I believe that the intention of 'local_gb_used' is to report the actual disk space.

Changed in nova:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers