nova hypervisor-stats shows wrong disk usage with shared storage

Bug #1414432 reported by ChangBo Guo(gcb)
40
This bug affects 9 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Confirmed
Wishlist
Unassigned

Bug Description

While using ceph as backend , each compute node reports its disk usage with total backend pool disk usage, this makes sense to caculate incoming resource request like booting a new instance or migrating instance to this compute node.

SEE https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L4334

but while get hypervisor-stats, we sum all the compute nodes's disk usage, that doesn't make sense for ceph backend.
That will show that disk usage = {actual disk usage} * {number of compuet node}

https://github.com/openstack/nova/blob/master/nova/db/sqlalchemy/api.py#L648

what will handle this case ?

 take average value of disk usage of each compute node instead while using ceph ?

Revision history for this message
ChangBo Guo(gcb) (glongwave) wrote :

I think this also hit on other shared storage like NFS

summary: - nova hypervisor-stats shows wrong disk usage with ceph backend
+ nova hypervisor-stats shows wrong disk usage with shared storage
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/149878

Changed in nova:
assignee: nobody → ChangBo Guo(gcb) (glongwave)
status: New → In Progress
Revision history for this message
Mark Wu (wudx05) wrote :

IMHO,whether it's a bug is up to how 'local_gb' is defined and interpreted. If 'local_gb' is defined as the local non-shared storage on each compute node, then the statistics is wrong. But in this case, the problem is not caused by the stats calculation, but the wrong usage of rbd backend. If 'local_gb' is interpreted as the available storage space seen by each compute node, including the cases of shared and non-shared. Then there's no bug in Nova code and the client should not add up 'local_gb' to get the total space if it's shared storage. Actually, it makes more sense to collect total usage of shared storage from storage specific administration software.

Revision history for this message
ChangBo Guo(gcb) (glongwave) wrote :

Hi Mark,
I think 'local_gb' should be interpreted as the available storage space seen by each compute node, that makes sense for scheduling, nova-client doesn'st add up 'local_gb' , but Nova add up 'local_gb' see
https://github.com/openstack/nova/blob/master/nova/db/sqlalchemy/api.py#L648
so we need change Nova code :-)

More thougts:

 Do we support some compute nodes with shared_storage, and others doesn't in one OpenStack development ?
 If yes, we face more complicated case, maybe we need admin point out the deployment and pass the information
to Nova.

Revision history for this message
Sébastien Han (sebastien-han) wrote :
Revision history for this message
ChangBo Guo(gcb) (glongwave) wrote :

This is an example:
We have 19 compute nodes using ceph in test , the hypervisor-stats show wrong usages of disk .
# nova hypervisor-stats
+----------------------+---------+
| Property | Value |
+----------------------+---------+
| count | 19 |
| current_workload | 2 |
| disk_available_least | 984873 | -------------------------------> stats value
| free_disk_gb | 987045 |
| free_ram_mb | 2417935 |
| local_gb | 987725 |
| local_gb_used | 680 |
| memory_mb | 2451215 |
| memory_mb_used | 33280 |
| running_vms | 34 |
| vcpus | 152 |
| vcpus_used | 46 |
+----------------------+---------+
# rados df
pool name category KB objects clones degraded unfound rd rd KB wr wr KB
backups - 0 0 0 0 0 0 0 0 0
compute - 13551959 2022 18 0 0 5097899 58466364 7653287 55951741
data - 0 0 0 0 0 0 0 0 0
images - 0 0 0 0 0 8244 33705999 16442 33636361
metadata - 0 0 0 0 0 0 0 0 0
rbd - 0 0 0 0 0 0 0 0 0
volumes - 1 5 0 0 0 66 50 14 2
  total used 152963616 2027
  total avail 54091224296 ------------------------------> actual value
  total space 54244187912
# python
Python 2.6.6 (r266:84292, Jan 22 2014, 09:42:36)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> avail = 54091224296/(1024*1024)*19 --------------------------> stats value = actual value * 19
>>> avail
980115
>>>

Revision history for this message
Sébastien Han (sebastien-han) wrote :

I think I've read too fast :). Sorry the noise, this makes perfect sense!

Changed in nova:
importance: Undecided → Low
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by ChangBo Guo(gcb) (<email address hidden>) on branch: master
Review: https://review.openstack.org/149878

Changed in nova:
status: In Progress → Confirmed
Revision history for this message
Fu Guang Ping (fuguangping) wrote :

@hudson-openstack, I think it may not work if we use multi backends, or some nodes use shared storage, but others not.

Revision history for this message
Vikram Hosakote (vhosakot) wrote :

Is this bug fixed ? I see it kilo.

"nova hypervisor-stats" reports that there is ~401 TB disk space on the compute nodes when there is actually just 1 TB storage space on the compute nodes.

]# nova hypervisor-stats | grep disk
| disk_available_least | 401542 |
| free_disk_gb | 401846 |

melanie witt (melwitt)
tags: added: ceph
Revision history for this message
Sujitha (sujitha-neti) wrote :

There are no open reviews for this bug report since a long time.
To signal that to other contributors which might provide patches for
this bug, I'm removing the assignee.

Feel free to add yourself as assignee and push a review for it.

Changed in nova:
assignee: ChangBo Guo(gcb) (glongwave) → nobody
Tao Li (eric-litao)
Changed in nova:
assignee: nobody → Tao Li (eric-litao)
Sean Dague (sdague)
Changed in nova:
assignee: Tao Li (eric-litao) → nobody
Changed in nova:
assignee: nobody → huanhongda (hongda)
status: Confirmed → In Progress
Revision history for this message
sean mooney (sean-k-mooney) wrote :

Note the resolution of this bug involves a change in api behavior and therefor requires a spec.
the current path that is planned to correct this bug is to track local and shared storage in placment and either proxy hypervior api call to placemetn to retirve the actul usage for allocations against the host or to depercate part or all of the hyperviors api.

tags: added: api placement shared-storage
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.opendev.org/149878
Reason: As noted this is extremely latent behavior and is an API change.

I've linked the bug to blueprint https://blueprints.launchpad.net/nova/+spec/support-shared-storage-resource-provider so those that care should make sure the use case here is covered in the spec for that blueprint:

https://review.opendev.org/#/c/650188/

Matt Riedemann (mriedem)
Changed in nova:
status: In Progress → Confirmed
importance: Low → Wishlist
assignee: huanhongda (hongda) → nobody
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.