Bug #1414432 “nova hypervisor-stats shows wrong disk usage with...” : Bugs : OpenStack Compute (nova)

Revision history for this message

ChangBo Guo(gcb) (glongwave) wrote on 2015-01-25:

#1

I think this also hit on other shared storage like NFS

summary:

- nova hypervisor-stats shows wrong disk usage with ceph backend
+ nova hypervisor-stats shows wrong disk usage with shared storage

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-01-25: Fix proposed to nova (master)

#2

Fix proposed to branch: master
Review: https://review.openstack.org/149878

Changed in nova:
assignee:	nobody → ChangBo Guo(gcb) (glongwave)
status:	New → In Progress

Revision history for this message

Mark Wu (wudx05) wrote on 2015-01-26:

#3

IMHO，whether it's a bug is up to how 'local_gb' is defined and interpreted. If 'local_gb' is defined as the local non-shared storage on each compute node, then the statistics is wrong. But in this case, the problem is not caused by the stats calculation, but the wrong usage of rbd backend. If 'local_gb' is interpreted as the available storage space seen by each compute node, including the cases of shared and non-shared. Then there's no bug in Nova code and the client should not add up 'local_gb' to get the total space if it's shared storage. Actually, it makes more sense to collect total usage of shared storage from storage specific administration software.

Revision history for this message

ChangBo Guo(gcb) (glongwave) wrote on 2015-01-26:

#4

Hi Mark,
I think 'local_gb' should be interpreted as the available storage space seen by each compute node, that makes sense for scheduling, nova-client doesn'st add up 'local_gb' , but Nova add up 'local_gb' see
https://github.com/openstack/nova/blob/master/nova/db/sqlalchemy/api.py#L648
so we need change Nova code :-)

More thougts:

Do we support some compute nodes with shared_storage, and others doesn't in one OpenStack development ?
If yes, we face more complicated case, maybe we need admin point out the deployment and pass the information
to Nova.

Revision history for this message

Sébastien Han (sebastien-han) wrote on 2015-01-27:

#5

What about this? https://review.openstack.org/#/c/102064/

Revision history for this message

ChangBo Guo(gcb) (glongwave) wrote on 2015-01-28:

#6

This is an example:
We have 19 compute nodes using ceph in test , the hypervisor-stats show wrong usages of disk .
# nova hypervisor-stats
+----------------------+---------+
| Property | Value |
+----------------------+---------+
| count | 19 |
| current_workload | 2 |
| disk_available_least | 984873 | -------------------------------> stats value
| free_disk_gb | 987045 |
| free_ram_mb | 2417935 |
| local_gb | 987725 |
| local_gb_used | 680 |
| memory_mb | 2451215 |
| memory_mb_used | 33280 |
| running_vms | 34 |
| vcpus | 152 |
| vcpus_used | 46 |
+----------------------+---------+
# rados df
pool name category KB objects clones degraded unfound rd rd KB wr wr KB
backups - 0 0 0 0 0 0 0 0 0
compute - 13551959 2022 18 0 0 5097899 58466364 7653287 55951741
data - 0 0 0 0 0 0 0 0 0
images - 0 0 0 0 0 8244 33705999 16442 33636361
metadata - 0 0 0 0 0 0 0 0 0
rbd - 0 0 0 0 0 0 0 0 0
volumes - 1 5 0 0 0 66 50 14 2
  total used 152963616 2027
  total avail 54091224296 ------------------------------> actual value
  total space 54244187912
# python
Python 2.6.6 (r266:84292, Jan 22 2014, 09:42:36)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> avail = 54091224296/(1024*1024)*19 --------------------------> stats value = actual value * 19
>>> avail
980115
>>>

This is an example:
We have 19 compute nodes  using ceph in test ,   the hypervisor-stats show  wrong usages of disk .
# nova hypervisor-stats
+----------------------+---------+
| Property             | Value   |
+----------------------+---------+
| count                | 19      |
| current_workload     | 2       |
| disk_available_least | 984873  | -------------------------------> stats value
| free_disk_gb         | 987045  |
| free_ram_mb          | 2417935 |
| local_gb             | 987725  |
| local_gb_used        | 680     |
| memory_mb            | 2451215 |
| memory_mb_used       | 33280   |
| running_vms          | 34      |
| vcpus                | 152     |
| vcpus_used           | 46      |
+----------------------+---------+
# rados df
pool name       category                 KB      objects       clones     degraded      unfound           rd        rd KB           wr        wr KB
backups         -                          0            0            0            0           0            0            0            0            0
compute         -                   13551959         2022           18            0           0      5097899     58466364      7653287     55951741
data            -                          0            0            0            0           0            0            0            0            0
images          -                          0            0            0            0           0         8244     33705999        16442     33636361
metadata        -                          0            0            0            0           0            0            0            0            0
rbd             -                          0            0            0            0           0            0            0            0            0
volumes         -                          1            5            0            0           0           66           50           14            2
  total used       152963616         2027
  total avail    54091224296             ------------------------------> actual  value 
  total space    54244187912
# python
Python 2.6.6 (r266:84292, Jan 22 2014, 09:42:36)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> avail = 54091224296/(1024*1024)*19                  -------------------------->   stats value  =    actual  value * 19 
>>> avail
980115
>>>

Revision history for this message

Sébastien Han (sebastien-han) wrote on 2015-01-28:

#7

I think I've read too fast :). Sorry the noise, this makes perfect sense!

Davanum Srinivas (DIMS) (dims-v) on 2015-02-18

Changed in nova:
importance:	Undecided → Low

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-02-28: Change abandoned on nova (master)

#8

Change abandoned by ChangBo Guo(gcb) (<email address hidden>) on branch: master
Review: https://review.openstack.org/149878

Davanum Srinivas (DIMS) (dims-v) on 2015-03-04

Changed in nova:
status:	In Progress → Confirmed

Revision history for this message

Fu Guang Ping (fuguangping) wrote on 2016-06-07:

#9

@hudson-openstack, I think it may not work if we use multi backends, or some nodes use shared storage, but others not.

Revision history for this message

Vikram Hosakote (vhosakot) wrote on 2016-06-08:

#10

Is this bug fixed ? I see it kilo.

"nova hypervisor-stats" reports that there is ~401 TB disk space on the compute nodes when there is actually just 1 TB storage space on the compute nodes.

melanie witt (melwitt) on 2016-07-01

tags:

added: ceph

Revision history for this message

Sujitha (sujitha-neti) wrote on 2016-07-08:

#11

There are no open reviews for this bug report since a long time.
To signal that to other contributors which might provide patches for
this bug, I'm removing the assignee.

Feel free to add yourself as assignee and push a review for it.

Changed in nova:
assignee:	ChangBo Guo(gcb) (glongwave) → nobody

Tao Li (eric-litao) on 2017-06-23

Changed in nova:
assignee:	nobody → Tao Li (eric-litao)

Sean Dague (sdague) on 2017-06-23

Changed in nova:
assignee:	Tao Li (eric-litao) → nobody

OpenStack Infra (hudson-openstack) on 2018-07-23

Changed in nova:
assignee:	nobody → huanhongda (hongda)
status:	Confirmed → In Progress

Revision history for this message

sean mooney (sean-k-mooney) wrote on 2019-03-04:

#12

Note the resolution of this bug involves a change in api behavior and therefor requires a spec.
the current path that is planned to correct this bug is to track local and shared storage in placment and either proxy hypervior api call to placemetn to retirve the actul usage for allocations against the host or to depercate part or all of the hyperviors api.

tags:

added: api placement shared-storage

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-06-13:

#13

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.opendev.org/149878
Reason: As noted this is extremely latent behavior and is an API change.

I've linked the bug to blueprint https://blueprints.launchpad.net/nova/+spec/support-shared-storage-resource-provider so those that care should make sure the use case here is covered in the spec for that blueprint:

https://review.opendev.org/#/c/650188/

Matt Riedemann (mriedem) on 2019-06-13

Changed in nova:
status:	In Progress → Confirmed
importance:	Low → Wishlist
assignee:	huanhongda (hongda) → nobody

OpenStack Compute (nova)

nova hypervisor-stats shows wrong disk usage with shared storage

Bug Description

Other bug subscribers

Related blueprints

Remote bug watches