Bug #1719770 “hypervisor stats aggregates resources from deleted...” : Bugs : OpenStack Nova Compute Charm

Revision history for this message

James Page (james-page) wrote on 2017-09-27:

#1

That's odd as most queries do:

model_query(context, models.Service, read_deleted="no")

Revision history for this message

James Page (james-page) wrote on 2017-09-27:

#2

Including the stats call:

@pick_context_manager_reader
def compute_node_statistics(context):
"""Compute statistics over all compute nodes."""

    # TODO(sbauza): Remove the service_id filter in a later release
    # once we are sure that all compute nodes report the host field
    _filter = or_(models.Service.host == models.ComputeNode.host,
                  models.Service.id == models.ComputeNode.service_id)

    result = model_query(context,
                         models.ComputeNode, (
                             func.count(models.ComputeNode.id),
                             func.sum(models.ComputeNode.vcpus),
                             func.sum(models.ComputeNode.memory_mb),
                             func.sum(models.ComputeNode.local_gb),
                             func.sum(models.ComputeNode.vcpus_used),
                             func.sum(models.ComputeNode.memory_mb_used),
                             func.sum(models.ComputeNode.local_gb_used),
                             func.sum(models.ComputeNode.free_ram_mb),
                             func.sum(models.ComputeNode.free_disk_gb),
                             func.sum(models.ComputeNode.current_workload),
                             func.sum(models.ComputeNode.running_vms),
                             func.sum(models.ComputeNode.disk_available_least),
                         ), read_deleted="no").\
                         filter(models.Service.disabled == false()).\
                         filter(models.Service.binary == "nova-compute").\
                         filter(_filter).\
                         first()

    # Build a dict of the info--making no assumptions about result
    fields = ('count', 'vcpus', 'memory_mb', 'local_gb', 'vcpus_used',
              'memory_mb_used', 'local_gb_used', 'free_ram_mb', 'free_disk_gb',
              'current_workload', 'running_vms', 'disk_available_least')
    return {field: int(result[idx] or 0)
            for idx, field in enumerate(fields)}

Including the stats call:

@pick_context_manager_reader
def compute_node_statistics(context):
    """Compute statistics over all compute nodes."""

# TODO(sbauza): Remove the service_id filter in a later release
    # once we are sure that all compute nodes report the host field
    _filter = or_(models.Service.host == models.ComputeNode.host,
                  models.Service.id == models.ComputeNode.service_id)

result = model_query(context,
                         models.ComputeNode, (
                             func.count(models.ComputeNode.id),
                             func.sum(models.ComputeNode.vcpus),
                             func.sum(models.ComputeNode.memory_mb),
                             func.sum(models.ComputeNode.local_gb),
                             func.sum(models.ComputeNode.vcpus_used),
                             func.sum(models.ComputeNode.memory_mb_used),
                             func.sum(models.ComputeNode.local_gb_used),
                             func.sum(models.ComputeNode.free_ram_mb),
                             func.sum(models.ComputeNode.free_disk_gb),
                             func.sum(models.ComputeNode.current_workload),
                             func.sum(models.ComputeNode.running_vms),
                             func.sum(models.ComputeNode.disk_available_least),
                         ), read_deleted="no").\
                         filter(models.Service.disabled == false()).\
                         filter(models.Service.binary == "nova-compute").\
                         filter(_filter).\
                         first()

# Build a dict of the info--making no assumptions about result
    fields = ('count', 'vcpus', 'memory_mb', 'local_gb', 'vcpus_used',
              'memory_mb_used', 'local_gb_used', 'free_ram_mb', 'free_disk_gb',
              'current_workload', 'running_vms', 'disk_available_least')
    return {field: int(result[idx] or 0)
            for idx, field in enumerate(fields)}

Revision history for this message

James Page (james-page) wrote on 2017-09-27:

#3

This code has had some significant refactoring since mitaka; so its possible this only impacts older openstack releases.

Either way, this is not a charm problem but a Nova issue AFAICT - raising distro task.

Changed in charm-nova-compute:
status:	New → Invalid

Revision history for this message

Drew Freiberger (afreiberger) wrote on 2017-09-27:

#4

In models.py, both in Mitaka and in master, I've found that the relation between ComputeNode and Service is using the following join in the Service context:

        primaryjoin='and_(Service.host == Instance.host,'
                    'Service.binary == "nova-compute",'
                    'Instance.deleted == 0)',

As in my case, I've redeployed a deleted node as the same hostname (Service.host) this join is relating a deleted ComputeNode.host entry to the non-deleted Service.host entry.

If I look at both my compute_nodes and services tables, it appears they should potentially be joined on the "id" field, rather than the "host" field, at least for this specific query, but this potentially breaks the Service object relation model for other query contexts such as instances running on a hypervisor.

Revision history for this message

Hua Zhang (zhhuabj) wrote on 2017-09-30:

#5

I can not reproduce the problem, I see SQL has used 'compute_nodes.deleted = 0' to filter deleted services as the following debug info shows (mitaka).

(Pdb) <oslo_db.sqlalchemy.orm.Query object at 0x7f97672b9810>
(Pdb) 'SELECT count(compute_nodes.id) AS count_1, sum(compute_nodes.vcpus) AS sum_1, sum(compute_nodes.memory_mb) AS sum_2, sum(compute_nodes.local_gb) AS sum_3, sum(compute_nodes.vcpus_used) AS sum_4, sum(compute_nodes.memory_mb_used) AS sum_5, sum(compute_nodes.local_gb_used) AS sum_6, sum(compute_nodes.free_ram_mb) AS sum_7, sum(compute_nodes.free_disk_gb) AS sum_8, sum(compute_nodes.current_workload) AS sum_9, sum(compute_nodes.running_vms) AS sum_10, sum(compute_nodes.disk_available_least) AS sum_11 \nFROM compute_nodes, services \nWHERE compute_nodes.deleted = :deleted_1 AND services.disabled = 0 AND services."binary" = :binary_1 AND (services.host = compute_nodes.host OR services.id = compute_nodes.service_id)'

This is result I run above SQL in mysql directly, all are OK.

mysql> SELECT count(compute_nodes.id) AS count_1, sum(compute_nodes.vcpus) AS sum_1, sum(compute_nodes.memory_mb) AS sum_2, sum(compute_nodes.local_gb) AS sum_3, sum(compute_nodes.vcpus_used) AS sum_4, sum(compute_nodes.memory_mb_used) AS sum_5, sum(compute_nodes.local_gb_used) AS sum_6, sum(compute_nodes.free_ram_mb) AS sum_7, sum(compute_nodes.free_disk_gb) AS sum_8, sum(compute_nodes.current_workload) AS sum_9, sum(compute_nodes.running_vms) AS sum_10, sum(compute_nodes.disk_available_least) AS sum_11 FROM compute_nodes, services WHERE compute_nodes.deleted = 0 AND services.disabled = 0 AND services.binary = 'nova-compute' AND (services.host = compute_nodes.host OR services.id = compute_nodes.service_id);
+---------+-------+-------+-------+-------+-------+-------+-------+-------+-------+--------+--------+
| count_1 | sum_1 | sum_2 | sum_3 | sum_4 | sum_5 | sum_6 | sum_7 | sum_8 | sum_9 | sum_10 | sum_11 |
+---------+-------+-------+-------+-------+-------+-------+-------+-------+-------+--------+--------+
| 2 | 4 | 7902 | 76 | 0 | 1024 | 0 | 6878 | 76 | 0 | 0 | 72 |
+---------+-------+-------+-------+-------+-------+-------+-------+-------+-------+--------+--------+
1 row in set (0.00 sec)

mysql> SELECT sum(compute_nodes.vcpus) FROM compute_nodes, services WHERE compute_nodes.deleted = 0 AND services.disabled = 0 AND services.binary = 'nova-compute' AND (services.host = compute_nodes.host OR services.id = compute_nodes.service_id);
+--------------------------+
| sum(compute_nodes.vcpus) |
+--------------------------+
| 4 |
+--------------------------+
1 row in set (0.00 sec)

Below are steps I used to create test env:

1, There are 3 nova-compute nodes initially.

2, Then use 'openstack compute service delete 10' to delete one compute service.

3, 'select * from services where id=10' will show deleted field of this record is not 0 no longer.

I can not reproduce the problem, I see SQL has used 'compute_nodes.deleted = 0' to filter deleted services as the following debug info shows (mitaka).

(Pdb) <oslo_db.sqlalchemy.orm.Query object at 0x7f97672b9810>
(Pdb) 'SELECT count(compute_nodes.id) AS count_1, sum(compute_nodes.vcpus) AS sum_1, sum(compute_nodes.memory_mb) AS sum_2, sum(compute_nodes.local_gb) AS sum_3, sum(compute_nodes.vcpus_used) AS sum_4, sum(compute_nodes.memory_mb_used) AS sum_5, sum(compute_nodes.local_gb_used) AS sum_6, sum(compute_nodes.free_ram_mb) AS sum_7, sum(compute_nodes.free_disk_gb) AS sum_8, sum(compute_nodes.current_workload) AS sum_9, sum(compute_nodes.running_vms) AS sum_10, sum(compute_nodes.disk_available_least) AS sum_11 \nFROM compute_nodes, services \nWHERE compute_nodes.deleted = :deleted_1 AND services.disabled = 0 AND services."binary" = :binary_1 AND (services.host = compute_nodes.host OR services.id = compute_nodes.service_id)'

This is result I run above SQL in mysql directly, all are OK.

mysql> SELECT count(compute_nodes.id) AS count_1, sum(compute_nodes.vcpus) AS sum_1, sum(compute_nodes.memory_mb) AS sum_2, sum(compute_nodes.local_gb) AS sum_3, sum(compute_nodes.vcpus_used) AS sum_4, sum(compute_nodes.memory_mb_used) AS sum_5, sum(compute_nodes.local_gb_used) AS sum_6, sum(compute_nodes.free_ram_mb) AS sum_7, sum(compute_nodes.free_disk_gb) AS sum_8, sum(compute_nodes.current_workload) AS sum_9, sum(compute_nodes.running_vms) AS sum_10, sum(compute_nodes.disk_available_least) AS sum_11 FROM compute_nodes, services WHERE compute_nodes.deleted = 0 AND services.disabled = 0 AND services.binary = 'nova-compute' AND (services.host = compute_nodes.host OR services.id = compute_nodes.service_id);
+---------+-------+-------+-------+-------+-------+-------+-------+-------+-------+--------+--------+
| count_1 | sum_1 | sum_2 | sum_3 | sum_4 | sum_5 | sum_6 | sum_7 | sum_8 | sum_9 | sum_10 | sum_11 |
+---------+-------+-------+-------+-------+-------+-------+-------+-------+-------+--------+--------+
|       2 |     4 |  7902 |    76 |     0 |  1024 |     0 |  6878 |    76 |     0 |      0 |     72 |
+---------+-------+-------+-------+-------+-------+-------+-------+-------+-------+--------+--------+
1 row in set (0.00 sec)

mysql> SELECT sum(compute_nodes.vcpus) FROM compute_nodes, services WHERE compute_nodes.deleted = 0 AND services.disabled = 0 AND services.binary = 'nova-compute' AND (services.host = compute_nodes.host OR services.id = compute_nodes.service_id);
+--------------------------+
| sum(compute_nodes.vcpus) |
+--------------------------+
|                        4 |
+--------------------------+
1 row in set (0.00 sec)

Below are steps I used to create test env:

1, There are 3 nova-compute nodes initially.

2, Then use 'openstack compute service delete 10' to delete one compute service.

3, 'select * from services where id=10' will show deleted field of this record is not 0 no longer.

Revision history for this message

Launchpad Janitor (janitor) wrote on 2017-10-03:

#6

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in nova (Ubuntu):
status:	New → Confirmed

Revision history for this message

James Page (james-page) wrote on 2017-10-20:

#7

Marking 'Incomplete' for now until we get a reproducer figured out.

Changed in nova (Ubuntu):
status:	Confirmed → Incomplete
importance:	Undecided → Medium

Revision history for this message

Drew Freiberger (afreiberger) wrote on 2017-10-26:

#8

I'm still working on reproducing. While attempting reproduction, I had an environment where I had 3 hosts, dummy charms on each for ubuntu, then added nova-compute to 3. Removed nova-compute unit from the third host, and still saw stats for it in hypervisor-stats. There may be some cleanup missing in charm-nova-compute relation-depart hooks to disable/remove the service.

http://pastebin.ubuntu.com/25824195/

I think to reproduce, it might require a full remove-machine, add-unit ubuntu --to machine-name, add-unit nova-compute --to <new machine number>.

Ed (~dosaboy) may be working on this reproduction.

Revision history for this message

Edward Hope-Morley (hopem) wrote on 2017-11-03:

#9

Hi @afreiberger, i've run a test using our charms and Xenial Mitaka and here are the results - http://pastebin.ubuntu.com/25879833/

A few points for clarification:

When you do juju remove-unit nova-compute/0 the charm is not performing any cleanup at all i.e. it will not notify the cloud controller that the node is to be disabled or deleted. So what you are left with following a remove-unit is a compute node whose database state represents it as "State:down" "disabled:0" and "deleted:0". Therefore Nova is correctly still counting the compute resources associated with that node as available.

If, subsequent to removing the unit, you then issue a 'openstack compute service delete <id>' the node is marked as "deleted:<id>" and it's resources are no longer counted as available (as can be clearly seen in the pastebin output I provided above).

I am still verifying whether entries with the same hostname but different deleted status are skewing the stats and will report back once ive confirmed but in any case the findings so far hopefully show that regardless you always need to manually delete a compute host using the api following its removal with juju.

Revision history for this message

Edward Hope-Morley (hopem) wrote on 2017-11-03:

#10

Bug Confirmed - http://paste.ubuntu.com/25880271/

Deploying nova-compute to a host that previously had a nova-compute deployed to it (i.e. hostname recycled) will result in nova hypervisor stats reporting stats from both the deleted and active entries of that service.

In terms of the nova-compute charm i think the topic of how it behaves when removing units has come up before i.e. that it should somehow mark the service/host as deleted when removing a unit of nova-compute. The problem with doing this is that the compute service would need to be provided with admin credentials since it no longer has direct access to the db. This has been previously raised and the bug is still pending - https://bugs.launchpad.net/charms/+source/nova-compute/+bug/1317560.

Changed in nova (Ubuntu):
status:	Incomplete → Confirmed

Edward Hope-Morley (hopem) on 2017-11-08

Changed in nova:
assignee:	nobody → Edward Hope-Morley (hopem)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-11-08: Fix proposed to nova (master)

#11

Fix proposed to branch: master
Review: https://review.openstack.org/518520

Changed in nova:
status:	New → In Progress

Edward Hope-Morley (hopem) on 2017-11-08

summary:	- hypervisor stats issue after charm removal if nova-compute service not - disabled first + hypervisor stats aggregates resources from deleted and existing services + if they share the same hostname
tags:	added: sts

James Page (james-page) on 2017-11-13

Changed in nova (Ubuntu):
status:	Confirmed → Triaged

Revision history for this message

Edward Hope-Morley (hopem) wrote on 2017-11-15:

#12

I appear to have made an oversight here in that the problem only actually exists up to Mitaka after which it was fixed as part of https://bugs.launchpad.net/nova/+bug/1692397 for which the fix was backported all the way down to stable/newton (and released to Ubuntu) back in August of this year. So, the question that remains is whether can backport this to Mitaka which I will now investigate.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-11-15: Change abandoned on nova (master)

#13

Change abandoned by Edward Hope-Morley (<email address hidden>) on branch: master
Review: https://review.openstack.org/518520
Reason: This already fixed as part of bug 1692397 and backported as far as Newton. I need this fix for Mitaka so will look into SRU for ubuntu packages.

Revision history for this message

Edward Hope-Morley (hopem) wrote on 2017-11-15:

#14

Moving to bug 1692397 where I perform the SRU (and closing this one).

Changed in nova:
status:	In Progress → Invalid
Changed in nova (Ubuntu):
status:	Triaged → Invalid
Changed in nova:
assignee:	Edward Hope-Morley (hopem) → nobody

OpenStack Nova Compute Charm

hypervisor stats aggregates resources from deleted and existing services if they share the same hostname

Bug Description

Other bug subscribers

Remote bug watches

	Status	Importance	Assigned to
OpenStack Compute (nova)	Invalid	Undecided	Unassigned
OpenStack Nova Compute Charm	Invalid	Undecided	Unassigned
nova (Ubuntu)	Invalid	Medium	Unassigned