hypervisor stats aggregates resources from deleted and existing services if they share the same hostname

Bug #1719770 reported by Drew Freiberger
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Invalid
Undecided
Unassigned
OpenStack Nova Compute Charm
Invalid
Undecided
Unassigned
nova (Ubuntu)
Invalid
Medium
Unassigned

Bug Description

In an environment with 592 physical threads (lscpu |grep '^CPU.s' and openstack hypervisor show -f value -c vcpus both show correct counts) I am seeing 712 vcpus. (likely also seeing inflated memory_mb and other stats due to the issue.)

Querying the nova services DB table, I see: http://pastebin.ubuntu.com/25624553/

It appears that of the 6 machines showing deleted in the services table, only one is showing as disabled.

Digging through the nova/db/sqlalchemy/api.py code, it appears that there are filters on the hypervisor stats for Service.disabled == false() and Service.binary == 'nova-compute', but I don't see it filtering for deleted == 0.

I'm not exactly certain of the timeline of my uninstall and reinstall of the nova-compute units on the 6 x 24vcpu servers happened (see *-ST-{1,2} nova-compute services) that caused this behavior of the services not getting disabled, but nova api for hypervisor stats might be well served to filter out deleted services as well as disabled services, or if a deleted service should never not be disabled, nova service-delete should also set the disabled flag for the service.

These services and compute_nodes do not show up in openstack hypervisor list.

Site is running up-to-date Xenial/Mitaka on openstack-charmers 17.02.

Revision history for this message
James Page (james-page) wrote :

That's odd as most queries do:

  model_query(context, models.Service, read_deleted="no")

Revision history for this message
James Page (james-page) wrote :

Including the stats call:

@pick_context_manager_reader
def compute_node_statistics(context):
    """Compute statistics over all compute nodes."""

    # TODO(sbauza): Remove the service_id filter in a later release
    # once we are sure that all compute nodes report the host field
    _filter = or_(models.Service.host == models.ComputeNode.host,
                  models.Service.id == models.ComputeNode.service_id)

    result = model_query(context,
                         models.ComputeNode, (
                             func.count(models.ComputeNode.id),
                             func.sum(models.ComputeNode.vcpus),
                             func.sum(models.ComputeNode.memory_mb),
                             func.sum(models.ComputeNode.local_gb),
                             func.sum(models.ComputeNode.vcpus_used),
                             func.sum(models.ComputeNode.memory_mb_used),
                             func.sum(models.ComputeNode.local_gb_used),
                             func.sum(models.ComputeNode.free_ram_mb),
                             func.sum(models.ComputeNode.free_disk_gb),
                             func.sum(models.ComputeNode.current_workload),
                             func.sum(models.ComputeNode.running_vms),
                             func.sum(models.ComputeNode.disk_available_least),
                         ), read_deleted="no").\
                         filter(models.Service.disabled == false()).\
                         filter(models.Service.binary == "nova-compute").\
                         filter(_filter).\
                         first()

    # Build a dict of the info--making no assumptions about result
    fields = ('count', 'vcpus', 'memory_mb', 'local_gb', 'vcpus_used',
              'memory_mb_used', 'local_gb_used', 'free_ram_mb', 'free_disk_gb',
              'current_workload', 'running_vms', 'disk_available_least')
    return {field: int(result[idx] or 0)
            for idx, field in enumerate(fields)}

Revision history for this message
James Page (james-page) wrote :

This code has had some significant refactoring since mitaka; so its possible this only impacts older openstack releases.

Either way, this is not a charm problem but a Nova issue AFAICT - raising distro task.

Changed in charm-nova-compute:
status: New → Invalid
Revision history for this message
Drew Freiberger (afreiberger) wrote :

In models.py, both in Mitaka and in master, I've found that the relation between ComputeNode and Service is using the following join in the Service context:

        primaryjoin='and_(Service.host == Instance.host,'
                    'Service.binary == "nova-compute",'
                    'Instance.deleted == 0)',

As in my case, I've redeployed a deleted node as the same hostname (Service.host) this join is relating a deleted ComputeNode.host entry to the non-deleted Service.host entry.

If I look at both my compute_nodes and services tables, it appears they should potentially be joined on the "id" field, rather than the "host" field, at least for this specific query, but this potentially breaks the Service object relation model for other query contexts such as instances running on a hypervisor.

Revision history for this message
Hua Zhang (zhhuabj) wrote :

I can not reproduce the problem, I see SQL has used 'compute_nodes.deleted = 0' to filter deleted services as the following debug info shows (mitaka).

(Pdb) <oslo_db.sqlalchemy.orm.Query object at 0x7f97672b9810>
(Pdb) 'SELECT count(compute_nodes.id) AS count_1, sum(compute_nodes.vcpus) AS sum_1, sum(compute_nodes.memory_mb) AS sum_2, sum(compute_nodes.local_gb) AS sum_3, sum(compute_nodes.vcpus_used) AS sum_4, sum(compute_nodes.memory_mb_used) AS sum_5, sum(compute_nodes.local_gb_used) AS sum_6, sum(compute_nodes.free_ram_mb) AS sum_7, sum(compute_nodes.free_disk_gb) AS sum_8, sum(compute_nodes.current_workload) AS sum_9, sum(compute_nodes.running_vms) AS sum_10, sum(compute_nodes.disk_available_least) AS sum_11 \nFROM compute_nodes, services \nWHERE compute_nodes.deleted = :deleted_1 AND services.disabled = 0 AND services."binary" = :binary_1 AND (services.host = compute_nodes.host OR services.id = compute_nodes.service_id)'

This is result I run above SQL in mysql directly, all are OK.

mysql> SELECT count(compute_nodes.id) AS count_1, sum(compute_nodes.vcpus) AS sum_1, sum(compute_nodes.memory_mb) AS sum_2, sum(compute_nodes.local_gb) AS sum_3, sum(compute_nodes.vcpus_used) AS sum_4, sum(compute_nodes.memory_mb_used) AS sum_5, sum(compute_nodes.local_gb_used) AS sum_6, sum(compute_nodes.free_ram_mb) AS sum_7, sum(compute_nodes.free_disk_gb) AS sum_8, sum(compute_nodes.current_workload) AS sum_9, sum(compute_nodes.running_vms) AS sum_10, sum(compute_nodes.disk_available_least) AS sum_11 FROM compute_nodes, services WHERE compute_nodes.deleted = 0 AND services.disabled = 0 AND services.binary = 'nova-compute' AND (services.host = compute_nodes.host OR services.id = compute_nodes.service_id);
+---------+-------+-------+-------+-------+-------+-------+-------+-------+-------+--------+--------+
| count_1 | sum_1 | sum_2 | sum_3 | sum_4 | sum_5 | sum_6 | sum_7 | sum_8 | sum_9 | sum_10 | sum_11 |
+---------+-------+-------+-------+-------+-------+-------+-------+-------+-------+--------+--------+
| 2 | 4 | 7902 | 76 | 0 | 1024 | 0 | 6878 | 76 | 0 | 0 | 72 |
+---------+-------+-------+-------+-------+-------+-------+-------+-------+-------+--------+--------+
1 row in set (0.00 sec)

mysql> SELECT sum(compute_nodes.vcpus) FROM compute_nodes, services WHERE compute_nodes.deleted = 0 AND services.disabled = 0 AND services.binary = 'nova-compute' AND (services.host = compute_nodes.host OR services.id = compute_nodes.service_id);
+--------------------------+
| sum(compute_nodes.vcpus) |
+--------------------------+
| 4 |
+--------------------------+
1 row in set (0.00 sec)

Below are steps I used to create test env:

1, There are 3 nova-compute nodes initially.

2, Then use 'openstack compute service delete 10' to delete one compute service.

3, 'select * from services where id=10' will show deleted field of this record is not 0 no longer.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in nova (Ubuntu):
status: New → Confirmed
Revision history for this message
James Page (james-page) wrote :

Marking 'Incomplete' for now until we get a reproducer figured out.

Changed in nova (Ubuntu):
status: Confirmed → Incomplete
importance: Undecided → Medium
Revision history for this message
Drew Freiberger (afreiberger) wrote :

I'm still working on reproducing. While attempting reproduction, I had an environment where I had 3 hosts, dummy charms on each for ubuntu, then added nova-compute to 3. Removed nova-compute unit from the third host, and still saw stats for it in hypervisor-stats. There may be some cleanup missing in charm-nova-compute relation-depart hooks to disable/remove the service.

http://pastebin.ubuntu.com/25824195/

I think to reproduce, it might require a full remove-machine, add-unit ubuntu --to machine-name, add-unit nova-compute --to <new machine number>.

Ed (~dosaboy) may be working on this reproduction.

Revision history for this message
Edward Hope-Morley (hopem) wrote :

Hi @afreiberger, i've run a test using our charms and Xenial Mitaka and here are the results - http://pastebin.ubuntu.com/25879833/

A few points for clarification:

When you do juju remove-unit nova-compute/0 the charm is not performing any cleanup at all i.e. it will not notify the cloud controller that the node is to be disabled or deleted. So what you are left with following a remove-unit is a compute node whose database state represents it as "State:down" "disabled:0" and "deleted:0". Therefore Nova is correctly still counting the compute resources associated with that node as available.

If, subsequent to removing the unit, you then issue a 'openstack compute service delete <id>' the node is marked as "deleted:<id>" and it's resources are no longer counted as available (as can be clearly seen in the pastebin output I provided above).

I am still verifying whether entries with the same hostname but different deleted status are skewing the stats and will report back once ive confirmed but in any case the findings so far hopefully show that regardless you always need to manually delete a compute host using the api following its removal with juju.

Revision history for this message
Edward Hope-Morley (hopem) wrote :

Bug Confirmed - http://paste.ubuntu.com/25880271/

Deploying nova-compute to a host that previously had a nova-compute deployed to it (i.e. hostname recycled) will result in nova hypervisor stats reporting stats from both the deleted and active entries of that service.

In terms of the nova-compute charm i think the topic of how it behaves when removing units has come up before i.e. that it should somehow mark the service/host as deleted when removing a unit of nova-compute. The problem with doing this is that the compute service would need to be provided with admin credentials since it no longer has direct access to the db. This has been previously raised and the bug is still pending - https://bugs.launchpad.net/charms/+source/nova-compute/+bug/1317560.

Changed in nova (Ubuntu):
status: Incomplete → Confirmed
Changed in nova:
assignee: nobody → Edward Hope-Morley (hopem)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/518520

Changed in nova:
status: New → In Progress
summary: - hypervisor stats issue after charm removal if nova-compute service not
- disabled first
+ hypervisor stats aggregates resources from deleted and existing services
+ if they share the same hostname
tags: added: sts
James Page (james-page)
Changed in nova (Ubuntu):
status: Confirmed → Triaged
Revision history for this message
Edward Hope-Morley (hopem) wrote :

I appear to have made an oversight here in that the problem only actually exists up to Mitaka after which it was fixed as part of https://bugs.launchpad.net/nova/+bug/1692397 for which the fix was backported all the way down to stable/newton (and released to Ubuntu) back in August of this year. So, the question that remains is whether can backport this to Mitaka which I will now investigate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Edward Hope-Morley (<email address hidden>) on branch: master
Review: https://review.openstack.org/518520
Reason: This already fixed as part of bug 1692397 and backported as far as Newton. I need this fix for Mitaka so will look into SRU for ubuntu packages.

Revision history for this message
Edward Hope-Morley (hopem) wrote :

Moving to bug 1692397 where I perform the SRU (and closing this one).

Changed in nova:
status: In Progress → Invalid
Changed in nova (Ubuntu):
status: Triaged → Invalid
Changed in nova:
assignee: Edward Hope-Morley (hopem) → nobody
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.