Bug #1789654 “placement allocation_ratio initialized with 0.0” : Series queens : Bugs : OpenStack Compute (nova)

Revision history for this message

Matt Riedemann (mriedem) wrote on 2018-08-29:

#1

Logs for the failed debian rocky run:

http://logs.openstack.org/75/597175/1/check/puppet-openstack-integration-4-scenario001-tempest-debian-stable-luminous/fd38fcf/logs/

This was also reported by the xenserver CI:

http://lists.openstack.org/pipermail/openstack-dev/2018-August/133896.html

My guess is the local ProviderTree cache of inventory thinks nothing has changed when we set allocation_ratio on the provider inventory so it never actually updates the inventory remotely and we're left with initial values of 0.0.

Changed in nova:
status:	New → Confirmed
importance:	Undecided → High

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-08-29: Related fix proposed to nova (master)

#2

Related fix proposed to branch: master
Review: https://review.openstack.org/597553

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-08-29:

#3

Related fix proposed to branch: master
Review: https://review.openstack.org/597560

Revision history for this message

Matt Riedemann (mriedem) wrote on 2018-08-29:

#4

https://review.openstack.org/#/c/597613/ should get the nova debug log patch into the xen CI testing.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2018-08-30:

#5

Download full text (3.2 KiB)

Got some debug logs from a failed xen CI run:

http://dd6b71949550285df7dc-dda4e480e005aaa13ec303551d2d8155.r49.cf1.rackcdn.com/13/597613/2/check/dsvm-tempest-neutron-network/cc81140/logs/screen-n-cpu.txt.gz

We see that the inventory is updated properly on startup with the correct allocation ratios:

Aug 29 16:56:06.926641 dsvm-devstack-citrix-mia-nodepool-1379368 nova-compute[24436]: INFO nova.compute.resource_tracker [None req-99d9d496-6720-4837-a2ee-560605fd1afe None None] Using cpu_allocation_ratio 16.0 for node: 2f5a2e04-1b61-4437-ab6e-8dbbf797dc07

Aug 29 16:56:07.057208 dsvm-devstack-citrix-mia-nodepool-1379368 nova-compute[24436]: DEBUG nova.compute.provider_tree [None req-99d9d496-6720-4837-a2ee-560605fd1afe None None] Updating resource provider 2f5a2e04-1b61-4437-ab6e-8dbbf797dc07 generation from 0 to 1 during operation: update_inventory {{(pid=24436) _update_generation /opt/stack/new/nova/nova/compute/provider_tree.py:161}}

Aug 29 16:56:07.057499 dsvm-devstack-citrix-mia-nodepool-1379368 nova-compute[24436]: INFO nova.compute.provider_tree [None req-99d9d496-6720-4837-a2ee-560605fd1afe None None] Updating inventory in ProviderTree for provider 2f5a2e04-1b61-4437-ab6e-8dbbf797dc07 with inventory: {'VCPU': {'allocation_ratio': 16.0, 'total': 8, 'reserved': 0, 'step_size': 1, 'min_unit': 1, 'max_unit': 8}, 'MEMORY_MB': {'allocation_ratio': 1.5, 'total': 12795, 'reserved': 512, 'step_size': 1, 'min_unit': 1, 'max_unit': 12795}, 'DISK_GB': {'allocation_ratio': 1.0, 'total': 47, 'reserved': 0, 'step_size': 1, 'min_unit': 1, 'max_unit': 47}}

But then a bit later the cpu_allocation_ratio is set to 0.0 in the _normalize_inventory_from_cn_obj method:

Aug 29 16:58:05.483508 dsvm-devstack-citrix-mia-nodepool-1379368 nova-compute[24436]: INFO nova.compute.resource_tracker [None req-a869fa19-aa9d-4335-9816-42ff29b64d48 None None] Using cpu_allocation_ratio 0.0 for node: 2f5a2e04-1b61-4437-ab6e-8dbbf797dc07

Aug 29 16:58:05.614421 dsvm-devstack-citrix-mia-nodepool-1379368 nova-compute[24436]: DEBUG nova.scheduler.client.report [None req-a869fa19-aa9d-4335-9816-42ff29b64d48 None None] Updated inventory for 2f5a2e04-1b61-4437-ab6e-8dbbf797dc07 at generation 2: {'VCPU': {'allocation_ratio': 0.0, 'total': 8, 'reserved': 0, 'step_size': 1, 'min_unit': 1, 'max_unit': 8}, 'MEMORY_MB': {'allocation_ratio': 0.0, 'total': 12795, 'reserved': 512, 'step_size': 1, 'min_unit': 1, 'max_unit': 12795}, 'DISK_GB': {'allocation_ratio': 0.0, 'total': 47, 'reserved': 0, 'step_size': 1, 'min_unit': 1, 'max_unit': 47}} {{(pid=24436) _update_inventory_attempt /opt/stack/new/nova/nova/scheduler/client/report.py:965}}

So it looks like at some random point the ComputeNode.cpu_allocation_ratio is being set to 0.0 but I don't see any of my debug logging for that in the logs:

https://review.openstack.org/#/c/597560/3/nova/objects/compute_node.py

From the startup logs when it dumps nova-cpu.conf, we see the cpu_allocation_ratio option is not set (default is 0.0):

Aug 29 16:56:03.687759 dsvm-devstack-citrix-mia-nodepool-1379368 nova-compute[24436]: DEBUG oslo_service.service [None req-e4bc9238-0584-459a-aa2c-dc5d425d198a None None] cpu_allocation_ratio = 0.0 {...

We have a new xen CI run with the more detailed debug logs now and it's basically confirming what I think the problem was, but apparently this is a race because this time the xen CI job passed.

Starting with the logs here:

http://dd6b71949550285df7dc-dda4e480e005aaa13ec303551d2d8155.r49.cf1.rackcdn.com/13/597613/2/check/dsvm-tempest-neutron-network/621833d/logs/screen-n-cpu.txt.gz

This is the initial inventory update for the newly created compute node and resource provider on start of nova-compute:

Aug 30 08:02:12.144029 dsvm-devstack-citrix-lon-nodepool-1379396 nova-compute[24292]: DEBUG nova.scheduler.client.report [None req-f4b08178-4b7a-4fba-b57a-91612721f970 None None] Updated inventory for 9c58942c-d183-455a-a760-991e4430e816 at generation 1: {'VCPU': {'allocation_ratio': 16.0, 'total': 4, 'reserved': 0, 'step_size': 1, 'min_unit': 1, 'max_unit': 4}, 'MEMORY_MB': {'allocation_ratio': 1.5, 'total': 12795, 'reserved': 512, 'step_size': 1, 'min_unit': 1, 'max_unit': 12795}, 'DISK_GB': {'allocation_ratio': 1.0, 'total': 47, 'reserved': 0, 'step_size': 1, 'min_unit': 1, 'max_unit': 47}} {{(pid=24292) _update_inventory_attempt /opt/stack/new/nova/nova/scheduler/client/report.py:967}}

Then when the update_available_resource periodic task runs, we see that _copy_resources updates the in-memory ComputeNode.*_allocation_ratio values to 0.0 from the config:

Aug 30 08:03:09.458344 dsvm-devstack-citrix-lon-nodepool-1379396 nova-compute[24292]: INFO nova.compute.resource_tracker [None req-9b1b9924-b89e-4a03-9a69-c9fff17594e3 None None] ComputeNode.cpu_allocation_ratio changed from 16.0 to 0.0 in _copy_resources.

And the _resource_change method, called from RT._update, confirms the change:

Aug 30 08:03:09.549234 dsvm-devstack-citrix-lon-nodepool-1379396 nova-compute[24292]: INFO nova.compute.resource_tracker [None req-9b1b9924-b89e-4a03-9a69-c9fff17594e3 None None] Compute node resources have changed.
Aug 30 08:03:09.549407 dsvm-devstack-citrix-lon-nodepool-1379396 nova-compute[24292]: Old: ComputeNode(cpu_allocation_ratio=16.0,cpu_info='{"model": "Intel(R) Xeon(R) CPU E3-1231 v3 @ 3.40GHz", "vendor": "GenuineIntel", "features": ["fpu", "de", "tsc", "msr", "pae", "mce", "cx8", "apic", "sep", "mca", "cmov", "pat", "clflush", "mmx", "fxsr", "sse", "sse2", "syscall", "nx", "lm", "constant_tsc", "rep_good", "nopl", "pni", "pclmulqdq", "ssse3", "cx16", "sse4_1", "sse4_2", "movbe", "popcnt", "aes", "rdrand", "hypervisor", "lahf_lm", "abm", "fsgsbase", "bmi1", "bmi2", "erms"], "topology": {"cores": 1, "threads": 1, "sockets": 4}}',created_at=2018-08-30T15:02:11Z,current_workload=0,deleted=False,deleted_at=None,disk_allocation_ratio=1.0,disk_available_least=27,free_disk_gb=47,free_ram_mb=12283,host='localhost',host_ip=192.168.33.1,hypervisor_hostname='localhost',hypervisor_type='XenServer',hypervisor_version=7001000,id=1,local_gb=47,local_gb_used=0,mapped=0,memory_mb=12795,memory_mb_used=512,metrics='[]',numa_topology=None,pci_device_pools=PciDevicePoolList,ram_allocation_ratio=1.5,running_vms=0,service_id=None,stats={failed_builds='0'},supported_hv_specs=[HVSpec,HVSpec],updated_at=None,uuid=9c58942c-d183-455a-a760-991e4430e816,vcpus=4,vcpus_used=0)

Aug 30 08:03:09.549912 dsvm-devstack-citrix-lon-nodepool-1379396 nova-compute[24292]: New: ComputeNode(cpu_allocation_ratio=0.0,cpu_info='{"model": "Intel(R) Xeon(R) CPU E3-1231 v3 @ 3.40GHz", "vendor": "GenuineIntel", "features": ["fpu", "de", "tsc", "msr", "pae", "mce", "cx8", "apic", "sep", "mca", "cmov", "pat", "clflush", "mmx", "fxsr", "sse", "sse2", "syscall", "nx", "lm", "constant_tsc", "rep_good", "nopl", "pni", "pclmulqdq", "ssse3", "cx16", "sse4_1", "sse4_2", "movbe", "popcnt", "aes", "rdrand", "hypervisor", "lahf_lm", "abm", "fsgsbase", "bmi1", "bmi2", "erms"], "topology": {"cores": 1, "threads": 1, "sockets": 4}}',created_at=2018-08-30T15:02:11Z,current_workload=0,deleted=False,deleted_at=None,disk_allocation_ratio=0.0,disk_available_least=27,free_disk_gb=47,free_ram_mb=12283,host='localhost',host_ip=192.168.33.1,hypervisor_hostname='localhost',hypervisor_type='XenServer',hypervisor_version=7001000,id=1,local_gb=47,local_gb_used=0,mapped=0,memory_mb=12795,memory_mb_used=512,metrics='[]',numa_topology=None,pci_device_pools=PciDevicePoolList,ram_allocation_ratio=0.0,running_vms=0,service_id=None,stats={failed_builds='0'},supported_hv_specs=[HVSpec,HVSpec],updated_at=2018-08-30T15:02:11Z,uuid=9c58942c-d183-455a-a760-991e4430e816,vcpus=4,vcpus_used=0)

Now while that happens, I see a ProviderTree.update_inventory call, and I'm not sure from where:

Aug 30 08:03:09.614402 dsvm-devstack-citrix-lon-nodepool-1379396 nova-compute[24292]: INFO nova.compute.provider_tree [None req-9b1b9924-b89e-4a03-9a69-c9fff17594e3 None None] Updating inventory in ProviderTree for provider 9c58942c-d183-455a-a760-991e4430e816 with inventory: {u'VCPU': {u'allocation_ratio': 16.0, u'total': 4, u'reserved': 0, u'step_size': 1, u'min_unit': 1, u'max_unit': 4}, u'MEMORY_MB': {u'allocation_ratio': 1.5, u'total': 12795, u'reserved': 512, u'step_size': 1, u'min_unit': 1, u'max_unit': 12795}, u'DISK_GB': {u'allocation_ratio': 1.0, u'total': 47, u'reserved': 0, u'step_size': 1, u'min_unit': 1, u'max_unit': 47}}

Back to the update_available_resource periodic thread, because RT._resource_changed returned True, we called ComputeNode.save() which will call ComputeNode._from_db_object which will reset the ComputeNode.cpu_allocation_ratio from 0.0 to 16.0, and then we pass that ComputeNode object to the RT._normalize_inventory_from_cn_obj method:

Aug 30 08:03:10.089986 dsvm-devstack-citrix-lon-nodepool-1379396 nova-compute[24292]: INFO nova.compute.resource_tracker [None req-9b1b9924-b89e-4a03-9a69-c9fff17594e3 None None] Begin _normalize_inventory_from_cn_obj for node 9c58942c-d183-455a-a760-991e4430e816
Aug 30 08:03:10.090256 dsvm-devstack-citrix-lon-nodepool-1379396 nova-compute[24292]: INFO nova.compute.resource_tracker [None req-9b1b9924-b89e-4a03-9a69-c9fff17594e3 None None] Using cpu_allocation_ratio 16.0 for node: 9c58942c-d183-455a-a760-991e4430e816
Aug 30 08:03:10.090512 dsvm-devstack-citrix-lon-nodepool-1379396 nova-compute[24292]: INFO nova.compute.resource_tracker [None req-9b1b9924-b89e-4a03-9a69-c9fff17594e3 None None] RT: Sending compute node inventory changes back to placement for node: 9c58942c-d183-455a-a760-991e4430e816

And in there we are setting cpu_allocation_ratio in the VCPU inventory dict properly to 16.0.

And because of that, we don't update the inventory and change the cpu_allocation_ratio in placement from 16.0 to 0.0:

Aug 30 08:03:10.170157 dsvm-devstack-citrix-lon-nodepool-1379396 nova-compute[24292]: INFO nova.scheduler.client.report [None req-9b1b9924-b89e-4a03-9a69-c9fff17594e3 None None] (Local) inventory has not changed for provider 9c58942c-d183-455a-a760-991e4430e816 based on inventory data: {'VCPU': {'allocation_ratio': 16.0, 'total': 4, 'reserved': 0, 'step_size': 1, 'min_unit': 1, 'max_unit': 4}, 'MEMORY_MB': {'allocation_ratio': 1.5, 'total': 12795, 'reserved': 512, 'step_size': 1, 'min_unit': 1, 'max_unit': 12795}, 'DISK_GB': {'allocation_ratio': 1.0, 'total': 47, 'reserved': 0, 'step_size': 1, 'min_unit': 1, 'max_unit': 47}}

But there is definitely something going on here with shared ProviderTree cache, I'm just not sure where.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-08-30: Related fix proposed to nova (master)

#10

Related fix proposed to branch: master
Review: https://review.openstack.org/598365

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-08-31: Change abandoned on nova (master)

#11

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.openstack.org/598176
Reason: Let's hope this works:

https://review.openstack.org/#/c/598365/

Revision history for this message

Eric Fried (efried) wrote on 2018-08-31:

#12

Anecdotal reports say we were failing PowerVM CI 90% of the time. I'm going to run https://review.openstack.org/#/c/598365/ through a couple of times, but initial results look positive.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2018-09-01:

#13

According to Rado in https://review.openstack.org/#/c/270116/ this also impacts the vmware CI.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2018-09-01:

#14

BTW, if xenserver, powervm and vmware CI are all failing on this, it's not only intermittent but seems to also be isolated to virt drivers that don't implement the update_provider_tree method for whatever reason. Must be something in that alternative flow through the scheduler report client to report inventory updates to placement, likely because in the case of the libvirt driver, which does implement update_provider_tree, we're passing the same provider tree that the driver worked on to the scheduler report client, and in the non-libvirt case, we're passing an inventory dict to scheduler report client and the scheduler report client is updating its own view of the provider tree, which might be stale somehow.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-09-04: Related fix proposed to nova (master)

#15

Related fix proposed to branch: master
Review: https://review.openstack.org/599670

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-09-04: Related fix proposed to nova (stable/rocky)

#16

Related fix proposed to branch: stable/rocky
Review: https://review.openstack.org/599672

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-09-04:

#17

Related fix proposed to branch: stable/rocky
Review: https://review.openstack.org/599673

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-09-05: Related fix merged to nova (master)

#18

Reviewed: https://review.openstack.org/598365
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=2588af87c862cfd02d860f6b860381e907b279ff
Submitter: Zuul
Branch: master

commit 2588af87c862cfd02d860f6b860381e907b279ff
Author: Matt Riedemann <email address hidden>
Date: Thu Aug 30 17:57:24 2018 -0400

Don't persist zero allocation ratios in ResourceTracker

    The ComputeNode object itself has a facade which provides the
    actual default values for cpu, disk and ram allocation ratios
    when the config option values are left at the default (0.0).

    When we initially create a ComputeNode in the ResourceTracker,
    the *_allocation_ratio values in the object are going to be
    unset, and then _copy_resources, called from _init_compute_node,
    will set the values to the config option values, again, defaulted
    to 0.0, but the ComputeNode object, after calling create() or
    save(), will change those *on the object* to the non-0.0 we
    actually know and love (16.0 for cpu, 1.5 for ram, and 1.0 for disk).

    During the update_available_resource periodic, we'll again go
    through _init_compute_node and _copy_resources in the ResourceTracker
    which will set the configured values (default of 0.0) onto the
    ComputeNode object, which makes the _resource_change method, called
    from _update, return True and trigger a ComputeNode.save() call
    from the _update method. At that point we're *persisting* the 0.0
    allocation ratios in the database, even though ComputeNode.save
    will change them to their non-0.0 default values *on the object*
    because of the _from_db_object call at the end of ComputeNode.save.

    So even if the ComputeNode object allocation ratio values are the
    non-0.0 defaults, we'll *always* update the database on every
    periodic even if nothing else changed in inventory.

This change modifies the _copy_resource method to only update the
ComputeNode fields if the configured ratios are not the 0.0 default.

Change-Id: I43a23a3290db0c835fed01b8d6a38962dc61adce
Related-Bug: #1789654

Reviewed:  https://review.openstack.org/598365
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=2588af87c862cfd02d860f6b860381e907b279ff
Submitter: Zuul
Branch:    master

commit 2588af87c862cfd02d860f6b860381e907b279ff
Author: Matt Riedemann <mriedem.os@gmail.com>
Date:   Thu Aug 30 17:57:24 2018 -0400

Don't persist zero allocation ratios in ResourceTracker
    
    The ComputeNode object itself has a facade which provides the
    actual default values for cpu, disk and ram allocation ratios
    when the config option values are left at the default (0.0).
    
    When we initially create a ComputeNode in the ResourceTracker,
    the *_allocation_ratio values in the object are going to be
    unset, and then _copy_resources, called from _init_compute_node,
    will set the values to the config option values, again, defaulted
    to 0.0, but the ComputeNode object, after calling create() or
    save(), will change those *on the object* to the non-0.0 we
    actually know and love (16.0 for cpu, 1.5 for ram, and 1.0 for disk).
    
    During the update_available_resource periodic, we'll again go
    through _init_compute_node and _copy_resources in the ResourceTracker
    which will set the configured values (default of 0.0) onto the
    ComputeNode object, which makes the _resource_change method, called
    from _update, return True and trigger a ComputeNode.save() call
    from the _update method. At that point we're *persisting* the 0.0
    allocation ratios in the database, even though ComputeNode.save
    will change them to their non-0.0 default values *on the object*
    because of the _from_db_object call at the end of ComputeNode.save.
    
    So even if the ComputeNode object allocation ratio values are the
    non-0.0 defaults, we'll *always* update the database on every
    periodic even if nothing else changed in inventory.
    
    This change modifies the _copy_resource method to only update the
    ComputeNode fields if the configured ratios are not the 0.0 default.
    
    Change-Id: I43a23a3290db0c835fed01b8d6a38962dc61adce
    Related-Bug: #1789654

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-09-05:

#19

Reviewed: https://review.openstack.org/599670
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=c45adaca5dd241408f1e29b657fe6ed42c908b8b
Submitter: Zuul
Branch: master

commit c45adaca5dd241408f1e29b657fe6ed42c908b8b
Author: Matt Riedemann <email address hidden>
Date: Tue Sep 4 12:11:57 2018 -0400

Document unset/reset wrinkle for *_allocation_ratio options

    This is a follow up to I43a23a3290db0c835fed01b8d6a38962dc61adce
    which makes the cpu/disk/ram_allocation_ratio config "sticky" in
    that once set to a non-default value, it is not possible to reset
    back to the default behavior (when config is 0.0) on an existing
    compute node record by unsetting the option from nova.conf. To
    reset back to the defaults, the non-0.0 default would have to be
    explicitly put into config, so cpu_allocation_ratio=16.0 for example.

    Alternatively operators could delete the nova-compute service
    record via the DELETE /os-services/{service_id} REST API and
    restart the nova-compute service to get a new compute_nodes record,
    but that workaround is messy and left undocumented in config.

Change-Id: I908615d82ead0f70f8e6d2d78d5dcaed8431084d
Related-Bug: #1789654

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-09-05: Change abandoned on nova (master)

#20

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.openstack.org/597560

Revision history for this message

Eric Fried (efried) wrote on 2018-09-06:

#21

This was fixed by https://review.openstack.org/#/c/598365/

Changed in nova:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-09-14: Related fix merged to nova (stable/rocky)

#22

Reviewed: https://review.openstack.org/599672
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=01265b98c4cd2b1377e891a06ce748fc6f8f3425
Submitter: Zuul
Branch: stable/rocky

commit 01265b98c4cd2b1377e891a06ce748fc6f8f3425
Author: Matt Riedemann <email address hidden>
Date: Thu Aug 30 17:57:24 2018 -0400

Don't persist zero allocation ratios in ResourceTracker

    The ComputeNode object itself has a facade which provides the
    actual default values for cpu, disk and ram allocation ratios
    when the config option values are left at the default (0.0).

    When we initially create a ComputeNode in the ResourceTracker,
    the *_allocation_ratio values in the object are going to be
    unset, and then _copy_resources, called from _init_compute_node,
    will set the values to the config option values, again, defaulted
    to 0.0, but the ComputeNode object, after calling create() or
    save(), will change those *on the object* to the non-0.0 we
    actually know and love (16.0 for cpu, 1.5 for ram, and 1.0 for disk).

    During the update_available_resource periodic, we'll again go
    through _init_compute_node and _copy_resources in the ResourceTracker
    which will set the configured values (default of 0.0) onto the
    ComputeNode object, which makes the _resource_change method, called
    from _update, return True and trigger a ComputeNode.save() call
    from the _update method. At that point we're *persisting* the 0.0
    allocation ratios in the database, even though ComputeNode.save
    will change them to their non-0.0 default values *on the object*
    because of the _from_db_object call at the end of ComputeNode.save.

    So even if the ComputeNode object allocation ratio values are the
    non-0.0 defaults, we'll *always* update the database on every
    periodic even if nothing else changed in inventory.

This change modifies the _copy_resource method to only update the
ComputeNode fields if the configured ratios are not the 0.0 default.

    Change-Id: I43a23a3290db0c835fed01b8d6a38962dc61adce
    Related-Bug: #1789654
    (cherry picked from commit 2588af87c862cfd02d860f6b860381e907b279ff)

Reviewed:  https://review.openstack.org/599672
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=01265b98c4cd2b1377e891a06ce748fc6f8f3425
Submitter: Zuul
Branch:    stable/rocky

commit 01265b98c4cd2b1377e891a06ce748fc6f8f3425
Author: Matt Riedemann <mriedem.os@gmail.com>
Date:   Thu Aug 30 17:57:24 2018 -0400

Don't persist zero allocation ratios in ResourceTracker
    
    The ComputeNode object itself has a facade which provides the
    actual default values for cpu, disk and ram allocation ratios
    when the config option values are left at the default (0.0).
    
    When we initially create a ComputeNode in the ResourceTracker,
    the *_allocation_ratio values in the object are going to be
    unset, and then _copy_resources, called from _init_compute_node,
    will set the values to the config option values, again, defaulted
    to 0.0, but the ComputeNode object, after calling create() or
    save(), will change those *on the object* to the non-0.0 we
    actually know and love (16.0 for cpu, 1.5 for ram, and 1.0 for disk).
    
    During the update_available_resource periodic, we'll again go
    through _init_compute_node and _copy_resources in the ResourceTracker
    which will set the configured values (default of 0.0) onto the
    ComputeNode object, which makes the _resource_change method, called
    from _update, return True and trigger a ComputeNode.save() call
    from the _update method. At that point we're *persisting* the 0.0
    allocation ratios in the database, even though ComputeNode.save
    will change them to their non-0.0 default values *on the object*
    because of the _from_db_object call at the end of ComputeNode.save.
    
    So even if the ComputeNode object allocation ratio values are the
    non-0.0 defaults, we'll *always* update the database on every
    periodic even if nothing else changed in inventory.
    
    This change modifies the _copy_resource method to only update the
    ComputeNode fields if the configured ratios are not the 0.0 default.
    
    Change-Id: I43a23a3290db0c835fed01b8d6a38962dc61adce
    Related-Bug: #1789654
    (cherry picked from commit 2588af87c862cfd02d860f6b860381e907b279ff)

tags:

added: in-stable-rocky

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-09-21:

#23

Reviewed: https://review.openstack.org/599673
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=a039f8397702d15718ebcec0fdb9cfeb6155f6a1
Submitter: Zuul
Branch: stable/rocky

commit a039f8397702d15718ebcec0fdb9cfeb6155f6a1
Author: Matt Riedemann <email address hidden>
Date: Tue Sep 4 12:11:57 2018 -0400

Document unset/reset wrinkle for *_allocation_ratio options

    This is a follow up to I43a23a3290db0c835fed01b8d6a38962dc61adce
    which makes the cpu/disk/ram_allocation_ratio config "sticky" in
    that once set to a non-default value, it is not possible to reset
    back to the default behavior (when config is 0.0) on an existing
    compute node record by unsetting the option from nova.conf. To
    reset back to the defaults, the non-0.0 default would have to be
    explicitly put into config, so cpu_allocation_ratio=16.0 for example.

    Alternatively operators could delete the nova-compute service
    record via the DELETE /os-services/{service_id} REST API and
    restart the nova-compute service to get a new compute_nodes record,
    but that workaround is messy and left undocumented in config.

    Change-Id: I908615d82ead0f70f8e6d2d78d5dcaed8431084d
    Related-Bug: #1789654
    (cherry picked from commit c45adaca5dd241408f1e29b657fe6ed42c908b8b)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-10-22: Related fix merged to nova (master)

#24

Reviewed: https://review.openstack.org/597553
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=9c842d1aa68d67383114f783c946492282832e5b
Submitter: Zuul
Branch: master

commit 9c842d1aa68d67383114f783c946492282832e5b
Author: Matt Riedemann <email address hidden>
Date: Wed Aug 29 10:41:25 2018 -0400

Log the operation when updating generation in ProviderTree

Today when a provider generation is updated we get a mostly
unhelpful message in the logs like:

Updating resource provider bc0f4c8e-96e2-4e68-a06e-f7e0ac9aac6b
generation from 0 to 1

What we really want to know with that is in what context did the
generation change, i.e. did inventory or traits change?

This adds the actual operation to the log message when generation
changes.

Change-Id: I9b61f1dfb8db06e02ff79e19c055c673094a4ed2
Related-Bug: #1789654

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-10-24:

#25

Reviewed: https://review.openstack.org/597560
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=ade240b392563914c1bddc228a3eb00585177d4c
Submitter: Zuul
Branch: master

commit ade240b392563914c1bddc228a3eb00585177d4c
Author: Matt Riedemann <email address hidden>
Date: Wed Aug 29 10:56:09 2018 -0400

Add debug logs for when provider inventory changes

    Because of the ProviderTree caching involved in both
    the ResourceTracker and SchedulerReportClient, it can
    be hard to know if inventory changes are being correctly
    reported to the placement service as expected, especially
    with things like configurable allocation ratios. This
    adds debug logs to the SchedulerReportClient and
    ProviderTree to determine if inventory has changed and
    when it's flushed back to placement.

Change-Id: Ia6ab3ab4dea08b479bb6b794f408fd2e6f678c50
Related-Bug: #1789654

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-10-25: Related fix proposed to nova (stable/pike)

#26

Related fix proposed to branch: stable/pike
Review: https://review.openstack.org/613263

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-10-25: Related fix proposed to nova (stable/queens)

#27

Related fix proposed to branch: stable/queens
Review: https://review.openstack.org/613271

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-03-25:

#28

Related fix proposed to branch: stable/queens
Review: https://review.openstack.org/647291

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-03-25: Related fix proposed to nova (stable/pike)

#29

Related fix proposed to branch: stable/pike
Review: https://review.openstack.org/647292

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-03-26: Related fix merged to nova (stable/queens)

#30

Reviewed: https://review.openstack.org/647291
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=7a3b072442db74418fa7fe626c637627d219c71f
Submitter: Zuul
Branch: stable/queens

commit 7a3b072442db74418fa7fe626c637627d219c71f
Author: Matt Riedemann <email address hidden>
Date: Tue Sep 4 12:11:57 2018 -0400

Document unset/reset wrinkle for *_allocation_ratio options

    This is a follow up to I43a23a3290db0c835fed01b8d6a38962dc61adce
    which makes the cpu/disk/ram_allocation_ratio config "sticky" in
    that once set to a non-default value, it is not possible to reset
    back to the default behavior (when config is 0.0) on an existing
    compute node record by unsetting the option from nova.conf. To
    reset back to the defaults, the non-0.0 default would have to be
    explicitly put into config, so cpu_allocation_ratio=16.0 for example.

    Alternatively operators could delete the nova-compute service
    record via the DELETE /os-services/{service_id} REST API and
    restart the nova-compute service to get a new compute_nodes record,
    but that workaround is messy and left undocumented in config.

    Change-Id: I908615d82ead0f70f8e6d2d78d5dcaed8431084d
    Related-Bug: #1789654
    (cherry picked from commit c45adaca5dd241408f1e29b657fe6ed42c908b8b)
    (cherry picked from commit a039f8397702d15718ebcec0fdb9cfeb6155f6a1)

tags:

added: in-stable-queens

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-03-28: Related fix merged to nova (stable/pike)

#31

Reviewed: https://review.openstack.org/647292
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=f8c0c4671c0b4967d9ccf9d84df575475fe3dddc
Submitter: Zuul
Branch: stable/pike

commit f8c0c4671c0b4967d9ccf9d84df575475fe3dddc
Author: Matt Riedemann <email address hidden>
Date: Tue Sep 4 12:11:57 2018 -0400

Document unset/reset wrinkle for *_allocation_ratio options

    This is a follow up to I43a23a3290db0c835fed01b8d6a38962dc61adce
    which makes the cpu/disk/ram_allocation_ratio config "sticky" in
    that once set to a non-default value, it is not possible to reset
    back to the default behavior (when config is 0.0) on an existing
    compute node record by unsetting the option from nova.conf. To
    reset back to the defaults, the non-0.0 default would have to be
    explicitly put into config, so cpu_allocation_ratio=16.0 for example.

    Alternatively operators could delete the nova-compute service
    record via the DELETE /os-services/{service_id} REST API and
    restart the nova-compute service to get a new compute_nodes record,
    but that workaround is messy and left undocumented in config.

    Change-Id: I908615d82ead0f70f8e6d2d78d5dcaed8431084d
    Related-Bug: #1789654
    (cherry picked from commit c45adaca5dd241408f1e29b657fe6ed42c908b8b)
    (cherry picked from commit a039f8397702d15718ebcec0fdb9cfeb6155f6a1)

tags:

added: in-stable-pike

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-03-28: Related fix merged to nova (stable/queens)

#32

Reviewed: https://review.openstack.org/613271
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=705284e18c3a9f2c93ecb15bd1bfee5d7710b684
Submitter: Zuul
Branch: stable/queens

commit 705284e18c3a9f2c93ecb15bd1bfee5d7710b684
Author: Matt Riedemann <email address hidden>
Date: Thu Aug 30 17:57:24 2018 -0400

Don't persist zero allocation ratios in ResourceTracker

    The ComputeNode object itself has a facade which provides the
    actual default values for cpu, disk and ram allocation ratios
    when the config option values are left at the default (0.0).

    When we initially create a ComputeNode in the ResourceTracker,
    the *_allocation_ratio values in the object are going to be
    unset, and then _copy_resources, called from _init_compute_node,
    will set the values to the config option values, again, defaulted
    to 0.0, but the ComputeNode object, after calling create() or
    save(), will change those *on the object* to the non-0.0 we
    actually know and love (16.0 for cpu, 1.5 for ram, and 1.0 for disk).

    During the update_available_resource periodic, we'll again go
    through _init_compute_node and _copy_resources in the ResourceTracker
    which will set the configured values (default of 0.0) onto the
    ComputeNode object, which makes the _resource_change method, called
    from _update, return True and trigger a ComputeNode.save() call
    from the _update method. At that point we're *persisting* the 0.0
    allocation ratios in the database, even though ComputeNode.save
    will change them to their non-0.0 default values *on the object*
    because of the _from_db_object call at the end of ComputeNode.save.

    So even if the ComputeNode object allocation ratio values are the
    non-0.0 defaults, we'll *always* update the database on every
    periodic even if nothing else changed in inventory.

This change modifies the _copy_resource method to only update the
ComputeNode fields if the configured ratios are not the 0.0 default.

    Change-Id: I43a23a3290db0c835fed01b8d6a38962dc61adce
    Related-Bug: #1789654
    (cherry picked from commit 2588af87c862cfd02d860f6b860381e907b279ff)
    (cherry picked from commit 01265b98c4cd2b1377e891a06ce748fc6f8f3425)

Reviewed:  https://review.openstack.org/613271
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=705284e18c3a9f2c93ecb15bd1bfee5d7710b684
Submitter: Zuul
Branch:    stable/queens

commit 705284e18c3a9f2c93ecb15bd1bfee5d7710b684
Author: Matt Riedemann <mriedem.os@gmail.com>
Date:   Thu Aug 30 17:57:24 2018 -0400

Don't persist zero allocation ratios in ResourceTracker
    
    The ComputeNode object itself has a facade which provides the
    actual default values for cpu, disk and ram allocation ratios
    when the config option values are left at the default (0.0).
    
    When we initially create a ComputeNode in the ResourceTracker,
    the *_allocation_ratio values in the object are going to be
    unset, and then _copy_resources, called from _init_compute_node,
    will set the values to the config option values, again, defaulted
    to 0.0, but the ComputeNode object, after calling create() or
    save(), will change those *on the object* to the non-0.0 we
    actually know and love (16.0 for cpu, 1.5 for ram, and 1.0 for disk).
    
    During the update_available_resource periodic, we'll again go
    through _init_compute_node and _copy_resources in the ResourceTracker
    which will set the configured values (default of 0.0) onto the
    ComputeNode object, which makes the _resource_change method, called
    from _update, return True and trigger a ComputeNode.save() call
    from the _update method. At that point we're *persisting* the 0.0
    allocation ratios in the database, even though ComputeNode.save
    will change them to their non-0.0 default values *on the object*
    because of the _from_db_object call at the end of ComputeNode.save.
    
    So even if the ComputeNode object allocation ratio values are the
    non-0.0 defaults, we'll *always* update the database on every
    periodic even if nothing else changed in inventory.
    
    This change modifies the _copy_resource method to only update the
    ComputeNode fields if the configured ratios are not the 0.0 default.
    
    Change-Id: I43a23a3290db0c835fed01b8d6a38962dc61adce
    Related-Bug: #1789654
    (cherry picked from commit 2588af87c862cfd02d860f6b860381e907b279ff)
    (cherry picked from commit 01265b98c4cd2b1377e891a06ce748fc6f8f3425)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-18: Change abandoned on nova (stable/pike)

#33

Change abandoned by melanie witt (<email address hidden>) on branch: stable/pike
Review: https://review.openstack.org/613263
Reason: Abandoning per rado.

Changed in nova:
assignee:	nobody → Matt Riedemann (mriedem)
status:	Confirmed → In Progress

OpenStack Compute (nova)

placement allocation_ratio initialized with 0.0

Bug Description

Other bug subscribers

Related blueprints

Remote bug watches

	Status	Importance	Assigned to
OpenStack Compute (nova)	Fix Released	High	Matt Riedemann
Pike	In Progress	High	Tony Breeds
Queens	In Progress	High	Tony Breeds
Rocky	Fix Released	High	Matt Riedemann