Comment 9 for bug 1789654

Revision history for this message
Matt Riedemann (mriedem) wrote :

We have a new xen CI run with the more detailed debug logs now and it's basically confirming what I think the problem was, but apparently this is a race because this time the xen CI job passed.

Starting with the logs here:

http://dd6b71949550285df7dc-dda4e480e005aaa13ec303551d2d8155.r49.cf1.rackcdn.com/13/597613/2/check/dsvm-tempest-neutron-network/621833d/logs/screen-n-cpu.txt.gz

This is the initial inventory update for the newly created compute node and resource provider on start of nova-compute:

Aug 30 08:02:12.144029 dsvm-devstack-citrix-lon-nodepool-1379396 nova-compute[24292]: DEBUG nova.scheduler.client.report [None req-f4b08178-4b7a-4fba-b57a-91612721f970 None None] Updated inventory for 9c58942c-d183-455a-a760-991e4430e816 at generation 1: {'VCPU': {'allocation_ratio': 16.0, 'total': 4, 'reserved': 0, 'step_size': 1, 'min_unit': 1, 'max_unit': 4}, 'MEMORY_MB': {'allocation_ratio': 1.5, 'total': 12795, 'reserved': 512, 'step_size': 1, 'min_unit': 1, 'max_unit': 12795}, 'DISK_GB': {'allocation_ratio': 1.0, 'total': 47, 'reserved': 0, 'step_size': 1, 'min_unit': 1, 'max_unit': 47}} {{(pid=24292) _update_inventory_attempt /opt/stack/new/nova/nova/scheduler/client/report.py:967}}

Then when the update_available_resource periodic task runs, we see that _copy_resources updates the in-memory ComputeNode.*_allocation_ratio values to 0.0 from the config:

Aug 30 08:03:09.458344 dsvm-devstack-citrix-lon-nodepool-1379396 nova-compute[24292]: INFO nova.compute.resource_tracker [None req-9b1b9924-b89e-4a03-9a69-c9fff17594e3 None None] ComputeNode.cpu_allocation_ratio changed from 16.0 to 0.0 in _copy_resources.

And the _resource_change method, called from RT._update, confirms the change:

Aug 30 08:03:09.549234 dsvm-devstack-citrix-lon-nodepool-1379396 nova-compute[24292]: INFO nova.compute.resource_tracker [None req-9b1b9924-b89e-4a03-9a69-c9fff17594e3 None None] Compute node resources have changed.
Aug 30 08:03:09.549407 dsvm-devstack-citrix-lon-nodepool-1379396 nova-compute[24292]: Old: ComputeNode(cpu_allocation_ratio=16.0,cpu_info='{"model": "Intel(R) Xeon(R) CPU E3-1231 v3 @ 3.40GHz", "vendor": "GenuineIntel", "features": ["fpu", "de", "tsc", "msr", "pae", "mce", "cx8", "apic", "sep", "mca", "cmov", "pat", "clflush", "mmx", "fxsr", "sse", "sse2", "syscall", "nx", "lm", "constant_tsc", "rep_good", "nopl", "pni", "pclmulqdq", "ssse3", "cx16", "sse4_1", "sse4_2", "movbe", "popcnt", "aes", "rdrand", "hypervisor", "lahf_lm", "abm", "fsgsbase", "bmi1", "bmi2", "erms"], "topology": {"cores": 1, "threads": 1, "sockets": 4}}',created_at=2018-08-30T15:02:11Z,current_workload=0,deleted=False,deleted_at=None,disk_allocation_ratio=1.0,disk_available_least=27,free_disk_gb=47,free_ram_mb=12283,host='localhost',host_ip=192.168.33.1,hypervisor_hostname='localhost',hypervisor_type='XenServer',hypervisor_version=7001000,id=1,local_gb=47,local_gb_used=0,mapped=0,memory_mb=12795,memory_mb_used=512,metrics='[]',numa_topology=None,pci_device_pools=PciDevicePoolList,ram_allocation_ratio=1.5,running_vms=0,service_id=None,stats={failed_builds='0'},supported_hv_specs=[HVSpec,HVSpec],updated_at=None,uuid=9c58942c-d183-455a-a760-991e4430e816,vcpus=4,vcpus_used=0)

Aug 30 08:03:09.549912 dsvm-devstack-citrix-lon-nodepool-1379396 nova-compute[24292]: New: ComputeNode(cpu_allocation_ratio=0.0,cpu_info='{"model": "Intel(R) Xeon(R) CPU E3-1231 v3 @ 3.40GHz", "vendor": "GenuineIntel", "features": ["fpu", "de", "tsc", "msr", "pae", "mce", "cx8", "apic", "sep", "mca", "cmov", "pat", "clflush", "mmx", "fxsr", "sse", "sse2", "syscall", "nx", "lm", "constant_tsc", "rep_good", "nopl", "pni", "pclmulqdq", "ssse3", "cx16", "sse4_1", "sse4_2", "movbe", "popcnt", "aes", "rdrand", "hypervisor", "lahf_lm", "abm", "fsgsbase", "bmi1", "bmi2", "erms"], "topology": {"cores": 1, "threads": 1, "sockets": 4}}',created_at=2018-08-30T15:02:11Z,current_workload=0,deleted=False,deleted_at=None,disk_allocation_ratio=0.0,disk_available_least=27,free_disk_gb=47,free_ram_mb=12283,host='localhost',host_ip=192.168.33.1,hypervisor_hostname='localhost',hypervisor_type='XenServer',hypervisor_version=7001000,id=1,local_gb=47,local_gb_used=0,mapped=0,memory_mb=12795,memory_mb_used=512,metrics='[]',numa_topology=None,pci_device_pools=PciDevicePoolList,ram_allocation_ratio=0.0,running_vms=0,service_id=None,stats={failed_builds='0'},supported_hv_specs=[HVSpec,HVSpec],updated_at=2018-08-30T15:02:11Z,uuid=9c58942c-d183-455a-a760-991e4430e816,vcpus=4,vcpus_used=0)

Now while that happens, I see a ProviderTree.update_inventory call, and I'm not sure from where:

Aug 30 08:03:09.614402 dsvm-devstack-citrix-lon-nodepool-1379396 nova-compute[24292]: INFO nova.compute.provider_tree [None req-9b1b9924-b89e-4a03-9a69-c9fff17594e3 None None] Updating inventory in ProviderTree for provider 9c58942c-d183-455a-a760-991e4430e816 with inventory: {u'VCPU': {u'allocation_ratio': 16.0, u'total': 4, u'reserved': 0, u'step_size': 1, u'min_unit': 1, u'max_unit': 4}, u'MEMORY_MB': {u'allocation_ratio': 1.5, u'total': 12795, u'reserved': 512, u'step_size': 1, u'min_unit': 1, u'max_unit': 12795}, u'DISK_GB': {u'allocation_ratio': 1.0, u'total': 47, u'reserved': 0, u'step_size': 1, u'min_unit': 1, u'max_unit': 47}}

Back to the update_available_resource periodic thread, because RT._resource_changed returned True, we called ComputeNode.save() which will call ComputeNode._from_db_object which will reset the ComputeNode.cpu_allocation_ratio from 0.0 to 16.0, and then we pass that ComputeNode object to the RT._normalize_inventory_from_cn_obj method:

Aug 30 08:03:10.089986 dsvm-devstack-citrix-lon-nodepool-1379396 nova-compute[24292]: INFO nova.compute.resource_tracker [None req-9b1b9924-b89e-4a03-9a69-c9fff17594e3 None None] Begin _normalize_inventory_from_cn_obj for node 9c58942c-d183-455a-a760-991e4430e816
Aug 30 08:03:10.090256 dsvm-devstack-citrix-lon-nodepool-1379396 nova-compute[24292]: INFO nova.compute.resource_tracker [None req-9b1b9924-b89e-4a03-9a69-c9fff17594e3 None None] Using cpu_allocation_ratio 16.0 for node: 9c58942c-d183-455a-a760-991e4430e816
Aug 30 08:03:10.090512 dsvm-devstack-citrix-lon-nodepool-1379396 nova-compute[24292]: INFO nova.compute.resource_tracker [None req-9b1b9924-b89e-4a03-9a69-c9fff17594e3 None None] RT: Sending compute node inventory changes back to placement for node: 9c58942c-d183-455a-a760-991e4430e816

And in there we are setting cpu_allocation_ratio in the VCPU inventory dict properly to 16.0.

And because of that, we don't update the inventory and change the cpu_allocation_ratio in placement from 16.0 to 0.0:

Aug 30 08:03:10.170157 dsvm-devstack-citrix-lon-nodepool-1379396 nova-compute[24292]: INFO nova.scheduler.client.report [None req-9b1b9924-b89e-4a03-9a69-c9fff17594e3 None None] (Local) inventory has not changed for provider 9c58942c-d183-455a-a760-991e4430e816 based on inventory data: {'VCPU': {'allocation_ratio': 16.0, 'total': 4, 'reserved': 0, 'step_size': 1, 'min_unit': 1, 'max_unit': 4}, 'MEMORY_MB': {'allocation_ratio': 1.5, 'total': 12795, 'reserved': 512, 'step_size': 1, 'min_unit': 1, 'max_unit': 12795}, 'DISK_GB': {'allocation_ratio': 1.0, 'total': 47, 'reserved': 0, 'step_size': 1, 'min_unit': 1, 'max_unit': 47}}

But there is definitely something going on here with shared ProviderTree cache, I'm just not sure where.