numatopology filter incorrectly returns no resources

Bug #1519878 reported by Serguei Bezverkhi
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Invalid
Undecided
Unassigned

Bug Description

When launching a new instance, in some cases NUmaTopology Filter does not return available compute nodes, but according to the content of numa_topology in compute_nodes tables, there are sufficient resources to satisfy requirements.

I started three instances, attached log show changes in numa_topology, when I try to start 4th instance which is requesting 4vCPU and according to numa_topology I have left 8 vCPU, NumaTopology filter incorrectly returns 0 hosts. If I delete existing instances, I can launch failed one without any modification.

rpm -qa | grep nova
openstack-nova-conductor-12.0.0-1.el7.noarch
python-novaclient-2.30.1-1.el7.noarch
openstack-nova-console-12.0.0-1.el7.noarch
openstack-nova-common-12.0.0-1.el7.noarch
openstack-nova-scheduler-12.0.0-1.el7.noarch
openstack-nova-compute-12.0.0-1.el7.noarch
python-nova-12.0.0-1.el7.noarch
openstack-nova-novncproxy-12.0.0-1.el7.noarch
openstack-nova-api-12.0.0-1.el7.noarch
openstack-nova-cert-12.0.0-1.el7.noarch

Revision history for this message
Serguei Bezverkhi (sbezverk) wrote :
Revision history for this message
Serguei Bezverkhi (sbezverk) wrote :

Nova-api and nova-scheduler logs

Revision history for this message
Serguei Bezverkhi (sbezverk) wrote :
Revision history for this message
Serguei Bezverkhi (sbezverk) wrote :
Revision history for this message
Serguei Bezverkhi (sbezverk) wrote :
Revision history for this message
Nikola Đipanov (ndipanov) wrote :

I am looking into what might be happening here - just for clarity since I pinged the reporter on IRC. Relevant and most complete dumps of DB data are in comment #3 (compute nodes) and comment #5 (instances). Feel free to disregard other attachments as they were trial and error to make sure we have the needed data.

Revision history for this message
Nikola Đipanov (ndipanov) wrote :

Hey so I was not able to reproduce this - the Nova pinning logic seems to pass this (I was testing it with 12.0.0 tag checked out) see the attached patch.

Are you sure that it's the NUMA topology filter that was failing. It might be best to turn on debug logging in the scheduler and make sure that it's the NUMATopologyFilter that is returning 0 hosts

Revision history for this message
Nikola Đipanov (ndipanov) wrote :
Changed in nova:
status: New → Incomplete
Revision history for this message
Serguei Bezverkhi (sbezverk) wrote : RE: [Bug 1519878] Re: numatopology filter incorrectly returns no resources
Download full text (4.8 KiB)

Chck out this log, it clearly states that the issue is Numa topology requirments.

2015-11-26 10:08:39.604 2899 DEBUG nova.scheduler.filters.numa_topology_filter [req-9688c526-0f54-43f3-ba1a-7fe1d1bc63d4 d525cf27fd9c4782a20363f65bed9795 f77fb93ac01c488f8cfd1eb4ebe7c2f0 - - -] sbezverk-osp-4.sbezverk.cisco.com, sbezverk-osp-4.sbezverk.cisco.com fails NUMA topologyrequirements. The instance does not fit on this host. host_passes /usr/lib/python2.7/site-packages/nova/scheduler/filters/numa_topology_filter.py:48

2015-11-26 10:08:39.605 2899 INFO nova.filters [req-9688c526-0f54-43f3-ba1a-7fe1d1bc63d4 d525cf27fd9c4782a20363f65bed9795 f77fb93ac01c488f8cfd1eb4ebe7c2f0 - - -] Filter NUMATopologyFilter returned 0 hosts
2015-11-26 10:08:39.605 2899 DEBUG nova.filters [req-9688c526-0f54-43f3-ba1a-7fe1d1bc63d4 d525cf27fd9c4782a20363f65bed9795 f77fb93ac01c488f8cfd1eb4ebe7c2f0 - - -] Filtering removed all hosts for the request with reservation ID 'r-tfsr0m79' and instance ID 'd4643825-0893-45b9-904d-b5a6bbd1ec30'. Filter results: [('RetryFilter', [(u'sbezverk-osp-4.sbezverk.cisco.com', u'sbezverk-osp-4.sbezverk.cisco.com')]), ('AvailabilityZoneFilter', [(u'sbezverk-osp-4.sbezverk.cisco.com', u'sbezverk-osp-4.sbezverk.cisco.com')]), ('RamFilter', [(u'sbezverk-osp-4.sbezverk.cisco.com', u'sbezverk-osp-4.sbezverk.cisco.com')]), ('ComputeFilter', [(u'sbezverk-osp-4.sbezverk.cisco.com', u'sbezverk-osp-4.sbezverk.cisco.com')]), ('ComputeCapabilitiesFilter', [(u'sbezverk-osp-4.sbezverk.cisco.com', u'sbezverk-osp-4.sbezverk.cisco.com')]), ('ImagePropertiesFilter', [(u'sbezverk-osp-4.sbezverk.cisco.com', u'sbezverk-osp-4.sbezverk.cisco.com')]), ('ServerGroupAntiAffinityFilter', [(u'sbezverk-osp-4.sbezverk.cisco.com', u'sbezverk-osp-4.sbezverk.cisco.com')]), ('ServerGroupAffinityFilter', [(u'sbezverk-osp-4.sbezverk.cisco.com', u'sbezverk-osp-4.sbezverk.cisco.com')]), ('PciPassthroughFilter', [(u'sbezverk-osp-4.sbezverk.cisco.com', u'sbezverk-osp-4.sbezverk.cisco.com')]), ('NUMATopologyFilter', None)] get_filtered_objects /usr/lib/python2.7/site-packages/nova/filters.py:122
2015-11-26 10:08:39.605 2899 INFO nova.filters [req-9688c526-0f54-43f3-ba1a-7fe1d1bc63d4 d525cf27fd9c4782a20363f65bed9795 f77fb93ac01c488f8cfd1eb4ebe7c2f0 - - -] Filtering removed all hosts for the request with reservation ID 'r-tfsr0m79' and instance ID 'd4643825-0893-45b9-904d-b5a6bbd1ec30'. Filter results: ['RetryFilter: (start: 1, end: 1)', 'AvailabilityZoneFilter: (start: 1, end: 1)', 'RamFilter: (start: 1, end: 1)', 'ComputeFilter: (start: 1, end: 1)', 'ComputeCapabilitiesFilter: (start: 1, end: 1)', 'ImagePropertiesFilter: (start: 1, end: 1)', 'ServerGroupAntiAffinityFilter: (start: 1, end: 1)', 'ServerGroupAffinityFilter: (start: 1, end: 1)', 'PciPassthroughFilter: (start: 1, end: 1)', 'NUMATopologyFilter: (start: 1, end: 0)']

-----Original Message-----
From: <email address hidden> [mailto:<email address hidden>] On Behalf Of Nikola Ðipanov
Sent: Thursday, November 26, 2015 11:04 AM
To: Serguei Bezverkhi (sbezverk) <email address hidden>
Subject: [Bug 1519878] Re: numatopology filter incorrectly returns no resources

Hey so I was not able to reproduce this - the Nova pin...

Read more...

Revision history for this message
Serguei Bezverkhi (sbezverk) wrote :

The attached patch is for "test" components which is not installed/used in my environment. Could you build a patch for "production"?

Changed in nova:
status: Incomplete → New
Revision history for this message
Nikola Đipanov (ndipanov) wrote :

Hey Serguei - so the patch I linked is only a test. It was the quickest way for me to confirm that the NUMA fitting logic itself is not flawed with the exact data it would use in your case. It does not seem to be on my fresh checkout of 12.0.0

If you look at the test added in the patch - it does not make the exact same call to numa_fit_instance_to_host that the filter does - filter considers overcommit ratios and requested pci devices in the general case, but that should be fine since overcommit is not considered for CPU pining anyway.

The only thing left to confirm would be if you are requesting any PCI devices with your instance. You don't seem to be doing it from the flavor, but it is possible that you are requesting a Neutron port that will result in Nova needing to request a PCI device (--vnic-type=direct). The reason this would fail is that if you request for both CPU pinning AND a pci device, Nova will refuse to pin an instance to CPUs that are on a different NUMA node than the available PCI device is.

Could you confirm if you are in fact requesting a PCI nic? The quickest way would be to paste contents of the pci_requests column of the instance_extra table (so SELECT pci_requests FROM instance_extra WHERE deleted=0;)

Changed in nova:
status: New → Incomplete
Revision history for this message
Nikola Đipanov (ndipanov) wrote :

One other thing is the log in your previous comment:

"... fails NUMA topologyrequirements. The instance does not fit on this host."

I don't see that logging call in the code - did you patch them code to add it?

Revision history for this message
Serguei Bezverkhi (sbezverk) wrote :

Yes, I added this logging, I was hopping it would provide more information than just stating that it does not fit. I took logging code from this change: Change in openstack/nova[master]: trivial: Add some logs to 'numa_topology_filter' submitted by '<email address hidden>'.

I will get you pci_requests in a bit.

Revision history for this message
Serguei Bezverkhi (sbezverk) wrote :
Revision history for this message
Serguei Bezverkhi (sbezverk) wrote :

I run some tests and you were right. When I start an instance that does not require any pci devices, but just CPU pinning from a second socket, it works fine. As soon as I try to launch an instance where I request for a pci device which happens to be bound to socket 0, but cpu pinning happens to be on socket 1. Numatopology filter fails this request.

Revision history for this message
Nikola Đipanov (ndipanov) wrote :

Yes. as discussed - that is to be expected. Closing the bug for now. Feel free to reopen if you feel it needs more looking into.

Changed in nova:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.