OpenStack Compute (nova)

numa_fit_instance_to_host() algorithm is highly ineffective on higher number of NUMA nodes

Bug #1978372 reported by Igor Raits on 2022-06-11

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Fix Released	Medium	Balazs Gibizer

Bug Description

Description
===========
Nova scheduler, when numa_fit_instance_to_host() is executed for instance with 8 NUMA nodes against host object with NUMA topology that includes 16 NUMA nodes (3 cores × 2 threads each) is taking ~5 minutes when first half of NUMA nodes are occupied.

This makes scheduling 48 cores flavor extremely sloooow…

Output of reproducer:
```
InstanceNUMATopology(cells=[InstanceNUMACell(8),InstanceNUMACell(9),InstanceNUMACell(10),InstanceNUMACell(11),InstanceNUMACell(12),InstanceNUMACell(13),InstanceNUMACell(14)],emulator_threads_policy=None,id=<?>,instance_uuid=<?>)

________________________________________________________
Executed in 269.13 secs fish external
usr time 268.60 secs 0.00 micros 268.60 secs
sys time 0.07 secs 595.00 micros 0.07 secs
```

Steps to reproduce
==================
1. Add host with 16 NUMA nodes (3 cores × 2 threads each) to the OpenStack
2. Create a flavor for 48 CPUs that would take half of the host exactly
openstack flavor create sh4a-c48r488e20 \
--ram $((488*1024)) \
--vcpus 48 \
--ephemeral 20 \
--disk 20 \
--swap 0 \
--property 'hw:mem_page_size=1GB' \
--property 'hw:cpu_policy=dedicated' \
--property 'hw:cpu_thread_policy=prefer' \
--property 'hw:cpu_max_sockets=8' \
--property 'hw:cpu_sockets=8' \
--property 'hw:numa_mempolicy=strict' \
--property 'hw:numa_nodes=8' \
--property 'hw:numa_cpus.0=0,1,2,3,4,5' \
--property 'hw:numa_cpus.1=6,7,8,9,10,11' \
--property 'hw:numa_cpus.2=12,13,14,15,16,17' \
--property 'hw:numa_cpus.3=18,19,20,21,22,23' \
--property 'hw:numa_cpus.4=24,25,26,27,28,29' \
--property 'hw:numa_cpus.5=30,31,32,33,34,35' \
--property 'hw:numa_cpus.6=36,37,38,39,40,41' \
--property 'hw:numa_cpus.7=42,43,44,45,46,47' \
--property 'hw:numa_mem.0=62464' \
--property 'hw:numa_mem.1=62464' \
--property 'hw:numa_mem.2=62464' \
--property 'hw:numa_mem.3=62464' \
--property 'hw:numa_mem.4=62464' \
--property 'hw:numa_mem.5=62464' \
--property 'hw:numa_mem.6=62464' \
--property 'hw:numa_mem.7=62464' \
--property 'hw:cpu_threads=2' \
--property 'hw:cpu_max_threads=2'
3. Create an instance with such flavor (so that it would normally land to that host) - command is skipped as in different installation it could be different
4. Wait for the first instance to spawn (this part is fast as it takes first 8 NUMA nodes).
5. Create a second instance with the same flavor.

…

Wait 5+ minutes until nova-scheduler is done with its work.

Expected result
===============
NUMA nodes selected within 10-15 seconds.

Actual result
=============
Algorithm is slow enough so that it takes 5 minutes to have instance scheduled.

Environment
===========
1. OpenStack Nova 23.2.0-1.el8. NOTE: I am able to reproduce this with master branch with 20 lines reproducer.
commit 4939318649650b60dd07d161b80909e70d0e093e (HEAD -> master, upstream/master)
Merge: c6e0f4f551 4c339c10e3
Author: Zuul <email address hidden>
Date: Tue May 17 00:01:41 2022 +0000

Merge "Drop lower-constraints.txt and its testing"

2. Libvirt + KVM (although it is not relevant here)
libvirt-8.0.0-6.module_el8.7.0+1140+ff0772f9.x86_64
qemu-kvm-6.2.0-12.module_el8.7.0+1140+ff0772f9.x86_64

2. LVM storage (not relevant either)
lvm2-2.03.14-3.el8.x86_64

3. Neutron with L2 (not relevant)

Logs & Configs
==============
Check the reproducer and try it with uncommented DEBUG lines (will attach it here too).

Tags:

Revision history for this message

Igor Raits (ignatenkobrain) wrote on 2022-06-11:

reproducer-simplified.py Edit (1.3 KiB, text/x-python)

Revision history for this message

Igor Raits (ignatenkobrain) wrote on 2022-06-11:

I'll not be attaching DEBUG log from reproducer as it contains gigabytes of the same lines.

Revision history for this message

Igor Raits (ignatenkobrain) wrote on 2022-06-11:

InstanceNUMATopology(cells=[InstanceNUMACell(8),InstanceNUMACell(9),InstanceNUMACell(10),InstanceNUMACell(11),InstanceNUMACell(12),InstanceNUMACell(13),InstanceNUMACell(14)],emulator_threads_policy=None,id=<?>,instance_uuid=<?>)
2036.89user 113.99system 35:56.78elapsed 99%CPU (0avgtext+0avgdata 98420maxresident)k
0inputs+29185272outputs (0major+22041minor)pagefaults 0swaps

with debug logging it takes 36 minutes and produces 18G logs

Revision history for this message

Balazs Gibizer (balazs-gibizer) wrote on 2022-06-14 (last edit on 2022-06-14):

Thanks you for the reproducer. I really appreciate it.

I was able to reproduce it locally. It took 6 mins to run the simple reproducer on master. This is definitely not good performance. And the log is excessive as well.

I mark this as Confirmed.

If you have time and ideas how to improve on this then feel free to propose a patch and add me (gibi) to the review.

Changed in nova:
status:	New → Confirmed
importance:	Undecided → Medium
tags:	added: numa performance scheduler

Revision history for this message

sean mooney (sean-k-mooney) wrote on 2022-06-14:

this is kind of an know issue.
the current implementation depend on itertools permutaions.
its expect to be roughly O(kn^2) where n is the number of numa nodes

i had investigated reimplemting this with a different approach a few years ago but
that effort was paused as we wanted to tack numa nodes in placement
https://specs.openstack.org/openstack/nova-specs/specs/victoria/approved/numa-topology-with-rps.html

the current approach really does not scale well beyond 4-8 numa nodes
it was developed assume up to 4

on my laptop your repoducer take 12 minutes

real 12m32.814s
user 12m15.340s
sys 0m2.704s

quickly replciating that in my alternative poc

https://github.com/SeanMooney/cpu-pinning/blob/master/pinning.py#L378-L415=

~/repos/cpu-pinning on  master [?] via 🐍 v3.8.13 (.venv)
[11:50:39]➜ ./run.sh
Courtesy Notice: Pipenv found itself running within a virtual environment, so it will automatically use that environment, instead of creating its own for any project. You can set PIPENV_IGNORE_VIRTUALENVS=1 to force pipenv to ignore that environment and create its own instead. You can set PIPENV_VERBOSITY=-1 to suppress this warning.
bug repoducer
----------------------------------------
0:21,1:69,2:22,3:70,4:23,5:71,6:24,7:72,8:25,9:73,10:26,11:74,12:27,13:75,14:28,15:76,16:29,17:77,18:30,19:78,20:31,21:79,22:32,23:80,24:33,25:81,26:34,27:82,28:35,29:83,30:36,31:84,32:37,33:85,34:38,35:86,36:39,37:87,38:40,39:88,40:41,41:89,42:42,43:90,44:43,45:91,46:44,47:92

real 0m1.074s
user 0m0.936s
sys 0m0.115s

there is just a slight performance increase using generators instead.

numa_fit_instance_to_host us a pure function of its inputs

in theory my protype could be refactored to use the real nova objects and we could extend it to the other usecauses that it does not currently support like mixed cpus and PCI devices.

or we could keep both implementation and provide a second numa filter that initaiily only implemented a subset of the features initially.
i.e. only run the existing one if the less strict generitive one found a possible match

dropping in a complete new implementation and backporting that for older branches even if we achieved feature parity is a risk that I'm not sure we would want to take for stable branches but if it was opt in and had such a large performance improvement for operators then maybe its something we should consider.

this is kind of an know issue.
the current implementation depend on itertools permutaions.
its expect to be roughly O(kn^2) where n is the number of numa nodes

the current approach really does not scale well beyond 4-8 numa  nodes
it was developed assume up to 4

on my laptop your repoducer take 12 minutes

real    12m32.814s
user    12m15.340s
sys     0m2.704s

quickly replciating that in my alternative poc

https://github.com/SeanMooney/cpu-pinning/blob/master/pinning.py#L378-L415=

~/repos/cpu-pinning on  master [?] via 🐍 v3.8.13 (.venv) 
[11:50:39]➜ ./run.sh 
Courtesy Notice: Pipenv found itself running within a virtual environment, so it will automatically use that environment, instead of creating its own for any project. You can set PIPENV_IGNORE_VIRTUALENVS=1 to force pipenv to ignore that environment and create its own instead. You can set PIPENV_VERBOSITY=-1 to suppress this warning.                                                                                    
bug repoducer
----------------------------------------
0:21,1:69,2:22,3:70,4:23,5:71,6:24,7:72,8:25,9:73,10:26,11:74,12:27,13:75,14:28,15:76,16:29,17:77,18:30,19:78,20:31,21:79,22:32,23:80,24:33,25:81,26:34,27:82,28:35,29:83,30:36,31:84,32:37,33:85,34:38,35:86,36:39,37:87,38:40,39:88,40:41,41:89,42:42,43:90,44:43,45:91,46:44,47:92

real    0m1.074s
user    0m0.936s
sys     0m0.115s

there is just a slight performance increase using generators instead.

numa_fit_instance_to_host us a pure function of its inputs

in theory my protype could be refactored to use the real nova objects and we could extend it to the other usecauses that it does not currently support like mixed cpus and PCI devices.

Revision history for this message

sean mooney (sean-k-mooney) wrote on 2022-06-14:

t.py Edit (1.4 KiB, text/x-python)

so with a slight change to your repoducer to enable our numa blanceing feature
introduced by https://github.com/openstack/nova/commit/d13412648d011994a146dac1e7214ead3b82b31b

the issue is resolved.

backports of this have started to xena
https://review.opendev.org/c/openstack/nova/+/829804

23.2.0-1.el8 is wallaby so if we backport that to wallaby it would provide a workaround for this issue.

Revision history for this message

sean mooney (sean-k-mooney) wrote on 2022-06-14:

so this is at least partly related or a partial dup of https://bugs.launchpad.net/nova/+bug/1940668
and this older bug https://bugs.launchpad.net/nova/+bug/1893121

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-06-15: Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/845896

Changed in nova:
status:	Confirmed → In Progress

Revision history for this message

Balazs Gibizer (balazs-gibizer) wrote on 2022-06-15:

@Igor: I proposed an optimization[1] that significantly reduced the runtime of the reproducer (from 6mins to 3seconds) locally. If you have a chance please test it in your environment and report back to us.

[1] https://review.opendev.org/c/openstack/nova/+/845896

Changed in nova:
assignee:	nobody → Balazs Gibizer (balazs-gibizer)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-06-16: Related fix proposed to nova (master)

#10

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/846169

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-08-10: Related fix merged to nova (master)

#11

Reviewed: https://review.opendev.org/c/openstack/nova/+/846169
Committed: https://opendev.org/openstack/nova/commit/e76ec7af4d30604e3df9423226ebda90035f30b2
Submitter: "Zuul (22348)"
Branch: master

commit e76ec7af4d30604e3df9423226ebda90035f30b2
Author: Sean Mooney <email address hidden>
Date: Thu Jun 16 14:36:05 2022 +0100

update default numa allocation strategy

    This change updated the default of
    [compute]/packing_host_numa_cells_allocation_strategy
    to False making nova spread vms across numa nodes by defualt.

    This should significantly improve scheduling performace
    when there are a large number of host and guest numa node
    and non empty hosts. see bug 1978372 for details.

Related-Bug: #1978372
Change-Id: I6fcd2c6b58dd36674be57eee70894ce04335955a

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-08-11: Fix merged to nova (master)

#12

Reviewed: https://review.opendev.org/c/openstack/nova/+/845896
Committed: https://opendev.org/openstack/nova/commit/099a6f63af7805440d91976ba0ea03bc6c278280
Submitter: "Zuul (22348)"
Branch: master

commit 099a6f63af7805440d91976ba0ea03bc6c278280
Author: Balazs Gibizer <email address hidden>
Date: Wed Jun 15 09:28:27 2022 +0200

Optimize numa_fit_instance_to_host

    The numa_fit_instance_to_host algorithm tries all the possible
    host cell permutations to fit the instance cells. So in worst case
    scenario it does n! / (n-k)! _numa_fit_instance_cell calls
    (n=len(host_cells) k=len(instance_cells)) to find if the instance can be
    fit to the host. With 16 NUMA nodes host and 8 NUMA node guests this
    means 500 million calls to _numa_fit_instance_cell. This takes excessive
    time.

    However going through these permutations there are many repetitive
    host_cell, instance_cell pairs to try to fit.
    E.g.
      host_cells=[H1, H2, H2]
      instance_cells=[G1, G2]

Produces pairings:

    * H1 <- G1 and H2 <- G2
    * H1 <- G1 and H3 <- G2
    ...

    Here G1 is checked to fit H1 twice. But if it does not fit in the first
    time then we know that it will not fit in the second time either. So we
    can cache the result of the first check and use that cache for the later
    permutations.

    This patch adds two caches to the algo. A fit_cache to hold
    host_cell.id, instance_cell.id pairs that we know fit, and a
    no_fit_cache for those pairs that we already know that doesn't fit.

    This change significantly boost the performance of the algorithm. The
    reproduction provided in the bug 1978372 took 6 minutes on my local
    machine to run without the optimization. With the optimization it run in
    3 seconds.

    This change increase the memory usage of the algorithm with the two
    caches. Those caches are sets of integer two tuples. And the total size
    of the cache is the total number of possible host_cell, instance_cell
    pairs which is len(host_cell) * len(instance_cells). So form the above
    example (16 host, 8 instance NUMA) it is 128 pairs of integers in the
    cache. That will not cause a significant memory increase.

Closes-Bug: #1978372
Change-Id: Ibcf27d741429a239d13f0404348c61e2668b4ce4

Reviewed:  https://review.opendev.org/c/openstack/nova/+/845896
Committed: https://opendev.org/openstack/nova/commit/099a6f63af7805440d91976ba0ea03bc6c278280
Submitter: "Zuul (22348)"
Branch:    master

commit 099a6f63af7805440d91976ba0ea03bc6c278280
Author: Balazs Gibizer <gibi@redhat.com>
Date:   Wed Jun 15 09:28:27 2022 +0200

Optimize numa_fit_instance_to_host
    
    The  numa_fit_instance_to_host algorithm tries all the possible
    host cell permutations to fit the instance cells. So in worst case
    scenario it does  n! / (n-k)! _numa_fit_instance_cell calls
    (n=len(host_cells) k=len(instance_cells)) to find if the instance can be
    fit to the host. With 16 NUMA nodes host and 8 NUMA node guests this
    means 500 million calls to _numa_fit_instance_cell. This takes excessive
    time.
    
    However going through these permutations there are many repetitive
    host_cell, instance_cell pairs to try to fit.
    E.g.
      host_cells=[H1, H2, H2]
      instance_cells=[G1, G2]
    
    Produces pairings:
    
    * H1 <- G1 and H2 <- G2
    * H1 <- G1 and H3 <- G2
    ...
    
    Here G1 is checked to fit H1 twice. But if it does not fit in the first
    time then we know that it will not fit in the second time either. So we
    can cache the result of the first check and use that cache for the later
    permutations.
    
    This patch adds two caches to the algo. A fit_cache to hold
    host_cell.id, instance_cell.id pairs that we know fit, and a
    no_fit_cache for those pairs that we already know that doesn't fit.
    
    This change significantly boost the performance of the algorithm. The
    reproduction provided in the bug 1978372 took 6 minutes on my local
    machine to run without the optimization. With the optimization it run in
    3 seconds.
    
    This change increase the memory usage of the algorithm with the two
    caches. Those caches are sets of integer two tuples. And the total size
    of the cache is the total number of possible host_cell, instance_cell
    pairs which is len(host_cell) * len(instance_cells). So form the above
    example (16 host, 8 instance NUMA) it is 128 pairs of integers in the
    cache. That will not cause a significant memory increase.
    
    Closes-Bug: #1978372
    Change-Id: Ibcf27d741429a239d13f0404348c61e2668b4ce4