OpenStack Compute (nova)

Bug #1792985
Activity log

Activity log for bug #1792985

Date	Who	What changed	Old value	New value	Message
2018-09-17 16:50:33	Chris Friesen	bug			added bug
2018-09-17 17:26:10	Chris Friesen	description	We've seen a case on a resource-constrained compute node where booting multiple instances passed, but led to the following error messages from the host kernel: [ 731.911731] Out of memory: Kill process 133047 (nova-api) score 4 or sacrifice child [ 731.920377] Killed process 133047 (nova-api) total-vm:374456kB, anon-rss:144708kB, file-rss:1892kB, shmem-rss:0kB The problem appears to be that currently with libvirt an instance which does not specify a NUMA topology (which implies "shared" CPUs and the default memory pagesize) is allowed to float across the whole compute node. As such, we do not know which host NUMA node its memory is going to be allocated from, and therefore we don't know how much memory is remaining on each host NUMA node. If we have a similar instance which is limited to a particular NUMA node (due to adding a PCI device for example, or in the future by specifying dedicated CPUs) then that allocation will currently use "strict" NUMA affinity. This allocation can fail if there isn't enough memory available on that NUMA node (due to being "stolen" by a floating instance, for example). I think this means that we cannot use "strict" affinity for the default page size even when we do have a numa_topology since we can't have accurate per-NUMA-node accounting due to the fact that we don't know which NUMA node floating instances allocated their memory from.	We've seen a case on a resource-constrained compute node where booting multiple instances passed, but led to the following error messages from the host kernel: [ 731.911731] Out of memory: Kill process 133047 (nova-api) score 4 or sacrifice child [ 731.920377] Killed process 133047 (nova-api) total-vm:374456kB, anon-rss:144708kB, file-rss:1892kB, shmem-rss:0kB The problem appears to be that currently with libvirt an instance which does not specify a NUMA topology (which implies "shared" CPUs and the default memory pagesize) is allowed to float across the whole compute node. As such, we do not know which host NUMA node its memory is going to be allocated from, and therefore we don't know how much memory is remaining on each host NUMA node. If we have a similar instance which is limited to a particular NUMA node (due to adding a PCI device for example, or in the future by specifying dedicated CPUs) then that allocation will currently use "strict" NUMA affinity. This allocation can fail if there isn't enough memory available on that NUMA node (due to being "stolen" by a floating instance, for example). I think this means that we cannot use "strict" affinity for the default page size even when we do have a numa_topology since we can't have accurate per-NUMA-node accounting due to the fact that we don't know which NUMA node floating instances allocated their memory from. As such, I propose we change the XML for a VM from something like this: <numatune> <memory mode='strict' nodeset='0'/> <memnode cellid='0' mode='strict' nodeset='0'/> </numatune> to something like this (without the overall "memory" mode): <numatune> <memnode cellid='0' mode='strict' nodeset='0'/> </numatune> This will have the end result that /proc/<pid>/numa_maps 4K memory policy will show 'default' instead of 'bind:<x>' for everything except qemu hugepages membacking. The numa_maps will look like this: 7fe3a3400000 default anon=64 dirty=64 N0=64 kernelpagesize_kB=4 7fe3a5600000 default stack:219058 anon=4 dirty=4 N0=4 kernelpagesize_kB=4 7fe3a5e00000 bind:1 file=/mnt/huge-2048kB/libvirt qemu/qemu_back_mem._objects_ram-node1.JhY5SD\040(deleted) huge dirty=256 mapmax=2 N1=256 kernelpagesize_kB=2048 7fe3c6000000 bind:0 file=/mnt/huge-2048kB/libvirt/qemu/qemu_back_mem._objects_ram-node0.Z0GkXp\040(deleted) huge dirty=256 mapmax=3 N0=256 kernelpagesize_kB=2048
2018-09-17 17:41:27	Chris Friesen	description	We've seen a case on a resource-constrained compute node where booting multiple instances passed, but led to the following error messages from the host kernel: [ 731.911731] Out of memory: Kill process 133047 (nova-api) score 4 or sacrifice child [ 731.920377] Killed process 133047 (nova-api) total-vm:374456kB, anon-rss:144708kB, file-rss:1892kB, shmem-rss:0kB The problem appears to be that currently with libvirt an instance which does not specify a NUMA topology (which implies "shared" CPUs and the default memory pagesize) is allowed to float across the whole compute node. As such, we do not know which host NUMA node its memory is going to be allocated from, and therefore we don't know how much memory is remaining on each host NUMA node. If we have a similar instance which is limited to a particular NUMA node (due to adding a PCI device for example, or in the future by specifying dedicated CPUs) then that allocation will currently use "strict" NUMA affinity. This allocation can fail if there isn't enough memory available on that NUMA node (due to being "stolen" by a floating instance, for example). I think this means that we cannot use "strict" affinity for the default page size even when we do have a numa_topology since we can't have accurate per-NUMA-node accounting due to the fact that we don't know which NUMA node floating instances allocated their memory from. As such, I propose we change the XML for a VM from something like this: <numatune> <memory mode='strict' nodeset='0'/> <memnode cellid='0' mode='strict' nodeset='0'/> </numatune> to something like this (without the overall "memory" mode): <numatune> <memnode cellid='0' mode='strict' nodeset='0'/> </numatune> This will have the end result that /proc/<pid>/numa_maps 4K memory policy will show 'default' instead of 'bind:<x>' for everything except qemu hugepages membacking. The numa_maps will look like this: 7fe3a3400000 default anon=64 dirty=64 N0=64 kernelpagesize_kB=4 7fe3a5600000 default stack:219058 anon=4 dirty=4 N0=4 kernelpagesize_kB=4 7fe3a5e00000 bind:1 file=/mnt/huge-2048kB/libvirt qemu/qemu_back_mem._objects_ram-node1.JhY5SD\040(deleted) huge dirty=256 mapmax=2 N1=256 kernelpagesize_kB=2048 7fe3c6000000 bind:0 file=/mnt/huge-2048kB/libvirt/qemu/qemu_back_mem._objects_ram-node0.Z0GkXp\040(deleted) huge dirty=256 mapmax=3 N0=256 kernelpagesize_kB=2048	We've seen a case on a resource-constrained compute node where booting multiple instances passed, but led to the following error messages from the host kernel: [ 731.911731] Out of memory: Kill process 133047 (nova-api) score 4 or sacrifice child [ 731.920377] Killed process 133047 (nova-api) total-vm:374456kB, anon-rss:144708kB, file-rss:1892kB, shmem-rss:0kB The problem appears to be that currently with libvirt an instance which does not specify a NUMA topology (which implies "shared" CPUs and the default memory pagesize) is allowed to float across the whole compute node. As such, we do not know which host NUMA node its memory is going to be allocated from, and therefore we don't know how much memory is remaining on each host NUMA node. If we have a similar instance which is limited to a particular NUMA node (due to adding a PCI device for example, or in the future by specifying dedicated CPUs) then that allocation will currently use "strict" NUMA affinity. This allocation can fail if there isn't enough memory available on that NUMA node (due to being "stolen" by a floating instance, for example). I think this means that we cannot use "strict" affinity for the default page size even when we do have a numa_topology since we can't have accurate per-NUMA-node accounting due to the fact that we don't know which NUMA node floating instances allocated their memory from. Logically speaking we want to use a numa mode of "preferred", but that only allows us to specify a single NUMA node. So we need to remove the specification entirely to allow the kernel's default allocation rules to take over. As such, I propose we change the XML for a VM from something like this: <numatune> <memory mode='strict' nodeset='0'/> <memnode cellid='0' mode='strict' nodeset='0'/> </numatune> to something like this (without the overall "memory" mode): <numatune> <memnode cellid='0' mode='strict' nodeset='0'/> </numatune> This will have the end result that /proc/<pid>/numa_maps 4K memory policy will show 'default' instead of 'bind:<x>' for everything except qemu hugepages membacking. The numa_maps will look like this: 7fe3a3400000 default anon=64 dirty=64 N0=64 kernelpagesize_kB=4 7fe3a5600000 default stack:219058 anon=4 dirty=4 N0=4 kernelpagesize_kB=4 7fe3a5e00000 bind:1 file=/mnt/huge-2048kB/libvirt qemu/qemu_back_mem._objects_ram-node1.JhY5SD\040(deleted) huge dirty=256 mapmax=2 N1=256 kernelpagesize_kB=2048 7fe3c6000000 bind:0 file=/mnt/huge-2048kB/libvirt/qemu/qemu_back_mem._objects_ram-node0.Z0GkXp\040(deleted) huge dirty=256 mapmax=3 N0=256 kernelpagesize_kB=2048
2018-09-17 17:41:47	Chris Friesen	description	We've seen a case on a resource-constrained compute node where booting multiple instances passed, but led to the following error messages from the host kernel: [ 731.911731] Out of memory: Kill process 133047 (nova-api) score 4 or sacrifice child [ 731.920377] Killed process 133047 (nova-api) total-vm:374456kB, anon-rss:144708kB, file-rss:1892kB, shmem-rss:0kB The problem appears to be that currently with libvirt an instance which does not specify a NUMA topology (which implies "shared" CPUs and the default memory pagesize) is allowed to float across the whole compute node. As such, we do not know which host NUMA node its memory is going to be allocated from, and therefore we don't know how much memory is remaining on each host NUMA node. If we have a similar instance which is limited to a particular NUMA node (due to adding a PCI device for example, or in the future by specifying dedicated CPUs) then that allocation will currently use "strict" NUMA affinity. This allocation can fail if there isn't enough memory available on that NUMA node (due to being "stolen" by a floating instance, for example). I think this means that we cannot use "strict" affinity for the default page size even when we do have a numa_topology since we can't have accurate per-NUMA-node accounting due to the fact that we don't know which NUMA node floating instances allocated their memory from. Logically speaking we want to use a numa mode of "preferred", but that only allows us to specify a single NUMA node. So we need to remove the specification entirely to allow the kernel's default allocation rules to take over. As such, I propose we change the XML for a VM from something like this: <numatune> <memory mode='strict' nodeset='0'/> <memnode cellid='0' mode='strict' nodeset='0'/> </numatune> to something like this (without the overall "memory" mode): <numatune> <memnode cellid='0' mode='strict' nodeset='0'/> </numatune> This will have the end result that /proc/<pid>/numa_maps 4K memory policy will show 'default' instead of 'bind:<x>' for everything except qemu hugepages membacking. The numa_maps will look like this: 7fe3a3400000 default anon=64 dirty=64 N0=64 kernelpagesize_kB=4 7fe3a5600000 default stack:219058 anon=4 dirty=4 N0=4 kernelpagesize_kB=4 7fe3a5e00000 bind:1 file=/mnt/huge-2048kB/libvirt qemu/qemu_back_mem._objects_ram-node1.JhY5SD\040(deleted) huge dirty=256 mapmax=2 N1=256 kernelpagesize_kB=2048 7fe3c6000000 bind:0 file=/mnt/huge-2048kB/libvirt/qemu/qemu_back_mem._objects_ram-node0.Z0GkXp\040(deleted) huge dirty=256 mapmax=3 N0=256 kernelpagesize_kB=2048	We've seen a case on a resource-constrained compute node where booting multiple instances passed, but led to the following error messages from the host kernel: [ 731.911731] Out of memory: Kill process 133047 (nova-api) score 4 or sacrifice child [ 731.920377] Killed process 133047 (nova-api) total-vm:374456kB, anon-rss:144708kB, file-rss:1892kB, shmem-rss:0kB The problem appears to be that currently with libvirt an instance which does not specify a NUMA topology (which implies "shared" CPUs and the default memory pagesize) is allowed to float across the whole compute node. As such, we do not know which host NUMA node its memory is going to be allocated from, and therefore we don't know how much memory is remaining on each host NUMA node. If we have a similar instance which is limited to a particular NUMA node (due to adding a PCI device for example, or in the future by specifying dedicated CPUs) then that allocation will currently use "strict" NUMA affinity. This allocation can fail if there isn't enough memory available on that NUMA node (due to being "stolen" by a floating instance, for example). I think this means that we cannot use "strict" affinity for the default page size even when we do have a numa_topology since we can't have accurate per-NUMA-node accounting due to the fact that we don't know which NUMA node floating instances allocated their memory from. Logically speaking we want to use a numa mode of "preferred", but that only allows us to specify a single NUMA node. So we need to remove the specification entirely to allow the kernel's default "local allocation" rules to take over. As such, I propose we change the XML for a VM from something like this: <numatune> <memory mode='strict' nodeset='0'/> <memnode cellid='0' mode='strict' nodeset='0'/> </numatune> to something like this (without the overall "memory" mode): <numatune> <memnode cellid='0' mode='strict' nodeset='0'/> </numatune> This will have the end result that /proc/<pid>/numa_maps 4K memory policy will show 'default' instead of 'bind:<x>' for everything except qemu hugepages membacking. The numa_maps will look like this: 7fe3a3400000 default anon=64 dirty=64 N0=64 kernelpagesize_kB=4 7fe3a5600000 default stack:219058 anon=4 dirty=4 N0=4 kernelpagesize_kB=4 7fe3a5e00000 bind:1 file=/mnt/huge-2048kB/libvirt qemu/qemu_back_mem._objects_ram-node1.JhY5SD\040(deleted) huge dirty=256 mapmax=2 N1=256 kernelpagesize_kB=2048 7fe3c6000000 bind:0 file=/mnt/huge-2048kB/libvirt/qemu/qemu_back_mem._objects_ram-node0.Z0GkXp\040(deleted) huge dirty=256 mapmax=3 N0=256 kernelpagesize_kB=2048
2018-09-17 17:45:57	Chris Friesen	description	We've seen a case on a resource-constrained compute node where booting multiple instances passed, but led to the following error messages from the host kernel: [ 731.911731] Out of memory: Kill process 133047 (nova-api) score 4 or sacrifice child [ 731.920377] Killed process 133047 (nova-api) total-vm:374456kB, anon-rss:144708kB, file-rss:1892kB, shmem-rss:0kB The problem appears to be that currently with libvirt an instance which does not specify a NUMA topology (which implies "shared" CPUs and the default memory pagesize) is allowed to float across the whole compute node. As such, we do not know which host NUMA node its memory is going to be allocated from, and therefore we don't know how much memory is remaining on each host NUMA node. If we have a similar instance which is limited to a particular NUMA node (due to adding a PCI device for example, or in the future by specifying dedicated CPUs) then that allocation will currently use "strict" NUMA affinity. This allocation can fail if there isn't enough memory available on that NUMA node (due to being "stolen" by a floating instance, for example). I think this means that we cannot use "strict" affinity for the default page size even when we do have a numa_topology since we can't have accurate per-NUMA-node accounting due to the fact that we don't know which NUMA node floating instances allocated their memory from. Logically speaking we want to use a numa mode of "preferred", but that only allows us to specify a single NUMA node. So we need to remove the specification entirely to allow the kernel's default "local allocation" rules to take over. As such, I propose we change the XML for a VM from something like this: <numatune> <memory mode='strict' nodeset='0'/> <memnode cellid='0' mode='strict' nodeset='0'/> </numatune> to something like this (without the overall "memory" mode): <numatune> <memnode cellid='0' mode='strict' nodeset='0'/> </numatune> This will have the end result that /proc/<pid>/numa_maps 4K memory policy will show 'default' instead of 'bind:<x>' for everything except qemu hugepages membacking. The numa_maps will look like this: 7fe3a3400000 default anon=64 dirty=64 N0=64 kernelpagesize_kB=4 7fe3a5600000 default stack:219058 anon=4 dirty=4 N0=4 kernelpagesize_kB=4 7fe3a5e00000 bind:1 file=/mnt/huge-2048kB/libvirt qemu/qemu_back_mem._objects_ram-node1.JhY5SD\040(deleted) huge dirty=256 mapmax=2 N1=256 kernelpagesize_kB=2048 7fe3c6000000 bind:0 file=/mnt/huge-2048kB/libvirt/qemu/qemu_back_mem._objects_ram-node0.Z0GkXp\040(deleted) huge dirty=256 mapmax=3 N0=256 kernelpagesize_kB=2048	We've seen a case on a resource-constrained compute node where booting multiple instances passed, but led to the following error messages from the host kernel: [ 731.911731] Out of memory: Kill process 133047 (nova-api) score 4 or sacrifice child [ 731.920377] Killed process 133047 (nova-api) total-vm:374456kB, anon-rss:144708kB, file-rss:1892kB, shmem-rss:0kB The problem appears to be that currently with libvirt an instance which does not specify a NUMA topology (which implies "shared" CPUs and the default memory pagesize) is allowed to float across the whole compute node. As such, we do not know which host NUMA node its memory is going to be allocated from, and therefore we don't know how much memory is remaining on each host NUMA node. If we have a similar instance which is limited to a particular NUMA node (due to adding a PCI device for example, or in the future by specifying dedicated CPUs) then that allocation will currently use "strict" NUMA affinity. This allocation can fail if there isn't enough memory available on that NUMA node (due to being "stolen" by a floating instance, for example). I think this means that we cannot use "strict" affinity for the default page size even when we do have a numa_topology since we can't have accurate per-NUMA-node accounting due to the fact that we don't know which NUMA node floating instances allocated their memory from. Logically speaking we want to use a numa mode of "preferred", but that only allows us to specify a single NUMA node. So we need to remove the specification entirely to allow the kernel's default "local allocation" rules to take over. As such, I propose we change the XML for a VM from something like this: <numatune> <memory mode='strict' nodeset='0'/> <memnode cellid='0' mode='strict' nodeset='0'/> </numatune> to something like this (without the overall "memory" mode): <numatune> <memnode cellid='0' mode='strict' nodeset='0'/> </numatune> This will have the end result that /proc/<pid>/numa_maps 4K memory policy will show 'default' instead of 'bind:<x>' for everything except qemu hugepages membacking. This will cause them to try to allocate from the "local" NUMA node, but they'll fall back to others if the request can't be satisfied locally. The numa_maps for an instance with 2MB pages would look like this: 7fe3a3400000 default anon=64 dirty=64 N0=64 kernelpagesize_kB=4 7fe3a5600000 default stack:219058 anon=4 dirty=4 N0=4 kernelpagesize_kB=4 7fe3c6000000 bind:0 file=/mnt/huge-2048kB/libvirt/qemu/qemu_back_mem._objects_ram-node0.Z0GkXp\040(deleted) huge dirty=256 mapmax=3 N0=256 kernelpagesize_kB=2048
2019-01-14 16:58:36	Stephen Finucane	marked as duplicate		1439247