OOM seen on worker node after fresh install - mem available but out of order 0 pages

Bug #1827258 reported by Maria Yousaf
40
This bug affects 5 people
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Bin Yang

Bug Description

Brief Description
-----------------
Out of memory killer seen on worker node after fresh system install. Looking at the logs, it looks like the node has memory available, but it is out of order 0 memory.

Severity
--------
Major

Steps to Reproduce
------------------
1. Install system
2. Observe worker node console and the following is seen:

compute-1 login: [ 604.915006] Out of memory: Kill process 43143 (nova-compute
 score 1003 or sacrifice child
[ 604.928673] Killed process 43143 (nova-compute) total-vm:2300792kB, anon-rss
119128kB, file-rss:0kB, shmem-rss:0kB
[ 1387.186936] Out of memory: Kill process 49810 (nova-compute) score 1003 or s
crifice child
[ 1387.201406] Killed process 49810 (nova-compute) total-vm:2300964kB, anon-rss
119164kB, file-rss:0kB, shmem-rss:0kB
[ 1863.595215] Out of memory: Kill process 54492 (/var/lib/openst) score 1002 o
 sacrifice child
[ 1863.616508] Killed process 54492 (/var/lib/openst) total-vm:288872kB, anon-r
s:105316kB, file-rss:0kB, shmem-rss:0kB
mem[ 2265.008223] Out of memory: Kill process 64088 (/var/lib/openst) score 100
 or sacrifice child
[ 2265.021867] Killed process 64088 (/var/lib/openst) total-vm:288404kB, anon-r
s:104712kB, file-rss:0kB, shmem-rss:0kB
[ 2694.819180] Out of memory: Kill process 59143 (nova-compute) score 1002 or s
crifice child
[ 2694.875767] Killed process 59143 (nova-compute) total-vm:302420kB, anon-rss:
01464kB, file-rss:0kB, shmem-rss:0kB

Expected Behavior
------------------
OOM not seen

Actual Behavior
----------------
OOM seen

Reproducibility
---------------
Seen once

System Configuration
--------------------
Storage system

Branch/Pull Time/Commit
-----------------------
master load: 20190501T013000Z

Last Pass
---------
N/A

Timestamp/Logs
--------------
Kernel log:

2019-05-01T17:36:25.530 compute-1 kernel: warning [ 604.914665] calico-node invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=999
2019-05-01T17:36:25.530 compute-1 kernel: info [ 604.914669] calico-node cpuset=7a16228d33a348cde5189f3dc1fa5a4dd9daaf1100e7f8d5a795d239535110c4 mems_allowed=0
2019-05-01T17:36:25.530 compute-1 kernel: warning [ 604.914671] CPU: 0 PID: 49567 Comm: calico-node Kdump: loaded Tainted: G O ------------ T 3.10.0-957.1.3.el7.1.tis.x86_64 #1
2019-05-01T17:36:25.530 compute-1 kernel: warning [ 604.914673] Hardware name: Intel Corporation S2600IP/S2600IP, BIOS SE5C600.86B.02.01.0002.082220131453 08/22/2013
2019-05-01T17:36:25.530 compute-1 kernel: warning [ 604.914674] Call Trace:
2019-05-01T17:36:25.530 compute-1 kernel: warning [ 604.914680] [<ffffffffadc07861>] dump_stack+0x19/0x1b
2019-05-01T17:36:25.530 compute-1 kernel: warning [ 604.914683] [<ffffffffadc02c9e>] dump_header+0x8e/0x23f
2019-05-01T17:36:25.530 compute-1 kernel: warning [ 604.914687] [<ffffffffad4e5cc2>] ? ktime_get_ts64+0x52/0xf0
2019-05-01T17:36:25.530 compute-1 kernel: warning [ 604.914691] [<ffffffffad52ff9f>] ? delayacct_end+0x8f/0xb0
2019-05-01T17:36:25.530 compute-1 kernel: warning [ 604.914694] [<ffffffffad58846e>] oom_kill_process+0x24e/0x3d0
2019-05-01T17:36:25.531 compute-1 kernel: warning [ 604.914697] [<ffffffffad489d65>] ? has_ns_capability_noaudit+0x35/0x50
2019-05-01T17:36:25.531 compute-1 kernel: warning [ 604.914698] [<ffffffffad489d97>] ? has_capability_noaudit+0x17/0x20
2019-05-01T17:36:25.531 compute-1 kernel: warning [ 604.914700] [<ffffffffad588cd3>] out_of_memory+0x4d3/0x510
2019-05-01T17:36:25.531 compute-1 kernel: warning [ 604.914702] [<ffffffffad58f046>] __alloc_pages_nodemask+0xa86/0xb80
2019-05-01T17:36:25.531 compute-1 kernel: warning [ 604.914705] [<ffffffffad5d97f8>] alloc_pages_current+0x98/0x110
2019-05-01T17:36:25.531 compute-1 kernel: warning [ 604.914707] [<ffffffffad5841d7>] __page_cache_alloc+0x97/0xb0
2019-05-01T17:36:25.531 compute-1 kernel: warning [ 604.914708] [<ffffffffad586fe8>] filemap_fault+0x278/0x460
2019-05-01T17:36:25.531 compute-1 kernel: warning [ 604.914737] [<ffffffffc237f15e>] __xfs_filemap_fault+0x7e/0x200 [xfs]
2019-05-01T17:36:25.531 compute-1 kernel: warning [ 604.914749] [<ffffffffc237f38c>] xfs_filemap_fault+0x2c/0x30 [xfs]
2019-05-01T17:36:25.531 compute-1 kernel: warning [ 604.914752] [<ffffffffad5b161a>] __do_fault.isra.70+0x8a/0x100
2019-05-01T17:36:25.531 compute-1 kernel: warning [ 604.914754] [<ffffffffad5b1c3c>] do_read_fault.isra.72+0x4c/0x1b0
2019-05-01T17:36:25.531 compute-1 kernel: warning [ 604.914756] [<ffffffffad5b8267>] handle_mm_fault+0x557/0xc30
2019-05-01T17:36:25.531 compute-1 kernel: warning [ 604.914759] [<ffffffffad45fcdf>] __do_page_fault+0x1ef/0x460
2019-05-01T17:36:25.531 compute-1 kernel: warning [ 604.914761] [<ffffffffad45ffc5>] do_page_fault+0x35/0x90
2019-05-01T17:36:25.531 compute-1 kernel: warning [ 604.914763] [<ffffffffadc16638>] page_fault+0x28/0x30
2019-05-01T17:36:25.531 compute-1 kernel: warning [ 604.914765] Mem-Info:
2019-05-01T17:36:25.531 compute-1 kernel: warning [ 604.914769] active_anon:282208 inactive_anon:2098 isolated_anon:0
2019-05-01T17:36:25.531 compute-1 kernel: warning [ 604.914769] active_file:10201 inactive_file:16368 isolated_file:0
2019-05-01T17:36:25.531 compute-1 kernel: warning [ 604.914769] unevictable:109522 dirty:30 writeback:0 unstable:0
2019-05-01T17:36:25.531 compute-1 kernel: warning [ 604.914769] slab_reclaimable:22794 slab_unreclaimable:108933
2019-05-01T17:36:25.532 compute-1 kernel: warning [ 604.914769] mapped:10965 shmem:3176 pagetables:5571 bounce:0
2019-05-01T17:36:25.532 compute-1 kernel: warning [ 604.914769] free:545174 free_pcp:477 free_cma:3135

Memory usage fairly stable:

memtop 0.1 -- selected options: delay = 1.000s, repeat = 1, period = 1.000s, non-strict, unit = MiB
yyyy-mm-dd hh:mm:ss.fff Tot Used Free Ca Buf Slab CAS CLim Dirty WBack Anon Avail 0:Avail 0:HFree 1:Avail 1:HFree
2019-05-01 18:19:56.557 64217.5 61916.4 2039.7 132.1 34.3 537.7 3762.3 2425.7 1.3 0.0 1537.0 2301.1 321.3 28544.0 1979.0 28774.0
done
compute-1:~# memtop
memtop 0.1 -- selected options: delay = 1.000s, repeat = 1, period = 1.000s, non-strict, unit = MiB
yyyy-mm-dd hh:mm:ss.fff Tot Used Free Ca Buf Slab CAS CLim Dirty WBack Anon Avail 0:Avail 0:HFree 1:Avail 1:HFree
2019-05-01 18:20:01.581 64217.5 61925.4 2024.1 138.8 34.3 543.4 3864.9 2425.7 1.3 0.0 1538.0 2292.1 319.3 28544.0 1973.3 28774.0
done
compute-1:~# memtop
memtop 0.1 -- selected options: delay = 1.000s, repeat = 1, period = 1.000s, non-strict, unit = MiB
yyyy-mm-dd hh:mm:ss.fff Tot Used Free Ca Buf Slab CAS CLim Dirty WBack Anon Avail 0:Avail 0:HFree 1:Avail 1:HFree
2019-05-01 18:20:05.181 64217.5 61929.9 2017.7 140.5 34.3 547.5 3964.0 2425.7 1.3 0.0 1538.8 2287.5 318.8 28544.0 1969.3 28774.0
done

Top:

top - 18:20:26 up 54 min, 1 user, load average: 15.41, 15.54, 14.93
Tasks: 414 total, 11 running, 403 sleeping, 0 stopped, 0 zombie
%Cpu(s): 10.4 us, 6.2 sy, 0.0 ni, 80.2 id, 3.3 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 65758700 total, 2066240 free, 62952448 used, 740012 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 1862636 avail Mem

   PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
 33738 root 10 -10 5522860 435328 12244 S 200.0 0.7 103:33.86 ovs-vswitchd
   183 root 20 0 0 0 0 R 100.0 0.0 39:59.85 kswapd0 <--- ??
 39926 root 20 0 280512 98020 0 R 6.2 0.1 0:45.15 /var/lib/openst
 41020 root 20 0 280568 96848 60 R 6.2 0.1 0:42.67 /var/lib/openst
 87231 root 20 0 301500 100624 0 R 6.2 0.2 0:07.66 nova-compute
 96462 root 20 0 80068 4348 0 R 6.2 0.0 0:01.57 python
 97522 root 20 0 11692 276 0 R 6.2 0.0 0:00.08 bash
     1 root 20 0 126852 5084 2292 S 0.0 0.0 0:13.87 systemd

Test Activity
-------------
Install

Revision history for this message
Maria Yousaf (myousaf) wrote :
Revision history for this message
Maria Yousaf (myousaf) wrote :
Download full text (23.1 KiB)

Some additional info:

compute-1:~$ sudo lsof -nP -a +L1
Password:
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NLINK NODE NAME
ovsdb-ser 31924 root 7u REG 0,37 159 0 109669 /tmp/tmpfHlfhxc (deleted)

compute-1:~$ sudo systemd-cgtop -m -b -n1 --depth=6

Path Tasks %CPU Memory Input/s Output/s

/ 302 - 2.8G - -
/user.slice - - 1.1G - -
/user.slice/user-0.slice - - 1.1G - -
/k8s-infra - - 971.0M - -
/k8s-infra/kubepods - - 971.0M - -
/k8s-infra/kubepods/besteffort - - 952.6M - -
/system.slice - - 876.8M - -
/system.slice/ovs-vswitchd.service 1 - 416.4M - -
/k8s-infra/kubepods/besteffort/pod7aa85fa3-6c30-11e9-b5e2-001e67680cba - - 219.4M - -
/k8s-infra/kubepods/beste...efcf34189968c7d19744022e6d03780c33a1b0284a5bc7af98 5 - 199.7M - -
/system.slice/docker.service 20 - 162.6M - -
/k8s-infra/kubepods/besteffort/pod7a58b2e8-6c30-11e9-b5e2-001e67680cba - - 157.0M - -
/k8s-infra/kubepods/besteffort/pod7a928ea9-6c30-11e9-b5e2-001e67680cba - - 151.4M - -
/k8s-infra/kubepods/beste...3f681d0ea155a876292f94041a3581f623639b6ca71c464792 2 - 149.3M - -
/k8s-infra/kubepods/besteffort/pod7b223017-6c30-11e9-b5e2-001e67680cba - - 147.7M - -
/k8s-infra/kubepods/beste...27cc220c99d696a85d46f2f1c5b1ea415a3ce2097b1509dd01 11 - 143.7M - -
/k8s-infra/kubepods/beste...ae615c4797728d533eb040a58eeec70c0faf9587742dafd636 1 - 128.9M - -
/k8s-infra/kubepods/besteffort/pod7ad7ef3b-6c30-11e9-b5e2-001e67680cba - - 117.0M - -
/k8s-infra/kubepods/besteffort/pod7a42d449-6c30-11e9-b5e2-001e67680cba - - 110.9M - -
/k8s-infra/kubepods/beste...36ef8b375482506848519026ce48f537c344e43eaa4e2245c6 1 - 109.5M -

compute-1:~$ ipcs

------ Message Queues --------
key msqid owner perms used-bytes messages

------ Shared Memory Segments --------
key shmid owner perms bytes nattch status
0x00000000 0 root 644 80 2
0x00000000 32769 root 644 16384 2 ...

Numan Waheed (nwaheed)
tags: added: stx.retestneeded
Revision history for this message
Maria Yousaf (myousaf) wrote :

I'm seeing this again on a worker node in IP33-36 (worker-0) running load: 20190505T233000Z:

compute-0:~$ [59763.882249] Out of memory: Kill process 635670 (python) score 0
or sacrifice child
[59764.055873] Killed process 635741 (smart) total-vm:17480kB, anon-rss:2328kB,
file-rss:716kB, shmem-rss:0kB

Looks like it might be the order=0 memory issue.

Revision history for this message
Maria Yousaf (myousaf) wrote :
Revision history for this message
Ghada Khalil (gkhalil) wrote :

@Maria, was the stx-openstack application running at the time the oom error was reported?

Changed in starlingx:
status: New → Incomplete
Revision history for this message
Cindy Xie (xxie1) wrote :

@Maria, do you think the issue is similar as this one? https://bugs.launchpad.net/starlingx/+bug/1826308
this was marked as "fix released" but we think it still there.

Revision history for this message
Maria Yousaf (myousaf) wrote :

I'm not sure Cindy. I think someone will need to look at the logs and triage it, to see if it is the same issue or different.

Revision history for this message
Maria Yousaf (myousaf) wrote :

@Ghada, I believe application-apply was done when the original issue was seen but I'm not 100% sure. Jim Gauld helped me investigate. He may have some additional information.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as gating for now and will monitor for a re-occurrence.

Changed in starlingx:
importance: Undecided → Medium
status: Incomplete → Triaged
assignee: nobody → Gerry Kopec (gerry-kopec)
tags: added: stx.2.0 stx.distro.other
Revision history for this message
Bin Yang (byangintel) wrote :

@Maria

from compute-0_20190507.124154/var/log/kern.log
===============================================

    [59763.881672] sm_debug invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
    ...
    [59763.882079] Node 0 DMA free:15848kB min:60kB low:72kB high:88kB
    [59763.882083] lowmem_reserve[]: 0 2800 15838 15838
    [59763.882085] Node 0 DMA32 free:63564kB min:11480kB low:14348kB high:17220kB
    [59763.882088] lowmem_reserve[]: 0 0 13038 13038
    [59763.882090] Node 0 Normal free:53384kB min:53464kB low:66828kB high:80196kB
    [59763.882092] lowmem_reserve[]: 0 0 0 0
    [59763.882094] Node 1 Normal free:66048kB min:66060kB low:82572kB high:99088kB
    [59763.882097] lowmem_reserve[]: 0 0 0 0
    ...
    [59763.882249] Out of memory: Kill process 635670 (python) score 0 or sacrifice child

gfp_mask=0x201da: it means the page allocation is from ZONE_NORMAL

But
    Node 0 Normal, free:53384kB < min:53464kB: it has no enough space
    Node 1 Normal, free:66048kB < min:66060kB: it also has no enough space
Since the free < min, kernel will start oom killer.

from compute-1_20190507.124154/var/log/kern.log
===============================================
[ 1515.471830] calico-node invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=999
...
[ 1515.471950] Node 0 DMA free:15848kB min:60kB low:72kB high:88kB
[ 1515.471954] lowmem_reserve[]: 0 2800 15838 15838
[ 1515.471956] Node 0 DMA32 free:63620kB min:11480kB low:14348kB high:17220kB
[ 1515.471959] lowmem_reserve[]: 0 0 13038 13038
[ 1515.471961] Node 0 Normal free:53148kB min:53464kB low:66828kB high:80196kB
[ 1515.471964] lowmem_reserve[]: 0 0 0 0
...
[ 1515.472142] Out of memory: Kill process 60356 (kubernetes-entr) score 1000 or sacrifice child

gfp_mask=0x201da: it means the page allocation is from ZONE_NORMAL

But
    Node 0 Normal, free:53148kB < min:53464kB: it has no enough space
Since the free < min, kernel will start oom killer.

Conclusion:
===========
The kernel has no enough memory for page allocation. "order=0" means 1 page is required.
The fact is "no free memory which causes oom".
So it is not a bug of "mem available but out of order 0 pages".

Revision history for this message
Bin Yang (byangintel) wrote :

@Maria

from compute-1_20190507.124154/var/log/kern.log
===============================================
kernel: info [ 1515.776621] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
...
kernel: info [ 1515.776671] [40564] 0 40564 34049530 71961 64718 0 0 ovs-vswitchd

from: include/linux/mm_types.h
==============================
unsigned long total_vm; /* Total pages mapped */

What workload did you run? it looks the mapped pages of ovs-vswitchd is huge.

Revision history for this message
Cindy Xie (xxie1) wrote :

assign to @Bin Yang as he is looking into this bug already.

Changed in starlingx:
assignee: Gerry Kopec (gerry-kopec) → Bin Yang (byangintel)
Revision history for this message
Bin Yang (byangintel) wrote :

Based on the logs, it should not be a bug.

Changed in starlingx:
status: Triaged → Invalid
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Can you elaborate on why this is not an issue? The system ran out of 4k pages. This is not expected under any condition.

Revision history for this message
Bin Yang (byangintel) wrote :

As I mentioned above, " free < min, kernel will start oom killer".

And you can find the kernel code for this logic:
mm/page_alloc.c:
__zone_watermark_ok()
=============================

        /*
         * Check watermarks for an order-0 allocation request. If these
         * are not met, then a high-order request also cannot go ahead
         * even if a suitable page happened to be free.
         */
        if (free_pages <= min + z->lowmem_reserve[classzone_idx])
                return false;

        /* If this is an order-0 request then the watermark is fine */
        if (!order)
                return true;

And here is a related document:
https://www.kernel.org/doc/Documentation/sysctl/vm.txt
min_free_kbytes:
================

This is used to force the Linux VM to keep a minimum number
of kilobytes free. The VM uses this number to compute a
watermark[WMARK_MIN] value for each lowmem zone in the system.
Each lowmem zone gets a number of reserved free pages based
proportionally on its size.

Some minimal amount of memory is needed to satisfy PF_MEMALLOC
allocations; if you set this to lower than 1024KB, your system will
become subtly broken, and prone to deadlock under high loads.

Setting this too high will OOM your machine instantly.

And the watermarks are setup base on this kernel parameter:
========================
mm/page_alloc.c
init_per_zone_wmark_min()

The default setting is calculated based on your total system memory size. I think min watermark = 53464kB is reasonable.

Revision history for this message
Brent Rowsell (brent-rowsell) wrote :

The issue is we ran out of free kernel memory. We should never run out under any circumstances.
This LP needs to determine why we ran out and address the issue.

Changed in starlingx:
status: Invalid → Confirmed
Revision history for this message
Bin Yang (byangintel) wrote :

from compute-1_20190507.124154/var/log/kern.log
    2019-05-06T16:24:12.266 localhost kernel: debug [ 0.000000] On node 0 totalpages: 4174118
    ...
    2019-05-06T17:28:56.749 compute-1 kernel: info [ 1515.471986] Node 0 hugepages_total=1 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
    2019-05-06T17:28:56.749 compute-1 kernel: info [ 1515.471987] Node 0 hugepages_total=6807 hugepages_free=6807 hugepages_surp=0 hugepages_size=2048kB
    ...

from hieradata/192.168.204.77.yaml:
    platform::compute::hugepage::params::vm_2M_pages: '"7024,7172"'
    ...
    platform::compute::params::worker_base_reserved: ("node0:8000MB:1" "node1:2000MB:1")

from puppet.log
    ...
    Exec[Allocate 7024 /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages]
    ...
    Exec[Allocate 7172 /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages]
    ...

The total memory on node 0 is 16GB;

From kenrel log, 2M hugepage is 6807 which is smaller than 7024. It looks system has no 7024 2M pages
Total expected hugepage size is: 7024*2M + 1G = 14.7GB

It is not reasonable. We have several reserved memory resources as below:
    1. 8GB reserved by worker_reserved.conf:
        WORKER_BASE_RESERVED=("node0:8000MB:1" "node1:2000MB:1")
    2. 10% reserved by below code:
        sysinv host.py: vm_hugepages_nr_2M = int(m.vm_hugepages_possible_2M * 0.9)

vm_hugepages_possible_2M is calculated by _inode_get_memory_hugepages() function as below logic:

    node_total_kb = total_hp_mb * SIZE_KB + free_kb + pss_mb * SIZE_KB
        total_hp_mb is 0 since 2M hugepage is not reserved by kernel command line
        free_kb is from /sys/devices/system/node/node0/meminfo
        pss_mb is collected from /proc/*/smaps

    vm_hugepages_possible_2M: node_total_kb - base_mem_mb - vswitch_mem_kb
        base_mem_mb is 8GB from WORKER_BASE_RESERVED in worker_reserved.conf
        vswitch_mem_kb is 1GB from COMPUTE_VSWITCH_MEMORY in worker_reserved.conf

So far, the vm_hugepages_possible_2M is always corrected on Shanghai bare metal tests.

Could reporters help to provide more info while this bug is triggered?
1. run system host-memory-list <compute node>
2. run cat /proc/sys/vm/overcommit_* #the mode will impact the free_kb calculation
3. run cat /proc/*/smaps 2>/dev/null | awk '/^Pss:/ {a += $2;} END {printf "%d\n", a/1024.0;}' on compute nodes
4. run cat /sys/devices/system/node/node*/meminfo on compute nodes

thanks,
Bin

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/667811

Changed in starlingx:
status: Confirmed → In Progress
Revision history for this message
Bin Yang (byangintel) wrote :

After code review, the hugepage allocation check code has a problem.

It only check whether the total allocation memory is bigger than total node
memory size. If the pending hugepage size is between max possible size and node
total size, the check will be passed.

Then, the hugepage allocation size will be overflow. The normal 4K pages will
not be enough. So the OOM min watermark will be hit.

Revision history for this message
Bin Yang (byangintel) wrote :

By default, if no pending hugepage request, _update_huge_pages() will allocate
(m.vm_hugepages_possible_2M*0.9) 2M hugepage. But m.vm_hugepages_nr_2M_pending
will be used with priority. If user config 2M hugepage manually with a wrong
size, this issue will be triggered.

Could reporter help to double check the test scripts?

Revision history for this message
Bart Wensley (bartwensley) wrote :
Download full text (4.1 KiB)

I am seeing the same problem on a 2+2 lab. There are two compute hosts in this lab, with the same hardware and 32G of memory each. It looks like the two computes (same hardware) are reporting different numbers of available 2M pages:

2019-06-25 17:58:53.971 109784 INFO sysinv.api.controllers.v1.host [-] Updating mem values of host(compute-0) node(5): {'vm_hugepages_nr_4K': 0, 'vm_hugepages_nr_2M': 3330, 'vswitch_hugepages_nr': 1}
2019-06-25 17:58:53.989 109784 INFO sysinv.api.controllers.v1.host [-] Updating mem values of host(compute-0) node(6): {'vm_hugepages_nr_4K': 94976, 'vm_hugepages_nr_2M': 6264, 'vswitch_hugepages_nr': 1}

2019-06-25 17:59:03.741 109782 INFO sysinv.api.controllers.v1.host [-] Updating mem values of host(compute-1) node(3): {'vm_hugepages_nr_4K': 0, 'vm_hugepages_nr_2M': 7025, 'vswitch_hugepages_nr': 1}
2019-06-25 17:59:03.762 109782 INFO sysinv.api.controllers.v1.host [-] Updating mem values of host(compute-1) node(4): {'vm_hugepages_nr_4K': 0, 'vm_hugepages_nr_2M': 7169, 'vswitch_hugepages_nr': 1}

The platform is configuring an incorrect number of 2M pages for compute-1:

platform::compute::hugepage::params::nr_hugepages_1G: ("node0:1048576kB:1" "node1:1048576kB:1")
platform::compute::hugepage::params::vm_2M_pages: '"6783,7169"'
platform::compute::params::worker_base_reserved: ("node0:8000MB:1" "node1:2000MB:1")

This would require 6782*2M + 1G = 14.9G. That leaves almost no memory on node 0 (which has 16G) for the platform - there is supposed to be 8G reserved. The kernel isn’t even able to allocate the 6783

I don’t understand why puppet things that the 2M page allocation was successful - it tries to allocate 7025 pages on node0:

2019-06-25T19:48:09.660 Notice: 2019-06-25 19:48:08 +0000 /Stage[main]/Platform::Compute::Allocate/Allocate_pages[Start node0 2048kB]/Exec[Allocate 7025 /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages]/returns: executed successfully
2019-06-25T19:48:29.051 Notice: 2019-06-25 19:48:29 +0000 /Stage[main]/Platform::Compute::Allocate/Allocate_pages[Start node1 2048kB]/Exec[Allocate 7169 /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages]/returns: executed successfully

But it appears that only 6758 were allocated:

2019-06-27T02:53:41.333 compute-1 kernel: info [46730.474219] Node 0 hugepages_total=1 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
2019-06-27T02:53:41.333 compute-1 kernel: info [46730.474221] Node 0 hugepages_total=6758 hugepages_free=6758 hugepages_surp=0 hugepages_size=2048kB
2019-06-27T02:53:41.333 compute-1 kernel: info [46730.474222] Node 1 hugepages_total=1 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
2019-06-27T02:53:41.333 compute-1 kernel: info [46730.474223] Node 1 hugepages_total=7169 hugepages_free=7169 hugepages_surp=0 hugepages_size=2048kB

This results in compute-1 being unusable. It essentially has no free memory:
compute-1:~# free -h
              total used free shared buff/cache available
Mem: 31G 30G 238M 12M 365M 7.7M
Swap: 0B 0B 0B
compute-1:~# memtop
memtop 0.1 -- selected options: delay ...

Read more...

Revision history for this message
Bart Wensley (bartwensley) wrote :

The above comment was for a designer load built on June 25, 2019.

Revision history for this message
Bin Yang (byangintel) wrote :

some new findings from database dump:
=================================
node_memtotal_mib platform_reserved_mib "vm_hugepages_possible_2M" "vm_hugepages_nr_2M_pending" "vm_hugepages_avail_2M" "vm_hugepages_nr_4K"
15428 8000 3202 \N 6795 0
15562 2000 6269 \N 7177 0
15643 8000 3309 \N 6799 0
15446 2000 6211 \N 7172 0

1. "vm_hugepages_possible_2M" < node_memtotal_mib
    It is reasonable. The hugepage possible size calculated in agent/node.py: inode_get_memory_hugepages() is correct

2. "vm_hugepages_avail_2M" > "vm_hugepages_possible_2M"
    It is not reasonable. something wrong in v1/host.py: _update_huge_pages()

3. "vm_hugepages_nr_4K" == 0
    _update_huge_pages()
            vm_hugepages_4K = \
                (m.node_memtotal_mib - m.platform_reserved_mib)
            vm_hugepages_4K -= \
                (vs_hugepages_nr * m.vswitch_hugepages_size_mib)
            vm_hugepages_4K -= \
                (constants.MIB_2M * vm_hugepages_nr_2M)
            vm_hugepages_4K -= \
                (constants.MIB_1G * vm_hugepages_nr_1G)
            vm_hugepages_4K = \
                (constants.NUM_4K_PER_MiB * vm_hugepages_4K)

            # Clip 4K pages
            min_4K = 32 * constants.Ki / 4
            if vm_hugepages_4K < min_4K:
                vm_hugepages_4K = 0

    node_memtotal_mib, vs hugepages, vm hugepages are calculated by gent/node.py: _inode_get_memory_hugepages().
    It should not simply set "vm_hugepages_4K = 0" if vm_hugepages_4K < min_4K. If requested hugepage size is too big, vm_hugepages_4K will be negative. It is the latest chance we can correct the wrong setting.

I will add more logs in v1/host.py and try to reproduce it.

Revision history for this message
Bin Yang (byangintel) wrote :

Patch (https://review.opendev.org/#/c/667811/) had been updated based on the discussion with Tao.

So far, I still cannot reproduce this issue. Could reporter help to test this patch?
Hopefully, the issue should be fixed if it is caused by vm_hugepages_possible_2M calculation.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Raising the priority of this bug to high as it has been seen multiple times recently:
https://bugs.launchpad.net/starlingx/+bug/1836240
https://bugs.launchpad.net/starlingx/+bug/1835941

The above bugs will be marked as duplicates.

Changed in starlingx:
importance: Medium → High
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Updating the tag from stx.distro.other to stx.config as the issue is in the stx hugepage allocation, not with a 3rd party package.

tags: added: stx.config
removed: stx.distro.other
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/667811
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=5e42f69989433b7ac4c0aba6a4dd62900a45f905
Submitter: Zuul
Branch: master

commit 5e42f69989433b7ac4c0aba6a4dd62900a45f905
Author: Bin Yang <email address hidden>
Date: Thu Jun 27 16:36:41 2019 +0800

    fix hugepage allocation overflow

    From the database dump, it shows current "vm_hugepages_nr_2M", which is
    derived from the previous "vm_hugepages_possible_2M" is much bigger
    than current "vm_hugepages_possible_2M".

    "vm_hugepages_possible_2M" reported by sysinv-agent is calculated using
    the platform reserved value from the "worker_reserved.conf", while
    "worker_reserved.conf" is updated by puppet after first unlock.

    This patch calculates the initial huge pages based on the current view of
    the host memory instead of using the vm_hugepages_possible_2M value reported
    from the sysinv-agent, which could be calculated based on the default platform
    reserved value. In addition, it also takes consideration of the platform memory
    reserved changes (by user) when performing huge pages semantic check.

    Change-Id: I686b99728ed2b3572ace39469d479176a6ae55ff
    Closes-Bug: 1827258
    Signed-off-by: Liu, Tao <email address hidden>
    Signed-off-by: Bin Yang <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :

closing based on verification in LP-1832647
2019-08-07_20-59-00

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.