OOM seen on worker node after fresh install - mem available but out of order 0 pages
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
High
|
Bin Yang |
Bug Description
Brief Description
-----------------
Out of memory killer seen on worker node after fresh system install. Looking at the logs, it looks like the node has memory available, but it is out of order 0 memory.
Severity
--------
Major
Steps to Reproduce
------------------
1. Install system
2. Observe worker node console and the following is seen:
compute-1 login: [ 604.915006] Out of memory: Kill process 43143 (nova-compute
score 1003 or sacrifice child
[ 604.928673] Killed process 43143 (nova-compute) total-vm:2300792kB, anon-rss
119128kB, file-rss:0kB, shmem-rss:0kB
[ 1387.186936] Out of memory: Kill process 49810 (nova-compute) score 1003 or s
crifice child
[ 1387.201406] Killed process 49810 (nova-compute) total-vm:2300964kB, anon-rss
119164kB, file-rss:0kB, shmem-rss:0kB
[ 1863.595215] Out of memory: Kill process 54492 (/var/lib/openst) score 1002 o
sacrifice child
[ 1863.616508] Killed process 54492 (/var/lib/openst) total-vm:288872kB, anon-r
s:105316kB, file-rss:0kB, shmem-rss:0kB
mem[ 2265.008223] Out of memory: Kill process 64088 (/var/lib/openst) score 100
or sacrifice child
[ 2265.021867] Killed process 64088 (/var/lib/openst) total-vm:288404kB, anon-r
s:104712kB, file-rss:0kB, shmem-rss:0kB
[ 2694.819180] Out of memory: Kill process 59143 (nova-compute) score 1002 or s
crifice child
[ 2694.875767] Killed process 59143 (nova-compute) total-vm:302420kB, anon-rss:
01464kB, file-rss:0kB, shmem-rss:0kB
Expected Behavior
------------------
OOM not seen
Actual Behavior
----------------
OOM seen
Reproducibility
---------------
Seen once
System Configuration
-------
Storage system
Branch/Pull Time/Commit
-------
master load: 20190501T013000Z
Last Pass
---------
N/A
Timestamp/Logs
--------------
Kernel log:
2019-05-
2019-05-
2019-05-
2019-05-
2019-05-
2019-05-
2019-05-
2019-05-
2019-05-
2019-05-
2019-05-
2019-05-
2019-05-
2019-05-
2019-05-
2019-05-
2019-05-
2019-05-
2019-05-
2019-05-
2019-05-
2019-05-
2019-05-
2019-05-
2019-05-
2019-05-
2019-05-
2019-05-
2019-05-
2019-05-
2019-05-
2019-05-
Memory usage fairly stable:
memtop 0.1 -- selected options: delay = 1.000s, repeat = 1, period = 1.000s, non-strict, unit = MiB
yyyy-mm-dd hh:mm:ss.fff Tot Used Free Ca Buf Slab CAS CLim Dirty WBack Anon Avail 0:Avail 0:HFree 1:Avail 1:HFree
2019-05-01 18:19:56.557 64217.5 61916.4 2039.7 132.1 34.3 537.7 3762.3 2425.7 1.3 0.0 1537.0 2301.1 321.3 28544.0 1979.0 28774.0
done
compute-1:~# memtop
memtop 0.1 -- selected options: delay = 1.000s, repeat = 1, period = 1.000s, non-strict, unit = MiB
yyyy-mm-dd hh:mm:ss.fff Tot Used Free Ca Buf Slab CAS CLim Dirty WBack Anon Avail 0:Avail 0:HFree 1:Avail 1:HFree
2019-05-01 18:20:01.581 64217.5 61925.4 2024.1 138.8 34.3 543.4 3864.9 2425.7 1.3 0.0 1538.0 2292.1 319.3 28544.0 1973.3 28774.0
done
compute-1:~# memtop
memtop 0.1 -- selected options: delay = 1.000s, repeat = 1, period = 1.000s, non-strict, unit = MiB
yyyy-mm-dd hh:mm:ss.fff Tot Used Free Ca Buf Slab CAS CLim Dirty WBack Anon Avail 0:Avail 0:HFree 1:Avail 1:HFree
2019-05-01 18:20:05.181 64217.5 61929.9 2017.7 140.5 34.3 547.5 3964.0 2425.7 1.3 0.0 1538.8 2287.5 318.8 28544.0 1969.3 28774.0
done
Top:
top - 18:20:26 up 54 min, 1 user, load average: 15.41, 15.54, 14.93
Tasks: 414 total, 11 running, 403 sleeping, 0 stopped, 0 zombie
%Cpu(s): 10.4 us, 6.2 sy, 0.0 ni, 80.2 id, 3.3 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 65758700 total, 2066240 free, 62952448 used, 740012 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 1862636 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
33738 root 10 -10 5522860 435328 12244 S 200.0 0.7 103:33.86 ovs-vswitchd
183 root 20 0 0 0 0 R 100.0 0.0 39:59.85 kswapd0 <--- ??
39926 root 20 0 280512 98020 0 R 6.2 0.1 0:45.15 /var/lib/openst
41020 root 20 0 280568 96848 60 R 6.2 0.1 0:42.67 /var/lib/openst
87231 root 20 0 301500 100624 0 R 6.2 0.2 0:07.66 nova-compute
96462 root 20 0 80068 4348 0 R 6.2 0.0 0:01.57 python
97522 root 20 0 11692 276 0 R 6.2 0.0 0:00.08 bash
1 root 20 0 126852 5084 2292 S 0.0 0.0 0:13.87 systemd
Test Activity
-------------
Install
tags: | added: stx.retestneeded |
Some additional info:
compute-1:~$ sudo lsof -nP -a +L1
Password:
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NLINK NODE NAME
ovsdb-ser 31924 root 7u REG 0,37 159 0 109669 /tmp/tmpfHlfhxc (deleted)
compute-1:~$ sudo systemd-cgtop -m -b -n1 --depth=6
Path Tasks %CPU Memory Input/s Output/s
/ 302 - 2.8G - - user-0. slice - - 1.1G - - kubepods/ besteffort - - 952.6M - - slice/ovs- vswitchd. service 1 - 416.4M - - kubepods/ besteffort/ pod7aa85fa3- 6c30-11e9- b5e2-001e67680c ba - - 219.4M - - kubepods/ beste.. .efcf34189968c7 d19744022e6d037 80c33a1b0284a5b c7af98 5 - 199.7M - - slice/docker. service 20 - 162.6M - - kubepods/ besteffort/ pod7a58b2e8- 6c30-11e9- b5e2-001e67680c ba - - 157.0M - - kubepods/ besteffort/ pod7a928ea9- 6c30-11e9- b5e2-001e67680c ba - - 151.4M - - kubepods/ beste.. .3f681d0ea155a8 76292f94041a358 1f623639b6ca71c 464792 2 - 149.3M - - kubepods/ besteffort/ pod7b223017- 6c30-11e9- b5e2-001e67680c ba - - 147.7M - - kubepods/ beste.. .27cc220c99d696 a85d46f2f1c5b1e a415a3ce2097b15 09dd01 11 - 143.7M - - kubepods/ beste.. .ae615c4797728d 533eb040a58eeec 70c0faf9587742d afd636 1 - 128.9M - - kubepods/ besteffort/ pod7ad7ef3b- 6c30-11e9- b5e2-001e67680c ba - - 117.0M - - kubepods/ besteffort/ pod7a42d449- 6c30-11e9- b5e2-001e67680c ba - - 110.9M - - kubepods/ beste.. .36ef8b37548250 6848519026ce48f 537c344e43eaa4e 2245c6 1 - 109.5M -
/user.slice - - 1.1G - -
/user.slice/
/k8s-infra - - 971.0M - -
/k8s-infra/kubepods - - 971.0M - -
/k8s-infra/
/system.slice - - 876.8M - -
/system.
/k8s-infra/
/k8s-infra/
/system.
/k8s-infra/
/k8s-infra/
/k8s-infra/
/k8s-infra/
/k8s-infra/
/k8s-infra/
/k8s-infra/
/k8s-infra/
/k8s-infra/
compute-1:~$ ipcs
------ Message Queues --------
key msqid owner perms used-bytes messages
------ Shared Memory Segments --------
key shmid owner perms bytes nattch status
0x00000000 0 root 644 80 2
0x00000000 32769 root 644 16384 2 ...