StarlingX

OOM on worker node (order zero OOM)

Bug #1832647 reported by Wendy Mitchell on 2019-06-12

This bug report is a duplicate of: Bug #1827258: OOM seen on worker node after fresh install - mem available but out of order 0 pages. Edit Remove

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	Bin Yang

Bug Description

Brief Description
-----------------
OOM xfs_filemap_fault
Order zero OOM (not a cgroup OOM)
can't give order zero memory back

Severity
--------
Major

Steps to Reproduce
------------------
1. installed stx but did not apply stx-openstack application (Note: labels existed on worker nodes)
2. Performed system application-apply (versioned tarfile for stx-openstack)
system application-apply was reported as successfully completed.

3. attempted system application-update to centos-stable-latest
2019-06-12T18:34:57.000 controller-0 -sh: info HISTORY: PID=1670384 UID=1875 system application-update stx-openstack-1.0-14-centos-stable-latest.tgz

Expected Behaviour
------------------

Actual Behaviour
----------------
Order zero OOM (Note: it is not a cgroup OOM) on compute-0
can't give order zero memory back

see memtop and top output attached

In step 3, the application-update fails (and also failed recovery as well)
...

see CrashLoopBackOff of the following pods
kubectl get pods -n openstack

neutron-ovs-agent-compute-0-9c041f23-hsx4x 0/1 CrashLoopBackOff 14 61m
neutron-ovs-agent-compute-1-eae26dba-rcd2n 1/1 Running 0 60m
..
nova-compute-compute-0-75ea0372-787m7 1/2 CrashLoopBackOff 13 61m
nova-compute-compute-1-eae26dba-2zk4k 2/2 Running 0 61m

Reproducibility
---------------
TBD

System Configuration
--------------------
standard system

Branch/Pull Time/Commit
-----------------------
BUILD_ID="20190612T013000Z"

Timestamp/Logs
--------------
see kern.log compute-0 also attached

2019-06-12T18:48:59.180 compute-0 kernel: warning [12644.237895] python invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=1000

2019-06-12T18:48:59.180 compute-0 kernel: info [12644.237901] python cpuset=80af688d281bc86ecf6cc924f480c1d292b507e64b737cdb31f287f1ba16a6f8 mems_allowed=0

2019-06-12T18:48:59.180 compute-0 kernel: warning [12644.237905] CPU: 0 PID: 316722 Comm: python Kdump: loaded Tainted: G O ------------ T 3.10.0-957.12.2.el7.1.tis.x86_64 #1
2019-06-12T18:48:59.180 compute-0 kernel: warning [12644.237907] Hardware name: Intel Corporation S2600WT2R/S2600WT2R, BIOS SE5C610.86B.01.01.0022.062820171903 06/28/2017

2019-06-12T18:48:59.180 compute-0 kernel: warning [12644.237909] Call Trace:

2019-06-12T18:48:59.180 compute-0 kernel: warning [12644.237921] [<ffffffff8a608991>] dump_stack+0x19/0x1b

2019-06-12T18:48:59.180 compute-0 kernel: warning [12644.237928] [<ffffffff8a603dce>] dump_header+0x8e/0x23f

2019-06-12T18:48:59.180 compute-0 kernel: warning [12644.237934] [<ffffffff8a617516>] ? retint_kernel+0x26/0x30

2019-06-12T18:48:59.180 compute-0 kernel: warning [12644.237941] [<ffffffff89f886de>] oom_kill_process+0x24e/0x3d0

2019-06-12T18:48:59.180 compute-0 kernel: warning [12644.237944] [<ffffffff89f2df9b>] ? rcu_read_unlock_special+0x1ab/0x1b0

2019-06-12T18:48:59.181 compute-0 kernel: warning [12644.237947] [<ffffffff89f88f43>] out_of_memory+0x4d3/0x510

2019-06-12T18:48:59.181 compute-0 kernel: warning [12644.237952] [<ffffffff89f8f2b5>] __alloc_pages_nodemask+0xa85/0xb80

2019-06-12T18:48:59.181 compute-0 kernel: warning [12644.237972] [<ffffffff89fd9ad8>] alloc_pages_current+0x98/0x110

2019-06-12T18:48:59.181 compute-0 kernel: warning [12644.237975] [<ffffffff89f84447>] __page_cache_alloc+0x97/0xb0

2019-06-12T18:48:59.181 compute-0 kernel: warning [12644.237978] [<ffffffff89f87258>] filemap_fault+0x278/0x460

2019-06-12T18:48:59.181 compute-0 kernel: warning [12644.238014] [<ffffffffc059215e>] __xfs_filemap_fault+0x7e/0x200 [xfs]

2019-06-12T18:48:59.181 compute-0 kernel: warning [12644.238029] [<ffffffffc059238c>] xfs_filemap_fault+0x2c/0x30 [xfs]

2019-06-12T18:48:59.181 compute-0 kernel: warning [12644.238047] [<ffffffff89fb18fa>] __do_fault.isra.70+0x8a/0x100

2019-06-12T18:48:59.181 compute-0 kernel: warning [12644.238050] [<ffffffff89fb1f1c>] do_read_fault.isra.72+0x4c/0x1b0

2019-06-12T18:48:59.181 compute-0 kernel: warning [12644.238053] [<ffffffff89fb8547>] handle_mm_fault+0x557/0xc30

2019-06-12T18:48:59.181 compute-0 kernel: warning [12644.238059] [<ffffffff89e5fd43>] __do_page_fault+0x1e3/0x440

2019-06-12T18:48:59.181 compute-0 kernel: warning [12644.238061] [<ffffffff89e60015>] do_page_fault+0x35/0x90

2019-06-12T18:48:59.181 compute-0 kernel: warning [12644.238064] [<ffffffff8a617648>] page_fault+0x28/0x30

2019-06-12T18:48:59.181 compute-0 kernel: warning [12644.238066] Mem-Info:

2019-06-12T18:48:59.181 compute-0 kernel: warning [12644.238074] active_anon:390090 inactive_anon:2108 isolated_anon:0
2019-06-12T18:48:59.181 compute-0 kernel: warning [12644.238074] active_file:29532 inactive_file:42331 isolated_file:0
2019-06-12T18:48:59.181 compute-0 kernel: warning [12644.238074] unevictable:1356 dirty:39 writeback:0 unstable:0
2019-06-12T18:48:59.181 compute-0 kernel: warning [12644.238074] slab_reclaimable:37926 slab_unreclaimable:115835
2019-06-12T18:48:59.181 compute-0 kernel: warning [12644.238074] mapped:19639 shmem:3472 pagetables:9901 bounce:0
2019-06-12T18:48:59.181 compute-0 kernel: warning [12644.238074] free:759204 free_pcp:273 free_cma:0

Tags:

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-06-12:

memory_podinfo.txt Edit (39.8 KiB, text/plain)

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-06-12:

kern.log Edit (1.8 MiB, text/plain)

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-06-12:

sysinv.log Edit (560.4 KiB, text/plain)

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-06-12:

stx-openstack-apply.log Edit (1.3 MiB, text/plain)

Revision history for this message

Cindy Xie (xxie1) wrote on 2019-06-12:

@Wendy, can you please check if you're seeing the same issue as of https://bugs.launchpad.net/starlingx/+bug/1827258?

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-06-13:

stx-openstack-apply.log Edit (1.3 MiB, text/plain)

stx-openstack-apply.log (from controller-0)
wcp 63-66 hardware

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-06-13:

neutron-ovs-agent-compute-0-9c041f23-hsx4x_logs.7z Edit (4.7 KiB, application/x-7z-compressed)

attaching also neutron-ovs-agent-compute-0 pod logs

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-06-13:

nova-compute-compute-0-75ea0372-787m7_logs.7z Edit (16.8 KiB, application/x-7z-compressed)

attaching also nova-compute-compute-0 pod logs

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-06-13:

configuration was vswitch_type=None (ovs running in container)

Numan Waheed (nwaheed) on 2019-06-14

tags:

added: stx.retestneeded

Bin Yang (byangintel) on 2019-06-15

Changed in starlingx:
assignee:	nobody → Bin Yang (byangintel)

Revision history for this message

Bin Yang (byangintel) wrote on 2019-06-15:

#10

it is highlighted that "order zero" (4KB) allocation causes OOM. It looks unreasonable that 4KB allocation was failed.

Here we have two problems:
1. Why 4KB memory allocation causes the OOM
2. Why system has no enough memory

For #1, please refer to my comment in another bug: https://bugs.launchpad.net/starlingx/+bug/1827258/comments/15

For #2, I see some unreasonable hugepages info when oom is triggered:
Node 0 hugepages_total=14104 hugepages_free=14104 hugepages_surp=0 hugepages_size=2048kB
Node 1 hugepages_total=14381 hugepages_free=14381 hugepages_surp=0 hugepages_size=2048kB

It looks hugepages occupy most of system memory and they are not used at all.

I will continue to look into it.

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-06-17:

#11

Marking as stx.2.0 release gating; the system should not run out of memory under any condition.

tags:

added: stx.2.0 stx.distro.other

Ghada Khalil (gkhalil) on 2019-06-17

Changed in starlingx:
status:	New → Triaged
importance:	Undecided → Medium

Revision history for this message

Bin Yang (byangintel) wrote on 2019-06-23:

#12

2019-06-12T15:18:39.345 compute-0 kernel: debug [ 0.000000] On node 0 totalpages: 8363883

2019-06-12T18:48:59.204 compute-0 kernel: info [12644.238828] Node 0 hugepages_total=14104 hugepages_free=14104 hugepages_surp=0 hugepages_size=2048kB
2019-06-12T18:48:59.204 compute-0 kernel: info [12644.238829] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB

total: 8363883 * 4K = 32GB
hugepages total: 14104 * 2M = 27.54 GB

the hugepage size is unreasonable. same issue as LP 1827258

Ghada Khalil (gkhalil) on 2019-07-12

tags:

added: stx.config
removed: stx.distro.other

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-07-12:

#13

Fixed by https://review.opendev.org/667811
Merged on 2019-07-12

Changed in starlingx:
status:	Triaged → Fix Released

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-08-08:

#14

Download full text (4.6 KiB)

Performed the following verification and did not see this OOM issue.
PV-1
BUILD_ID="2019-08-07_20-59-00"

Note; Platform CPU threshold alarms were reported against each compute however.

100.101 Platform CPU threshold exceeded ; threshold 95.00%, actual 100.00%

2019-08-08T19:19:09.795 [332552.00967] controller-0 mtcAgent hbs nodeClass.cpp (4971) collectd_notify_handler : Info : compute-2 collectd degrade state change ; clear -> assert (due to cpu)
2019-08-08T19:19:09.795 [332552.00968] controller-0 mtcAgent inv mtcInvApi.cpp (1079) mtcInvApi_update_state : Info : compute-2 degraded (seq:105)
2019-08-08T19:19:14.873 [332552.00969] controller-0 mtcAgent hbs nodeClass.cpp (4971) collectd_notify_handler : Info : compute-7 collectd degrade state change ; clear -> assert (due to cpu)
2019-08-08T19:19:14.873 [332552.00970] controller-0 mtcAgent inv mtcInvApi.cpp (1079) mtcInvApi_update_state : Info : compute-7 degraded (seq:104)
2019-08-08T19:19:18.949 [332552.00971] controller-0 mtcAgent hbs nodeClass.cpp (4971) collectd_notify_handler : Info : compute-3 collectd degrade state change ; clear -> assert (due to cpu)
2019-08-08T19:19:18.949 [332552.00972] controller-0 mtcAgent inv mtcInvApi.cpp (1079) mtcInvApi_update_state : Info : compute-3 degraded (seq:107)
2019-08-08T19:19:28.697 [332552.00973] controller-0 mtcAgent hbs nodeClass.cpp (4971) collectd_notify_handler : Info : compute-4 collectd degrade state change ; clear -> assert (due to cpu)
2019-08-08T19:19:28.697 [332552.00974] controller-0 mtcAgent inv mtcInvApi.cpp (1079) mtcInvApi_update_state : Info : compute-4 degraded (seq:106)
2019-08-08T19:19:32.547 [332552.00975] controller-0 mtcAgent hbs nodeClass.cpp (4971) collectd_notify_handler : Info : compute-5 collectd degrade state change ; clear -> assert (due to cpu)
2019-08-08T19:19:32.547 [332552.00976] co...

Performed the following verification and did not see this OOM issue.
PV-1
BUILD_ID="2019-08-07_20-59-00"

Note; Platform CPU threshold alarms were reported against each compute however.

100.101 	Platform CPU threshold exceeded ; threshold 95.00%, actual 100.00%

2019-08-08T19:19:09.795 [332552.00967] controller-0 mtcAgent hbs nodeClass.cpp     (4971) collectd_notify_handler : Info : compute-2 collectd degrade state change ; clear -> assert (due to cpu)
2019-08-08T19:19:09.795 [332552.00968] controller-0 mtcAgent inv mtcInvApi.cpp     (1079) mtcInvApi_update_state  : Info : compute-2 degraded (seq:105)
2019-08-08T19:19:14.873 [332552.00969] controller-0 mtcAgent hbs nodeClass.cpp     (4971) collectd_notify_handler : Info : compute-7 collectd degrade state change ; clear -> assert (due to cpu)
2019-08-08T19:19:14.873 [332552.00970] controller-0 mtcAgent inv mtcInvApi.cpp     (1079) mtcInvApi_update_state  : Info : compute-7 degraded (seq:104)
2019-08-08T19:19:18.949 [332552.00971] controller-0 mtcAgent hbs nodeClass.cpp     (4971) collectd_notify_handler : Info : compute-3 collectd degrade state change ; clear -> assert (due to cpu)
2019-08-08T19:19:18.949 [332552.00972] controller-0 mtcAgent inv mtcInvApi.cpp     (1079) mtcInvApi_update_state  : Info : compute-3 degraded (seq:107)
2019-08-08T19:19:28.697 [332552.00973] controller-0 mtcAgent hbs nodeClass.cpp     (4971) collectd_notify_handler : Info : compute-4 collectd degrade state change ; clear -> assert (due to cpu)
2019-08-08T19:19:28.697 [332552.00974] controller-0 mtcAgent inv mtcInvApi.cpp     (1079) mtcInvApi_update_state  : Info : compute-4 degraded (seq:106)
2019-08-08T19:19:32.547 [332552.00975] controller-0 mtcAgent hbs nodeClass.cpp     (4971) collectd_notify_handler : Info : compute-5 collectd degrade state change ; clear -> assert (due to cpu)
2019-08-08T19:19:32.547 [332552.00976] controller-0 mtcAgent inv mtcInvApi.cpp     (1079) mtcInvApi_update_state  : Info : compute-5 degraded (seq:105)
2019-08-08T19:19:43.996 [332552.00977] controller-0 mtcAgent hbs nodeClass.cpp     (4971) collectd_notify_handler : Info : compute-6 collectd degrade state change ; clear -> assert (due to cpu)

# kubectl get pods -n openstack -o wide | grep compute- ; date
neutron-ovs-agent-compute-0-5621f953-9wqvx          1/1     Running     0          9m3s    192.168.204.121   compute-0      <none>           <none>
neutron-ovs-agent-compute-1-532206f8-2b8h6          1/1     Running     0          8m50s   192.168.204.92    compute-1      <none>           <none>
neutron-ovs-agent-compute-2-cad8c828-65v49          1/1     Running     0          9m      192.168.204.16    compute-2      <none>           <none>
neutron-ovs-agent-compute-3-54df1c1f-7lpf4          1/1     Running     0          8m47s   192.168.204.237   compute-3      <none>           <none>
neutron-ovs-agent-compute-4-59184df4-br4ff          1/1     Running     0          9m1s    192.168.204.113   compute-4      <none>           <none>
neutron-ovs-agent-compute-5-782e859e-clctj          1/1     Running     0          9m2s    192.168.204.125   compute-5      <none>           <none>
neutron-ovs-agent-compute-6-c637a2a8-ng6hr          1/1     Running     0          8m57s   192.168.204.193   compute-6      <none>           <none>
neutron-ovs-agent-compute-7-bc86b9f6-gzjjx          1/1     Running     0          8m51s   192.168.204.13    compute-7      <none>           <none>

tags:

removed: stx.retestneeded