OOM on worker node (order zero OOM)

Bug #1832647 reported by Wendy Mitchell
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Bin Yang

Bug Description

Brief Description
-----------------
OOM xfs_filemap_fault
Order zero OOM (not a cgroup OOM)
can't give order zero memory back

Severity
--------
Major

Steps to Reproduce
------------------
1. installed stx but did not apply stx-openstack application (Note: labels existed on worker nodes)
2. Performed system application-apply (versioned tarfile for stx-openstack)
system application-apply was reported as successfully completed.

3. attempted system application-update to centos-stable-latest
2019-06-12T18:34:57.000 controller-0 -sh: info HISTORY: PID=1670384 UID=1875 system application-update stx-openstack-1.0-14-centos-stable-latest.tgz

Expected Behaviour
------------------

Actual Behaviour
----------------
Order zero OOM (Note: it is not a cgroup OOM) on compute-0
can't give order zero memory back

see memtop and top output attached

In step 3, the application-update fails (and also failed recovery as well)
...

| stx-openstack | 1.0-14-centos-stable- | armada-manifest | stx-openstack. | apply-failed | application update from |
| | versioned | | yaml | | version 1.0-14-centos-stable-versioned to version 1.0-14 centos-stable-latest aborted. application recover to version 1.0-14-centos-stable-versioned aborted. please check logs for detail.

see CrashLoopBackOff of the following pods
kubectl get pods -n openstack

neutron-ovs-agent-compute-0-9c041f23-hsx4x 0/1 CrashLoopBackOff 14 61m
neutron-ovs-agent-compute-1-eae26dba-rcd2n 1/1 Running 0 60m
..
nova-compute-compute-0-75ea0372-787m7 1/2 CrashLoopBackOff 13 61m
nova-compute-compute-1-eae26dba-2zk4k 2/2 Running 0 61m

Reproducibility
---------------
TBD

System Configuration
--------------------
standard system

Branch/Pull Time/Commit
-----------------------
BUILD_ID="20190612T013000Z"

Timestamp/Logs
--------------
see kern.log compute-0 also attached

2019-06-12T18:48:59.180 compute-0 kernel: warning [12644.237895] python invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=1000

2019-06-12T18:48:59.180 compute-0 kernel: info [12644.237901] python cpuset=80af688d281bc86ecf6cc924f480c1d292b507e64b737cdb31f287f1ba16a6f8 mems_allowed=0

2019-06-12T18:48:59.180 compute-0 kernel: warning [12644.237905] CPU: 0 PID: 316722 Comm: python Kdump: loaded Tainted: G O ------------ T 3.10.0-957.12.2.el7.1.tis.x86_64 #1
2019-06-12T18:48:59.180 compute-0 kernel: warning [12644.237907] Hardware name: Intel Corporation S2600WT2R/S2600WT2R, BIOS SE5C610.86B.01.01.0022.062820171903 06/28/2017

2019-06-12T18:48:59.180 compute-0 kernel: warning [12644.237909] Call Trace:

2019-06-12T18:48:59.180 compute-0 kernel: warning [12644.237921] [<ffffffff8a608991>] dump_stack+0x19/0x1b

2019-06-12T18:48:59.180 compute-0 kernel: warning [12644.237928] [<ffffffff8a603dce>] dump_header+0x8e/0x23f

2019-06-12T18:48:59.180 compute-0 kernel: warning [12644.237934] [<ffffffff8a617516>] ? retint_kernel+0x26/0x30

2019-06-12T18:48:59.180 compute-0 kernel: warning [12644.237941] [<ffffffff89f886de>] oom_kill_process+0x24e/0x3d0

2019-06-12T18:48:59.180 compute-0 kernel: warning [12644.237944] [<ffffffff89f2df9b>] ? rcu_read_unlock_special+0x1ab/0x1b0

2019-06-12T18:48:59.181 compute-0 kernel: warning [12644.237947] [<ffffffff89f88f43>] out_of_memory+0x4d3/0x510

2019-06-12T18:48:59.181 compute-0 kernel: warning [12644.237952] [<ffffffff89f8f2b5>] __alloc_pages_nodemask+0xa85/0xb80

2019-06-12T18:48:59.181 compute-0 kernel: warning [12644.237972] [<ffffffff89fd9ad8>] alloc_pages_current+0x98/0x110

2019-06-12T18:48:59.181 compute-0 kernel: warning [12644.237975] [<ffffffff89f84447>] __page_cache_alloc+0x97/0xb0

2019-06-12T18:48:59.181 compute-0 kernel: warning [12644.237978] [<ffffffff89f87258>] filemap_fault+0x278/0x460

2019-06-12T18:48:59.181 compute-0 kernel: warning [12644.238014] [<ffffffffc059215e>] __xfs_filemap_fault+0x7e/0x200 [xfs]

2019-06-12T18:48:59.181 compute-0 kernel: warning [12644.238029] [<ffffffffc059238c>] xfs_filemap_fault+0x2c/0x30 [xfs]

2019-06-12T18:48:59.181 compute-0 kernel: warning [12644.238047] [<ffffffff89fb18fa>] __do_fault.isra.70+0x8a/0x100

2019-06-12T18:48:59.181 compute-0 kernel: warning [12644.238050] [<ffffffff89fb1f1c>] do_read_fault.isra.72+0x4c/0x1b0

2019-06-12T18:48:59.181 compute-0 kernel: warning [12644.238053] [<ffffffff89fb8547>] handle_mm_fault+0x557/0xc30

2019-06-12T18:48:59.181 compute-0 kernel: warning [12644.238059] [<ffffffff89e5fd43>] __do_page_fault+0x1e3/0x440

2019-06-12T18:48:59.181 compute-0 kernel: warning [12644.238061] [<ffffffff89e60015>] do_page_fault+0x35/0x90

2019-06-12T18:48:59.181 compute-0 kernel: warning [12644.238064] [<ffffffff8a617648>] page_fault+0x28/0x30

2019-06-12T18:48:59.181 compute-0 kernel: warning [12644.238066] Mem-Info:

2019-06-12T18:48:59.181 compute-0 kernel: warning [12644.238074] active_anon:390090 inactive_anon:2108 isolated_anon:0
2019-06-12T18:48:59.181 compute-0 kernel: warning [12644.238074] active_file:29532 inactive_file:42331 isolated_file:0
2019-06-12T18:48:59.181 compute-0 kernel: warning [12644.238074] unevictable:1356 dirty:39 writeback:0 unstable:0
2019-06-12T18:48:59.181 compute-0 kernel: warning [12644.238074] slab_reclaimable:37926 slab_unreclaimable:115835
2019-06-12T18:48:59.181 compute-0 kernel: warning [12644.238074] mapped:19639 shmem:3472 pagetables:9901 bounce:0
2019-06-12T18:48:59.181 compute-0 kernel: warning [12644.238074] free:759204 free_pcp:273 free_cma:0

Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :
Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :
Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :
Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :
Revision history for this message
Cindy Xie (xxie1) wrote :

@Wendy, can you please check if you're seeing the same issue as of https://bugs.launchpad.net/starlingx/+bug/1827258?

Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :

stx-openstack-apply.log (from controller-0)
wcp 63-66 hardware

Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :

attaching also neutron-ovs-agent-compute-0 pod logs

Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :

attaching also nova-compute-compute-0 pod logs

Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :

configuration was vswitch_type=None (ovs running in container)

Numan Waheed (nwaheed)
tags: added: stx.retestneeded
Bin Yang (byangintel)
Changed in starlingx:
assignee: nobody → Bin Yang (byangintel)
Revision history for this message
Bin Yang (byangintel) wrote :

it is highlighted that "order zero" (4KB) allocation causes OOM. It looks unreasonable that 4KB allocation was failed.

Here we have two problems:
1. Why 4KB memory allocation causes the OOM
2. Why system has no enough memory

For #1, please refer to my comment in another bug: https://bugs.launchpad.net/starlingx/+bug/1827258/comments/15

For #2, I see some unreasonable hugepages info when oom is triggered:
Node 0 hugepages_total=14104 hugepages_free=14104 hugepages_surp=0 hugepages_size=2048kB
Node 1 hugepages_total=14381 hugepages_free=14381 hugepages_surp=0 hugepages_size=2048kB

It looks hugepages occupy most of system memory and they are not used at all.

I will continue to look into it.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as stx.2.0 release gating; the system should not run out of memory under any condition.

tags: added: stx.2.0 stx.distro.other
Ghada Khalil (gkhalil)
Changed in starlingx:
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
Bin Yang (byangintel) wrote :

2019-06-12T15:18:39.345 compute-0 kernel: debug [ 0.000000] On node 0 totalpages: 8363883

2019-06-12T18:48:59.204 compute-0 kernel: info [12644.238828] Node 0 hugepages_total=14104 hugepages_free=14104 hugepages_surp=0 hugepages_size=2048kB
2019-06-12T18:48:59.204 compute-0 kernel: info [12644.238829] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB

total: 8363883 * 4K = 32GB
hugepages total: 14104 * 2M = 27.54 GB

the hugepage size is unreasonable. same issue as LP 1827258

Ghada Khalil (gkhalil)
tags: added: stx.config
removed: stx.distro.other
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Fixed by https://review.opendev.org/667811
Merged on 2019-07-12

Changed in starlingx:
status: Triaged → Fix Released
Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :
Download full text (4.6 KiB)

Performed the following verification and did not see this OOM issue.
PV-1
BUILD_ID="2019-08-07_20-59-00"

Note; Platform CPU threshold alarms were reported against each compute however.

 Thu Aug 8 19:29:01 2019
+---------------------+---------------------------+-------------------------------+--------------------+---------+---------------------------------------------------------------------------------------------------+
| application | version | manifest name | manifest file | status | progress |
+---------------------+---------------------------+-------------------------------+--------------------+---------+---------------------------------------------------------------------------------------------------+
| platform-integ-apps | 1.0-7 | platform-integration-manifest | manifest.yaml | applied | completed |
| stx-openstack | 1.0-17-centos-stable- | armada-manifest | stx-openstack.yaml | applied | Application update from version 1.0-17-centos-stable-versioned to version 1.0-17-centos-stable- |
| | latest | | | | latest completed.

100.101 Platform CPU threshold exceeded ; threshold 95.00%, actual 100.00%

2019-08-08T19:19:09.795 [332552.00967] controller-0 mtcAgent hbs nodeClass.cpp (4971) collectd_notify_handler : Info : compute-2 collectd degrade state change ; clear -> assert (due to cpu)
2019-08-08T19:19:09.795 [332552.00968] controller-0 mtcAgent inv mtcInvApi.cpp (1079) mtcInvApi_update_state : Info : compute-2 degraded (seq:105)
2019-08-08T19:19:14.873 [332552.00969] controller-0 mtcAgent hbs nodeClass.cpp (4971) collectd_notify_handler : Info : compute-7 collectd degrade state change ; clear -> assert (due to cpu)
2019-08-08T19:19:14.873 [332552.00970] controller-0 mtcAgent inv mtcInvApi.cpp (1079) mtcInvApi_update_state : Info : compute-7 degraded (seq:104)
2019-08-08T19:19:18.949 [332552.00971] controller-0 mtcAgent hbs nodeClass.cpp (4971) collectd_notify_handler : Info : compute-3 collectd degrade state change ; clear -> assert (due to cpu)
2019-08-08T19:19:18.949 [332552.00972] controller-0 mtcAgent inv mtcInvApi.cpp (1079) mtcInvApi_update_state : Info : compute-3 degraded (seq:107)
2019-08-08T19:19:28.697 [332552.00973] controller-0 mtcAgent hbs nodeClass.cpp (4971) collectd_notify_handler : Info : compute-4 collectd degrade state change ; clear -> assert (due to cpu)
2019-08-08T19:19:28.697 [332552.00974] controller-0 mtcAgent inv mtcInvApi.cpp (1079) mtcInvApi_update_state : Info : compute-4 degraded (seq:106)
2019-08-08T19:19:32.547 [332552.00975] controller-0 mtcAgent hbs nodeClass.cpp (4971) collectd_notify_handler : Info : compute-5 collectd degrade state change ; clear -> assert (due to cpu)
2019-08-08T19:19:32.547 [332552.00976] co...

Read more...

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.