Large number of nginx workers in one node system contributing to memory alarm condition
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
High
|
Yi Wang |
Bug Description
Brief Description
-----------------
Shortly after the stx-openstack application is deployed in simplex, platform memory is rapidly depleted leading to unresponsive system and finally OOM-induced reboot.
However, the platform memory appears to stabilize after a series of reboots
controller-0:~$ last reboot
reboot system boot 3.10.0-957.1.3.e Mon Apr 8 09:52 - 19:44 (09:51)
reboot system boot 3.10.0-957.1.3.e Mon Apr 8 09:13 - 19:44 (10:30)
reboot system boot 3.10.0-957.1.3.e Mon Apr 8 07:42 - 19:44 (12:01)
reboot system boot 3.10.0-957.1.3.e Mon Apr 8 06:19 - 19:44 (13:24)
reboot system boot 3.10.0-957.1.3.e Mon Apr 8 04:53 - 06:17 (01:23)
reboot system boot 3.10.0-957.1.3.e Mon Apr 8 04:02 - 06:17 (02:14)
controller-0:~$ uptime
19:44:11 up 9:51, 2 users, load average: 3.30, 3.01, 2.69
Severity
--------
Critical
Steps to Reproduce
------------------
Install, configure, unlock AIOSX and apply stx-openstack application
Expected Behavior
------------------
No memory alarms. Ideally, platform memory should stay below 50% to accomodate occasional/periodic surges from audits, VM deployments/
Actual Behavior
----------------
Major memory alarms appear after stx-openstack app is applied. These alarms are shortly upgraded to critical. Processes (e.g. kube-apiserver, mysqld, etc..) started getting randomly killed due to OOM.
Reproducibility
---------------
Reproducible
System Configuration
-------
One node system, http, IPv4. The number of nginx workers are likely split between the 2 controllers in duplex configurations. It is highly likely that the memory alarm condition is also observable in AIODX.
Branch/Pull Time/Commit
-------
BUILD_ID=
JOB="STX_
<email address hidden>"
Last Pass
---------
The timeframe when this issue might be introduced is unknown.
Timestamp/Logs
--------------
See memory dumps attached
After osh-openstack-
During the deployment of osh-openstack-
Test Activity
-------------
Developer Testing
Changed in starlingx: | |
assignee: | Cindy Xie (xxie1) → Yi Wang (wangyi4) |
Marking as release gating; high priority given the system becomes unusable after 10 hours.
This will require an upstream change in openstack-helm as it is just creating the number of works to match the number of cores on the node. This doesn't work well on a simplex node. openstack-helm doesn't currently provide a mechanism to override this. The upstream code change needs to focus on allowing the override so that the values can be customized in StarlingX.