Large number of nginx workers in one node system contributing to memory alarm condition

Bug #1823803 reported by Tee Ngo
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Yi Wang

Bug Description

Brief Description
-----------------
Shortly after the stx-openstack application is deployed in simplex, platform memory is rapidly depleted leading to unresponsive system and finally OOM-induced reboot.

However, the platform memory appears to stabilize after a series of reboots

controller-0:~$ last reboot
reboot system boot 3.10.0-957.1.3.e Mon Apr 8 09:52 - 19:44 (09:51)
reboot system boot 3.10.0-957.1.3.e Mon Apr 8 09:13 - 19:44 (10:30)
reboot system boot 3.10.0-957.1.3.e Mon Apr 8 07:42 - 19:44 (12:01)
reboot system boot 3.10.0-957.1.3.e Mon Apr 8 06:19 - 19:44 (13:24)
reboot system boot 3.10.0-957.1.3.e Mon Apr 8 04:53 - 06:17 (01:23)
reboot system boot 3.10.0-957.1.3.e Mon Apr 8 04:02 - 06:17 (02:14)

controller-0:~$ uptime
 19:44:11 up 9:51, 2 users, load average: 3.30, 3.01, 2.69

Severity
--------
Critical

Steps to Reproduce
------------------
Install, configure, unlock AIOSX and apply stx-openstack application

Expected Behavior
------------------
No memory alarms. Ideally, platform memory should stay below 50% to accomodate occasional/periodic surges from audits, VM deployments/migrations and maintenance related activities.

Actual Behavior
----------------
Major memory alarms appear after stx-openstack app is applied. These alarms are shortly upgraded to critical. Processes (e.g. kube-apiserver, mysqld, etc..) started getting randomly killed due to OOM.

Reproducibility
---------------
Reproducible

System Configuration
--------------------
One node system, http, IPv4. The number of nginx workers are likely split between the 2 controllers in duplex configurations. It is highly likely that the memory alarm condition is also observable in AIODX.

Branch/Pull Time/Commit
-----------------------
BUILD_ID="20190406T203346Z"

JOB="STX_build_master_master"
<email address hidden>"

Last Pass
---------
The timeframe when this issue might be introduced is unknown.

Timestamp/Logs
--------------
See memory dumps attached
After osh-openstack-ingress chart was processed (around 2019-04-08 05:18:04 in sysinv.log), there were 72 nginx workers (refer to rss.dump) with average RSS value of 27791.

During the deployment of osh-openstack-mariadb chart, the number of workers jumped significantly to 144 (Mon Apr 8 05:18:25 in rss.dump) and again to 216 (Mon Apr 8 05:18:25 in rss.dump). The number of workers reached 220 before the first OOM-induced reboot.

Test Activity
-------------
Developer Testing

Revision history for this message
Tee Ngo (teewrs) wrote :
description: updated
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as release gating; high priority given the system becomes unusable after 10 hours.

This will require an upstream change in openstack-helm as it is just creating the number of works to match the number of cores on the node. This doesn't work well on a simplex node. openstack-helm doesn't currently provide a mechanism to override this. The upstream code change needs to focus on allowing the override so that the values can be customized in StarlingX.

tags: added: stx.2.0 stx.containers
Changed in starlingx:
importance: Undecided → High
status: New → Triaged
assignee: nobody → Cindy Xie (xxie1)
Cindy Xie (xxie1)
Changed in starlingx:
assignee: Cindy Xie (xxie1) → Yi Wang (wangyi4)
Revision history for this message
Yi Wang (wangyi4) wrote :

I reproduced this issue with my deployment, simplex(two Intel(R) Xeon(R) Gold 6139 + 192G memory). There are three ingress controller pods. In each pod, there are one ngnix master process and 72 worker processes. In total, all nginx processes in the three pods consumed ~8G memory. I am working on a patch to modify nginx configuration.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/659456

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to upstream (master)

Fix proposed to branch: master
Review: https://review.opendev.org/659464

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/659533

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/659456
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=f6e7130cabfd85697fd46ed345bbf220b5601360
Submitter: Zuul
Branch: master

commit f6e7130cabfd85697fd46ed345bbf220b5601360
Author: Yi Wang <email address hidden>
Date: Thu May 16 15:26:57 2019 +0800

    Fix large number of nginx worker issue

    Override nginx "worker-processes" setting in ingress controller.
    Default value is changed from auto to 4 to reduce memory
    consumption by nginx worker processes. 4 worker can give 2 per
    platform CPU (in AIO) to avoid blocking all users in case that
    part of workers are blocked.
    The static override is done in the Armada manifest.

    Closes-Bug: #1823803
    Change-Id: I1f92cf0c3fdfde41364abe65e4747d2091c4c3ea
    Signed-off-by: Yi Wang <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to upstream (master)

Reviewed: https://review.opendev.org/659464
Committed: https://git.openstack.org/cgit/starlingx/upstream/commit/?id=6a341bbb5e9adde2e81b52abd0905cfd987f0eab
Submitter: Zuul
Branch: master

commit 6a341bbb5e9adde2e81b52abd0905cfd987f0eab
Author: Yi Wang <email address hidden>
Date: Fri May 17 15:03:41 2019 +0800

    Add a configmap for mariadb ingress controller

    The configmap is for the nginx ingress controller in mariadb
    chart. With it, we enable the capability of overriding default
    nginx configurations in the ingress controller.

    Submitted this patch to upstream openstack-helm-infra also.
    https://review.opendev.org/#/c/659560/

    Closes-Bug: #1823803
    Change-Id: Ibda2aef7413b4bf3cb990600463389a0b3661022
    Signed-off-by: Yi Wang <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/659533
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=6dfb75be644af97f75ca394aa46970c6550ed4aa
Submitter: Zuul
Branch: master

commit 6dfb75be644af97f75ca394aa46970c6550ed4aa
Author: Yi Wang <email address hidden>
Date: Fri May 17 15:14:10 2019 +0800

    Fix mariadb nginx worker number issue

    Override nginx "worker-processes" setting in mariadb ingress
    controller. Default value is changed from auto to 4 to reduce
    memory consumption by nginx worker processes. 4 worker can
    give 2 per platform CPU (in AIO) to avoid blocking all users
    in case that part of workers are blocked.
    The static override is done in the Armada manifest.

    Closes-Bug: #1823803
    Depends-On: https://review.opendev.org/#/c/659464/
    Change-Id: If0e6d2b2ac45dedbd9e67b4f866702d9de1db15c
    Signed-off-by: Yi Wang <email address hidden>

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.