Activity log for bug #1823803

Date Who What changed Old value New value Message
2019-04-08 22:21:50 Tee Ngo bug added bug
2019-04-08 22:21:50 Tee Ngo attachment added memdump.tgz https://bugs.launchpad.net/bugs/1823803/+attachment/5254315/+files/memdump.tgz
2019-04-08 22:22:48 Tee Ngo description Brief Description ----------------- Shortly after the stx-openstack application is deployed in simplex, platform memory is rapidly depleted leading to unresponsive system and finally OOM-induced reboot. However, the platform memory appear to stabilize after a series of reboots controller-0:~$ last reboot reboot system boot 3.10.0-957.1.3.e Mon Apr 8 09:52 - 19:44 (09:51) reboot system boot 3.10.0-957.1.3.e Mon Apr 8 09:13 - 19:44 (10:30) reboot system boot 3.10.0-957.1.3.e Mon Apr 8 07:42 - 19:44 (12:01) reboot system boot 3.10.0-957.1.3.e Mon Apr 8 06:19 - 19:44 (13:24) reboot system boot 3.10.0-957.1.3.e Mon Apr 8 04:53 - 06:17 (01:23) reboot system boot 3.10.0-957.1.3.e Mon Apr 8 04:02 - 06:17 (02:14) controller-0:~$ uptime 19:44:11 up 9:51, 2 users, load average: 3.30, 3.01, 2.69 Severity -------- Critical Steps to Reproduce ------------------ Install, configure, unlock AIOSX and apply stx-openstack application Expected Behavior ------------------ No memory alarms. Ideally, platform memory should stay below 50% to accomodate occasional/periodic surges from audits, VM deployments/migrations and maintenance related activities. Actual Behavior ---------------- Major memory alarms appear after stx-openstack app is applied. These alarms are shortly upgraded to critical. Processes (e.g. kube-apiserver, mysqld, etc..) started getting randomly killed due to OOM. Reproducibility --------------- Reproducible System Configuration -------------------- One node system, http, IPv4. The number of nginx workers are likely split between the 2 controllers in duplex configurations. It is highly likely that the memory alarm condition is also observable in AIODX. Branch/Pull Time/Commit ----------------------- BUILD_ID="20190406T203346Z" JOB="STX_build_master_master" BUILD_BY="starlingx.build@cengn.ca" Last Pass --------- The timeframe when this issue might be introduced is unknown. Timestamp/Logs -------------- See memory dumps attached After osh-openstack-ingress chart was processed (around 2019-04-08 05:18:04 in sysinv.log), there were 72 nginx workers (refer to rss.dump) with average RSS value of 27791. During the deployment of osh-openstack-mariadb chart, the number of workers jumped significantly to 144 (Mon Apr 8 05:18:25 in rss.dump) and again to 216 (Mon Apr 8 05:18:25 in rss.dump). The number of workers reached 220 before the first OOM-induced reboot. Test Activity ------------- Developer Testing Brief Description ----------------- Shortly after the stx-openstack application is deployed in simplex, platform memory is rapidly depleted leading to unresponsive system and finally OOM-induced reboot. However, the platform memory appears to stabilize after a series of reboots controller-0:~$ last reboot reboot system boot 3.10.0-957.1.3.e Mon Apr 8 09:52 - 19:44 (09:51) reboot system boot 3.10.0-957.1.3.e Mon Apr 8 09:13 - 19:44 (10:30) reboot system boot 3.10.0-957.1.3.e Mon Apr 8 07:42 - 19:44 (12:01) reboot system boot 3.10.0-957.1.3.e Mon Apr 8 06:19 - 19:44 (13:24) reboot system boot 3.10.0-957.1.3.e Mon Apr 8 04:53 - 06:17 (01:23) reboot system boot 3.10.0-957.1.3.e Mon Apr 8 04:02 - 06:17 (02:14) controller-0:~$ uptime  19:44:11 up 9:51, 2 users, load average: 3.30, 3.01, 2.69 Severity -------- Critical Steps to Reproduce ------------------ Install, configure, unlock AIOSX and apply stx-openstack application Expected Behavior ------------------ No memory alarms. Ideally, platform memory should stay below 50% to accomodate occasional/periodic surges from audits, VM deployments/migrations and maintenance related activities. Actual Behavior ---------------- Major memory alarms appear after stx-openstack app is applied. These alarms are shortly upgraded to critical. Processes (e.g. kube-apiserver, mysqld, etc..) started getting randomly killed due to OOM. Reproducibility --------------- Reproducible System Configuration -------------------- One node system, http, IPv4. The number of nginx workers are likely split between the 2 controllers in duplex configurations. It is highly likely that the memory alarm condition is also observable in AIODX. Branch/Pull Time/Commit ----------------------- BUILD_ID="20190406T203346Z" JOB="STX_build_master_master" BUILD_BY="starlingx.build@cengn.ca" Last Pass --------- The timeframe when this issue might be introduced is unknown. Timestamp/Logs -------------- See memory dumps attached After osh-openstack-ingress chart was processed (around 2019-04-08 05:18:04 in sysinv.log), there were 72 nginx workers (refer to rss.dump) with average RSS value of 27791. During the deployment of osh-openstack-mariadb chart, the number of workers jumped significantly to 144 (Mon Apr 8 05:18:25 in rss.dump) and again to 216 (Mon Apr 8 05:18:25 in rss.dump). The number of workers reached 220 before the first OOM-induced reboot. Test Activity ------------- Developer Testing
2019-04-15 17:38:46 Ghada Khalil tags stx.2.0 stx.containers
2019-04-15 17:38:52 Ghada Khalil starlingx: importance Undecided High
2019-04-15 17:38:55 Ghada Khalil starlingx: status New Triaged
2019-04-15 17:41:53 Ghada Khalil starlingx: assignee Cindy Xie (xxie1)
2019-04-26 00:38:08 Cindy Xie starlingx: assignee Cindy Xie (xxie1) Yi Wang (wangyi4)
2019-05-16 06:51:48 OpenStack Infra starlingx: status Triaged In Progress
2019-05-31 13:56:45 OpenStack Infra starlingx: status In Progress Fix Released