OOM issues in the gate

Bug #1656386 reported by Dariusz Smigiel
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack-Gate
Fix Released
Undecided
Unassigned
neutron
Won't Fix
Wishlist
Unassigned

Bug Description

Couple examples of recent leakages for linuxbridge job [1], [2]

[1] http://logs.openstack.org/73/373973/13/check/gate-tempest-dsvm-neutron-linuxbridge-ubuntu-xenial/295d92f/logs/syslog.txt.gz#_Jan_11_13_56_32
[2] http://logs.openstack.org/59/382659/26/check/gate-tempest-dsvm-neutron-linuxbridge-ubuntu-xenial/7de01d0/logs/syslog.txt.gz#_Jan_11_15_54_36

Close to the end of running tests, consumption of swap growths pretty quickly, exceeding 2GBs.
I didn't find root cause of that.

tags: added: gate-failure
Changed in neutron:
importance: Undecided → Critical
Revision history for this message
Jakub Libosvar (libosvar) wrote :
summary: - Memory leaks on linuxbridge job
+ Memory leaks on Neutron jobs
Revision history for this message
Jakub Libosvar (libosvar) wrote : Re: Memory leaks on Neutron jobs
Changed in neutron:
status: New → Confirmed
Revision history for this message
Kevin Benton (kevinbenton) wrote :

I mentioned on the mailing list, but it looks like we are hardly using any swap space at all. @Darek, where did you see that we are using lots of swap?

Revision history for this message
Dariusz Smigiel (smigiel-dariusz) wrote :

Kevin, I've plotted couple of dstat outputs when problems started. Every time, I was able to confirm, that consumption of swap arose around the same time.
Now, I see it completely reversed.

For [1] I see this output [2].

[1] http://logs.openstack.org/59/382659/26/check/gate-tempest-dsvm-neutron-linuxbridge-ubuntu-xenial/7de01d0/logs/dstat-csv_log.txt.gz
[2] https://imgur.com/a/59KYz

Revision history for this message
Dariusz Smigiel (smigiel-dariusz) wrote :
Changed in neutron:
assignee: nobody → Darek Smigiel (smigiel-dariusz)
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :
Revision history for this message
Miguel Angel Ajo (mangelajo) wrote :

Around 17hits/day now.

Revision history for this message
Miguel Angel Ajo (mangelajo) wrote :

From the neutron-drivers meeting, possible causes:

  * Openstack services in general using more memory
  * Possible changes in the newest ubuntu kernel.
  * Something in kernel triggers the OOM since there's enough swap available.

Possible ideas for the short term:

  * Increase swappiness settings
  * Tweak down mysql caching (at cost of performance)

Long term:
  * we should make a cross-project effort to reduce our memory footprints. (Heat people already worked on that)

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

At this point I wonder if something like is worth pursuing. At least that attempts to keep MySql memory footprint smaller.

[1] https://review.openstack.org/#/c/426029/2

Revision history for this message
Kevin Benton (kevinbenton) wrote :

When OOM killer was invoked there was 186M swap used and 7626M free.

See 16:15:06 here for that entry: http://logs.openstack.org/93/426793/1/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/2b47c1a/logs/screen-dstat.txt.gz

If we can't use swap space reliably, then saving memory usage is going to be a really difficult battle unless you find something that cuts off a couple of gigs.

Revision history for this message
Antonio Ojea (aojea) wrote :

wget http://logs.openstack.org/93/426793/1/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/2b47c1a/logs/ps.txt.gz

OS_SERVICES="neutron nova swift cinder qemu keystone horizon glance”

bash-3.2$ for i in $OS_SERVICES; do echo "Service $i Memory Usage: $(grep $i ps.txt.gz | awk '{ print $8 }' | paste -sd+ - | bc)" ; done | tee memory_usage.txt
Service neutron Memory Usage: 1756340
Service nova Memory Usage: 1953568
Service swift Memory Usage: 703456
Service cinder Memory Usage: 920088
Service qemu Memory Usage: 169752
Service keystone Memory Usage: 903204
Service horizon Memory Usage: 25236
Service glance Memory Usage: 740652

Total memory usage by openstack services
bash-3.2$ cat memory_usage.txt | awk '{ print $5}' | paste -sd+ - | bc
7172296

Neutron and nova are the most demanding services, I guess that reducing the number of workers can make some room

bash-3.2$ grep nova-api ps.txt.gz
stack 2111 1 2111 1.6 1.5 112504 124300 /usr/bin/python /usr/local/bin/nova-api
stack 2241 2111 2241 11.9 2.4 186704 197536 /usr/bin/python /usr/local/bin/nova-api
stack 2242 2111 2242 11.1 2.3 180648 191840 /usr/bin/python /usr/local/bin/nova-api
stack 2826 2111 2826 0.2 1.9 147908 155948 /usr/bin/python /usr/local/bin/nova-api
stack 2827 2111 2827 0.2 1.8 146216 154196 /usr/bin/python /usr/local/bin/nova-api
bash-3.2$ grep neutron-api ps.txt.gz
bash-3.2$ grep neutron-server ps.txt.gz
stack 2698 1 2698 0.1 1.2 92744 104268 /usr/bin/python /usr/local/bin/neutron-server --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/ml2_conf.ini
stack 2976 2698 2976 37.3 1.7 131684 139904 /usr/bin/python /usr/local/bin/neutron-server --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/ml2_conf.ini
stack 2977 2698 2977 40.6 1.7 132916 141040 /usr/bin/python /usr/local/bin/neutron-server --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/ml2_conf.ini
stack 2978 2698 2978 28.2 1.6 126208 134348 /usr/bin/python /usr/local/bin/neutron-server --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/ml2_conf.ini
stack 2979 2698 2979 0.1 1.2 96976 102680 /usr/bin/python /usr/local/bin/neutron-server --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/ml2_conf.ini
stack 2980 2698 2980 2.0 1.4 116652 122472 /u

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

RSS is not quite the same as actual memory used. It overestimates it a lot.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Either way, I totally agree with the sentiment that we should determine whether we can trim down the neutron system memory footprint if we can :)

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

I took the liberty of running the above bash commands on [1,2] (which is mitaka) and that's what I got:

Service neutron Memory Usage: 1395816
Service nova Memory Usage: 1703248
Service swift Memory Usage: 643940
Service cinder Memory Usage: 731640
Service qemu Memory Usage:
Service keystone Memory Usage: 761212
Service horizon Memory Usage: 17064
Service glance Memory Usage: 538260
-------------
Service neutron Memory Usage: 1391592
Service nova Memory Usage: 1699580
Service swift Memory Usage: 614824
Service cinder Memory Usage: 732296
Service qemu Memory Usage:
Service keystone Memory Usage: 760984
Service horizon Memory Usage: 17088
Service glance Memory Usage: 538680
-------------
Service neutron Memory Usage: 1395148
Service nova Memory Usage: 1700740
Service swift Memory Usage: 639156
Service cinder Memory Usage: 728424
Service qemu Memory Usage:
Service keystone Memory Usage: 779816
Service horizon Memory Usage: 16956
Service glance Memory Usage: 538712

[1] http://logs.openstack.org/90/421990/2/check/gate-tempest-dsvm-neutron-full-ubuntu-trusty/3414755/logs/
[2] http://logs.openstack.org/06/423206/1/check/gate-tempest-dsvm-neutron-full-ubuntu-trusty/8ed673c/logs/ps.txt.gz
[3] http://logs.openstack.org/20/418120/8/check/gate-tempest-dsvm-neutron-full-ubuntu-trusty/463c4d7/logs/ps.txt.gz

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Same campaign on newton stable changes:

Service neutron Memory Usage: 1597120
Service nova Memory Usage: 1672088
Service swift Memory Usage: 779300
Service cinder Memory Usage: 878768
Service qemu Memory Usage:
Service keystone Memory Usage: 919124
Service horizon Memory Usage: 21564
Service glance Memory Usage: 721296
----------
Service neutron Memory Usage: 1590636
Service nova Memory Usage: 1673776
Service swift Memory Usage: 776248
Service cinder Memory Usage: 880852
Service qemu Memory Usage:
Service keystone Memory Usage: 913752
Service horizon Memory Usage: 21288
Service glance Memory Usage: 719128
----------
Service neutron Memory Usage: 1584952
Service nova Memory Usage: 1668928
Service swift Memory Usage: 775976
Service cinder Memory Usage: 870576
Service qemu Memory Usage:
Service keystone Memory Usage: 945896
Service horizon Memory Usage: 33804
Service glance Memory Usage: 718808

[1] http://logs.openstack.org/59/422059/1/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/b4e8762/logs/ps.txt.gz
[2] http://logs.openstack.org/64/422464/1/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/8b5b7ac/logs/ps.txt.gz
[3] http://logs.openstack.org/57/423557/3/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/d01c20f/logs/ps.txt.gz

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

So in a nutshell neutron has gone from ~1.4G (mitaka) to ~1.6G (newton) and ~1.8GB (ocata).

Changed in neutron:
milestone: none → ocata-rc1
tags: added: ocata-rc-potential
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

RSS being reliable or not as a memory footprint indication has increased since Mitaka.

Revision history for this message
Matt Riedemann (mriedem) wrote :

It'd be good to get an idea of what is taking up the most amount of space, my guess would be versioned objects, but not sure which ones, or what's holding onto them.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Running with one API worker per service cuts down the memory footprint quite a bit:

Service neutron Memory Usage: 1537072
Service nova Memory Usage: 1240048
Service swift Memory Usage: 694140
Service cinder Memory Usage: 778824
Service qemu Memory Usage:
Service keystone Memory Usage: 891460
Service horizon Memory Usage: 24348
Service glance Memory Usage: 490388

summary: - Memory leaks on Neutron jobs
+ Reduce neutron services' memory footprint
Changed in neutron:
assignee: Darek Smigiel (smigiel-dariusz) → nobody
milestone: ocata-rc1 → pike-1
importance: Critical → Wishlist
tags: removed: gate-failure ocata-rc-potential
Revision history for this message
Bogdan Dobrelya (bogdando) wrote : Re: Reduce neutron services' memory footprint

This bug requires more attention from *all* OpenStack projects and infra folks, IMO. Related thread [0].
I suggest to try those fancy bcc tools [1] for BPF in Linux kernels 4.x. See example [2]. This could realy help to track the RC down.

[0] http://lists.openstack.org/pipermail/openstack-dev/2017-February/thread.html#111568
[1] https://github.com/iovisor/bcc/blob/master/tools/oomkill.py
[2] https://github.com/iovisor/bcc/blob/master/tools/oomkill_example.txt#L14

Revision history for this message
Andrea Frittoli (andrea-frittoli) wrote :

Gate runs a failing because towards the end of the test phase the VM runs OOM.

Usually mysql gets killed and until it's restarted keystone fails to connect which leads to several test failures, e.g. http://logs.openstack.org/93/432793/1/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/6f31320/logs/syslog.txt.gz#_Feb_14_00_32_28

A logstash query reveals 90 hits in the past 10dd or so, all on RAX infra.

summary: - Reduce neutron services' memory footprint
+ OOM issues in the gate
Revision history for this message
Dr. Jens Harbott (j-harbott) wrote :

TripleO is producing a high rate of false positives now because they do this:

Invoked with warn=True executable=None _uses_shell=True _raw_params=grep -v ansible-command /var/log/messages | grep oom-killer && grep -v ansible-command /var/log/messages | grep oom-killer > /var/log/extra/oom-killers.txt removes=None creates=None chdir=None
node_provider

See e.g. http://logs.openstack.org/23/445523/2/check/gate-tripleo-ci-centos-7-nonha-multinode-oooq/e7cb7b3/logs/syslog.txt

So the query needs to be made more specific.

Revision history for this message
Matt Riedemann (mriedem) wrote :

(5:07:19 PM) clarkb: mriedem: there were a lot of small thinsg we did
(5:07:56 PM) clarkb: mriedem: we enable same page merging or whatever its called so libvirt VMs woudl share memory. We reduced the number of siwft processes. We reduced the number of apache workers
(5:08:11 PM) clarkb: mriedem: I don't think any openstack projects did anything to reduce their memory consumption though

Changed in openstack-gate:
status: New → Fix Released
Changed in neutron:
milestone: pike-1 → pike-2
Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Bug closed due to lack of activity, please feel free to reopen if needed.

Changed in neutron:
status: Confirmed → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.