Memory leak in some neutron agents

Bug #1823818 reported by Lei Zhang
28
This bug affects 6 people
Affects Status Importance Assigned to Milestone
kolla
Invalid
Undecided
Unassigned
Rocky
Triaged
High
Unassigned
neutron
Invalid
High
Unassigned

Bug Description

We have an OpenStack deployment using rocky release. We have seen a memory leak issue in some neutron agents twice in our environment since it was first deployed this Jan.

Below are some of the commands we ran to identify the issue and their corresponding output:

This was on one of the compute nodes:
-----------------------------------------------
[root@c1s4 ~]# ps aux --sort -rss|head -n1

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND

42435 48229 3.5 73.1 98841060 96323252 pts/13 S+ 2018 1881:25 /usr/bin/python2 /usr/bin/neutron-openvswitch-agent --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/ml2_conf.ini
-----------------------------------------------

And this was on one of the controller nodes:
-----------------------------------------------
[root@r1 neutron]# ps aux --sort -rss|head

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND

42435 30940 3.1 48.6 68596320 64144784 pts/37 S+ Jan08 588:26 /usr/bin/python2 /usr/bin/neutron-lbaasv2-agent --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/lbaas_agent.ini --config-file /etc/neutron/neutron_lbaas.conf

42435 20902 2.8 26.1 36055484 34408952 pts/35 S+ Jan08 525:12 /usr/bin/python2 /usr/bin/neutron-dhcp-agent --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/dhcp_agent.ini

42434 34199 7.1 6.0 39420516 8033480 pts/11 Sl+ 2018 3620:08 /usr/libexec/mysqld --basedir=/usr --datadir=/var/lib/mysql/ --plugin-dir=/usr/lib64/mysql/plugin --wsrep_provider=/usr/lib64/galera/libgalera_smm.so --wsrep_on=ON --log-error=/var/log/kolla/mariadb/mariadb.log --pid-file=/var/lib/mysql/mariadb.pid --port=3306 --wsrep_start_position=0809f452-0251-11e9-8e60-6ad108d9be7b:0

42435 8327 2.6 2.2 3546004 3001772 pts/10 S+ Jan17 152:04 /usr/bin/python2 /usr/bin/neutron-l3-agent --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/neutron_vpnaas.conf --config-file /etc/neutron/l3_agent.ini --config-file /etc/neutron/fwaas_driver.ini

42435 40171 2.6 2.1 3893480 2840852 pts/19 S+ Jan16 190:54 /usr/bin/python2 /usr/bin/neutron-openvswitch-agent --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/ml2_conf.ini

root 42430 3.1 0.3 4412216 495492 pts/29 SLl+ Jan16 231:20 /usr/sbin/ovs-vswitchd unix:/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --mlockall --log-file=/var/log/kolla/openvswitch/ovs-vswitchd.log
---------------------------------------------

When it happened, we saw a lot of 'OSError: [Errno 12] Cannot allocate memory' ERRORs in different neutron-* logs, because there were no free mem left. However, we don't know yet what had triggered the memory leakage.

Here is our globals.yml:
---------------------------------------------
[root@r1 kolla]# cat globals.yml |grep -v "^#"|tr -s "\n"
---
openstack_release: "rocky"
kolla_internal_vip_address: "172.21.69.22"
enable_barbican: "yes"
enable_ceph: "yes"
enable_ceph_mds: "yes"
enable_ceph_rgw: "yes"
enable_cinder: "yes"
enable_neutron_lbaas: "yes"
enable_neutron_fwaas: "yes"
enable_neutron_agent_ha: "yes"
enable_ceph_rgw_keystone: "yes"
ceph_pool_pg_num: 16
ceph_pool_pgp_num: 16
ceph_osd_store_type: "xfs"
glance_backend_ceph: "yes"
glance_backend_file: "no"
glance_enable_rolling_upgrade: "no"
ironic_dnsmasq_dhcp_range:
tempest_image_id:
tempest_flavor_ref_id:
tempest_public_network_id:
tempest_floating_network_name:
-----------------------------------------------

I did some search on google and found this ovs bug is highly related https://bugzilla.redhat.com/show_bug.cgi?id=1667007

I am not sure if the fix has been included in the latest Rocky kolla images?

Best regards,

Lei

Tags: ovs
Tom Fifield (fifieldt)
Changed in kolla:
status: New → Confirmed
Revision history for this message
Mark Goddard (mgoddard) wrote :

The linked RH bugzilla bug suggests that the OVS fix is included in v2.11.0. The kolla image just installs these packages:

RPM: openvswitch, python-openvswitch
DEB: openvswitch-switch, python-openvswitch

So it really depends on what is included in the distro packages. On the master branch of kolla, in the CentOS image I see

openvswitch-2.11.0-4.el7.x86_64

which comes from the delorean-master-testing yum repo.

The kolla rocky branch in the CentOS image I see

openvswitch-2.10.1-3.el7.x86_64

which comes from the centos-openstack-rocky Yum repo.

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

It looks like this is really same issue as in https://bugzilla.redhat.com/show_bug.cgi?id=1667007 so it's not direclty issue in neutron but in openvswitch.
I will then mark it as invalid for neutron but feel free to change it if that would be different issue.

tags: added: ovs
Changed in neutron:
status: New → Invalid
importance: Undecided → High
Revision history for this message
Bernard Cafarelli (bcafarel) wrote :

And in the meantime, for the kolla side I filled https://bugzilla.redhat.com/show_bug.cgi?id=1697925 to track it for Rocky RDO (other branches are not impacted)

Revision history for this message
Lei Zhang (zhangleiop) wrote :

Great thanks!

Mark Goddard (mgoddard)
Changed in kolla:
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.