[Queens] Memory leak in pyroute2 0.4.21

Bug #1835044 reported by Candido Campos Rivas
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Won't Fix
High
Rodolfo Alonso

Bug Description

Description of problem:
Memory leak in privsep-helper(neutron agents) when create/destroy action are executed.

Version-Release number of selected component (if applicable):

How reproducible:
Queens is affected

Steps to Reproduce:
1. Deploy Queens
2.create the next scripts:
(overcloud) [stack@undercloud-0 ~]$ cat create2.sh
set -x

ips=(0 10.0.0.215 10.0.0.249 10.0.0.223 10.0.0.222 10.0.0.218 10.0.0.247 10.0.0.210 10.0.0.220 10.0.0.246 10.0.0.213 10.0.0.224 10.0.0.212 10.0.0.217 10.0.0.221 10.0.0.216)
ips=(0 10.0.0.220 10.0.0.216 10.0.0.235 10.0.0.232 10.0.0.245 10.0.0.226 10.0.0.217 10.0.0.211 10.0.0.221 10.0.0.230 10.0.0.248 10.0.0.228 10.0.0.223 10.0.0.212 10.0.0.225)
openstack network create net_$1
openstack subnet create --network net_$1 --dns-nameserver 10.0.0.1 --gateway 10.$1.0.1 --subnet-range 10.$1.0.0/16 net_$1
openstack router create router_$1
openstack router add subnet router_$1 net_$1
#openstack router set router_$1 --external-gateway public

#openstack server create --flavor cirros --image cirros --nic net-id=net$1 --security-group test --key-name mykey vm$1

#openstack server add floating ip vm$1 ${ips[$1]}

#ping ${ips[$1]} -c 10
(overcloud) [stack@undercloud-0 ~]$ cat delete2.sh

#openstack server delete vm$1
openstack router remove subnet router_$1 net_$1
openstack network delete net_$1
openstack router delete router_$1

3. Execute:

for j in $(seq 1 10); do for i in $(seq 1 10) ; do ./create2.sh $i ; done ; for i in $(seq 1 10) ; do ./delete2.sh $i ; done ;echo "####LOOP $j ####" ;done

Chech the memory usage of the privsep-helper in the controllers:

[root@controller-1 heat-admin]# top -c -p 125052 -p 278912

top - 22:37:22 up 9 days, 4:59, 1 user, load average: 4.45, 3.71, 3.77
Tasks: 2 total, 0 running, 2 sleeping, 0 stopped, 0 zombie
%Cpu(s): 19.8 us, 4.1 sy, 0.0 ni, 75.2 id, 0.0 wa, 0.0 hi, 0.8 si, 0.0 st
KiB Mem : 32779936 total, 3150768 free, 16209200 used, 13419968 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 15146444 avail Mem

    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
 125052 root 20 0 2497576 2.2g 2468 S 0.0 7.2 76:06.44 /usr/bin/python2 /bin/privsep-helper --config-file /usr/share/neutron/neutron-dist.conf --config-file /etc/neutron/neutron.conf --config-fi+
 278912 root 20 0 1048908 897004 2468 S 0.0 2.7 13:55.34 /usr/bin/python2 /bin/privsep-helper --config-file /usr/share/neutron/neutron-dist.conf --config-file /etc/neutron/neutron.conf --config-fi+

The important colunm is RES

Actual results:

Memory usage increases in every loop.

Expected results:

Memory usage should be stable

Additional info:

Leak is in pyroute2--> necessary update version to 0.5.2.x

Tags: loadimpact
Revision history for this message
Candido Campos Rivas (ccamposr) wrote :

the problem only affects to queens, the constraint in rocky is correct:

https://github.com/openstack/requirements/blob/stable/rocky/upper-constraints.txt

...

SQLAlchemy===1.2.10
pyroute2===0.5.2
google-auth===1.5.0
kazoo===2.5.0
XStatic-roboto-fontface=
....

but in queens no:

https://github.com/openstack/requirements/blob/stable/queens/upper-constraints.txt :

....

semantic-version===2.6.0
virtualbmc===1.2.0
deprecation===1.0.1
SQLAlchemy===1.2.1
pyroute2===0.4.21
google-auth===1.3.0
kazoo===2.4.0
XStatic-roboto-fontface===0.5.0.0
pyudev===0.21.0

....

description: updated
Revision history for this message
Candido Campos Rivas (ccamposr) wrote :
Download full text (4.2 KiB)

this script reproduces the leak or part of the leak:

    )[root@controller-0 /]# cat leak3.py
    #!/usr/bin/env python2

    from pyroute2 import netns, NetNS, IPDB, IPRoute
    import time
    import objgraph
    from pyroute2.netlink import rtnl
    from pyroute2.netlink.rtnl import ifinfmsg
    from pyroute2.netlink.rtnl import ndmsg
    import socket

    family = socket.AF_INET

    def get_scope_name(scope):
        """Return the name of the scope (given as a number), or the scope number
        if the name is unknown.

        For backward compatibility (with "ip" tool) "global" scope is converted to
        "universe" before converting to number
        """
        scope = 'universe' if scope == 'global' else scope
        return rtnl.rt_scope.get(scope, scope)

    while True:

        with IPDB() as ipdb:
            ipdb_routes = ipdb.routes
            ipdb_interfaces = ipdb.interfaces
            routes = [{'destination': route['dst'],
                       'nexthop': route.get('gateway'),
                       'device': ipdb_interfaces[route['oif']]['ifname'],
                       'scope': get_scope_name(route['scope'])}
                      for route in ipdb_routes if route['family'] == family]
            print routes
            time.sleep(0.2)
    #With IPDB() as a:
    # a.release()
    # print(objgraph.growth(10))
    # time.sleep(0.2)
    # print("...")

()[root@controller-0 /]# ps -ef | grep leak
root 118083 53454 54 10:08 ? 00:00:02 python leak3.py
root 118895 66070 0 10:09 ? 00:00:00 grep --color=auto leak
()[root@controller-0 /]# while true; do top -p 64897 -n1 -b ; sleep 60 ; done
top - 10:09:35 up 15 days, 16:31, 0 users, load average: 5.81, 5.38, 4.75
Tasks: 0 total, 0 running, 0 sleeping, 0 stopped, 0 zombie
%Cpu(s): 41.7 us, 4.2 sy, 0.0 ni, 52.5 id, 0.0 wa, 0.0 hi, 1.7 si, 0.0 st
KiB Mem : 32779936 total, 644984 free, 15815812 used, 16319140 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 15120388 avail Mem

    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
^C
()[root@controller-0 /]# while true; do top -p 118083 -n1 -b ; sleep 60 ; done
top - 10:09:56 up 15 days, 16:31, 0 users, load average: 5.55, 5.35, 4.75
Tasks: 1 total, 1 running, 0 sleeping, 0 stopped, 0 zombie
%Cpu(s): 44.5 us, 8.4 sy, 0.0 ni, 45.4 id, 0.0 wa, 0.0 hi, 1.7 si, 0.0 st
KiB Mem : 32779936 total, 606604 free, 15854788 used, 16318544 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 15081184 avail Mem

    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
 118083 root 20 0 281960 34188 4208 R 106.7 0.1 0:32.64 python
top - 10:10:57 up 15 days, 16:32, 0 users, load average: 6.45, 5.61, 4.88
Tasks: 1 total, 0 running, 1 sleeping, 0 stopped, 0 zombie
%Cpu(s): 47.1 us, 19.3 sy, 0.0 ni, 31.9 id, 0.0 wa, 0.0 hi, 1.7 si, 0.0 st
KiB Mem : 32779936 total, 606032 free, 15858556 used, 16315348 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 15078224 ...

Read more...

summary: - [Rocky]Memory leak in privsep-helper(neutron agents)
+ [Queens] Memory leak in pyroute2 0.4.21
Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello:

I've tested with the script provided by Candido [1], using several versions and retrieving the memory used by the script using:
  watch -n 1 -d "pmap <PID> | ag total"

Pyroute2 0.5.2 (Rocky constrain): I don't see any memory increase in the process.
Pyroute2 0.4.21 (Queens constrain): both with Python 2.7 and 3.6, I can see a memory consumption increase every loop repetition. The memory of the process increases lineally over time.

I'll push a patch to bump the Queens Pyroute2 constrain to 0.5.2 (same as Rocky) and I'll run a Neutron change, depending on the requirements change, to execute a full CI test.

[1] http://paste.openstack.org/show/753759/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/queens)

Related fix proposed to branch: stable/queens
Review: https://review.opendev.org/668677

Revision history for this message
Bernard Cafarelli (bcafarel) wrote :

Thanks Rodolfo, marking high as queens is still full support (and memory leaks are bad), let's see how discusion goes bumping pyroute2 in queens

Changed in neutron:
status: New → In Progress
importance: Undecided → High
assignee: nobody → Rodolfo Alonso (rodolfo-alonso-hernandez)
tags: added: loadimpact
Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Patch to bump the version in requirements: https://review.opendev.org/#/c/668676/
Patch to test the Neutron CI, depending on the previous one: https://review.opendev.org/#/c/668677/

Revision history for this message
Jeremy Stanley (fungi) wrote :

Stable branches of the requirements repo are effectively frozen for purposes of tracking external dependencies, and are merely used to indicate the known working versions contemporary with the coordinated release corresponding to that branch. Production deployments need to be using a security-supported distribution to obtain those instead. To rephrase, unless the memory leak you've found is preventing OpenStack from testing for potential regressions in proposed stable/queens branch changes, we should not be touching the pyroute2 version in the openstack/requirements repository's upper-constraints.txt file.

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello:

As Jeremy commented, stable branches requirements are only for testing, not for production deployments. That means [1] should not be merged.

However, I want to draw your attention to the existing problem in this specific version of Pyroute2. If you want to deploy a production environment, please be aware of the problem described in this bug.

Regards.

[1] https://review.opendev.org/#/c/668676/

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

For the bug deputy this week: can you close this bug as "Won't fix"?

Thank you in advance.

Changed in neutron:
status: In Progress → Won't Fix
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (stable/queens)

Change abandoned by Rodolfo Alonso Hernandez (<email address hidden>) on branch: stable/queens
Review: https://review.opendev.org/668677

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.