neutron-rootwrap processes not getting cleaned up

Bug #1629097 reported by Corey Bryant
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
neutron
Invalid
Undecided
Unassigned

Bug Description

neutron-rootwrap processes aren't getting cleaned up on Newton. I'm testing with Newton rc3.

I was noticing memory exhaustion on my neutron gateway units, which turned out to be due to compounding neutron-rootwrap processes:
sudo /usr/bin/neutron-rootwrap /etc/neutron/rootwrap.conf ovsdb-client monitor Interface name,ofport,external_ids --format=json

$ top -n1 -b -o VIRT
http://paste.ubuntu.com/23252407/

$ ps aux|grep ovsdb-client
http://paste.ubuntu.com/23252658/

Restarting openvswitch cleans up the processes but they just start piling again up soon after:
sudo systemctl restart openvswitch-switch

At first I thought this was an openvswitch issue, however I reverted the code in get_root_helper_child_pid() and neutron-rootwrap processes started getting cleaned up. See corresponding commit for code that possibly introduced this at [1].

This can be recreated with the openstack charms using xenial-newton-staging. On newton deploys, neutron-gateway and nova-compute units will exhaust memory due to compounding ovsdb-client processes.

[1]
commit fd93e19f2a415b3803700fc491749daba01a4390
Author: Assaf Muller <email address hidden>
Date: Fri Mar 18 16:29:26 2016 -0400

    Change get_root_helper_child_pid to stop when it finds cmd

    get_root_helper_child_pid recursively finds the child of pid,
    until it can no longer find a child. However, the intention is
    not to find the deepest child, but to strip away root helpers.
    For example 'sudo neutron-rootwrap x' is supposed to find the
    pid of x. However, in cases 'x' spawned quick lived children of
    its own (For example: ip / brctl / ovs invocations),
    get_root_helper_child_pid returned those pids if called in
    the wrong time.

    Change-Id: I582aa5c931c8bfe57f49df6899445698270bb33e
    Closes-Bug: #1558819

Tags: ovs
Revision history for this message
Corey Bryant (corey.bryant) wrote :

This may be the issue, though I don't know the code well enough. https://github.com/openvswitch/ovs/commit/fe5593818dca05b03804de5d99a9edd125f2d440

service_start without any corresponding service_stop

Changed in openvswitch (Ubuntu):
importance: Undecided → High
no longer affects: openvswitch (Ubuntu)
summary: - ovsdb-client processes not getting cleaned up
+ neutron-rootwrap processes not getting cleaned up
description: updated
description: updated
Revision history for this message
Corey Bryant (corey.bryant) wrote :

I don't think this is an openvswitch issue. I updated the description with the neutron commit that I think caused this regression. commit fd93e19f2a415b3803700fc491749daba01a4390

Revision history for this message
Assaf Muller (amuller) wrote :

I don't see this happening according to gate 'ps' output, or on a local devstack VM. Are you see ovsdb-client processes spawning at idle, without doing anything? Are there errors in the OVS agent log?

Revision history for this message
Corey Bryant (corey.bryant) wrote :

Assaf, thanks for looking. Yes this occurs at idle. Here's the /var/log/openvswitch/ovs-vswitchd.log:

http://paste.ubuntu.com/23252958/

This is with ovs 2.6.0.

I've reverted and unverted the get_root_helper_child_pid() code from that commit about 3 times now to make sure I'm not mistaken, and it seems to fix it for me. Interestingly it is only ovsdb-client that leaks.

Revision history for this message
Assaf Muller (amuller) wrote :

OVS agent log?

Revision history for this message
Corey Bryant (corey.bryant) wrote :

neutron-ovs-agent log:

http://paste.ubuntu.com/23253153/

which appear to show dns issues that trigger this. Still it seems the processes get cleaned up with the old code.

Changed in neutron:
importance: Undecided → High
tags: added: ovs
Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

Corey, can you reproduce it with ovs 2.5?

Revision history for this message
Assaf Muller (amuller) wrote :

@Ihar I don't think the OVS version is relevant.

From the OVS agent logs Corey pasted:

"ERROR neutron.agent.linux.async_process [-] Error received from [ovsdb-client monitor Interface name,ofport,external_ids --format=json]: sudo: unable to resolve host juju-59804ab4-57a7-4f07-8669-31b464d37867-machine-11"

I tried reproducing this by editing my local /etc/hosts and changing the line:
127.0.0.1 localhost ubuntu

To:
blah localhost

Afterwards when I execute:
'sudo true'

I get:
sudo: unable to resolve host ubuntu

I then started up the OVS agent and interestingly I don't get the ovsdb-monitor spawned at all.

I noticed that Corey is not using rootwrap-daemon so I commented out daemon mode, started the OVS agent and confirmed that regular rootwrap is being used. I still don't see the ovsdb monitor spawned, nor do I see:

"ERROR neutron.agent.linux.async_process [-] Error received from [ovsdb-client monitor Interface name,ofport,external_ids --format=json]: sudo: unable to resolve host juju-59804ab4-57a7-4f07-8669-31b464d37867-machine-11"

To conclude, I can't seem to be able to reproduce the issue. I'm not ruling out an issue, but I just can't reproduce this. Corey can you try to add additional info to help out?

Changed in neutron:
status: New → Incomplete
importance: High → Undecided
Revision history for this message
Corey Bryant (corey.bryant) wrote :

This was triggered by a dns issue on my end, and fixing up dns resolved the issue.

However, I think it's still an issue because once it's triggered, you can run into the process leak.

@Assaf, if I edit /etc/resolv.conf with invalid dns, restart neutron-openvswitch-agent, I'll see ovsdb-client calls from neutron-rootwrap start to build up (watch 'ps aux | grep ovsdb-client').

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for neutron because there has been no activity for 60 days.]

Changed in neutron:
status: Incomplete → Expired
Revision history for this message
Billy Olsen (billy-olsen) wrote :

I saw this last night, and can indeed confirm its related to DNS issues. It was suggested by a colleague that it may be related to the sudo call returning an error indicating that the hostname could not be found, though I spent no time exploring this option today. Restarting the openvswitch-switch service closes all the existing processes but it seems more that the service can't stop cleanly. Also of note is that the service stop took a very long time, whereas with working DNS it took a few seconds at worst.

Changed in neutron:
status: Expired → Confirmed
Revision history for this message
Long Zhang (josie.zhang.long) wrote :

I saw this these days frequently on the network host of newton version, Linux centos7.2.
When memory exhausting, many neutron-rootwrap processes are working.
run
#ps aux|grep ip
It outputs many lines as following:
   sudo neutron-rootwrap /etc/neutron/rootwrap.conf ip netns exec qrouter-ba.... ip -o link show qg-2e3ecdcc-06

and neutron's L3-agent cannot config ip netns any more, for example, if binding floating-ip with a fixed ip, it cannot work, also has no error report.

when memory freed, these processes is disappeared, and ip netns can be configured correctly through neutron l3-agent.

Revision history for this message
Lajos Katona (lajos-katona) wrote :

Neutron changed to use privsep, if you still see similar issues please reopen this bug report or open a new one for privsep please.

Changed in neutron:
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.