neutron

neutron-rootwrap processes not getting cleaned up

Bug #1629097 reported by Corey Bryant on 2016-09-29

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	neutron	Invalid	Undecided	Unassigned

Bug Description

neutron-rootwrap processes aren't getting cleaned up on Newton. I'm testing with Newton rc3.

I was noticing memory exhaustion on my neutron gateway units, which turned out to be due to compounding neutron-rootwrap processes:
sudo /usr/bin/neutron-rootwrap /etc/neutron/rootwrap.conf ovsdb-client monitor Interface name,ofport,external_ids --format=json

$ top -n1 -b -o VIRT
http://paste.ubuntu.com/23252407/

$ ps aux|grep ovsdb-client
http://paste.ubuntu.com/23252658/

Restarting openvswitch cleans up the processes but they just start piling again up soon after:
sudo systemctl restart openvswitch-switch

At first I thought this was an openvswitch issue, however I reverted the code in get_root_helper_child_pid() and neutron-rootwrap processes started getting cleaned up. See corresponding commit for code that possibly introduced this at [1].

This can be recreated with the openstack charms using xenial-newton-staging. On newton deploys, neutron-gateway and nova-compute units will exhaust memory due to compounding ovsdb-client processes.

[1]
commit fd93e19f2a415b3803700fc491749daba01a4390
Author: Assaf Muller <email address hidden>
Date: Fri Mar 18 16:29:26 2016 -0400

Change get_root_helper_child_pid to stop when it finds cmd

    get_root_helper_child_pid recursively finds the child of pid,
    until it can no longer find a child. However, the intention is
    not to find the deepest child, but to strip away root helpers.
    For example 'sudo neutron-rootwrap x' is supposed to find the
    pid of x. However, in cases 'x' spawned quick lived children of
    its own (For example: ip / brctl / ovs invocations),
    get_root_helper_child_pid returned those pids if called in
    the wrong time.

Change-Id: I582aa5c931c8bfe57f49df6899445698270bb33e
Closes-Bug: #1558819

See original description

Tags:

Revision history for this message

Corey Bryant (corey.bryant) wrote on 2016-09-29:

This may be the issue, though I don't know the code well enough. https://github.com/openvswitch/ovs/commit/fe5593818dca05b03804de5d99a9edd125f2d440

service_start without any corresponding service_stop

Changed in openvswitch (Ubuntu):
importance:	Undecided → High

Corey Bryant (corey.bryant) on 2016-09-29

no longer affects:	openvswitch (Ubuntu)
summary:	- ovsdb-client processes not getting cleaned up + neutron-rootwrap processes not getting cleaned up

Corey Bryant (corey.bryant) on 2016-09-29

description:	updated
description:	updated

Revision history for this message

Corey Bryant (corey.bryant) wrote on 2016-09-29:

I don't think this is an openvswitch issue. I updated the description with the neutron commit that I think caused this regression. commit fd93e19f2a415b3803700fc491749daba01a4390

Revision history for this message

Assaf Muller (amuller) wrote on 2016-09-29:

I don't see this happening according to gate 'ps' output, or on a local devstack VM. Are you see ovsdb-client processes spawning at idle, without doing anything? Are there errors in the OVS agent log?

Revision history for this message

Corey Bryant (corey.bryant) wrote on 2016-09-29:

Assaf, thanks for looking. Yes this occurs at idle. Here's the /var/log/openvswitch/ovs-vswitchd.log:

http://paste.ubuntu.com/23252958/

This is with ovs 2.6.0.

I've reverted and unverted the get_root_helper_child_pid() code from that commit about 3 times now to make sure I'm not mistaken, and it seems to fix it for me. Interestingly it is only ovsdb-client that leaks.

Revision history for this message

Assaf Muller (amuller) wrote on 2016-09-29:

OVS agent log?

Revision history for this message

Corey Bryant (corey.bryant) wrote on 2016-09-29:

neutron-ovs-agent log:

http://paste.ubuntu.com/23253153/

which appear to show dns issues that trigger this. Still it seems the processes get cleaned up with the old code.

Ihar Hrachyshka (ihar-hrachyshka) on 2016-10-03

Changed in neutron:
importance:	Undecided → High
tags:	added: ovs

Revision history for this message

Ihar Hrachyshka (ihar-hrachyshka) wrote on 2016-10-03:

Corey, can you reproduce it with ovs 2.5?

Revision history for this message

Assaf Muller (amuller) wrote on 2016-10-03:

@Ihar I don't think the OVS version is relevant.

From the OVS agent logs Corey pasted:

"ERROR neutron.agent.linux.async_process [-] Error received from [ovsdb-client monitor Interface name,ofport,external_ids --format=json]: sudo: unable to resolve host juju-59804ab4-57a7-4f07-8669-31b464d37867-machine-11"

I tried reproducing this by editing my local /etc/hosts and changing the line:
127.0.0.1 localhost ubuntu

To:
blah localhost

Afterwards when I execute:
'sudo true'

I get:
sudo: unable to resolve host ubuntu

I then started up the OVS agent and interestingly I don't get the ovsdb-monitor spawned at all.

I noticed that Corey is not using rootwrap-daemon so I commented out daemon mode, started the OVS agent and confirmed that regular rootwrap is being used. I still don't see the ovsdb monitor spawned, nor do I see:

To conclude, I can't seem to be able to reproduce the issue. I'm not ruling out an issue, but I just can't reproduce this. Corey can you try to add additional info to help out?

Changed in neutron:
status:	New → Incomplete
importance:	High → Undecided

Revision history for this message

Corey Bryant (corey.bryant) wrote on 2016-10-03:

This was triggered by a dns issue on my end, and fixing up dns resolved the issue.

However, I think it's still an issue because once it's triggered, you can run into the process leak.

@Assaf, if I edit /etc/resolv.conf with invalid dns, restart neutron-openvswitch-agent, I'll see ovsdb-client calls from neutron-rootwrap start to build up (watch 'ps aux | grep ovsdb-client').

Revision history for this message

Launchpad Janitor (janitor) wrote on 2016-12-03:

#10

[Expired for neutron because there has been no activity for 60 days.]

Changed in neutron:
status:	Incomplete → Expired

Revision history for this message

Billy Olsen (billy-olsen) wrote on 2017-03-11:

#11

I saw this last night, and can indeed confirm its related to DNS issues. It was suggested by a colleague that it may be related to the sudo call returning an error indicating that the hostname could not be found, though I spent no time exploring this option today. Restarting the openvswitch-switch service closes all the existing processes but it seems more that the service can't stop cleanly. Also of note is that the service stop took a very long time, whereas with working DNS it took a few seconds at worst.

Changed in neutron:
status:	Expired → Confirmed

Revision history for this message

Long Zhang (josie.zhang.long) wrote on 2018-08-25:

#12

I saw this these days frequently on the network host of newton version, Linux centos7.2.
When memory exhausting, many neutron-rootwrap processes are working.
run
#ps aux|grep ip
It outputs many lines as following:
sudo neutron-rootwrap /etc/neutron/rootwrap.conf ip netns exec qrouter-ba.... ip -o link show qg-2e3ecdcc-06

and neutron's L3-agent cannot config ip netns any more, for example, if binding floating-ip with a fixed ip, it cannot work, also has no error report.

when memory freed, these processes is disappeared, and ip netns can be configured correctly through neutron l3-agent.

Revision history for this message

Lajos Katona (lajos-katona) wrote on 2024-02-22:

#13

Neutron changed to use privsep, if you still see similar issues please reopen this bug report or open a new one for privsep please.

Changed in neutron:
status:	Confirmed → Invalid

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.