quantum-server creates openssl zombies until process limit is reached

Bug #1074257 reported by Steve Baker
24
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Heat
Fix Released
Undecided
Unassigned
OpenStack Identity (keystone)
Fix Released
Undecided
Adam Young
neutron
Fix Released
Critical
Gary Kotton

Bug Description

I see this running Quantum master on devstack with Fedora 17.

Over time the count for the following command will rise:
ps -ef |grep openssl | grep defunct | wc -l

Eventually the process limit for the user is reached (insert zombie Halloween joke here)

Killing and restarting quantum-server brings the zombie count back to zero.

dan wendlandt (danwent)
Changed in quantum:
status: New → Confirmed
importance: Undecided → Critical
milestone: none → grizzly-1
Revision history for this message
dan wendlandt (danwent) wrote :

I can confirm this on my Ubuntu devstack setup as well, and that the parent pid is the quantum-server process. I see this when using the OVS plugin.

I'm not familiar with any places where quantum-server would invoke openssl directly, so my best guess would be that its related to openstack-common code being used by quantum, perhaps for RPC.

tags: added: folsom-backport-potential
Revision history for this message
Alan Pevec (apevec) wrote :

This is most likely PKI tokens in auth-token middelware from Keystone, Adam can you have a look?

Changed in keystone:
assignee: nobody → Adam Young (ayoung)
Revision history for this message
Adam Young (ayoung) wrote :

It is possible that the popen mechanism in Eventlet is broken, or requires some additional cleanup after the validation calls. Can you add the whole output from ps?

ps -ef |grep openssl | grep defunct | head -3

Revision history for this message
Adam Young (ayoung) wrote :
Revision history for this message
dan wendlandt (danwent) wrote :

here's the ps output that was requested.

nicira@com-dev:/opt/stack/quantum$ ps -ef |grep openssl | grep defunct | head -3
nicira 366 46537 0 08:47 pts/10 00:00:00 [openssl] <defunct>
nicira 375 46537 0 08:47 pts/10 00:00:00 [openssl] <defunct>
nicira 384 46537 0 08:47 pts/10 00:00:00 [openssl] <defunct>

Revision history for this message
Gary Kotton (garyk) wrote :

This is starting to ring a bell - in the past when we did the monkey patch the quantum popen used to leave zombies - Bob helped me get through this one.
The solution was to use "from eventlet.green import subprocess"

In the keystone code I see that you guys are using - import subprocess in keystone/common/utils.py

Revision history for this message
Adam Young (ayoung) wrote :
dan wendlandt (danwent)
Changed in quantum:
assignee: nobody → dan wendlandt (danwent)
Revision history for this message
dan wendlandt (danwent) wrote :

This patch does not seem to do the trick in my setup (or i'm screwing something else up). Is someone else seeing that it solves the problem for them?

Revision history for this message
Adam Young (ayoung) wrote :

I don't have the setup to reproduce. It was based on the feedback that is out there about other Zombie process issues with Eventlet. I wasn't certain it would help, so likely your setup is correct.

Revision history for this message
Steve Baker (steve-stevebaker) wrote :

I'm testing the patch now.

Revision history for this message
Steve Baker (steve-stevebaker) wrote :

I think this patch makes no difference. In 2 devstack runs there were 11 and 34 zombies after all services were started. Zombies go to zero when quantum-server is restarted. About 4 are created in the first minute, then an average of 2 per minute after that.

Revision history for this message
Adam Young (ayoung) wrote :

Seems like the only fix right now is to change cms.py:

change line 2

-import subprocess
+from eventlet.green import subprocess

Revision history for this message
dan wendlandt (danwent) wrote :

Yeah, i have confirmed that these processes are only created when someone is querying via the webservices API.

And the second patch from Adam succeeds in preventing new zombie openssl processes on my setup.

What I'm confused about is why this is happening with quantum but not other services. Anyone from the keystone team have thoughts on this?

no longer affects: fedora
Revision history for this message
Steve Baker (steve-stevebaker) wrote :

#12 works for me too, thanks all.

Revision history for this message
Adam Young (ayoung) wrote :

I suspect that we will see this problem across the board. We'll come up with a solution in Keystone. We are going to try and make the code non-eventlet specific.

dan wendlandt (danwent)
Changed in quantum:
milestone: grizzly-1 → none
no longer affects: quantum
Revision history for this message
Steve Baker (steve-stevebaker) wrote :

I've just noticed that the workaround in #12 seems to break glance auth.

Glance API calls fail with auth token errors, and /var/cache/glance/api has zero files.

This is with devstack, current git master of everything.

Revision history for this message
dan wendlandt (danwent) wrote :

yeah, with the change, I'm seeing other services fail with auth errors as well (specifically, the quantum l3-agent making API calls).

Changed in quantum:
milestone: none → grizzly-1
importance: Undecided → High
status: New → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to keystone (master)

Fix proposed to branch: master
Review: https://review.openstack.org/15429

Changed in keystone:
status: New → In Progress
Revision history for this message
Adam Young (ayoung) wrote :

Note that the change I just probposed for Keystone does not fix it in Quantum, etc, but shows the approach. Bascially, it is working around a shortcoming of the eventlet library. You would need a comparable changin in the wsgi startup code in Quantum and other services.

dan wendlandt (danwent)
Changed in quantum:
assignee: nobody → dan wendlandt (danwent)
importance: High → Critical
Revision history for this message
dan wendlandt (danwent) wrote :

Hi Adam,

Can you help me out a bit here on exactly how to update Quantum? Below is the change I made, but I am either testing it incorrectly, or this is the wrong change, as it is not working for me (i.e., i still see a growing number of openssl processes after updating the keystone code, the quantum code, and restarting quantum).

Am I understanding that the changes to cms.py are imported by the keystone middleware, and thus I really just need to port the equivalent of the changes in keystone-all to quantum? Thanks,.

dan

diff --git a/quantum/wsgi.py b/quantum/wsgi.py
index af46267..714f17e 100644
--- a/quantum/wsgi.py
+++ b/quantum/wsgi.py
@@ -22,12 +22,14 @@ import sys
 from xml.dom import minidom
 from xml.parsers import expat

+from eventlet.green import subprocess
 import eventlet.wsgi
 eventlet.patcher.monkey_patch(all=False, socket=True)
 import routes.middleware
 import webob.dec
 import webob.exc

+from keystone.common import cms
 from quantum.common import exceptions as exception
 from quantum import context
 from quantum.openstack.common import jsonutils
@@ -35,9 +37,12 @@ from quantum.openstack.common import log as logging

 LOG = logging.getLogger(__name__)

+def monkeypatch_keystone_cms():
+ cms.Popen = subprocess.Popen

 def run_server(application, port):
     """Run a WSGI server with the given application."""
+ monkeypatch_keystone_cms()
     sock = eventlet.listen(('0.0.0.0', port))
     eventlet.wsgi.server(sock, application)

Revision history for this message
Adam Young (ayoung) wrote :

The change you made looks correct. Is it possible that you are not running with the modified version of auth_token middleware?

Revision history for this message
dan wendlandt (danwent) wrote :

mmm.... I am using my devstack env and i just updated keystone in /opt/stack/keystone to include your patch, then restarted quantum. I was thinking that would be enough so that the keystone middleware in quantum would get the updated code, but I must be missing something.

Revision history for this message
Adam Young (ayoung) wrote :

It is also possible that the monkey patch happens at a different point, or with different values. I noticed today that bin/quantum-server calls monkey_patch. It might be necessary to put the code there instead.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to keystone (master)

Reviewed: https://review.openstack.org/15429
Committed: http://github.com/openstack/keystone/commit/ef65550328ced10be85da2370dfc64b46dfc6071
Submitter: Jenkins
Branch: master

commit ef65550328ced10be85da2370dfc64b46dfc6071
Author: Adam Young <email address hidden>
Date: Mon Nov 5 12:49:29 2012 -0500

    monkeypatch cms Popen

    Bug 1074257

    Change-Id: I1372204c1e128aa664840e09b76fe979064d9efb

Changed in keystone:
status: In Progress → Fix Committed
Revision history for this message
Gary Kotton (garyk) wrote :
Download full text (9.4 KiB)

Hi,
Sadly this does not solve the problem. I am using the latest keystone code.
Thanks
Gary

1000 994 0.0 0.1 115760 3488 pts/0 Ss 08:40 0:00 -bash
1000 1041 0.0 0.0 0 0 pts/6 Z+ 11:43 0:00 [openssl] <defunct>
1000 1055 0.0 0.0 0 0 pts/6 Z+ 11:43 0:00 [openssl] <defunct>
1000 1086 0.0 0.0 0 0 pts/6 Z+ 11:43 0:00 [openssl] <defunct>
1000 1088 0.0 0.0 0 0 pts/6 Z+ 11:43 0:00 [openssl] <defunct>
1000 1168 0.0 0.0 0 0 pts/6 Z+ 11:43 0:00 [openssl] <defunct>
1000 1169 0.0 0.0 0 0 pts/6 Z+ 11:43 0:00 [openssl] <defunct>
1000 1171 0.0 0.0 0 0 pts/6 Z+ 11:43 0:00 [openssl] <defunct>
1000 1172 0.0 0.0 0 0 pts/6 Z+ 11:43 0:00 [openssl] <defunct>
1000 1176 0.0 0.0 0 0 pts/6 Z+ 11:43 0:00 [openssl] <defunct>
1000 1177 0.0 0.0 0 0 pts/6 Z+ 11:43 0:00 [openssl] <defunct>
1000 1383 0.0 0.0 0 0 pts/6 Z+ 11:43 0:00 [openssl] <defunct>
1000 1384 0.0 0.0 0 0 pts/6 Z+ 11:43 0:00 [openssl] <defunct>
1000 1390 0.0 0.0 0 0 pts/6 Z+ 11:43 0:00 [openssl] <defunct>
1000 1394 0.0 0.0 0 0 pts/6 Z+ 11:43 0:00 [openssl] <defunct>
1000 1395 0.0 0.0 0 0 pts/6 Z+ 11:43 0:00 [openssl] <defunct>
1000 1401 0.0 0.0 0 0 pts/6 Z+ 11:43 0:00 [openssl] <defunct>
1000 1402 0.0 0.0 0 0 pts/6 Z+ 11:43 0:00 [openssl] <defunct>
1000 1729 0.0 0.0 0 0 pts/6 Z+ 11:43 0:00 [openssl] <defunct>
1000 1730 0.0 0.0 0 0 pts/6 Z+ 11:43 0:00 [openssl] <defunct>
1000 1732 0.0 0.0 0 0 pts/6 Z+ 11:43 0:00 [openssl] <defunct>
1000 1733 0.0 0.0 0 0 pts/6 Z+ 11:43 0:00 [openssl] <defunct>
1000 1735 0.0 0.0 0 0 pts/6 Z+ 11:43 0:00 [openssl] <defunct>
1000 1736 0.0 0.0 0 0 pts/6 Z+ 11:43 0:00 [openssl] <defunct>
1000 1763 0.0 0.0 0 0 pts/6 Z+ 11:43 0:00 [openssl] <defunct>
1000 1765 0.0 0.0 0 0 pts/6 Z+ 11:43 0:00 [openssl] <defunct>
1000 1766 0.0 0.0 0 0 pts/6 Z+ 11:43 0:00 [openssl] <defunct>
1000 1957 0.0 0.0 0 0 pts/6 Z+ 11:43 0:00 [openssl] <defunct>
1000 1958 0.0 0.0 0 0 pts/6 Z+ 11:43 0:00 [openssl] <defunct>
1000 1961 0.0 0.0 0 0 pts/6 Z+ 11:43 0:00 [openssl] <defunct>
1000 1962 0.0 0.0 0 0 pts/6 Z+ 11:43 0:00 [openssl] <defunct>
1000 2022 0.0 0.0 0 0 pts/6 Z+ 11:43 0:00 [openssl] <defunct>
1000 2039 0.0 0.0 0 0 pts/6 Z+ 11:43 0:00 [openssl] <defunct>
1000 2075 0.0 0.0 0 0 pts/6 Z+ 11:43 0:00 [openssl] <defunct>
1000 2076 0.0 0.0 0 0 pts/6 Z+ 11:43 0:00 [openssl] <defunct>
1000 2090 0.0 0.0 0 0 pts/6 Z+ 11:43 0:00 [openssl] <defunct>
1000 2103 0.0 0.0 0 0 pts/6 Z+ 11:43 0...

Read more...

Revision history for this message
Adam Young (ayoung) wrote :

Gary, please confirm that the Eventlet Popen is actually getting called. It is quite possible that the System Popen is still in place, but that needs to be fixed in the Quantum code.

Changed in quantum:
assignee: dan wendlandt (danwent) → Gary Kotton (garyk)
status: Confirmed → In Progress
Revision history for this message
Terry Wilson (otherwiseguy) wrote :

Adding the monkey patching to quantum/wsgi.py didn't work for me, but moving the monkey patching code to bin/quantum-server stops the zombie process issue for me when testing with devstack.

Revision history for this message
dan wendlandt (danwent) wrote :

yes, moving the monkeypatch from wsgi.py to bin/quantum-server seems to work for me. I've +2'd the review.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to quantum (master)

Reviewed: https://review.openstack.org/15645
Committed: http://github.com/openstack/quantum/commit/db0846d5000762038db1a4244679d495e5dcc712
Submitter: Jenkins
Branch: master

commit db0846d5000762038db1a4244679d495e5dcc712
Author: Gary Kotton <email address hidden>
Date: Sun Nov 4 07:03:48 2012 +0000

    Fix openssl zombies

    Fixes bug 1074257

    Change-Id: I6a6673ad12dfbd24dc1d02623e2e70068999fe45

Changed in quantum:
status: In Progress → Fix Committed
Revision history for this message
Terry Wilson (otherwiseguy) wrote :

Just for reference, the eventlet patch at https://bitbucket.org/which_linden/eventlet/pull-request/24/fix-waitpid-returning-0-0-and-add-test/diff causes the zombie issue with quantum to go away as well.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to quantum (stable/folsom)

Fix proposed to branch: stable/folsom
Review: https://review.openstack.org/15771

Gary Kotton (garyk)
tags: removed: folsom-backport-potential
Thierry Carrez (ttx)
Changed in quantum:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in keystone:
milestone: none → grizzly-1
status: Fix Committed → Fix Released
Changed in heat:
status: New → Fix Released
Thierry Carrez (ttx)
Changed in keystone:
milestone: grizzly-1 → 2013.1
Thierry Carrez (ttx)
Changed in quantum:
milestone: grizzly-1 → 2013.1
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.