Should RPC consume_in_thread() be more fault tolerant?
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
oslo-incubator |
Fix Released
|
Undecided
|
Raymond Pekowski |
Bug Description
Here is a series of emails that explains the problem.
#######
Ray Pekowski <email address hidden>
Jun 6 (4 days ago)
to OpenStack
I was thinking about the use of RPC consume_in_thread() in the single reply queue changes I made and how the crashing of the created thread would result in RPC call timeout failures in all future attempts to issue RPC calls (not casts). Probably, the worst part is that the RPC calls would be sent and perhaps successfully executed, but the responses would not be received. The RPC would timeout.
There is a similar issue with the consume_in_thread() being done in the RPC dispatcher that receives RPC requests, but in this case the RPCs are not executed.
I looked into consume_in_thread() and realize that the base thread it creates does not in fact do the equivalent of a catch all (except Exception). This means some unexpected exception could result in the death of the consumer thread. I was wondering if we should add an "except Exception" in the consumer infinite loop that perhaps sleeps for a time and goes on. Of course, we would have to make sure in the same loop an exception for a valid kill of the thread was caught. Currently that is "except greenlet.
#######
Mike Wilson via lists.openstack.org
Jun 6 (4 days ago)
to OpenStack
Hey Ray,
I can confirm that this problem at least affects the qpid implementation. I think catching all exceptions and retrying in a backing off manner is great. I have noticed that it seems like we haven't been doing a blanket catch-all for both sql and amqp connections. I'm assuming that was a decision not a mistake, so I'm interested to hear feedback on this also.
Anyway, +1 from me on doing what you are suggesting.
-Mike Wilson
#######
Vishvananda Ishaya via lists.openstack.org
Jun 6 (4 days ago)
to OpenStack
+1 consumer greenthreads have died in the past and while i haven't seen this issue recently, it is a particularly nasty one to debug because often the exception doesn't show up in the logs.
Vish
#######
Ray Pekowski <email address hidden>
Jun 6 (4 days ago)
to OpenStack
On Thu, Jun 6, 2013 at 12:03 PM, Ray Pekowski <email address hidden> wrote:
I looked into consume_in_thread() and realize that the base thread it creates does not in fact do the equivalent of a catch all (except Exception).
To be a little more fair on this, both impl_kombu.py and impl_qpid.py use the _ensure() method inside of their iterconsume() method and that seems to be the place for catch all type work. impl_kombu.py has everything but the call to reconnect() covered in the catch all. reconnect() is used to recover from some exceptions, but reconnect() itself could raise and exception, so there is a hole to be filled. It might just be best to fill that hole closest to the base of the thread simply to catch any coding errors or some future change by someone not aware that everything needs to be caught. impl_qpid.py simply doesn't catch everything in the _ensure() method.
#######
Eric Windisch via lists.openstack.org
Jun 6 (4 days ago)
to OpenStack
Thanks Ray.
I've caught a few ways I could easily crash consumers with ZeroMQ… I'll be following up with patches.
--
Eric Windisch
#######
On Thu, Jun 6, 2013 at 3:22 PM, Chris Behrens <email address hidden> wrote:
? There's a try/except in reconnect() in impl_kombu around the piece that can raise… unless we want to think something like LOG.* calls could fail.
Sure there is a try/except in reconnect(), but the catch all re-raises the exception unless it is a timeout. Here is the code:
except Exception as e:
# NOTE(comstud): Unfortunately it's possible for amqplib
# to return an error not covered by its transport
# connection_errors in the case of a timeout waiting for
# a protocol response. (See paste link in LP888621)
# So, we check all exceptions for 'timeout' in them
# and try to reconnect in this case.
if 'timeout' not in str(e):
Preceding the above is a catch for a few other exception types:
except (IOError, self.connection
but anything else will simply raise an exception in an unprotected piece of code, since ensure() doesn't include the reconnect itself in a try block:
def ensure(self, error_callback, method, *args, **kwargs):
while True:
try:
except (self.connectio
if error_callback:
except Exception as e:
# NOTE(comstud): Unfortunately it's possible for amqplib
# to return an error not covered by its transport
# connection_errors in the case of a timeout waiting for
# a protocol response. (See paste link in LP888621)
# So, we check all exceptions for 'timeout' in them
# and try to reconnect in this case.
if 'timeout' not in str(e):
if error_callback:
See the last line above.
One unknown is whether there are any other exceptions coming from the reconnect() code other than IOError, self.connection
#######
Ray Pekowski <email address hidden>
Jun 8 (2 days ago)
to OpenStack
Here is the first patch set for AMQP based RPCs:
I0d6ec8a5: Make AMQP based RPC consumer threads more robust
https:/
I still need to add a QPID unit test case for it.
#######
Ray Pekowski <email address hidden>
Jun 8 (2 days ago)
to OpenStack
I have an idea. Maybe this code should be implemented as a decorator, for example calling it robust_thread. At least that would make it common between kombu and qpid code. Maybe zmq could use it. Perhaps other components have a similar situation whereby it would be good to restart crashed threads in a controlled manner.
#######
Ray Pekowski <email address hidden>
Jun 9 (1 day ago)
to OpenStack
Just submitted patch set 2 which uses a decorator: https:/
Any preferences or comments on how it is implemented?
Changed in oslo: | |
assignee: | nobody → Raymond Pekowski (pekowski) |
status: | New → In Progress |
Changed in oslo: | |
assignee: | Eric Windisch (ewindisch) → Raymond Pekowski (pekowski) |
Changed in oslo: | |
assignee: | Raymond Pekowski (pekowski) → Eric Windisch (ewindisch) |
Changed in oslo: | |
assignee: | Eric Windisch (ewindisch) → Raymond Pekowski (pekowski) |
Changed in oslo: | |
milestone: | none → havana-2 |
status: | Fix Committed → Fix Released |
Changed in oslo: | |
milestone: | havana-2 → 2013.2 |
Fix proposed to branch: master /review. openstack. org/33481
Review: https:/