Should RPC consume_in_thread() be more fault tolerant?

Bug #1189711 reported by Raymond Pekowski
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
oslo-incubator
Fix Released
Undecided
Raymond Pekowski

Bug Description

Here is a series of emails that explains the problem.

#################################################################################

Ray Pekowski <email address hidden>

Jun 6 (4 days ago)

to OpenStack
I was thinking about the use of RPC consume_in_thread() in the single reply queue changes I made and how the crashing of the created thread would result in RPC call timeout failures in all future attempts to issue RPC calls (not casts). Probably, the worst part is that the RPC calls would be sent and perhaps successfully executed, but the responses would not be received. The RPC would timeout.

There is a similar issue with the consume_in_thread() being done in the RPC dispatcher that receives RPC requests, but in this case the RPCs are not executed.

I looked into consume_in_thread() and realize that the base thread it creates does not in fact do the equivalent of a catch all (except Exception). This means some unexpected exception could result in the death of the consumer thread. I was wondering if we should add an "except Exception" in the consumer infinite loop that perhaps sleeps for a time and goes on. Of course, we would have to make sure in the same loop an exception for a valid kill of the thread was caught. Currently that is "except greenlet.GreenletExit". On the other hand, if the thread died for some unexpected reason, it might just die again for the same reason, but that is why I suggest adding a sleep to the loop. To prevent a 100% CPU situation.Mike Wilson via lists.openstack.org

#################################################################################

Mike Wilson via lists.openstack.org

Jun 6 (4 days ago)

to OpenStack
Hey Ray,

I can confirm that this problem at least affects the qpid implementation. I think catching all exceptions and retrying in a backing off manner is great. I have noticed that it seems like we haven't been doing a blanket catch-all for both sql and amqp connections. I'm assuming that was a decision not a mistake, so I'm interested to hear feedback on this also.

Anyway, +1 from me on doing what you are suggesting.

-Mike Wilson

#####################################################################################

Vishvananda Ishaya via lists.openstack.org

Jun 6 (4 days ago)

to OpenStack

+1 consumer greenthreads have died in the past and while i haven't seen this issue recently, it is a particularly nasty one to debug because often the exception doesn't show up in the logs.

Vish

#####################################################################################

Ray Pekowski <email address hidden>

Jun 6 (4 days ago)

to OpenStack
On Thu, Jun 6, 2013 at 12:03 PM, Ray Pekowski <email address hidden> wrote:

    I looked into consume_in_thread() and realize that the base thread it creates does not in fact do the equivalent of a catch all (except Exception).

To be a little more fair on this, both impl_kombu.py and impl_qpid.py use the _ensure() method inside of their iterconsume() method and that seems to be the place for catch all type work. impl_kombu.py has everything but the call to reconnect() covered in the catch all. reconnect() is used to recover from some exceptions, but reconnect() itself could raise and exception, so there is a hole to be filled. It might just be best to fill that hole closest to the base of the thread simply to catch any coding errors or some future change by someone not aware that everything needs to be caught. impl_qpid.py simply doesn't catch everything in the _ensure() method.

####################################################################################

Eric Windisch via lists.openstack.org

Jun 6 (4 days ago)

to OpenStack
Thanks Ray.
I've caught a few ways I could easily crash consumers with ZeroMQ… I'll be following up with patches.

--
Eric Windisch

##################################################################################

On Thu, Jun 6, 2013 at 3:22 PM, Chris Behrens <email address hidden> wrote:

    ? There's a try/except in reconnect() in impl_kombu around the piece that can raise… unless we want to think something like LOG.* calls could fail.

Sure there is a try/except in reconnect(), but the catch all re-raises the exception unless it is a timeout. Here is the code:

            except Exception as e:
                # NOTE(comstud): Unfortunately it's possible for amqplib
                # to return an error not covered by its transport
                # connection_errors in the case of a timeout waiting for
                # a protocol response. (See paste link in LP888621)
                # So, we check all exceptions for 'timeout' in them
                # and try to reconnect in this case.
                if 'timeout' not in str(e):
                    raise

Preceding the above is a catch for a few other exception types:

            except (IOError, self.connection_errors) as e:
                pass

but anything else will simply raise an exception in an unprotected piece of code, since ensure() doesn't include the reconnect itself in a try block:

    def ensure(self, error_callback, method, *args, **kwargs):
        while True:
            try:
                return method(*args, **kwargs)
            except (self.connection_errors, socket.timeout, IOError) as e:
                if error_callback:
                    error_callback(e)
            except Exception as e:
                # NOTE(comstud): Unfortunately it's possible for amqplib
                # to return an error not covered by its transport
                # connection_errors in the case of a timeout waiting for
                # a protocol response. (See paste link in LP888621)
                # So, we check all exceptions for 'timeout' in them
                # and try to reconnect in this case.
                if 'timeout' not in str(e):
                    raise
                if error_callback:
                    error_callback(e)
            self.reconnect()

See the last line above.

One unknown is whether there are any other exceptions coming from the reconnect() code other than IOError, self.connection_errors and timeout. Maybe, maybe not. That's why I do give some credit to this code, but it wouldn't hurt to plug this hole in a way that would prevent future bugs. In other words, as close to the base of the thread as possible.

#################################################################################

Ray Pekowski <email address hidden>

Jun 8 (2 days ago)

to OpenStack
Here is the first patch set for AMQP based RPCs:

I0d6ec8a5: Make AMQP based RPC consumer threads more robust

https://review.openstack.org/#/c/32235/1

I still need to add a QPID unit test case for it.

####################################################################################

Ray Pekowski <email address hidden>

Jun 8 (2 days ago)

to OpenStack

I have an idea. Maybe this code should be implemented as a decorator, for example calling it robust_thread. At least that would make it common between kombu and qpid code. Maybe zmq could use it. Perhaps other components have a similar situation whereby it would be good to restart crashed threads in a controlled manner.

##################################################################################

Ray Pekowski <email address hidden>

Jun 9 (1 day ago)

to OpenStack

Just submitted patch set 2 which uses a decorator: https://review.openstack.org/#/c/32235/1
Any preferences or comments on how it is implemented?

Changed in oslo:
assignee: nobody → Raymond Pekowski (pekowski)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo-incubator (master)

Fix proposed to branch: master
Review: https://review.openstack.org/33481

Changed in oslo:
assignee: Raymond Pekowski (pekowski) → Eric Windisch (ewindisch)
Revision history for this message
Erica Windisch (ewindisch) wrote :

Ray: FYI - "hudson-openstack" changed the assignee to me. I seem unable to switch this bug back to you.

The proposed fix above (33481) is specific to ZeroMQ - which is parallel to the AMQP fix you're working on.

Changed in oslo:
assignee: Eric Windisch (ewindisch) → Raymond Pekowski (pekowski)
Changed in oslo:
assignee: Raymond Pekowski (pekowski) → Eric Windisch (ewindisch)
Changed in oslo:
assignee: Eric Windisch (ewindisch) → Raymond Pekowski (pekowski)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo-incubator (master)

Reviewed: https://review.openstack.org/32235
Committed: http://github.com/openstack/oslo-incubator/commit/22ec8ff616a799085239e3e529daeeefea6366c4
Submitter: Jenkins
Branch: master

commit 22ec8ff616a799085239e3e529daeeefea6366c4
Author: Raymond Pekowski <email address hidden>
Date: Tue Jun 25 04:24:05 2013 +0000

    Make AMQP based RPC consumer threads more robust

    bug 1189711 Should RPC consume_in_thread() be more fault tolerant?

    There are unprotected holes in the thread kicked off by RPC
    consume_in_thread such that an exception will kill the thread.
    This exists for both the service (TopicConsumer) and the new
    reply proxy (DirectConsumer) consumers. This patch plugs
    those holes as close to the base of the consumer thread as
    possible by catching all non-caught exceptions and retrying
    with sleeps between retries and some pacing of the log
    output to prevent log flooding.

    Change-Id: I0d6ec8a5e3a310314da201656ee862bb40b41616

Changed in oslo:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in oslo:
milestone: none → havana-2
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in oslo:
milestone: havana-2 → 2013.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.