SWIFT request hangs as "oslo_messaging.Notifier.info()" interface does not returns in some failure scenarios

Bug #1696905 reported by Sanchit Malhotra on 2017-06-09
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
ceilometermiddleware
Undecided
Unassigned
oslo.messaging
Undecided
Unassigned

Bug Description

When Ceilometer Middleware is integrated with SWIFT, some of user requests hangs at ceilometer level.
It is observed that with default max pool size of 30, in case if 31 requests are received at oslo_messaging, the hung behaviour (indefinite hung) is observed.

When we debugged firther, to identify where the requests are stuck, it is observed that the following call of oslo_messaging Notifier class does not return.

File: https://github.com/openstack/ceilometermiddleware/blob/stable/mitaka/ceilometermiddleware/swift.py#L248
{code}
 47 import oslo_messaging
....
121 self._notifier = oslo_messaging.Notifier(
122 oslo_messaging.get_transport(cfg.CONF, url=conf.get('url')),
123 publisher_id='ceilometermiddleware',
124 driver=conf.get('driver', 'messagingv2'),
125 topic=conf.get('topic', 'notifications'))
126
....
283
284 self._notifier.info(context.get_admin_context().to_dict(),
285 'objectstore.http.request', event.as_dict())
286
{code}

the call "self._notifier.info" does not return in some case and due to this, the user requests are stuck indefinitely.

It is confirmed that there is no problem with RabbitMQ server or ceilometer server as both of them are accessible from other SWIFT cluster at the time problem is observed on different setup.

When conformed from network packet dump, it is seen that ceilometer middleware does not sends any data(metering information) to rabbitMQ server.

It is thus confirmed that the request is stuck somewhere in oslo_messaging notifier API call.

Also, it's effect is seen in Proxy Service as no new user requests are handled once any of the request is stuck in such a manner.

On restart of Proxy Service, the problem seems to be resolved temporarily.

description: updated
description: updated
description: updated
Ken Giusti (kgiusti) wrote :

Hi Sanchit,

Which version of oslo.messaging are you using?

Any chance you can get a traceback of the hung process? Like via pstack?

Can you get a dump of rabbitmq queue counters by running the following commands *after* the hang occurs?

$ sudo rabbitmqctl list_queues
$ sudo rabbitmqctl list_queues messages_unacknowledged

And lastly can you capture logs (debug level and up) from the hung service?

thanks

Sanchit Malhotra (isanchitm) wrote :
Download full text (22.1 KiB)

Hi Ken,

>Which version of oslo.messaging are you using?
oslo.messaging-5.12.0

>Any chance you can get a traceback of the hung process? Like via pstack?

The py traceback is obtained (manually) as below:

----
#012Traceback (most recent call last):#012 File "/usr/lib/python2.6/osd-packages/oslo_messaging/notify/messaging.py", line 70, in notify#012 retry=retry)#012 File "/usr/lib/python2.6/osd-packages/oslo_messaging/transport.py", line 104, in _send_notification#012 retry=retry)#012 File "/usr/lib/python2.6/osd-packages/oslo_messaging/_drivers/amqpdriver.py", line 471, in send_notification#012 envelope=(version == 2.0), notify=True, retry=retry)#012 File "/usr/lib/python2.6/osd-packages/oslo_messaging/_drivers/amqpdriver.py", line 428, in _send#012 with self._get_connection(rpc_common.PURPOSE_SEND) as conn:#012 File "/usr/lib/python2.6/osd-packages/oslo_messaging/_drivers/amqpdriver.py", line 374, in _get_connection#012 purpose=purpose)#012 File "/usr/lib/python2.6/osd-packages/oslo_messaging/_drivers/common.py", line 399, in __init__#012 self.connection = connection_pool.get()#012 File "/usr/lib/python2.6/osd-packages/oslo_messaging/_drivers/pool.py", line 103, in get#012 wait_condition(self._cond)#012 File "/usr/lib/python2.6/osd-packages/oslo_messaging/_drivers/pool.py", line 33, in wait_condition#012 cond.wait(timeout=1)#012 File "/opt/rh/python27/root/usr/lib64/python2.7/threading.py", line 361, in wait#012 _sleep(delay)#012 File "/opt/objectStorage/osd/proxyService/timeout.py", line 18, in _handle_timeout#012 raise TimeoutError(error_message)#012TimeoutError: Timer expired (txn: txf02c301ee4fd40bdbfb29-00593ebf05)
----

Apart from above, pstack is below:

====

[root@HN0101 ~]# pstack 29016
Thread 3 (Thread 0x7f234f41a700 (LWP 30639)):
#0 0x00000034c04e8ef3 in epoll_wait () from /lib64/libc.so.6
#1 0x00007f2352d173f2 in boost::asio::detail::epoll_reactor<false>::run(bool) () from /opt/objectStorage/lib/release_prod_64/libcommunication.so
#2 0x00007f2352d190ce in boost::asio::detail::task_io_service<boost::asio::detail::epoll_reactor<false> >::run(boost::system::error_code&) () from /opt/objectStorage/lib/release_prod_64/libcommunication.so
#3 0x00007f2352d14024 in CommunicationHelper::invoke() () from /opt/objectStorage/lib/release_prod_64/libcommunication.so
#4 0x00007f235775bd10 in thread_proxy () from /usr/lib64/libboost_thread-mt.so.1.41.0
#5 0x00000034c0c079d1 in start_thread () from /lib64/libpthread.so.0
#6 0x00000034c04e88fd in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7f234e0ae700 (LWP 20244)):
#0 0x00000034c0c0d930 in sem_wait () from /lib64/libpthread.so.0
#1 0x00007f235e835788 in PyThread_acquire_lock () from /opt/rh/python27/root/usr/lib64/libpython2.7.so.1.0
#2 0x00007f235e8398f4 in ?? () from /opt/rh/python27/root/usr/lib64/libpython2.7.so.1.0
#3 0x00007f235e808c5f in PyEval_EvalFrameEx () from /opt/rh/python27/root/usr/lib64/libpython2.7.so.1.0
#4 0x00007f235e80a64e in PyEval_EvalCodeEx () from /opt/rh/python27/root/usr/lib64/libpython2.7.so.1.0
#5 0x00007f235e808cdb in PyEval_EvalFrameEx () from /opt/rh/python27/root/usr/lib64/libpython2.7.so.1.0
#6 0x00...

Sanchit Malhotra (isanchitm) wrote :

It has been further identified that the oslo_messaging is stuck in wait_condition().

oslo_messaging/_drivers/pool.py
{code}
 85 def get(self):
     86 """Return an item from the pool, when one is available.
     87
     88 This may cause the calling thread to block.
     89 """
     90 with self._cond:
     91 while True:
     92 try:
     93 ttl_watch, item = self._items.pop()
     94 self.expire()
     95 return item
     96 except IndexError:
     97 pass
     98
     99 if self._current_size < self._max_size:
    100 self._current_size += 1
    101 break
    102
    103 wait_condition(self._cond)
    104
    105 # We've grabbed a slot and dropped the lock, now do the creation
    106 try:
    107 return self.create()
    108 except Exception:
    109 with self._cond:
    110 self._current_size -= 1
    111 raise
{code}

In the problematic case when requests and service hangs, the following condition is seen:
self._cond = <Condition(<_RLock owner='MainThread' count=1>, 0)>

In our system, the cond.wait() interface is called with timeout of 1.

{code}
# TODO(harlowja): remove this when we no longer have to support 2.7
     29 if sys.version_info[0:2] < (3, 2):
     30 def wait_condition(cond):
     31 # FIXME(markmc): timeout needed to allow keyboard interrupt
     32 # http://bugs.python.org/issue8844
     33 cond.wait(timeout=1)
     34 else:
     35 def wait_condition(cond):
     36 cond.wait()
{code}

There seems to be some problem with RLock. What do you think?

description: updated
Sanchit Malhotra (isanchitm) wrote :
Download full text (20.6 KiB)

We suspect that the connection pool implementation within oslo_messaging library is not greenthread aware.

We simulated the use of oslo_messaging connection pool from an independent application (Execution using python 2.7.8) using two thread models:

1. Parallel requests using green threads.

{code}
#################################################################
# @Usage : python file_name.py hostname port username password #
# hostname : Machine IP on which rabbitMQ server is running #
# port : port on which rabbitMQ server is listening #
# username : username of hostname #
# password : password of hostname #
#################################################################

import sys
import threading
import os

os.environ['PBR_VERSION']='1.10.0'
from oslo_config import cfg
import eventlet
from oslo_messaging._drivers import pool
from oslo_messaging._drivers.impl_rabbit import Connection

from oslo_messaging._drivers import amqp as rpc_amqp
from oslo_messaging._drivers import amqpdriver
from oslo_messaging._drivers import base
from oslo_messaging.transport import get_transport
from oslo_messaging.transport import TransportURL, TransportHost

max_size = 30
min_size = 0
ttl = 1200

"""
The below part of code to initialize rabbit_opts is copy-paste from .../oslo_messaging/_drivers/impl_rabbit.py
Line 56 to 204
version : oslo.messaging-5.12.0
"""
rabbit_opts = [
    cfg.StrOpt('kombu_ssl_version',
               default='',
               deprecated_group='DEFAULT',
               help='SSL version to use (valid only if SSL enabled). '
                    'Valid values are TLSv1 and SSLv23. SSLv2, SSLv3, '
                    'TLSv1_1, and TLSv1_2 may be available on some '
                    'distributions.'
               ),
    cfg.StrOpt('kombu_ssl_keyfile',
               default='',
               deprecated_group='DEFAULT',
               help='SSL key file (valid only if SSL enabled).'),
    cfg.StrOpt('kombu_ssl_certfile',
               default='',
               deprecated_group='DEFAULT',
               help='SSL cert file (valid only if SSL enabled).'),
    cfg.StrOpt('kombu_ssl_ca_certs',
               default='',
               deprecated_group='DEFAULT',
               help='SSL certification authority file '
                    '(valid only if SSL enabled).'),
    cfg.FloatOpt('kombu_reconnect_delay',
                 default=1.0,
                 deprecated_group='DEFAULT',
                 help='How long to wait before reconnecting in response to an '
                      'AMQP consumer cancel notification.'),
    cfg.StrOpt('kombu_compression',
               help="EXPERIMENTAL: Possible values are: gzip, bz2. If not "
                    "set compression will not be used. This option may not "
                    "be available in future versions."),

    cfg.IntOpt('kombu_missing_consumer_retry_timeout',
               deprecated_name="kombu_reconnect_timeout",
               default=60,
               help='How long to wait a missing client before abandoning to '
                    'send it its replies. This value should not be longer '...

Arun (arun-verma) wrote :

We are facing the same issue. Swift-Proxy requests hung at ceilometer middleware.

Ken Giusti (kgiusti) wrote :

Hi - sorry for the inexcusably late reply :(

I see the same issue with your reproducer you provided above. However that reproducer does not properly monkey patch in the case of the eventlet reproducer.

In order to use oslo.messaging with greenthreads (eventlet), the application must override all system library calls that can block the thread. Eventlet provides a method to do exactly that: monkey_patch().

See http://eventlet.net/doc/patching.html#monkey-patch

In the case of your failing script, I added:

import eventlet
eventlet.sleep()
eventlet.monkey_patch()

at the very top of the script, before all other imports. Once that's done the script no longer hangs.

(the second line - eventlet.sleep() may not be necessary. It is a work-around for a bug in my older version of eventlet: https://github.com/eventlet/eventlet/issues/428)

Arun (arunverma) wrote :

Hi Ken,

Thanks for suggesting this. I'm checking the suggested solution. I'll get back to you if face any issue.

Ben Nemec (bnemec) on 2018-04-25
Changed in oslo.messaging:
status: New → Incomplete
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.