[zmq]send rpc.call message failed on direct connections dynamic mode

Bug #1658913 reported by lihong
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
oslo.messaging
Fix Released
Undecided
Oleksii Zamiatin

Bug Description

2017-01-24 12:04:27.158 28372 DEBUG oslo_messaging._drivers.zmq_driver.client.zmq_routing_table [req-efbf39dc-e816-4232-8904-36063d6c3aff - - - - -] Processing target ROUTER/conductor for round-robin. _fetch_round_robin_hosts_from_matchmaker /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/zmq_driver/client/zmq_routing_table.py:59
2017-01-24 12:04:27.158 28372 DEBUG oslo_messaging._drivers.zmq_driver.client.zmq_routing_table [req-efbf39dc-e816-4232-8904-36063d6c3aff - - - - -] Target ROUTER/conductor is not in cache. Check matchmaker server. _fetch_round_robin_hosts_from_matchmaker /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/zmq_driver/client/zmq_routing_table.py:65
2017-01-24 12:04:27.160 28372 DEBUG oslo_messaging._drivers.zmq_driver.client.zmq_routing_table [-] Updating routing table from the matchmaker. 0 target(s) updated []. _update_routing_table /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/zmq_driver/client/zmq_routing_table.py:194
2017-01-24 12:04:27.161 28372 DEBUG oslo_messaging._drivers.zmq_driver.matchmaker.zmq_matchmaker_redis [req-efbf39dc-e816-4232-8904-36063d6c3aff - - - - -] [Redis] get_hosts for target <Target topic=conductor, namespace=baseapi, version=1.0>: ['host224:52851', 'host224:60471', 'host224:57940', 'host224:58155'] get_hosts /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/zmq_driver/matchmaker/zmq_matchmaker_redis.py:248
2017-01-24 12:04:27.161 28372 DEBUG oslo_messaging._drivers.zmq_driver.client.zmq_routing_table [req-efbf39dc-e816-4232-8904-36063d6c3aff - - - - -] Received hosts ['host224:52851', 'host224:60471', 'host224:57940', 'host224:58155'] _fetch_round_robin_hosts_from_matchmaker /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/zmq_driver/client/zmq_routing_table.py:68
2017-01-24 12:04:27.162 28372 DEBUG oslo_messaging._drivers.zmq_driver.client.zmq_routing_table [req-efbf39dc-e816-4232-8904-36063d6c3aff - - - - -] Host resolved for the current connection is host224:52851 get_round_robin_host /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/zmq_driver/client/zmq_routing_table.py:48
2017-01-24 12:04:27.162 28372 DEBUG oslo_messaging._drivers.zmq_driver.zmq_socket [req-efbf39dc-e816-4232-8904-36063d6c3aff - - - - -] Connecting DEALER socket 0a1743c6-bb9d-423c-9ade-c38b4a82ff31 to tcp://host224:52851 connect_to_address /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/zmq_driver/zmq_socket.py:193
2017-01-24 12:04:27.162 28372 DEBUG oslo_messaging._drivers.zmq_driver.zmq_socket [req-efbf39dc-e816-4232-8904-36063d6c3aff - - - - -] Socket 0a1743c6-bb9d-423c-9ade-c38b4a82ff31 closed close /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/zmq_driver/zmq_socket.py:184
2017-01-24 12:04:27.163 28372 CRITICAL nova [req-efbf39dc-e816-4232-8904-36063d6c3aff - - - - -] Again: Resource temporarily unavailable
2017-01-24 12:04:27.163 28372 ERROR nova Traceback (most recent call last):
2017-01-24 12:04:27.163 28372 ERROR nova File "/usr/bin/nova-compute", line 10, in <module>
2017-01-24 12:04:27.163 28372 ERROR nova sys.exit(main())
2017-01-24 12:04:27.163 28372 ERROR nova File "/usr/lib/python2.7/site-packages/nova/cmd/compute.py", line 74, in main
2017-01-24 12:04:27.163 28372 ERROR nova db_allowed=CONF.conductor.use_local)
2017-01-24 12:04:27.163 28372 ERROR nova File "/usr/lib/python2.7/site-packages/nova/service.py", line 296, in create
2017-01-24 12:04:27.163 28372 ERROR nova db_allowed=db_allowed)
2017-01-24 12:04:27.163 28372 ERROR nova File "/usr/lib/python2.7/site-packages/nova/service.py", line 191, in __init__
2017-01-24 12:04:27.163 28372 ERROR nova self.conductor_api.wait_until_ready(context.get_admin_context())
2017-01-24 12:04:27.163 28372 ERROR nova File "/usr/lib/python2.7/site-packages/nova/conductor/api.py", line 157, in wait_until_ready
2017-01-24 12:04:27.163 28372 ERROR nova timeout=timeout)
2017-01-24 12:04:27.163 28372 ERROR nova File "/usr/lib/python2.7/site-packages/nova/baserpc.py", line 65, in ping
2017-01-24 12:04:27.163 28372 ERROR nova return cctxt.call(context, 'ping', arg=arg_p)
2017-01-24 12:04:27.163 28372 ERROR nova File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 169, in call
2017-01-24 12:04:27.163 28372 ERROR nova retry=self.retry)
2017-01-24 12:04:27.163 28372 ERROR nova File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 97, in _send
2017-01-24 12:04:27.163 28372 ERROR nova timeout=timeout, retry=retry)
2017-01-24 12:04:27.163 28372 ERROR nova File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_zmq.py", line 161, in send
2017-01-24 12:04:27.163 28372 ERROR nova return client.send_call(target, ctxt, message, timeout, retry)
2017-01-24 12:04:27.163 28372 ERROR nova File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/zmq_driver/client/zmq_client_base.py", line 52, in send_call
2017-01-24 12:04:27.163 28372 ERROR nova return self.call_publisher.send_call(request)
2017-01-24 12:04:27.163 28372 ERROR nova File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/zmq_driver/client/zmq_publisher_manager.py", line 51, in _target_not_found_timeout
2017-01-24 12:04:27.163 28372 ERROR nova return func(self, request, *args, **kwargs)
2017-01-24 12:04:27.163 28372 ERROR nova File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/zmq_driver/client/zmq_publisher_manager.py", line 126, in send_call
2017-01-24 12:04:27.163 28372 ERROR nova self.publisher.send_request(socket, request)
2017-01-24 12:04:27.163 28372 ERROR nova File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/zmq_driver/client/publishers/dealer/zmq_dealer_publisher_direct.py", line 95, in send_request
2017-01-24 12:04:27.163 28372 ERROR nova self.sender.send(socket, request)
2017-01-24 12:04:27.163 28372 ERROR nova File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/zmq_driver/client/zmq_senders.py", line 148, in send
2017-01-24 12:04:27.163 28372 ERROR nova send_version(socket, request)
2017-01-24 12:04:27.163 28372 ERROR nova File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/zmq_driver/client/zmq_senders.py", line 158, in _send_v_1_0
2017-01-24 12:04:27.163 28372 ERROR nova socket.send(b'', zmq.SNDMORE|zmq.NOBLOCK)
2017-01-24 12:04:27.163 28372 ERROR nova File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/zmq_driver/zmq_socket.py", line 124, in send
2017-01-24 12:04:27.163 28372 ERROR nova self.handle.send(*args, **kwargs)
2017-01-24 12:04:27.163 28372 ERROR nova File "/usr/lib/python2.7/site-packages/eventlet/green/zmq.py", line 275, in send
2017-01-24 12:04:27.163 28372 ERROR nova result = _Socket_send(self, msg, flags, copy, track)
2017-01-24 12:04:27.163 28372 ERROR nova File "zmq/backend/cython/socket.pyx", line 619, in zmq.backend.cython.socket.Socket.send (zmq/backend/cython/socket.c:6169)
2017-01-24 12:04:27.163 28372 ERROR nova File "zmq/backend/cython/socket.pyx", line 666, in zmq.backend.cython.socket.Socket.send (zmq/backend/cython/socket.c:5931)
2017-01-24 12:04:27.163 28372 ERROR nova File "zmq/backend/cython/socket.pyx", line 201, in zmq.backend.cython.socket._send_copy (zmq/backend/cython/socket.c:2400)
2017-01-24 12:04:27.163 28372 ERROR nova File "zmq/backend/cython/checkrc.pxd", line 19, in zmq.backend.cython.checkrc._check_rc (zmq/backend/cython/socket.c:6920)
2017-01-24 12:04:27.163 28372 ERROR nova raise Again(errno)
2017-01-24 12:04:27.163 28372 ERROR nova Again: Resource temporarily unavailable

Revision history for this message
lihong (lihong) wrote :

The establishment of the connection is asynchronous in zeromq.
When completed execution acquire_connection() method in class
DealerPublisherDirect, the connection has not been established,
sending a message at this time will fail.So we need a delay 20ms
to wait for the connection establish.

Changed in oslo.messaging:
assignee: nobody → lihong (lihong)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (master)

Fix proposed to branch: master
Review: https://review.openstack.org/424480

Changed in oslo.messaging:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/429639

Changed in oslo.messaging:
assignee: lihong (lihong) → Oleksii Zamiatin (ozamiatin)
Revision history for this message
Oleksii Zamiatin (ozamiatin) wrote :

@lihong, could you please check my fix https://review.openstack.org/#/c/429639 with your deployment. Does it help?

As I understood the root cause is that eventlet zmq forces zmq.NOBLOCK flag (btw you can see this constant in the stack trace) and in such case we need to handle zmq.Again exception. I respectfully disagree to rely on time.sleep in production because sleep doesn't guarantee that connection will be established in that specific time and we have probability of failure anyway.

Thanks for finding this bug, hope to hear that fix helped. Feel free to ping me if it doesn't.

Revision history for this message
lihong (lihong) wrote :

Yes,I add zmq.NOBLOCK flag to socket.send() of RequestSenderDirect class.

When send CALL message,socket's option IMMEDIATE=1.
Messages whill be queued only to completed connections.
This will cause the socket to block if there are no
other connections. We need to avoid it.
It's necessary to add NOBLOCK flag to send() function.
The function will be performed in non-blocking mode.
If the message cannot be queued on the socket,
the send() function will fail with errno set to EAGAIN.

see \eventlet\green\zmq.py:
class Socket(_Socket):
    @_wraps(_Socket.send)
    def send(self, msg, flags=0, copy=True, track=False):
        """A send method that's safe to use when multiple greenthreads
        are calling send, send_multipart, recv and recv_multipart on
        the same socket.
        """
        if flags & NOBLOCK:
            result = _Socket_send(self, msg, flags, copy, track)
            # Instead of calling both wake methods, could call
            # self.getsockopt(EVENTS) which would trigger wakeups if
            # needed.
            self._eventlet_send_event.wake()
            self._eventlet_recv_event.wake()
            return result

        # TODO: pyzmq will copy the message buffer and create Message
        # objects under some circumstances. We could do that work here
        # once to avoid doing it every time the send is retried.
        flags |= NOBLOCK
        with self._eventlet_send_lock:
            while True:
                try:
                    return _Socket_send(self, msg, flags, copy, track)
                except ZMQError as e:
                    if e.errno == EAGAIN:
                        self._eventlet_send_event.block()
                    else:
                        raise
                finally:
                    # The call to send processes 0mq events and may
                    # make the socket ready to recv. Wake the next
                    # receiver. (Could check EVENTS for POLLIN here)
                    self._eventlet_recv_event.wake()

Revision history for this message
lihong (lihong) wrote :

This solution is not feasible,If connected to an abandoned socket, contract socket will fall into the dead loop.

Revision history for this message
Oleksii Zamiatin (ozamiatin) wrote :

@lihong, could you please clarify: did you receive the trace after changing RequestSenderDirect to use zmq.NOBLOCK in sending? Btw, I don't see this particular change in your patch... What kind of failure did you see without zmq.NOBLOCK (what behavior do we have on master branch)? Because the default behavior of dealer socket (storing messages before the first connection appears) is what was really wanted here.

Regarding the dead loop as I see now it is the same thing that eventlet does, handling EAGAIN and retrying. For dynamic connections that means to close by timeout (which will happen anyway when the reply waiting future expires). We also can limit the number of retries to 3-5 for example. But first of all I'd like to figure out do we really need to use zmq.NOBLOCK ...

Revision history for this message
lihong (lihong) wrote :
Download full text (8.6 KiB)

Yes,I receive the trace after changing RequestSenderDirect to use zmq.NOBLOCK in sending.

Here are the reasons for using zmq.NOBLOCK:
1、nova-compute send CALL message to nova-conductor , this is normal procedure:
[root@host224 home(keystone_admin)]# tail -f /var/log/nova/nova-compute.log
2017-02-08 10:59:51.588 28874 DEBUG oslo_messaging._drivers.zmq_driver.client.zmq_routing_table [-] Processing target ROUTER/conductor for round-robin. _fetch_round_robin_hosts_from_matchmaker /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/zmq_driver/client/zmq_routing_table.py:59
2017-02-08 10:59:51.588 28874 DEBUG oslo_messaging._drivers.zmq_driver.client.zmq_routing_table [-] Host resolved for the current connection is host224:62374 get_round_robin_host /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/zmq_driver/client/zmq_routing_table.py:48
2017-02-08 10:59:51.588 28874 DEBUG oslo_messaging._drivers.zmq_driver.zmq_socket [-] Connecting DEALER socket 3937d22b-4335-4859-8a32-43a834d13675 to tcp://host224:62374 connect_to_address /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/zmq_driver/zmq_socket.py:188
2017-02-08 10:59:51.588 28874 DEBUG oslo_messaging._drivers.zmq_driver.client.zmq_routing_table [-] Processing target ROUTER/conductor for round-robin. _fetch_round_robin_hosts_from_matchmaker /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/zmq_driver/client/zmq_routing_table.py:59
2017-02-08 10:59:51.589 28874 DEBUG oslo_messaging._drivers.zmq_driver.zmq_socket [-] Connecting DEALER socket 3937d22b-4335-4859-8a32-43a834d13675 to tcp://host224:53730 connect_to_address /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/zmq_driver/zmq_socket.py:188
2017-02-08 10:59:51.589 28874 DEBUG oslo_messaging._drivers.zmq_driver.zmq_socket [-] Connecting DEALER socket 3937d22b-4335-4859-8a32-43a834d13675 to tcp://host224:53038 connect_to_address /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/zmq_driver/zmq_socket.py:188
2017-02-08 10:59:51.589 28874 DEBUG oslo_messaging._drivers.zmq_driver.client.zmq_senders [-] Sending CALL message a8ef9d4e-edaa-11e6-9e4b-fa163e6b3caa to target <Target topic=conductor, version=3.0> (v1.0) send /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/zmq_driver/client/zmq_senders.py:155
2017-02-08 10:59:51.589 28874 DEBUG oslo_messaging._drivers.zmq_driver.poller.green_poller [-] Registering socket 3937d22b-4335-4859-8a32-43a834d13675 register /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/zmq_driver/poller/green_poller.py:32
2017-02-08 10:59:51.600 28874 DEBUG oslo_messaging._drivers.zmq_driver.client.zmq_receivers [-] Received REPLY for a8ef9d4e-edaa-11e6-9e4b-fa163e6b3caa _run_loop /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/zmq_driver/client/zmq_receivers.py:125
2017-02-08 10:59:51.605 28874 DEBUG oslo_messaging._drivers.zmq_driver.poller.green_poller [-] Unregistering socket 3937d22b-4335-4859-8a32-43a834d13675 unregister /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/zmq_driver/poller/green_poller.py:40
2017-02-08 10:59:51.605 28874 DEBUG oslo_messaging._drivers.zmq_driver.zmq_socket [-] Socket 3937d22b-4335-4859-8a32-43a834d13675 closed close /...

Read more...

Revision history for this message
Oleksii Zamiatin (ozamiatin) wrote :

So, as far as I can see the socket was blocked because we had unsent message in it's queue and socket didn't close. This behavior in DEALER socket prevents process exiting when something is in the queue and it is controlled by zmq_linger option which by default is infinite waiting. I recommend to set up linger to some definite value with dynamic connections mode to avoid such blocking.

Please update nova.conf with the following:

[oslo_messaging_zmq]
zmq_linger = 60

Revision history for this message
lihong (lihong) wrote :
Download full text (11.6 KiB)

Still not feasible
[root@host224 home(keystone_admin)]# vi /etc/nova/nova.conf
.....
[oslo_messaging_zmq]
rpc_zmq_matchmaker=redis
use_pub_sub=false
use_router_proxy=false
use_dynamic_connections=true
zmq_linger=60
.......

When restart nova-conductor, nova-compute send CALL message to conductor, socket.send() function has still been blocked:
2017-02-09 09:14:32.954 8054 DEBUG oslo_messaging._drivers.zmq_driver.client.zmq_routing_table [-] Processing target ROUTER/conductor for round-robin. _fetch_round_robin_hosts_from_matchmaker /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/zmq_driver/client/zmq_routing_table.py:59
2017-02-09 09:14:32.955 8054 DEBUG oslo_messaging._drivers.zmq_driver.client.zmq_routing_table [-] Host resolved for the current connection is host224:52880 get_round_robin_host /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/zmq_driver/client/zmq_routing_table.py:48
2017-02-09 09:14:32.955 8054 DEBUG oslo_messaging._drivers.zmq_driver.zmq_socket [-] Connecting DEALER socket f33e3904-739c-4a7b-90a0-71637a669e1e to tcp://host224:52880 connect_to_address /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/zmq_driver/zmq_socket.py:188
2017-02-09 09:14:32.955 8054 DEBUG oslo_messaging._drivers.zmq_driver.client.zmq_routing_table [-] Processing target ROUTER/conductor for round-robin. _fetch_round_robin_hosts_from_matchmaker /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/zmq_driver/client/zmq_routing_table.py:59
2017-02-09 09:14:32.955 8054 DEBUG oslo_messaging._drivers.zmq_driver.zmq_socket [-] Connecting DEALER socket f33e3904-739c-4a7b-90a0-71637a669e1e to tcp://host224:55407 connect_to_address /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/zmq_driver/zmq_socket.py:188
2017-02-09 09:14:32.956 8054 DEBUG oslo_messaging._drivers.zmq_driver.zmq_socket [-] Connecting DEALER socket f33e3904-739c-4a7b-90a0-71637a669e1e to tcp://host224:57659 connect_to_address /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/zmq_driver/zmq_socket.py:188
2017-02-09 09:16:47.391 8054 DEBUG oslo_messaging._drivers.zmq_driver.client.zmq_routing_table [-] Updating routing table from the matchmaker. 2 target(s) updated ['ROUTER/scheduler', 'ROUTER/conductor']. _update_routing_table /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/zmq_driver/client/zmq_routing_table.py:194
2017-02-09 09:19:47.394 8054 DEBUG oslo_messaging._drivers.zmq_driver.client.zmq_routing_table [-] Updating routing table from the matchmaker. 2 target(s) updated ['ROUTER/scheduler', 'ROUTER/conductor']. _update_routing_table /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/zmq_driver/client/zmq_routing_table.py:194
2017-02-09 09:22:47.397 8054 DEBUG oslo_messaging._drivers.zmq_driver.client.zmq_routing_table [-] Updating routing table from the matchmaker. 2 target(s) updated ['ROUTER/scheduler', 'ROUTER/conductor']. _update_routing_table /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/zmq_driver/client/zmq_routing_table.py:194
2017-02-09 09:25:47.400 8054 DEBUG oslo_messaging._drivers.zmq_driver.client.zmq_routing_table [-] Updating routing table from the matchmaker. 2 target(s) updated ['ROUTER/scheduler', 'ROUTE...

Revision history for this message
lihong (lihong) wrote :

Sorry,this bug should not be reported on which the code has not yet been implemented, it should be closed.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.messaging (master)

Reviewed: https://review.openstack.org/429639
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=4a1679450d72b784272a7201015a5b331cdc09aa
Submitter: Jenkins
Branch: master

commit 4a1679450d72b784272a7201015a5b331cdc09aa
Author: ozamiatin <email address hidden>
Date: Mon Feb 6 12:45:26 2017 +0200

    [zmq] Dynamic connections send failure

    For dynamic connections it is crucial to close connection
    and not to have hanging sockets either we have sent message
    successfully or not.

    eventlet.green.zmq by default blocks the calling thread on sending message
    when connection was not established yet (which is correct DEALER
    socket behavior though), but socket cannot be closed when we hang on
    sending forever (if we never get the valid host to connect).

    eventlet also shields EAGAIN exception in default (blocking) sending mode
    so we need to use async zmq.NOBLOCK flag to receive this exception
    and hanlde it in our own way to not block forever.

    Change-Id: Ib561e061c4b20644213c059a8e8d0efd225edea1
    Closes-Bug: #1658913
    Closes-Bug: #1663459

Changed in oslo.messaging:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.messaging 5.18.0

This issue was fixed in the openstack/oslo.messaging 5.18.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on oslo.messaging (master)

Change abandoned by Davanum Srinivas (dims) (<email address hidden>) on branch: master
Review: https://review.openstack.org/424480

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.