Message was sent to wrong node with zmq as rpc_backend

Bug #1301723 reported by zhangjialong on 2014-04-03
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
oslo.messaging
Low
zhangjialong
oslo.messaging (Ubuntu)
Undecided
Unassigned

Bug Description

In Icehouse b3 and rc1, i use zmq as nova rpc_backend. When i boot vm instances, error occored like follows:

2014-04-03 11:58:35.897 6583 ERROR oslo.messaging.rpc.dispatcher [-] Exception during message handling: computer3 is not a valid node managed by this compute host.
2014-04-03 11:58:35.897 6583 TRACE oslo.messaging.rpc.dispatcher Traceback (most recent call last):
2014-04-03 11:58:35.897 6583 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.6/site-packages/oslo/messaging/rpc/dispatcher.py", line 133, in _dispatch_and_reply
2014-04-03 11:58:35.897 6583 TRACE oslo.messaging.rpc.dispatcher incoming.message))
2014-04-03 11:58:35.897 6583 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.6/site-packages/oslo/messaging/rpc/dispatcher.py", line 176, in _dispatch
2014-04-03 11:58:35.897 6583 TRACE oslo.messaging.rpc.dispatcher return self._do_dispatch(endpoint, method, ctxt, args)
2014-04-03 11:58:35.897 6583 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.6/site-packages/oslo/messaging/rpc/dispatcher.py", line 122, in _do_dispatch
2014-04-03 11:58:35.897 6583 TRACE oslo.messaging.rpc.dispatcher result = getattr(endpoint, method)(ctxt, **new_args)
2014-04-03 11:58:35.897 6583 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.6/site-packages/nova/exception.py", line 88, in wrapped
2014-04-03 11:58:35.897 6583 TRACE oslo.messaging.rpc.dispatcher payload)
2014-04-03 11:58:35.897 6583 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.6/site-packages/nova/openstack/common/excutils.py", line 68, in __exit__
2014-04-03 11:58:35.897 6583 TRACE oslo.messaging.rpc.dispatcher six.reraise(self.type_, self.value, self.tb)
2014-04-03 11:58:35.897 6583 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.6/site-packages/nova/exception.py", line 71, in wrapped
2014-04-03 11:58:35.897 6583 TRACE oslo.messaging.rpc.dispatcher return f(self, context, *args, **kw)
2014-04-03 11:58:35.897 6583 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 244, in decorated_function
2014-04-03 11:58:35.897 6583 TRACE oslo.messaging.rpc.dispatcher pass
2014-04-03 11:58:35.897 6583 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.6/site-packages/nova/openstack/common/excutils.py", line 68, in __exit__
2014-04-03 11:58:35.897 6583 TRACE oslo.messaging.rpc.dispatcher six.reraise(self.type_, self.value, self.tb)
2014-04-03 11:58:35.897 6583 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 230, in decorated_function
2014-04-03 11:58:35.897 6583 TRACE oslo.messaging.rpc.dispatcher return function(self, context, *args, **kwargs)
2014-04-03 11:58:35.897 6583 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 295, in decorated_function
2014-04-03 11:58:35.897 6583 TRACE oslo.messaging.rpc.dispatcher function(self, context, *args, **kwargs)
2014-04-03 11:58:35.897 6583 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 272, in decorated_function
2014-04-03 11:58:35.897 6583 TRACE oslo.messaging.rpc.dispatcher e, sys.exc_info())
2014-04-03 11:58:35.897 6583 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.6/site-packages/nova/openstack/common/excutils.py", line 68, in __exit__
2014-04-03 11:58:35.897 6583 TRACE oslo.messaging.rpc.dispatcher six.reraise(self.type_, self.value, self.tb)
2014-04-03 11:58:35.897 6583 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 259, in decorated_function
2014-04-03 11:58:35.897 6583 TRACE oslo.messaging.rpc.dispatcher return function(self, context, *args, **kwargs)
2014-04-03 11:58:35.897 6583 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 1849, in run_instance
2014-04-03 11:58:35.897 6583 TRACE oslo.messaging.rpc.dispatcher do_run_instance()
2014-04-03 11:58:35.897 6583 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.6/site-packages/nova/openstack/common/lockutils.py", line 249, in inner
2014-04-03 11:58:35.897 6583 TRACE oslo.messaging.rpc.dispatcher return f(*args, **kwargs)
2014-04-03 11:58:35.897 6583 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 1848, in do_run_instance
2014-04-03 11:58:35.897 6583 TRACE oslo.messaging.rpc.dispatcher legacy_bdm_in_spec)
2014-04-03 11:58:35.897 6583 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 997, in _run_instance
2014-04-03 11:58:35.897 6583 TRACE oslo.messaging.rpc.dispatcher notify("error", fault=e) # notify that build failed
2014-04-03 11:58:35.897 6583 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.6/site-packages/nova/openstack/common/excutils.py", line 68, in __exit__
2014-04-03 11:58:35.897 6583 TRACE oslo.messaging.rpc.dispatcher six.reraise(self.type_, self.value, self.tb)
2014-04-03 11:58:35.897 6583 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 981, in _run_instance
2014-04-03 11:58:35.897 6583 TRACE oslo.messaging.rpc.dispatcher instance, image_meta, legacy_bdm_in_spec)
2014-04-03 11:58:35.897 6583 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 1035, in _build_instance
2014-04-03 11:58:35.897 6583 TRACE oslo.messaging.rpc.dispatcher rt = self._get_resource_tracker(node)
2014-04-03 11:58:35.897 6583 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 454, in _get_resource_tracker
2014-04-03 11:58:35.897 6583 TRACE oslo.messaging.rpc.dispatcher "compute host.") % nodename)
2014-04-03 11:58:35.897 6583 TRACE oslo.messaging.rpc.dispatcher NovaException: computer3 is not a valid node managed by this compute host.
2014-04-03 11:58:35.897 6583 TRACE oslo.messaging.rpc.dispatcher

For more details I found, the message should be send to computer3, while computer2 received.

In the method named _multi_send in the file impl_zmq.py , there is a problem when get and handle the queues .

zhangjialong (zhangjl) on 2014-04-03
Changed in oslo.messaging:
assignee: nobody → zhangjialong (zhangjl)

Fix proposed to branch: master
Review: https://review.openstack.org/84938

Changed in oslo.messaging:
status: New → In Progress
Mark McLoughlin (markmc) on 2014-05-06
Changed in oslo.messaging:
importance: Undecided → Low
tags: added: icehouse-backport-potential

@zhangialong

Can you please provide more context to the stack trace. I can see that an exception is being raised, but I need more of the stack to start trying to look at the problem.

zhangjialong (zhangjl) wrote :

let me describe this problem more detail.

The follow is my match_makering.json file for zmq; http://pastebin.com/zza8KhJi

And the next is the message transport by zmq through impl_zmq.py: http://pastebin.com/TfgTE3rj

For more information , the messagins were logged by the code in impl_zmq.py line 731 and 734

From the log, we can see the messages should be send to computer3 but send to computer2, then ,cause this bug

description: updated
Mark McLoughlin (markmc) wrote :

As I said in the review, I think the bug is that in oslo-incubator, a call to rpc.cast() or rpc.call() for a specific node will be passed topic='topic.server' and we pass that string onto multi_send()

In oslo.messaging the target object contains the topic and server attributes, so we need to concatenate them into a string before passing them to multi_send()

We simply need to add

 elif target.server:
     topic = '%s.%s' % (topic, target.server)

in ZmqDriver._send()

My broader concern is that if this is a genuine bug, then Icehouse Nova cannot possibly work with zmq ... and we only seem to have one report of that fact

Ajaya Agrawal (ajayaa) on 2014-08-25
summary: - Message was sended to wrong node with use zmq as rpc_backend
+ Message was sent to wrong node with zmq as rpc_backend
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in oslo.messaging (Ubuntu):
status: New → Confirmed
James Page (james-page) wrote :

@Mark

"My broader concern is that if this is a genuine bug, then Icehouse Nova cannot possibly work with zmq ... and we only seem to have one report of that fact"

AFAICT the messaging flow that generates this error is hit when using nova+neutron with notifications enabled for port creation etc..; the proposed patch fixes this problem but it does highlight some fragility as its really important that the zmq_host configuration matches the nova host configuration exactly and that the host configuration is resolvable/addressable for the message dispatch.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package oslo.messaging - 1.4.0.0~a3-0ubuntu2

---------------
oslo.messaging (1.4.0.0~a3-0ubuntu2) utopic; urgency=medium

  * d/p/zmq-server-routing.patch: Cherry pick WIP to ensure that ZMQ
    messages get routed correctly for call's (LP: #1301723).
 -- James Page <email address hidden> Thu, 11 Sep 2014 09:58:18 +0100

Changed in oslo.messaging (Ubuntu):
status: Confirmed → Fix Released
Changed in oslo.messaging:
milestone: none → next-kilo

Reviewed: https://review.openstack.org/84938
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=3e6c0b38a36af5276dd4ce7a1bd0d0dfc5a3d03f
Submitter: Jenkins
Branch: master

commit 3e6c0b38a36af5276dd4ce7a1bd0d0dfc5a3d03f
Author: zhangjl <email address hidden>
Date: Thu Apr 3 12:44:23 2014 +0800

    Message was send to wrong node with use zmq as rpc_backend

    zmq: Send message to correct endpoint for directed messages

    If set the zmq driver needs to use the topic.server
    attribute to ensure that messages directed to a
    specific endpoint are sent to the correct server rather
    than being randomly sent to endpoints advertising the
    required topic.

    Change-Id: If63235640e0b8e6ec8690a4edaefa1b303705ac6
    Closes-bug: #1301723

Changed in oslo.messaging:
status: In Progress → Fix Committed
Mehdi Abaakouk (sileht) on 2014-12-02
Changed in oslo.messaging:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers