At least one part of this problem has been narrowed to an unfortunate side-effect of the oslo.messaging change which went into Juno. To understand this problem, a simple repro was created. 

You can recreate the problem and experiment with the proposed fix by un-commenting a line in the test program.

To run the repro:

copy server.py, server.conf, client.py and client.conf to a linux machine (I used ubuntu) and execute the following.

python client.py --config-file client.conf

You will get an output like:

amrith@amrith-work:~/source/rabbit$ python client.py --config-file client.conf
2016-01-12 13:30:06.948 101415 DEBUG __main__ [-] client starting main client.py:27
2016-01-12 13:30:07.005 101415 DEBUG __main__ [-] Now calling get_rpc_server() main client.py:56
2016-01-12 13:30:07.010 101415 DEBUG __main__ [-] Now calling server.start() main client.py:59
2016-01-12 13:30:07.010 101415 DEBUG oslo_messaging._drivers.amqp [-] Pool creating new connection create /usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqp.py:103
2016-01-12 13:30:07.024 101415 DEBUG __main__ [-] Now calling rpcserver.stop() main client.py:67
2016-01-12 13:30:07.025 101415 DEBUG oslo_messaging._drivers.amqp [-] Pool creating new connection create /usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqp.py:103
2016-01-12 13:30:07.031 101415 DEBUG oslo_messaging._drivers.amqpdriver [-] CAST unique_id: 435d43890c394e5ab41bdd3b336012e7 exchange 'openstack' topic 'fbb9b530-3299-44d3-b048-ee2411afce0c' _send /usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py:448
2016-01-12 13:30:07.033 101415 DEBUG __main__ [-] Going to sleep for 240s main client.py:89
2016-01-12 13:30:07.033 101415 DEBUG oslo_messaging._drivers.amqpdriver [-] received message unique_id: 435d43890c394e5ab41bdd3b336012e7  __call__ /usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py:195

--

In another window, execute

python server.py --config-file server.conf

You get something like this:

amrith@amrith-work:~/source/rabbit$ python server.py --config-file server.conf
2016-01-12 13:31:01.532 101448 DEBUG __main__ [-] server starting main server.py:54
2016-01-12 13:31:01.589 101448 DEBUG __main__ [-] Server listening for server = guestclient.fbb9b530-3299-44d3-b048-ee2411afce0c, topic = fbb9b530-3299-44d3-b048-ee2411afce0c main server.py:64
2016-01-12 13:31:01.594 101448 DEBUG __main__ [-] Calling server.start() main server.py:76
2016-01-12 13:31:01.594 101448 DEBUG oslo_messaging._drivers.amqp [-] Pool creating new connection create /usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqp.py:103
2016-01-12 13:31:01.606 101448 DEBUG __main__ [-] Going to sleep for 240s now main server.py:79
2016-01-12 13:33:07.023 101448 DEBUG oslo_messaging._drivers.amqpdriver [-] received message unique_id: 435d43890c394e5ab41bdd3b336012e7  __call__ /usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py:195
2016-01-12 13:33:07.024 101448 DEBUG __main__ [-] got a ping from 2016-01-12T13:30:07.024895 at 2016-01-12 13:33:07.024389 ping server.py:41

Observe that the message sent from the client at about 2016-01-12 13:30:07 was delivered in the server 3m later.

The reason for this is that the code in the task manager seeks to push a prepare() message to the guest agent before the guest agent has come up. This poses a challenge to oslo.messaging especially with rabbit; the server for the queue should be the guest agent, and in normal course the queue itself would be constructed by the server. 

But, in the absence of the server (guest agent), the taskmanager in the prepare message (trove/guestagent/api.py) first creates the guest queue (calling _create_guest_queue) which launches a server and immediately shuts it down. That code is:

        target = messaging.Target(topic=self._get_routing_key(),
                                  server=self.id,
                                  version=rpc_version.RPC_API_VERSION)
        try:
            server = rpc.get_server(target, [])
            server.start()
        finally:
            if server is not None:
                server.stop()

As soon as server.stop() is done, the next thing is to cast a message on the wire.

First, server.stop() is not a synchronous call. It returns almost immediately and in the background takes some seconds to complete. Therefore when the prepare message is cast on the queue, Rabbit still thinks the taskmanager is a viable server for that message since it is listening on the queue.

You will observe that when you attempt to launch a trove instance, the following shows up in taskmanager log.

2016-01-12 08:08:13.355 125933 DEBUG oslo_messaging._drivers.amqpdriver [-] CAST unique_id: ff86886711914dee927f2150a91d72c7 exchange 'openstack' topic 'guestagent.aa8ce7f5-2146-43a1-a594-2f66b702e8d7' _send /usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py:448
2016-01-12 08:08:13.357 125933 INFO trove.instance.models [-] Resetting task status to NONE on instance aa8ce7f5-2146-43a1-a594-2f66b702e8d7.
2016-01-12 08:08:13.359 125933 DEBUG oslo_messaging._drivers.amqpdriver [-] received message unique_id: ff86886711914dee927f2150a91d72c7  __call__ /usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py:195

When the prepare message is cast (at 2016-01-12 08:08:13.355 125933) the server.stop() would already have been called yet Rabbit still thought the taskmanager was a viable handler and hence the message at 2016-01-12 08:08:13.359 125933 indicates that the message is delivered to taskmanager.

Thus when the guest agent launches, it takes a full 3 minutes before rabbit finally indicates that the message must be handled by someone else.

The issue is therefore that server.stop() is not synchronous and we should follow it with a server.wait().