Comment 1 for bug 1338732

Revision history for this message
Noel Burton-Krahn (noelbk) wrote :

The code above was missing a call to acknowledge(), so it kept reading stale messages. However, with acknowedge(), it still reproduces the original bug. Here's how I reproduce the original problem. I verify the cluster is working, then kill rabbit, then try again.

    1. start cluster, create vms, migrate: ok
    2. kill and restart rabbit
    3. migrate vm: timeout

Here's a trace from https://gist.github.com/noelbk/619426fa88c3bdd0534c after rabbit was restarted. This
pattern repeats a few times during a migration after rabbit was
restarted:

    19:24:56 10.35.0.13 sends _msg_id = 29c6579c5de24c00b0b0e55579b8e047
    19:24:56 10.35.0.14 receives _msg_id = 29c6579c5de24c00b0b0e55579b8e047
    19:24:56 10.35.0.14 acknowledges _msg_id = 29c6579c5de24c00b0b0e55579b8e047
    19:25:56 10.35.0.13 Timed out waiting for a reply to message ID 29c6579c5de24c00b0b0e55579b8e047

I'm instrumenting the rpc calls now to see if they all actually do retry and complete after the timeout errors. I'm trying to get a trace of all the rpc calls to see if they're being acknowledged but not replied to in time.

While digging through oslo.messaging, I noticed that in amqpdriver.py, the incoming queues in ReplyWaiter and AMQPListener are plain arrays, not thread-safe Queues. ReplyWaiter does acquire a lock, but I'm not sure if the plain arrays are always thread-safe. Not sure if this is causing my issue.

Still digging...