The code above was missing a call to acknowledge(), so it kept reading stale messages. However, with acknowedge(), it still reproduces the original bug. Here's how I reproduce the original problem. I verify the cluster is working, then kill rabbit, then try again.
1. start cluster, create vms, migrate: ok
2. kill and restart rabbit
3. migrate vm: timeout
19:24:56 10.35.0.13 sends _msg_id = 29c6579c5de24c00b0b0e55579b8e047
19:24:56 10.35.0.14 receives _msg_id = 29c6579c5de24c00b0b0e55579b8e047
19:24:56 10.35.0.14 acknowledges _msg_id = 29c6579c5de24c00b0b0e55579b8e047
19:25:56 10.35.0.13 Timed out waiting for a reply to message ID 29c6579c5de24c00b0b0e55579b8e047
I'm instrumenting the rpc calls now to see if they all actually do retry and complete after the timeout errors. I'm trying to get a trace of all the rpc calls to see if they're being acknowledged but not replied to in time.
While digging through oslo.messaging, I noticed that in amqpdriver.py, the incoming queues in ReplyWaiter and AMQPListener are plain arrays, not thread-safe Queues. ReplyWaiter does acquire a lock, but I'm not sure if the plain arrays are always thread-safe. Not sure if this is causing my issue.
The code above was missing a call to acknowledge(), so it kept reading stale messages. However, with acknowedge(), it still reproduces the original bug. Here's how I reproduce the original problem. I verify the cluster is working, then kill rabbit, then try again.
1. start cluster, create vms, migrate: ok
2. kill and restart rabbit
3. migrate vm: timeout
Here's a trace from https:/ /gist.github. com/noelbk/ 619426fa88c3bdd 0534c after rabbit was restarted. This
pattern repeats a few times during a migration after rabbit was
restarted:
19:24:56 10.35.0.13 sends _msg_id = 29c6579c5de24c0 0b0b0e55579b8e0 47 0b0b0e55579b8e0 47 0b0b0e55579b8e0 47 0b0b0e55579b8e0 47
19:24:56 10.35.0.14 receives _msg_id = 29c6579c5de24c0
19:24:56 10.35.0.14 acknowledges _msg_id = 29c6579c5de24c0
19:25:56 10.35.0.13 Timed out waiting for a reply to message ID 29c6579c5de24c0
I'm instrumenting the rpc calls now to see if they all actually do retry and complete after the timeout errors. I'm trying to get a trace of all the rpc calls to see if they're being acknowledged but not replied to in time.
While digging through oslo.messaging, I noticed that in amqpdriver.py, the incoming queues in ReplyWaiter and AMQPListener are plain arrays, not thread-safe Queues. ReplyWaiter does acquire a lock, but I'm not sure if the plain arrays are always thread-safe. Not sure if this is causing my issue.
Still digging...