Comment 15 for bug 1463802

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Upon request from Eugene Bogdanov, I'll provide a summary here.

User impact:

Sometimes after successful failover of RabbitMQ server, OpenStack services may start to fail RPC calls with the following messages in logs: "Queue not found:Basic.consume: (404) NOT_FOUND - no queue 'reply_f7cac1a2428d414bb8b9e0a612". This happens when one or more controllers are brought down, but not frequently. The workaround is to restart the affected service. After restart, the service will immediately become operational.

What we've done so far:

- analyzed existing logs of reproduce (Artem Panchenko's and Leonid Istomin's environments)
- did audit of oslo.messaging code to find out if failover is handled correctly (it must be: queues are explicitly redeclared on reconnect)
- tried to reproduce the issue 'synthetically' without MOS on a RabbitMQ cluster (as well as one-node RabbitMQ server), without any luck - oslo.messaging performs as expected and redeclares all the queues used
- tried to reproduce the issue on a MOS installation with oslo.messaging debug logs enabled - unfortunately, the whole cluster went into weird a state with both Galera and RabbitMQ clusters failing to start
- reproduced the issue once on a small MOS installation (3 controllers + 2 computes), but with oslo.messaging debug logs disabled - we are sure that we see what we see (a reply being missing and not re-created on the next RPC call, leaving the affected service unabled to make RPC calls until it's restarted/reconnect is triggered), despite what Alexey/Bogdan suggested in the comments
- when we managed to reproduce the issue once, we tried to trigger a reconnect (gdb -p $PID; call close($FD)) - oslo.messaging correctly reconnected and redeclared all the queues

The plan is to get a simple repro with oslo.messaging debug logs enabled (ideally, without MOS at all, just plain oslo.messaging) and find out why queue redeclare code path may not be executed properly on reconnect after failover.

Possible workarounds:

1) restart of affected services

2) make sure reply queues are durable and thus survive a RabbitMQ restart (so that even if oslo.messaging fails to redeclare the queues explicitly, they are persisted in RabbitMQ itself). <-- the problem with that is that we are not fixing the root cause of the issue