Comment 8 for bug 1890759

Revision history for this message
Nobuto Murata (nobuto) wrote :

Just for the record again,

> We found out the underlying bcache device for control plane including RabbitMQ wasn't set as writeback accidentally. So the whole race condition might have been caused by IO contention and starvation. So the new config and the new default value may not be the culprit here.

Even after setting writeback to bcache, the deployment wasn't reliable. With bionic's rabbitmq at least, other services had error status sometimes. And the following change in charm config made it reliable in the end.

known-wait: 180
queue-master-locator: client-local

I'm not saying queue-master-locator is the one, but we just need to keep an eye on it especially with large scale deployments.