Comment 7 for bug 1577239

Revision history for this message
Denis Meltsaykin (dmeltsaykin) wrote :

Let me describe my findings from logs and thoughts on the topic.

1. Steps to reproduce: Bartosz, we started dozens of VMs in parallel and this worked like a charm on a fully operational OpenStack environment. The behavior you observe is caused by the issue, these are not "steps to reproduce", if you see failures during VMs startup - this is the outcome of the issue, not the root cause.

2. As per comment #2 these "crashes" are harmless, and the fix you're asking to backport is just a cosmetic change, it won't help you to obtain stability. The root cause of the issue is not in these crashes, the crashes are only a signal.

3. Digging the logs I found the following:

   * There were about 1.2 million of outstanding messages in queues before a crash. (mostly in scheduler_fanout_*)
   * The cause of the first cluster partition is a missed net_tick (heartbeat) which came too late, and the node-3 has been marked "down" on the node-2, while node-1 received the tick on time. This caused the node-2 to intentionally disconnect from node-1 creating a network partition.
   * pacemaker via OCF noticed that there is a network partition and started to reassemble the cluster

From my point of view, all these things are the root cause of the issue. We discussed the findings with the oslo team and here is what can be done to avoid such in future:

   * increase the net_timetick parameter of rabbitmq/erlang to 60s (default)
   * introduce an expiration policy on rabbitmq's queues and exchanges, which should help against growing fanouts
   * increase amount of timeouts to ignore by pacemaker (https://docs.mirantis.com/openstack/fuel/fuel-7.0/operations.html#how-to-make-rabbitmq-ocf-script-tolerate-rabbitmqctl-timeouts)

All these workarounds have not been tested properly and change the reference design of the product. Including them into MU is a subject to a wide discussion.