Comment 7 for bug 1902793

Revision history for this message
Trent Lloyd (lathiat) wrote :

I've seen similar behaviour. One of the things I found was that while known-wait/modulu-nodes is used to defer restarts by 30 seconds between units, the work taken often takes longer than 30 seconds so collides with the other units and I also found the extra work the leader does which can often be the first unit in that list takes even longer making it even more likely to collide with the other units.

I would suggest that absent co-ordinating the restarts between nodes that they could try and wait up until a longer default timeout (maybe 60-180 seconds) for the cluster to be healthy and all queues to be synchronised before doing a local restart. Or at least for the number of nodes online and queues synchronising to stop changing for 30+ seconds indicating it's likely in a steady state.