Comment 15 for bug 1903745

Revision history for this message
Trent Lloyd (lathiat) wrote : Re: upgrade from 1.1.14-2ubuntu1.8 to 1.1.14-2ubuntu1.9 breaks clusters

For the fix to Bug #1654403 charm-hacluster sets TimeoutStartSec and TimeoutStopSync for both corosync and pacemaker, to the same value.

system-wide default (xenial, bionic): TimeoutStopSec=90s TimeoutStartSec=90s
corosync package default: system-wide default (no changes)
pacemaker package default: TimeoutStopSec=30min TimeoutStartSec=60s

charm-hacluster corosync+pacemaker override: TimeoutStopSec=60s TimeoutStartSec=180s

effective changes:
corosync TimeoutStopSec=90s -> 60s TimeoutStartSec=90s -> 180s
pacemaker TimeoutStopSec=30min -> 60s TimeoutStartSec=60s -> 180s

The original bug description was "On corosync restart, corosync may take longer than a minute to come up. The systemd start script times out too soon. Then pacemaker which is dependent on corosync is immediatly started and fails as corosync is still in the process of starting."

So the TimeoutStartSec increase from 60/90 -> 180 was the only thing needed. I believe the TimeoutStopSec change for pacemaker is in error at least as the bug is described.

Having said that, I can imagine charm failures during deployment or reconfiguration where it tries to stop pacemaker for various reasons and it fails to stop fast enough because the resources won't migrate away (possibly because all the nodes are trying to stop at the same time, as charm-hacluster doesn't seem to have a staggered change setup) and it currently restarts corosync to effect changes to the ring. So this may well have fixed other charm-related problems not really accurately described in the previous bug - though that bug does specifically mention cases where the expected cluster_count is not set - in that case it tries to setup corosync/pacemaker before all 3 nodes are up - which might get into this scenario. So before we go ahead and change the stop_timeout back to 30min we probably need to validate various scenarios for that issue.