OpenStack HA Cluster Charm

Bug #1903745
Comment #13

Comment 13 for bug 1903745

Revision history for this message

Trent Lloyd (lathiat) wrote on 2020-11-12: Re: upgrade from 1.1.14-2ubuntu1.8 to 1.1.14-2ubuntu1.9 breaks clusters

#13

Analysed the logs for an occurance of this, the problem appears to be that pacemaker doesn't stop after 1 minute so systemd gives up and just starts a new instance anyway, noting that all of the existing processes are left behind.

I am awaiting the extra rotated logs to confirm but from what I can see basically the new pacemaker fails to start because the old one is still running, and then the old one eventually exits, leave you with no instance of pacemaker (which is the state we found it in, pacemaker was stopped).

06:13:44 systemd[1]: pacemaker.service: State 'stop-sigterm' timed out. Skipping SIGKILL.
06:13:44 pacemakerd[427]: notice: Caught 'Terminated' signal
06:14:44 systemd[1]: pacemaker.service: State 'stop-final-sigterm' timed out. Skipping SIGKILL. Entering failed mode.
06:14:44 systemd[1]: pacemaker.service: Failed with result 'timeout'.
06:14:44 systemd[1]: Stopped Pacemaker High Availability Cluster Manager.
06:14:45 systemd[1]: pacemaker.service: Found left-over process 445 (cib) in control group while starting unit. Ignoring.
06:14:45 systemd[1]: pacemaker.service: Found left-over process 449 (attrd) in control group while starting unit. Ignoring.
06:14:45 systemd[1]: pacemaker.service: Found left-over process 450 (pengine) in control group while starting unit. Ignoring.
06:14:45 systemd[1]: pacemaker.service: Found left-over process 451 (crmd) in control group while starting unit. Ignoring.
06:14:45 systemd[1]: pacemaker.service: Found left-over process 427 (pacemakerd) in control group while starting unit. Ignoring.
06:14:45 systemd[1]: pacemaker.service: Found left-over process 447 (stonithd) in control group while starting unit. Ignoring.
06:14:45 systemd[1]: pacemaker.service: Found left-over process 448 (lrmd) in control group while starting unit. Ignoring.
06:14:45 systemd[1]: pacemaker.service: Failed to reset devices.list: Operation not permitted
06:14:45 systemd[1]: Started Pacemaker High Availability Cluster Manager.

Likely the solution here is some combination of tweaking the systemd config to wait longer, force kill if necessary and possibly reap all processes if it does force a restart. It's not a native systemd unit though some of this stuff can be tweaked by comments. I'll look a little further at that.

06:13:44 systemd[1]: pacemaker.service: State 'stop-sigterm' timed out. Skipping SIGKILL.
06:13:44 pacemakerd[427]:   notice: Caught 'Terminated' signal
06:14:44 systemd[1]: pacemaker.service: State 'stop-final-sigterm' timed out. Skipping SIGKILL. Entering failed mode.
06:14:44 systemd[1]: pacemaker.service: Failed with result 'timeout'.
06:14:44 systemd[1]: Stopped Pacemaker High Availability Cluster Manager.
06:14:45 systemd[1]: pacemaker.service: Found left-over process 445 (cib) in control group while starting unit. Ignoring.
06:14:45 systemd[1]: pacemaker.service: Found left-over process 449 (attrd) in control group while starting unit. Ignoring.
06:14:45 systemd[1]: pacemaker.service: Found left-over process 450 (pengine) in control group while starting unit. Ignoring.
06:14:45 systemd[1]: pacemaker.service: Found left-over process 451 (crmd) in control group while starting unit. Ignoring.
06:14:45 systemd[1]: pacemaker.service: Found left-over process 427 (pacemakerd) in control group while starting unit. Ignoring.
06:14:45 systemd[1]: pacemaker.service: Found left-over process 447 (stonithd) in control group while starting unit. Ignoring.
06:14:45 systemd[1]: pacemaker.service: Found left-over process 448 (lrmd) in control group while starting unit. Ignoring.
06:14:45 systemd[1]: pacemaker.service: Failed to reset devices.list: Operation not permitted
06:14:45 systemd[1]: Started Pacemaker High Availability Cluster Manager.