pacemaker left stopped after unattended-upgrade of pacemaker (1.1.14-2ubuntu1.8 -> 1.1.14-2ubuntu1.9)

Bug #1903745 reported by Laurent Sesquès
62
This bug affects 9 people
Affects Status Importance Assigned to Milestone
OpenStack HA Cluster Charm
Fix Released
Critical
Billy Olsen
pacemaker (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

On several machines running pacemaker with corosync, after the package was upgraded by unattended-upgrades, the VIPs were gone.
Restarting pacemaker and corosync didn't help, because some processes (lrmd) remained after the stop.
Manually killing them allowed to restart in a good shape.

This is on Ubuntu xenial (EDIT: and bionic)

Tags: sts
Revision history for this message
Dan Ackerson (dan.ackerson) wrote :

We've been working all day on fixing pacemaker/corosync issues on several of our clouds due to this issue.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in pacemaker (Ubuntu):
status: New → Confirmed
Revision history for this message
Dan Streetman (ddstreet) wrote :

possibly related to bug 1815101

Revision history for this message
Marc Deslauriers (mdeslaur) wrote :

Does anyone have any pertinent log files that were generated during the upgrade process?

Revision history for this message
Jamie Strandboge (jdstrand) wrote :

Can someone confirm that the affected systems are running systemd-networkd? That would more strongly suggest that https://bugs.launchpad.net/netplan/+bug/1815101 is related.

Based on the description and the other bug, the security update doesn't seem to have regressed pacemaker; instead, it's a preexisting pacemaker/systemd interaction that causes this problem under certain circumstances for any pacemaker update.

If you believe this assessment is in error, please provide more details. I'm concerned that without resolving the systemd issue, anything we do with the pacemaker packaging in an SRU/follow-up update would simply reintroduce the condition that prompted filing this bug.

Revision history for this message
Zachary Zehring (zzehring) wrote :

There was a mix of bionic and xenial clouds that experienced this issue. Doesn't look to be tied to distro, so not all affected systems were running systemd-networkd.

Revision history for this message
Zachary Zehring (zzehring) wrote :

I have uploaded sosreports from 2 lxds that would have been affected by the issue:

https://private-fileshare.canonical.com/~zzehring/sosreport-xenial-2020-11-10.tar.xz
https://private-fileshare.canonical.com/~zzehring/sosreport-bionic-20201110230352.tar.xz

Let me know if you need other logs.

Revision history for this message
Trent Lloyd (lathiat) wrote :
Download full text (3.7 KiB)

I reviewed the sosreports and provide some general analysis below.

[sosreport-juju-machine-2-lxc-1-2020-11-10-tayyude]

I don't see any sign in this log of package upgrades or VIP stop/starts, I suspect this host may be unrelated.

[sosreport-juju-caae6f-19-lxd-6-20201110230352.tar.xz]

This is a charm-keystone node

Looking at this sosreport my general finding is that everything worked correctly on this specific host.

unattended-upgrades.log:
We can see the upgrade starts at 2020-11-10 06:17:03 and finishes at "2020-11-10 06:17:48"

syslog.1:
Nov 10 06:17:41 juju-caae6f-19-lxd-6 crmd[41203]: notice: Result of probe operation for res_ks_680cfdf_vip on juju-caae6f-19-lxd-6: 7 (not running)
Nov 10 06:19:44 juju-caae6f-19-lxd-6 crmd[41203]: notice: Result of start operation for res_ks_680cfdf_vip on juju-caae6f-19-lxd-6: 0 (ok)

We also see that the VIP moved around to different hosts a few times, likely as a result of each host successively upgrading. Which makes sense. I don't see any sign in this log of the mentioned lrmd issue.

[mysql issue]

What we do see however is issues with "Too many connections" from MySQL in the keystone logs. This generally happens because when the VIP moves from one host to another, all the old connections are left behind and just go stale (because the VIP was removed, the traffic for these connections just disappears and is sent to the new VIP owner which doens't have those TCP connections) and sit there until wait_timeout is reached (typically either 180s/3 min or 3600s/1 hour in our deployments) as the node will never get the TCP reset when the remote end sends it. The problem happens when it fails *back* to a host it already failed away from, now many of the connection slots are still used by the stale connections and you run our of connections if your max_connections limit is not at least double your normal connection count. This problem will eventually self resolve once the connections timeout but may take an hour.

Note that this sosreport is from a keystone node that *also* has charm-hacluster/corosync/pacemaker but the above discussed mysql issue would have occurred on the percona mysql nodes. To analyse the number of failovers we would need to get sosreports from the mysql node(s).

[summary]

I think we have likely 2 potential issues here from what I can see described so far.

Firstly the networkd issue is likely not related to this specific case, as that happens specifically when systemd is upgraded and thus networkd is restarted, that shouldn't have happened here.

(Issue 1) The first is that we hit max_connections due to the multiple successive MySQL VIP failovers where max_connections is not at least 2x the steady state connection count. It also seems possible in some cases the VIP may shift back to the same host a 3rd time by chance and you may end up needing 3x. I think we could potentially improve that by modifying the pacemaker resource scripts to kill active connections when the VIP departs, or, ensuring that you have 2-3x max_connections of the steady state active connection count. That should go into a new bug likely against charm-percona-cluster as it ships it's own resource agent. We...

Read more...

Revision history for this message
Ante Karamatić (ivoks) wrote :

Big problem with corosync and pacemaker packages is that they don't put resources into unmanaged state before upgrade. Hitting an issue like this is also tightly related to configuration of the pacemaker/corosync. Upgrading corosync, without stopping pacemaker (and therefore lrmd) has high probability to result with this behaviour. In the end, corosync is only a messaging bus.

So the upgrade procedure should really be:
1) put pacemaker's resource into unmanaged mode
2) stop pacemaker
3) upgrade corosync
4) restart corosync
5) start pacemaker
6) put resources into managed mode

Steps 3 & 4 are automatically done by corosync upgrade package, but all other steps are left to the operator. If upgrade is done by unattended upgrades, this will quite likely happen again.

Revision history for this message
James Troup (elmo) wrote :

Er, but unattended upgrades are on by default?

Revision history for this message
Ante Karamatić (ivoks) wrote :

Well, yes. Therefore, if unattended upgrades wants to update stuff randomly and mimic an operator, it should do it as an operator and not do it partially. I guess what I'm saying is that this is a bug in unattended upgrades, rather than pacemaker/corosync. It should have a way of defining an upgrade procedure or at least blacklist corosync and pacemaker, and many other services that require updates in specific order/way.

Revision history for this message
Felipe Reyes (freyes) wrote :

Just some history, in the past we attempted to disable unattended-upgrades (as a config option) in the hacluster charm, but it was decided that it wasn't the right place to get this addressed.

Bug https://bugs.launchpad.net/charm-hacluster/+bug/1826898

tags: added: sts
Revision history for this message
Trent Lloyd (lathiat) wrote :

Analysed the logs for an occurance of this, the problem appears to be that pacemaker doesn't stop after 1 minute so systemd gives up and just starts a new instance anyway, noting that all of the existing processes are left behind.

I am awaiting the extra rotated logs to confirm but from what I can see basically the new pacemaker fails to start because the old one is still running, and then the old one eventually exits, leave you with no instance of pacemaker (which is the state we found it in, pacemaker was stopped).

06:13:44 systemd[1]: pacemaker.service: State 'stop-sigterm' timed out. Skipping SIGKILL.
06:13:44 pacemakerd[427]: notice: Caught 'Terminated' signal
06:14:44 systemd[1]: pacemaker.service: State 'stop-final-sigterm' timed out. Skipping SIGKILL. Entering failed mode.
06:14:44 systemd[1]: pacemaker.service: Failed with result 'timeout'.
06:14:44 systemd[1]: Stopped Pacemaker High Availability Cluster Manager.
06:14:45 systemd[1]: pacemaker.service: Found left-over process 445 (cib) in control group while starting unit. Ignoring.
06:14:45 systemd[1]: pacemaker.service: Found left-over process 449 (attrd) in control group while starting unit. Ignoring.
06:14:45 systemd[1]: pacemaker.service: Found left-over process 450 (pengine) in control group while starting unit. Ignoring.
06:14:45 systemd[1]: pacemaker.service: Found left-over process 451 (crmd) in control group while starting unit. Ignoring.
06:14:45 systemd[1]: pacemaker.service: Found left-over process 427 (pacemakerd) in control group while starting unit. Ignoring.
06:14:45 systemd[1]: pacemaker.service: Found left-over process 447 (stonithd) in control group while starting unit. Ignoring.
06:14:45 systemd[1]: pacemaker.service: Found left-over process 448 (lrmd) in control group while starting unit. Ignoring.
06:14:45 systemd[1]: pacemaker.service: Failed to reset devices.list: Operation not permitted
06:14:45 systemd[1]: Started Pacemaker High Availability Cluster Manager.

Likely the solution here is some combination of tweaking the systemd config to wait longer, force kill if necessary and possibly reap all processes if it does force a restart. It's not a native systemd unit though some of this stuff can be tweaked by comments. I'll look a little further at that.

Revision history for this message
Trent Lloyd (lathiat) wrote :

I misread and the systemd unit is native, and it already sets the following settings:
SendSIGKILL=no
TimeoutStopSec=30min
TimeoutStartSec=60s

The problem is that most of these failures have been experienced on juju hacluster charm installations, which are overriding these values

$ cat ./systemd/system/pacemaker.service.d/overrides.conf
[Service]
TimeoutStartSec=180
TimeoutStopSec=60

This was apparently done to fix the following bug:
https://bugs.launchpad.net/charms/+source/hacluster/+bug/1654403

FWIW These values are configurable in charm config options. It seems this bug needs to be revisited and/or this bug may need to be retargeted at least in part to charm-hacluster.

Revision history for this message
Trent Lloyd (lathiat) wrote :

For the fix to Bug #1654403 charm-hacluster sets TimeoutStartSec and TimeoutStopSync for both corosync and pacemaker, to the same value.

system-wide default (xenial, bionic): TimeoutStopSec=90s TimeoutStartSec=90s
corosync package default: system-wide default (no changes)
pacemaker package default: TimeoutStopSec=30min TimeoutStartSec=60s

charm-hacluster corosync+pacemaker override: TimeoutStopSec=60s TimeoutStartSec=180s

effective changes:
corosync TimeoutStopSec=90s -> 60s TimeoutStartSec=90s -> 180s
pacemaker TimeoutStopSec=30min -> 60s TimeoutStartSec=60s -> 180s

The original bug description was "On corosync restart, corosync may take longer than a minute to come up. The systemd start script times out too soon. Then pacemaker which is dependent on corosync is immediatly started and fails as corosync is still in the process of starting."

So the TimeoutStartSec increase from 60/90 -> 180 was the only thing needed. I believe the TimeoutStopSec change for pacemaker is in error at least as the bug is described.

Having said that, I can imagine charm failures during deployment or reconfiguration where it tries to stop pacemaker for various reasons and it fails to stop fast enough because the resources won't migrate away (possibly because all the nodes are trying to stop at the same time, as charm-hacluster doesn't seem to have a staggered change setup) and it currently restarts corosync to effect changes to the ring. So this may well have fixed other charm-related problems not really accurately described in the previous bug - though that bug does specifically mention cases where the expected cluster_count is not set - in that case it tries to setup corosync/pacemaker before all 3 nodes are up - which might get into this scenario. So before we go ahead and change the stop_timeout back to 30min we probably need to validate various scenarios for that issue.

Trent Lloyd (lathiat)
Changed in charm-hacluster:
status: New → Confirmed
Changed in pacemaker (Ubuntu):
status: Confirmed → Invalid
summary: - upgrade from 1.1.14-2ubuntu1.8 to 1.1.14-2ubuntu1.9 breaks clusters
+ pacemaker left stopped after unattended-upgrade of pacemaker
+ (1.1.14-2ubuntu1.8 -> 1.1.14-2ubuntu1.9)
Trent Lloyd (lathiat)
description: updated
Revision history for this message
Pedro Victor Lourenço Fragola (pedrovlf) wrote :

I have an environment with the same behavior in a Bionic VM that does not use charm.

Preparing to unpack .../pacemaker_1.1.18-0ubuntu1.3_amd64.deb ...
Unpacking pacemaker (1.1.18-0ubuntu1.3) over (1.1.18-0ubuntu1.1) ...
Log ended: 2020-11-11 06:54:05

After the upgrade, there was an IP(master/vip) switch between the nodes that left the environment unavailable.

Revision history for this message
Billy Olsen (billy-olsen) wrote :

I'm not sure why the pacemaker task was marked as invalid. The issues that Trent identified in comment #15 are a problem, but I'm not entirely convinced that its *the* problem that was encountered here (as also evidenced by Pedro in comment #16).

While packages do typically restart services automatically, not all package upgrades will trigger this particular behavior. For example, the ceph packages will upgrade the services but not actually restart services as that could be very disruptive to the storage provided.

I agree with Ante in comment #9 that the packages aren't doing all of the necessary steps to properly manage the upgrade. While, I'm not necessarily convinced this is something that should be handled by unattended-upgrades I'm also not convinced that it will be easy to add this logic to the packages. For example pacemaker depends on corosync, but so does dlm-controld and it is not reasonable for the corosync package to make assumptions on how to treat software built on top of it.

Rather, I think it should be considered to change the behavior of the corosync/pacemaker packages to not automatically restart the services, or to provide an option in which the operator can control this behavior.

Changed in pacemaker (Ubuntu):
status: Invalid → Confirmed
Changed in charm-hacluster:
assignee: nobody → Billy Olsen (billy-olsen)
Revision history for this message
Billy Olsen (billy-olsen) wrote :

I should note that one of the reasons I do not suspect the charm is at fault here is that in the bionic sosreport linked in the bug, I do not see the delay that I would expect in a timeout scenario. If the stop timeout were coming into play, I would expect to see a long duration on the restart.

However, we can see the package was upgraded at 06:17:34 :

...
Start-Date: 2020-11-10 06:17:34
Commandline: /usr/bin/unattended-upgrade
Upgrade: pacemaker:amd64 (1.1.18-0ubuntu1.1, 1.1.18-0ubuntu1.3)
End-Date: 2020-11-10 06:17:36
...

Pacemaker service is restarted at 06:17:34 with a SIGTERM:

Nov 10 06:17:34 [51765] juju-caae6f-19-lxd-6 pacemakerd: notice: crm_signal_dispatch: Caught 'Terminated' signal | 15 (invoking handler)

and is restarted at 06:17:35

Nov 10 06:17:36 [41195] juju-caae6f-19-lxd-6 pacemakerd: info: crm_log_init: Changed active directory to /var/lib/pacemaker/cores
Nov 10 06:17:36 [41195] juju-caae6f-19-lxd-6 pacemakerd: info: get_cluster_type: Detected an active 'corosync' cluster
Nov 10 06:17:36 [41195] juju-caae6f-19-lxd-6 pacemakerd: error: sysrq_init: Cannot write to /proc/sys/kernel/sysrq: Permission denied (13)

This doesn't look like its hindered by the timeout configuration.

Revision history for this message
Trent Lloyd (lathiat) wrote :

With regards to Billy's Comment #18, my analysis for that bionic sosreport is in Comment #8 where I found that specific sosreport didn't experience this issue - but I found most likely that node was suffering from the issue occuring on the MySQL nodes it was connected to - and the service couldn't connect to MySQL as a result. We'd need the full logs (sosreport --all-logs) from all related keystone nodes and mysql nodes in the environment to be sure but I am 95% sure that is the case there.

I think there is some argument to be made to improve the package restart process for the pacemaker package itself, whoever I am finding based on the logs here and in a couple of environments I analysed that the primary problem is specifically related to the reduced StopTimeout set by charm-hacluster. So I think we should focus on that issue here and if we decide it makes sense to make improvements to the pacemaker package process itself that should be opened as a separate bug as I haven't seen any evidence of that issue in the logs here so far.

For anyone else experiencing this bug, please take a *full* copy of /var/log (or sosreport --all-logs) from -all- nodes in that specific pacemaker cluster and upload them and I am happy to analyse them - if you need a non-public location to share the files feel free to e-mail them to me. It would be great to receive that from any nodes already recovered so we can ensure we fully understand all the cases that happened.

Revision history for this message
Trent Lloyd (lathiat) wrote :

For clarity my findings so far are that:
 - The package upgrade stops pacemaker

 - After 30 seconds (customised down from 30min by charm-hacluster), the stop times out and pretends to have finished, but leaves pacemaker running (due to SendSIGKILL=no in the .service intentionally set upstream to prevent fencing)

 - Pacemaker is started again, but fails to start because the old copy is still running, so exits and the systemd service is left 'stopped'

 - The original "unmanaged" pacemaker copy eventually exits sometimes later (usually once the resources all transitioned away) leaving no running pacemaker at all

Compounding this issue is that:
 - Pacemaker won't stop until it confirms all local services have stopped and transitioned away to other nodes (and possibly that it won't destory quorum by going down, but I am not sure about that bit) - in some cases this just takes more than 30 seconds in other cases the cluster may be in such a state that it will never happen, e.g. another node was already down or trying to shutdown.

 - All unattended-upgrades happen within a randomized 60 minute window (apt-daily-upgrade.timer), and they all just try to stop pacemaker without regard to whether that is possible or likely to succeed - after a while all 3 will be attempting to stop so none of them would succeed.

Current Thoughts:
 - Adjust the charm-hacluster StopTimeout=30 back to some value (possibly the default) after testing this does not break the charm from doing deploy/scale-up/scale-down [as noted in previous bugs where it was originally added, but the original case was supposedly fixed by adding the cluster_count option].

 - Consider whether we need to override SendSigKILL in the charm - changing it as a global package default seems like a bad idea

 - Research an improvement to the pacemaker dpkg scripts to do something smarter than just running stop, for example the preinst script could ask for a transition away without actually running stop on pacemaker and/or abort the upgrade if it is obvious that that transition will fail.

 - As a related note, the patch to set BindsTo=corosync on pacemaker.service was removed in Groovy due to debate with Debian over this change (but still exists in Xenial-Focal). This is something that will need to be dealt with for the next LTS. This override should probably be added to charm-hacluster at a minimum.

Revision history for this message
Billy Olsen (billy-olsen) wrote :

Based on discussion with Trent, who has access to more logs and data than I currently do for this, all signs are indeed pointing to the override timeouts provided by the charm itself.

A viable work-around to prevent this is to tweak the service_stop_timeout config on the hacluster charm to be higher than the 60 second default. Setting it to 1800 would restore this to package's default.

I am also going to invalidate the pacemaker task as it wasn't caused by the change to pacemaker and a more targeted bug to tweak the behavior of whether the service starts/stops should be raised instead.

An investigation on possible alternatives for dealing with the upgrades and maintenance mode of the cluster should be pursued outside the bounds of this particular bug.

As a work-around is available, I'll reduce subscribe field-high/remove field-critical while working on a patch to change the service timeout defaults.

Changed in pacemaker (Ubuntu):
status: Confirmed → Invalid
Changed in charm-hacluster:
importance: Undecided → Critical
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-hacluster (master)

Fix proposed to branch: master
Review: https://review.opendev.org/763077

Changed in charm-hacluster:
status: Confirmed → In Progress
Revision history for this message
Martin Benicek (umabeni) wrote :

Hi, when we can expect pacemaker1.10 version? I am working for company developing and delivering cloud software and upgrade pacemaker to version 1.9 is causing crash of VMs. Problem is occurring very frequently (8-9 times from 10).
Our temporary solution is that we have blacklisted pacemaker 1.9, so we are staying on 1.8 and waiting for 1.10.
thanks
Martin

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

@Martin - hi, given the state of the bug it isn't considered an issue/fixable-bug in pacemaker and a fix will be deployed via the charms controlling it in this case.

I must admit I don't know enough about your particular case, but is it really the very same one?
Would it make sense to report a new bug with a lot of details on setup and why/why happens and what fails?

Revision history for this message
Martin Benicek (umabeni) wrote :
Download full text (3.6 KiB)

Hello,
I cannot share sos report due to sensitive information. Symptoms are following. Update of pacemaker in Ubuntu 16 to pacemaker 1.9 will cause crash of VMs. Also error log of pacemaker and syslog are flooded by thousands of messages per second - the disk will go out quit quickly.
Reboot -f is helping(VMs boots also). Ordinary reboot (without -f) is stuck for about 30 minutes, then machine boots and problem is gone.

snippet of logs from syslog:
194551 Jan 8 09:14:31 te-primary crmd[3948]: notice: Transition aborted by lrm_rsc_op.te-res_last_failure_0: Event failed (cib=0.0.0, source=match_graph_event:381, path=/create_request_adv/crm_xml/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='rte_re s']/lrm_rsc_op[@id='te-res_last_failure_0'], 0)
194552 Jan 8 09:14:31 te-primary crmd[3948]: warning: FSA: Input I_FAIL from get_lrm_resource() received in state S_TRANSITION_ENGINE
194553 Jan 8 09:14:31 te-primary crmd[3948]: notice: Transition aborted: Peer Cancelled (source=do_te_invoke:161, 0)
194554 Jan 8 09:14:31 te-primary crmd[3948]: notice: Transition 6451494 (Complete=3, Pending=0, Fired=0, Skipped=4, Incomplete=28, Source=/var/lib/pacemaker/pengine/pe-input-19.bz2): Stopped
194555 Jan 8 09:14:31 te-primary pengine[3947]: notice: On loss of DCM Quorum: Ignore
194556 Jan 8 09:14:31 te-primary pengine[3947]: notice: Start te-res#011(te-primary)
194557 Jan 8 09:14:31 te-primary pengine[3947]: notice: Start fs_res#011(te-primary)
194558 Jan 8 09:14:31 te-primary pengine[3947]: notice: Start VM_DCM_res#011(te-primary)
194559 Jan 8 09:14:31 te-primary pengine[3947]: notice: Start VM_DAC_1_res#011(te-primary)
194560 Jan 8 09:14:31 te-primary pengine[3947]: notice: Start drbd_res:0#011(te-primary)
194561 Jan 8 09:14:31 te-primary pengine[3947]: notice: Promote drbd_res:0#011(Stopped -> Master te-primary)
194562 Jan 8 09:14:31 te-primary pengine[3947]: notice: Calculated Transition 6451495: /var/lib/pacemaker/pengine/pe-input-19.bz2
194563 Jan 8 09:14:31 te-primary crmd[3948]: warning: bad input <create_request_adv origin="te_rsc_command" t="crmd" version="3.0.10" subt="request" reference="lrm_invoke-tengine-1610093671-19354500" crm_task="lrm_invoke" crm_sys_to="lrmd" crm_sys_from="tengine" crm_h ost_to="te-primary" acl_target="hacluster" crm_user="hacluster">
194564 Jan 8 09:14:31 te-primary crmd[3948]: warning: bad input <crm_xml>
194565 Jan 8 09:14:31 te-primary crmd[3948]: warning: bad input <rsc_op id="2" operation="monitor" operation_key="te-res_monitor_0" on_node="te-primary" on_node_uuid="1" transition-key="2:6451495:7:ea06a2cc-9174-42a2-93ee-121424209102">
194566 Jan 8 09:14:31 te-primary crmd[3948]: warning: bad input <primitive id="te-res" class="ocf" provider="copmany_name" type="hdsrte_ra"/>
194567 Jan 8 09:14:31 te-primary crmd[3948]: warning: bad input <attributes CRM_meta_op_target_rc="7" CRM_meta_timeout="20000" crm_feature_set="3.0.10"/>
194568 Jan 8 09:14:31 te-primary crmd[3948]: warning: bad input </rsc_op>
194569 Jan 8 09:14:31 te-primary crmd[3948]: warning: bad input </crm_xml>
194570 Jan 8 09:14:31 ...

Read more...

Revision history for this message
Billy Olsen (billy-olsen) wrote :

Charm fix was released in the 21.01 release with this patch https://review.opendev.org/c/openstack/charm-hacluster/+/763077

Changed in charm-hacluster:
milestone: none → 21.01
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.