resource timeout not respecting units
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
pacemaker (Ubuntu) |
Fix Released
|
Undecided
|
Unassigned | ||
Focal |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
[Impact]
* Cluster resource operation timeouts are not working correctly for systemd resources and should be working. Timeouts are important in order for the actions executed by pacemaker - for the systemd resource in question - don't wait forever to start (or stop) a service, causing the police engine to take the correct decisions (like trying to start the resource somewhere else).
[Test Case]
* configure correctly a pacemaker cluster and add the following resources:
# fencing
primitive fence-focal01 stonith:fence_virsh \
params ipaddr=
secure=true plug=focal01 login=fenceuser \
op monitor interval=30s
primitive fence-focal02 stonith:fence_virsh \
params ipaddr=
secure=true plug=focal02 login=fenceuser \
op monitor interval=30s
primitive fence-focal03 stonith:fence_virsh \
params ipaddr=
secure=true plug=focal03 login=fenceuser \
op monitor interval=30s
# resources
primitive virtual_ip IPaddr2 \
params ip=10.250.92.90 nic=public01 \
op monitor interval=5s
primitive webserver systemd:lighttpd \
op monitor interval=5s \
op start interval=0s timeout=2s \
op stop interval=0s timeout=2s \
meta migration-
# resource group
group webserver_vip webserver virtual_ip \
meta target-role=Stopped
# locations
location fence-focal01-
location fence-focal02-
location fence-focal03-
# properties
property cib-bootstrap-
* Try to stop an already started resource group with "op stop timeout=2s" for the systemd resource will not be accounted as 2 seconds:
Failed Resource Actions:
* webserver_stop_0 on focal03 'OCF_TIMEOUT' (198): call=29, status='Timed Out', exitreason='', last-rc-
* Watch the cluster collapse.. (fencing nodes, trying to start resources, fencing nodes again, and over)
Increasing timeout to 20s does not help:
Failed Resource Actions:
* webserver_stop_0 on focal01 'OCF_TIMEOUT' (198): call=47, status='Timed Out', exitreason='', last-rc-
* webserver_start_0 on focal03 'OCF_TIMEOUT' (198): call=22, status='Timed Out', exitreason='', last-rc-
and the systemd resources startup is much less than 20 seconds.
[Regression Potential]
* Debian was still using ftime() for pacemaker 2.0.3, and, because of deprecation warnings, wgrant has changed it in: pacemaker (2.0.3-3ubuntu2):
This was "bad" because it made this issue to appear (as we started using clock_gettime(
* So, there is no easy path: Its either we disable clock_gettime() support, by defining PCMK_TIME_
Now... to the potential issues:
* After SRU review it was decided that, instead of cherry-picking the 2 upstream merges pointed by upstream maintainer (#1992 and #1997) we would only backport changes that affect clock_gettime() code base and execution path. This is per SRU guidelines, trying to minimize amount of changes to be reviewed and merged.
* The original fix (merges #1992 and #1997) were not merged in 2.0.3 because they were missed (it is like "half fix" for clock_gettime() was done before the release).
* There are 2 possible clocking choices for pacemaker in 2.0.3: To use ftime() if supported (the upstream default) OR to use clock_gettime() if selected (it becomes the upstream default after upstream merges #1992 and #1997 were done, but after 2.0.3 release).
* I confined all chances inside "#ifdef PCMK__TIME_USE_CGT" scope and made sure that one could compile same source code with ftime() support (just because I cared about not braking compilation for someone else if needed).
* Fixes are done in execd_commands, the code responsible for starting and stopping resources and/or fencing agents (anything that will need an execv() basically). This could jeopardize other agents (other than systemd ones) so functional and regression tests are needed.
* Someone that has increased systemd resources timeouts because they were not being respected (2 would have to be value 2000 in previous version, since seconds were not being respected) could have their timeout settings increased now to 2000sec... so it is advised that those should review their timeout settings and always use a timing metric (like suffixing timeout with "s").
* This change has been recommended by upstream maintainer (from 2 merge numbers he pointed out in the upstream bug = https:/
[Other Info]
* Original Description (from the reporter):
While working on pacemaker, i discovered a issue with timeouts
haproxy_stop_0 on primary 'OCF_TIMEOUT' (198): call=583, status='Timed Out', exitreason='', last-rc-
this lead me down the path of finding that setting a timeout unit value was not doing anything
primitive haproxy systemd:haproxy \
op monitor interval=2s \
op start interval=0s timeout=500s \
op stop interval=0s timeout=500s \
meta migration-
primitive haproxy systemd:haproxy \
op monitor interval=2s \
op start interval=0s timeout=500 \
op stop interval=0s timeout=500 \
meta migration-
the two above configs result in the same behavior, pacemaker/crm seems to be ignoring the "s"
I file a bug with pacemaker itself
https:/
but this lead to the following responsed, copied from the ticket:
<<Looking back on your irc chat, I see you have a version of Pacemaker with a known bug:
<<haproxy_stop_0 on primary 'OCF_TIMEOUT' (198): call=583, status='Timed Out', exitreason='', last-rc-
<<The incorrect date is a result of bugs that occur in systemd resources when Pacemaker 2.0.3 is built <<with the -UPCMK_
<<The underlying bugs are fixed as of the Pacemaker 2.0.4 release. If anyone wants to backport specific <<commits instead, the github pull requests #1992 and #1997 should take care of it.
It appears the the root cause of my issue with setting timeout values with units ("600s") is a bug in the build process of ubuntu pacemaker
1) lsb_release -d Description: Ubuntu 20.04 LTS
2) ii pacemaker 2.0.3-3ubuntu3 amd64 cluster resource manager
3) setting "100s" in the timeout of a resource should result in a 100 second timeout, not a 100 milisecond timeout
4) the settings unit value "s", is being ignored. force me to set the timeout to 10000 to get a 10 second timeout
Related branches
- Rafael David Tinoco (community): Approve
- Robie Basak: Approve (sru)
- Canonical Server: Pending requested
-
Diff: 466 lines (+419/-2)6 files modifieddebian/changelog (+10/-0)
debian/patches/series (+7/-0)
debian/patches/ubuntu-2.0.3-fixes/lp1881762-01-Executor-systemd-execution-time-fixes.patch (+167/-0)
debian/patches/ubuntu-2.0.3-fixes/lp1881762-02-Low-executor-record-correct-last-run-and-last-rc.patch (+133/-0)
debian/patches/ubuntu-2.0.3-fixes/lp1881762-03-Refactor-executor-functionize-getting-current.patch (+88/-0)
debian/rules (+14/-2)
- Rafael David Tinoco (community): Approve
- Christian Ehrhardt (community): Approve
- Canonical Server: Pending requested
-
Diff: 1090 lines (+1032/-0)9 files modifieddebian/changelog (+14/-0)
debian/patches/series (+10/-0)
debian/patches/ubuntu-2.0.3-fixes/lp1881762-01-b5ff0e4-Build-finalize-restore-buildability.patch (+404/-0)
debian/patches/ubuntu-2.0.3-fixes/lp1881762-02-1f79b43-Refactor-executor-systemd-is-no-longer-supported-without.patch (+110/-0)
debian/patches/ubuntu-2.0.3-fixes/lp1881762-03-0772292-Fix-executor-handle-systemd-execution-times.patch (+186/-0)
debian/patches/ubuntu-2.0.3-fixes/lp1881762-04-08e3f7e-Fix-executor-correctly-convert-ns-to-ms.patch (+30/-0)
debian/patches/ubuntu-2.0.3-fixes/lp1881762-05-c9ce7ed-Low-executor-correctly-set-first-run-time.patch (+39/-0)
debian/patches/ubuntu-2.0.3-fixes/lp1881762-06-9075ad9-Low-executor-record-correct-last-run-and-last-rc.patch (+155/-0)
debian/patches/ubuntu-2.0.3-fixes/lp1881762-07-71ae72d-Refactor-executor-functionize-getting-current-time.patch (+84/-0)
Changed in pacemaker (Ubuntu): | |
status: | New → Triaged |
assignee: | nobody → Rafael David Tinoco (rafaeldtinoco) |
description: | updated |
Changed in pacemaker (Ubuntu): | |
status: | Triaged → Fix Released |
assignee: | Rafael David Tinoco (rafaeldtinoco) → nobody |
Changed in pacemaker (Ubuntu Focal): | |
assignee: | nobody → Rafael David Tinoco (rafaeldtinoco) |
importance: | Undecided → High |
status: | New → In Progress |
tags: | added: server-next |
description: | updated |
description: | updated |
description: | updated |
Changed in pacemaker (Ubuntu Focal): | |
assignee: | Rafael David Tinoco (rafaeldtinoco) → nobody |
importance: | High → Undecided |
description: | updated |
This bug is preventing Charmed Kubernetes from working with hacluster on Ubuntu 20.04 (Focal).