resource timeout not respecting units

Bug #1881762 reported by Jason Grammenos on 2020-06-02
36
This bug affects 4 people
Affects Status Importance Assigned to Milestone
pacemaker (Ubuntu)
Undecided
Unassigned
Focal
Undecided
Unassigned

Bug Description

[Impact]

 * Cluster resource operation timeouts are not working correctly for systemd resources and should be working. Timeouts are important in order for the actions executed by pacemaker - for the systemd resource in question - don't wait forever to start (or stop) a service, causing the police engine to take the correct decisions (like trying to start the resource somewhere else).

[Test Case]

 * configure correctly a pacemaker cluster and add the following resources:

# fencing

primitive fence-focal01 stonith:fence_virsh \
        params ipaddr=192.168.100.202 \
        secure=true plug=focal01 login=fenceuser \
        op monitor interval=30s
primitive fence-focal02 stonith:fence_virsh \
        params ipaddr=192.168.100.202 \
        secure=true plug=focal02 login=fenceuser \
        op monitor interval=30s
primitive fence-focal03 stonith:fence_virsh \
        params ipaddr=192.168.100.202 \
        secure=true plug=focal03 login=fenceuser \
        op monitor interval=30s

# resources

primitive virtual_ip IPaddr2 \
        params ip=10.250.92.90 nic=public01 \
        op monitor interval=5s

primitive webserver systemd:lighttpd \
        op monitor interval=5s \
        op start interval=0s timeout=2s \
        op stop interval=0s timeout=2s \
        meta migration-threshold=2

# resource group

group webserver_vip webserver virtual_ip \
        meta target-role=Stopped

# locations

location fence-focal01-location fence-focal01 -inf: focal01
location fence-focal02-location fence-focal02 -inf: focal02
location fence-focal03-location fence-focal03 -inf: focal03

# properties

property cib-bootstrap-options: \
        have-watchdog=false \
        dc-version=2.0.3-4b1f869f0f \
        cluster-infrastructure=corosync \
        stonith-enabled=on \
        stonith-action=reboot \
        no-quorum-policy=stop \
        cluster-name=focal

* Try to stop an already started resource group with "op stop timeout=2s" for the systemd resource will not be accounted as 2 seconds:

Failed Resource Actions:
  * webserver_stop_0 on focal03 'OCF_TIMEOUT' (198): call=29, status='Timed Out', exitreason='', last-rc-change='1970-01-01 00:01:57Z', queued=1828ms, exec=204557ms

* Watch the cluster collapse.. (fencing nodes, trying to start resources, fencing nodes again, and over)

Increasing timeout to 20s does not help:

Failed Resource Actions:
  * webserver_stop_0 on focal01 'OCF_TIMEOUT' (198): call=47, status='Timed Out', exitreason='', last-rc-change='1970-01-01 00:10:35Z', queued=20ms, exec=236013ms
  * webserver_start_0 on focal03 'OCF_TIMEOUT' (198): call=22, status='Timed Out', exitreason='', last-rc-change='1970-01-01 00:05:09Z', queued=33ms, exec=241831ms

and the systemd resources startup is much less than 20 seconds.

[Regression Potential]

 * Debian was still using ftime() for pacemaker 2.0.3, and, because of deprecation warnings, wgrant has changed it in: pacemaker (2.0.3-3ubuntu2):

This was "bad" because it made this issue to appear (as we started using clock_gettime(CLOCK_MONOTONIC) instead of ftime(). But.. it was good, because in order for pacemaker to support systemd resources a monotonic clock is required (and this change enabled it).

 * So, there is no easy path: Its either we disable clock_gettime() support, by defining PCMK_TIME_EMERGENCY_CGT (like 2.0.3 does by default) - and stick with broken systemd resources + FTBFS - or we fix the clock_gettime() support (with this patchset) enabled by wgrant in 2.0.3.

Now... to the potential issues:

 * After SRU review it was decided that, instead of cherry-picking the 2 upstream merges pointed by upstream maintainer (#1992 and #1997) we would only backport changes that affect clock_gettime() code base and execution path. This is per SRU guidelines, trying to minimize amount of changes to be reviewed and merged.

 * The original fix (merges #1992 and #1997) were not merged in 2.0.3 because they were missed (it is like "half fix" for clock_gettime() was done before the release).

 * There are 2 possible clocking choices for pacemaker in 2.0.3: To use ftime() if supported (the upstream default) OR to use clock_gettime() if selected (it becomes the upstream default after upstream merges #1992 and #1997 were done, but after 2.0.3 release).

 * I confined all chances inside "#ifdef PCMK__TIME_USE_CGT" scope and made sure that one could compile same source code with ftime() support (just because I cared about not braking compilation for someone else if needed).

 * Fixes are done in execd_commands, the code responsible for starting and stopping resources and/or fencing agents (anything that will need an execv() basically). This could jeopardize other agents (other than systemd ones) so functional and regression tests are needed.

 * Someone that has increased systemd resources timeouts because they were not being respected (2 would have to be value 2000 in previous version, since seconds were not being respected) could have their timeout settings increased now to 2000sec... so it is advised that those should review their timeout settings and always use a timing metric (like suffixing timeout with "s").

 * This change has been recommended by upstream maintainer (from 2 merge numbers he pointed out in the upstream bug = https://bugs.clusterlabs.org/show_bug.cgi?id=5429).

[Other Info]

 * Original Description (from the reporter):

While working on pacemaker, i discovered a issue with timeouts

haproxy_stop_0 on primary 'OCF_TIMEOUT' (198): call=583, status='Timed Out', exitreason='', last-rc-change='1970-01-04 17:21:18 -05:00', queued=44ms, exec=176272ms

this lead me down the path of finding that setting a timeout unit value was not doing anything

primitive haproxy systemd:haproxy \
        op monitor interval=2s \
        op start interval=0s timeout=500s \
        op stop interval=0s timeout=500s \
        meta migration-threshold=2

primitive haproxy systemd:haproxy \
        op monitor interval=2s \
        op start interval=0s timeout=500 \
        op stop interval=0s timeout=500 \
        meta migration-threshold=2

the two above configs result in the same behavior, pacemaker/crm seems to be ignoring the "s"

I file a bug with pacemaker itself
https://bugs.clusterlabs.org/show_bug.cgi?id=5429

but this lead to the following responsed, copied from the ticket:

<<Looking back on your irc chat, I see you have a version of Pacemaker with a known bug:

<<haproxy_stop_0 on primary 'OCF_TIMEOUT' (198): call=583, status='Timed Out', exitreason='', last-rc-<<change='1970-01-04 17:21:18 -05:00', queued=44ms, exec=176272ms

<<The incorrect date is a result of bugs that occur in systemd resources when Pacemaker 2.0.3 is built <<with the -UPCMK_TIME_EMERGENCY_CGT C flag (which is not the default). I was only aware of that being the <<case in one Fedora release. If those are stock Ubuntu packages, please file an Ubuntu bug to make sure <<they are aware of it.

<<The underlying bugs are fixed as of the Pacemaker 2.0.4 release. If anyone wants to backport specific <<commits instead, the github pull requests #1992 and #1997 should take care of it.

It appears the the root cause of my issue with setting timeout values with units ("600s") is a bug in the build process of ubuntu pacemaker

1) lsb_release -d Description: Ubuntu 20.04 LTS
2) ii pacemaker 2.0.3-3ubuntu3 amd64 cluster resource manager
3) setting "100s" in the timeout of a resource should result in a 100 second timeout, not a 100 milisecond timeout
4) the settings unit value "s", is being ignored. force me to set the timeout to 10000 to get a 10 second timeout

Related branches

Changed in pacemaker (Ubuntu):
status: New → Triaged
assignee: nobody → Rafael David Tinoco (rafaeldtinoco)
George Kraft (cynerva) wrote :

This bug is preventing Charmed Kubernetes from working with hacluster on Ubuntu 20.04 (Focal).

Because of the blocking for charmed kubernetes on focal; we're raising this bug to field critical

I'm on it now...

I'll provide a PPA with:

commit c9be5b271
Merge: 72a1f4834 08e3f7e44
Author: Ken Gaillot <email address hidden>
Date: Fri Feb 21 14:57:57 2020

    Merge pull request #1992 from kgaillot/times

    Fix systemd support when using clock_gettime()

and

commit 8254f3d25
Merge: 0625eb1a2 71ae72df9
Author: Ken Gaillot <email address hidden>
Date: Thu Feb 27 00:24:38 2020

    Merge pull request #1997 from kgaillot/times

    Fix regression in reported last run / rc change times in status displays

And Ken Gaillot suggested in upstream bug.

After PPA is ready, I'll do the SRU and we can lower priority as it will depend on the SRU process.

Thanks for the work on that, very nice turn around time;
lowering to field-high

These are the patches being backported from upstream:

  * Post 2.0.3 release fixes backported to Ubuntu: Fix resource timeout
    functionality (LP: #1881762)
    debian/patches/ubuntu-2.0.3-fixes/:
    - lp1881762-0001-e0e59163-Build-configure-remove-no-longer-needed-LIBADD_INTL-.patch
    - lp1881762-0002-6a6ba482-Build-finalize-restore-buildability-in-the-face-of-o.patch
    - lp1881762-0003-720b9d53-Build-configure-fix-syntax-error.patch
    - lp1881762-0004-7f07480c-Low-configure-missing-monotonic-clock-isn-t-fatal-fo.patch
    - lp1881762-0005-794aad8b-Log-executor-improve-process-completion-message.patch
    - lp1881762-0006-441dc18e-Refactor-executor-systemd-is-no-longer-supported-wit.patch
    - lp1881762-0007-dd1756c2-Fix-executor-handle-systemd-execution-times-under-1-.patch
    - lp1881762-0008-7969c205-Fix-executor-correctly-convert-nanoseconds-to-millis.patch
    - lp1881762-0009-310616af-Low-executor-correctly-set-first-run-time.patch
    - lp1881762-0010-e0f20f51-Low-executor-record-correct-last-run-and-last-rc-cha.patch
    - lp1881762-0011-a3050274-Refactor-executor-functionize-getting-current-time-i.patch

And they come from the fixes mentioned from upstream maintainer (merges #1992 and #1997, and some other pre requirements). Let's see how that goes, you can test this hotfix from:

https://launchpad.net/~rafaeldtinoco/+archive/ubuntu/lp1881762

once it is finished building. I'll link a merge request to this public bug so it gets reviewed by another engineer from ubuntu server team before we ask for the SRU.

Let me know if the fix is not good (any time) and I'll provide you something else.

Cheers o/

description: updated
Changed in pacemaker (Ubuntu):
status: Triaged → Fix Released
assignee: Rafael David Tinoco (rafaeldtinoco) → nobody
Changed in pacemaker (Ubuntu Focal):
assignee: nobody → Rafael David Tinoco (rafaeldtinoco)
importance: Undecided → High
status: New → In Progress
tags: added: server-next
description: updated

Hello,

Do you have any updates on this ?

Thank you

I can confirm that this fixes the issue

For the SRU reviewer:

I have uploaded the source code with the orig.tar.gz together (not sure if this is a blocker as the package is waiting for approval in queue and has not been rejected). Please find me on IRC so I can re-upload it if needed.

Thank you!

description: updated

SRU reviewer:

https://pastebin.ubuntu.com/p/rgVVb66xGd/

These are the regression tests results. All good.

SRU reviewer:

https://pastebin.ubuntu.com/p/PcZq7Hr3NQ/

These are the functional tests results.

There was an issue running one test, I had to re-ran, and it is all good:

Oct 01 04:09:09 Running test SimulStop (focal02) [ 1]
Oct 01 04:09:24 Stopping Cluster Manager on all nodes
Oct 01 04:09:31 ****************
Oct 01 04:09:31 Overall Results:{'success': 1, 'failure': 0, 'BadNews': 0, 'skipped': 0}
Oct 01 04:09:31 ****************
Oct 01 04:09:31 Test Summary
Oct 01 04:09:31 Test SimulStop: {'calls': 1, 'failure': 0, 'skipped': 0, 'auditfail': 0}
Oct 01 04:09:31 <<<<<<<<<<<<<<<< TESTS COMPLETED

Nothing to worry as it wasn't a failure.

The other one, being a failure, is already expected because we don't package cts-exec-helper command on purpose.

So, with those 2 set of tests I'm confident this fix is stable enough.

Thanks for reviewing it.

Robie Basak (racb) wrote :

I spoke to Rafael and we agreed that he would prepare a more minimal SRU.

description: updated
description: updated

After having my new fix (about to be uploaded):

[rafaeldtinoco@focal01 ~]$ crm status
Cluster Summary:
  * Stack: corosync
  * Current DC: focal01 (version 2.0.3-4b1f869f0f) - partition with quorum
  * Last updated: Sun Oct 4 14:19:09 2020
  * Last change: Sun Oct 4 14:19:07 2020 by root via cibadmin on focal01
  * 3 nodes configured
  * 5 resource instances configured

Node List:
  * Online: [ focal01 focal02 focal03 ]

Full List of Resources:
  * fence-focal01 (stonith:fence_virsh): Started focal02
  * fence-focal02 (stonith:fence_virsh): Started focal01
  * fence-focal03 (stonith:fence_virsh): Started focal01
  * Resource Group: webserver_vip:
    * webserver (systemd:lighttpd): Started focal03
    * virtual_ip (ocf::heartbeat:IPaddr2): Started focal03

The same 2s (from the test case) worked fine.

[rafaeldtinoco@gitubuntu pacemaker]$ git push pkg --delete upload/2.0.3-3ubuntu4
To ssh://git.launchpad.net/ubuntu/+source/pacemaker
 - [deleted] upload/2.0.3-3ubuntu4

[rafaeldtinoco@gitubuntu pacemaker]$ git tag -d upload/2.0.3-3ubuntu4
Deleted tag 'upload/2.0.3-3ubuntu4' (was 198170f66)

[rafaeldtinoco@focal00 pacemaker]$ git describe
upload/2.0.3-3ubuntu4

[rafaeldtinoco@focal00 pacemaker]$ git push pkg upload/2.0.3-3ubuntu4
Enumerating objects: 24, done.
Counting objects: 100% (24/24), done.
Delta compression using up to 8 threads
Compressing objects: 100% (18/18), done.
Writing objects: 100% (18/18), 14.73 KiB | 4.91 MiB/s, done.
Total 18 (delta 11), reused 0 (delta 0)
To ssh://git.launchpad.net/ubuntu/+source/pacemaker
 * [new tag] upload/2.0.3-3ubuntu4 -> upload/2.0.3-3ubuntu4

[rafaeldtinoco@focal00 ubuntu]$ debdiff pacemaker_2.0.3-3ubuntu3.dsc pacemaker_2.0.3-3ubuntu4.dsc 2>&1 | diffstat -l
changelog
patches/series
patches/ubuntu-2.0.3-fixes/lp1881762-01-b5ff0e4-Build-finalize-restore-buildability.patch
patches/ubuntu-2.0.3-fixes/lp1881762-02-1f79b43-Refactor-executor-systemd-is-no-longer-supported-without.patch
patches/ubuntu-2.0.3-fixes/lp1881762-03-0772292-Fix-executor-handle-systemd-execution-times.patch
patches/ubuntu-2.0.3-fixes/lp1881762-04-08e3f7e-Fix-executor-correctly-convert-ns-to-ms.patch
patches/ubuntu-2.0.3-fixes/lp1881762-05-c9ce7ed-Low-executor-correctly-set-first-run-time.patch
patches/ubuntu-2.0.3-fixes/lp1881762-06-9075ad9-Low-executor-record-correct-last-run-and-last-rc.patch
patches/ubuntu-2.0.3-fixes/lp1881762-07-71ae72d-Refactor-executor-functionize-getting-current-time.patch

[rafaeldtinoco@focal00 ubuntu]$ ls -tr1 *_source.changes | tail -1
pacemaker_2.0.3-3ubuntu4_source.changes

[rafaeldtinoco@focal00 ubuntu]$ dput ubuntu pacemaker_2.0.3-3ubuntu4_source.changes
Uploading to ubuntu (via ftp to upload.ubuntu.com):
  Uploading pacemaker_2.0.3-3ubuntu4.dsc: done.
  Uploading pacemaker_2.0.3-3ubuntu4.debian.tar.xz: done.
  Uploading pacemaker_2.0.3-3ubuntu4_source.buildinfo: done.
  Uploading pacemaker_2.0.3-3ubuntu4_source.changes: done.
Successfully uploaded packages.

@racb could you, please, review it again for me ? I have done what we agreed on.

Thanks a lot!

-rafaeldtinoco

Changed in pacemaker (Ubuntu Focal):
assignee: Rafael David Tinoco (rafaeldtinoco) → nobody
importance: High → Undecided
description: updated

Hello Jason, or anyone else affected,

Accepted pacemaker into focal-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/pacemaker/2.0.3-3ubuntu4 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-focal to verification-done-focal. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-focal. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in pacemaker (Ubuntu Focal):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-focal

@jason,

I could verify this SRU but I'll let you handle it as you were the reporter. Could you please change tags from "needed" to "done" when you're good ? This will sinalize the update can migrate from -proposed to -updates.

Thanks again for reporting this!

-rafaeldtinoco

All autopkgtests for the newly accepted pacemaker (2.0.3-3ubuntu4) for focal have finished running.
The following regressions have been reported in tests triggered by the package:

crmsh/4.2.0-2ubuntu1 (arm64, amd64, s390x, armhf, ppc64el)

Please visit the excuses page listed below and investigate the failures, proceeding afterwards as per the StableReleaseUpdates policy regarding autopkgtest regressions [1].

https://people.canonical.com/~ubuntu-archive/proposed-migration/focal/update_excuses.html#pacemaker

[1] https://wiki.ubuntu.com/StableReleaseUpdates#Autopkgtest_Regressions

Thank you!

The autopkgtests results above look like a regression in crmsh DEP8 because of some other change (an apt-cache | grep "Package" command is not working for crmsh) and unrelated to this upload (as other tests have passed).

After testing I can confirm that this new version by all accounts fixes the reported bug.
Stopping and starting a systemd resource with timeout = "10s" does not fail with timeouts

I tested this by installing the original 2.0.3-3ubuntu3, and reproducing the bug.
then I upgraded to 2.0.3-3ubuntu4 (the proposed) and cannot reproduce the bug.

Marking this as resolved

tags: added: verification-done verification-done-focal
removed: verification-needed verification-needed-focal
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers