Activity log for bug #1881762

Date Who What changed Old value New value Message
2020-06-02 13:59:09 Jason Grammenos bug added bug
2020-06-17 18:10:34 Rafael David Tinoco pacemaker (Ubuntu): status New Triaged
2020-06-17 18:10:39 Rafael David Tinoco pacemaker (Ubuntu): assignee Rafael David Tinoco (rafaeldtinoco)
2020-06-17 18:10:48 Rafael David Tinoco bug added subscriber Ubuntu Server
2020-07-04 16:32:55 Peter Kasza bug added subscriber Peter Kasza
2020-09-25 15:24:11 Alexander Balderson bug added subscriber Canonical Field Critical
2020-09-25 19:43:04 Alexander Balderson removed subscriber Canonical Field Critical
2020-09-25 19:43:12 Alexander Balderson bug added subscriber Canonical Field High
2020-09-25 20:42:11 Launchpad Janitor merge proposal linked https://code.launchpad.net/~rafaeldtinoco/ubuntu/+source/pacemaker/+git/pacemaker/+merge/391397
2020-09-25 20:54:45 Rafael David Tinoco description While working on pacemaker, i discovered a issue with timeouts haproxy_stop_0 on primary 'OCF_TIMEOUT' (198): call=583, status='Timed Out', exitreason='', last-rc-change='1970-01-04 17:21:18 -05:00', queued=44ms, exec=176272ms this lead me down the path of finding that setting a timeout unit value was not doing anything primitive haproxy systemd:haproxy \ op monitor interval=2s \ op start interval=0s timeout=500s \ op stop interval=0s timeout=500s \ meta migration-threshold=2 primitive haproxy systemd:haproxy \ op monitor interval=2s \ op start interval=0s timeout=500 \ op stop interval=0s timeout=500 \ meta migration-threshold=2 the two above configs result in the same behaviour, pacemaker/crm seems to be ignoring the "s" I file a bug with pacemaker itself https://bugs.clusterlabs.org/show_bug.cgi?id=5429 but this lead to the following responsed, copied from the ticket: <<Looking back on your irc chat, I see you have a version of Pacemaker with a known bug: <<haproxy_stop_0 on primary 'OCF_TIMEOUT' (198): call=583, status='Timed Out', exitreason='', last-rc-<<change='1970-01-04 17:21:18 -05:00', queued=44ms, exec=176272ms <<The incorrect date is a result of bugs that occur in systemd resources when Pacemaker 2.0.3 is built <<with the -UPCMK_TIME_EMERGENCY_CGT C flag (which is not the default). I was only aware of that being the <<case in one Fedora release. If those are stock Ubuntu packages, please file an Ubuntu bug to make sure <<they are aware of it. <<The underlying bugs are fixed as of the Pacemaker 2.0.4 release. If anyone wants to backport specific <<commits instead, the github pull requests #1992 and #1997 should take care of it. It appears the the root cause of my issue with setting timeout values with units ("600s") is a bug in the build process of ubuntu pacemaker 1) lsb_release -d Description: Ubuntu 20.04 LTS 2) ii pacemaker 2.0.3-3ubuntu3 amd64 cluster resource manager 3) setting "100s" in the timeout of a resource should result in a 100 second timeout, not a 100 milisecond timeout 4) the settings unit value "s", is being ignored. force me to set the timeout to 10000 to get a 10 second timeout [Impact] * Cluster resource timeouts are not working and should be working. Timeouts are important in order for the actions (done for the resource) don't timeout before we're expecting (sometimes starting a resource can take more time than the default time because of configuration files, or cache to be loaded, etc). [Test Case] * Create a pacemaker cluster with Ubuntu focal and configure a primitive with: primitive haproxy systemd:haproxy \         op monitor interval=2s \         op start interval=0s timeout=500s \         op stop interval=0s timeout=500s \         meta migration-threshold=2 or even primitive haproxy systemd:haproxy \         op monitor interval=2s \         op start interval=0s timeout=500 \         op stop interval=0s timeout=500 \         meta migration-threshold=2 and observe timeouts are not being respected. [Regression Potential] * The number of patches are not small but they're ALL related to the same thing: fixing timeout not working and re-organizing timing for resources. * TBD (more info to come) [Other Info] * Original Description (from the reporter): While working on pacemaker, i discovered a issue with timeouts haproxy_stop_0 on primary 'OCF_TIMEOUT' (198): call=583, status='Timed Out', exitreason='', last-rc-change='1970-01-04 17:21:18 -05:00', queued=44ms, exec=176272ms this lead me down the path of finding that setting a timeout unit value was not doing anything primitive haproxy systemd:haproxy \         op monitor interval=2s \         op start interval=0s timeout=500s \         op stop interval=0s timeout=500s \         meta migration-threshold=2 primitive haproxy systemd:haproxy \         op monitor interval=2s \         op start interval=0s timeout=500 \         op stop interval=0s timeout=500 \         meta migration-threshold=2 the two above configs result in the same behavior, pacemaker/crm seems to be ignoring the "s" I file a bug with pacemaker itself https://bugs.clusterlabs.org/show_bug.cgi?id=5429 but this lead to the following responsed, copied from the ticket: <<Looking back on your irc chat, I see you have a version of Pacemaker with a known bug: <<haproxy_stop_0 on primary 'OCF_TIMEOUT' (198): call=583, status='Timed Out', exitreason='', last-rc-<<change='1970-01-04 17:21:18 -05:00', queued=44ms, exec=176272ms <<The incorrect date is a result of bugs that occur in systemd resources when Pacemaker 2.0.3 is built <<with the -UPCMK_TIME_EMERGENCY_CGT C flag (which is not the default). I was only aware of that being the <<case in one Fedora release. If those are stock Ubuntu packages, please file an Ubuntu bug to make sure <<they are aware of it. <<The underlying bugs are fixed as of the Pacemaker 2.0.4 release. If anyone wants to backport specific <<commits instead, the github pull requests #1992 and #1997 should take care of it. It appears the the root cause of my issue with setting timeout values with units ("600s") is a bug in the build process of ubuntu pacemaker 1) lsb_release -d Description: Ubuntu 20.04 LTS 2) ii pacemaker 2.0.3-3ubuntu3 amd64 cluster resource manager 3) setting "100s" in the timeout of a resource should result in a 100 second timeout, not a 100 milisecond timeout 4) the settings unit value "s", is being ignored. force me to set the timeout to 10000 to get a 10 second timeout
2020-09-25 21:01:26 Launchpad Janitor merge proposal linked https://code.launchpad.net/~rafaeldtinoco/ubuntu/+source/pacemaker/+git/pacemaker/+merge/391398
2020-09-25 21:03:42 Rafael David Tinoco nominated for series Ubuntu Focal
2020-09-25 21:03:42 Rafael David Tinoco bug task added pacemaker (Ubuntu Focal)
2020-09-25 21:03:51 Rafael David Tinoco pacemaker (Ubuntu): status Triaged Fix Released
2020-09-25 21:03:53 Rafael David Tinoco pacemaker (Ubuntu): assignee Rafael David Tinoco (rafaeldtinoco)
2020-09-25 21:03:56 Rafael David Tinoco pacemaker (Ubuntu Focal): assignee Rafael David Tinoco (rafaeldtinoco)
2020-09-25 21:03:57 Rafael David Tinoco pacemaker (Ubuntu Focal): importance Undecided High
2020-09-25 21:04:02 Rafael David Tinoco pacemaker (Ubuntu Focal): status New In Progress
2020-09-25 21:04:09 Rafael David Tinoco tags server-next
2020-09-25 21:04:15 Rafael David Tinoco merge proposal unlinked https://code.launchpad.net/~rafaeldtinoco/ubuntu/+source/pacemaker/+git/pacemaker/+merge/391397
2020-09-25 21:05:32 Rafael David Tinoco bug added subscriber Ubuntu HA Interest
2020-09-26 05:45:36 Rafael David Tinoco description [Impact] * Cluster resource timeouts are not working and should be working. Timeouts are important in order for the actions (done for the resource) don't timeout before we're expecting (sometimes starting a resource can take more time than the default time because of configuration files, or cache to be loaded, etc). [Test Case] * Create a pacemaker cluster with Ubuntu focal and configure a primitive with: primitive haproxy systemd:haproxy \         op monitor interval=2s \         op start interval=0s timeout=500s \         op stop interval=0s timeout=500s \         meta migration-threshold=2 or even primitive haproxy systemd:haproxy \         op monitor interval=2s \         op start interval=0s timeout=500 \         op stop interval=0s timeout=500 \         meta migration-threshold=2 and observe timeouts are not being respected. [Regression Potential] * The number of patches are not small but they're ALL related to the same thing: fixing timeout not working and re-organizing timing for resources. * TBD (more info to come) [Other Info] * Original Description (from the reporter): While working on pacemaker, i discovered a issue with timeouts haproxy_stop_0 on primary 'OCF_TIMEOUT' (198): call=583, status='Timed Out', exitreason='', last-rc-change='1970-01-04 17:21:18 -05:00', queued=44ms, exec=176272ms this lead me down the path of finding that setting a timeout unit value was not doing anything primitive haproxy systemd:haproxy \         op monitor interval=2s \         op start interval=0s timeout=500s \         op stop interval=0s timeout=500s \         meta migration-threshold=2 primitive haproxy systemd:haproxy \         op monitor interval=2s \         op start interval=0s timeout=500 \         op stop interval=0s timeout=500 \         meta migration-threshold=2 the two above configs result in the same behavior, pacemaker/crm seems to be ignoring the "s" I file a bug with pacemaker itself https://bugs.clusterlabs.org/show_bug.cgi?id=5429 but this lead to the following responsed, copied from the ticket: <<Looking back on your irc chat, I see you have a version of Pacemaker with a known bug: <<haproxy_stop_0 on primary 'OCF_TIMEOUT' (198): call=583, status='Timed Out', exitreason='', last-rc-<<change='1970-01-04 17:21:18 -05:00', queued=44ms, exec=176272ms <<The incorrect date is a result of bugs that occur in systemd resources when Pacemaker 2.0.3 is built <<with the -UPCMK_TIME_EMERGENCY_CGT C flag (which is not the default). I was only aware of that being the <<case in one Fedora release. If those are stock Ubuntu packages, please file an Ubuntu bug to make sure <<they are aware of it. <<The underlying bugs are fixed as of the Pacemaker 2.0.4 release. If anyone wants to backport specific <<commits instead, the github pull requests #1992 and #1997 should take care of it. It appears the the root cause of my issue with setting timeout values with units ("600s") is a bug in the build process of ubuntu pacemaker 1) lsb_release -d Description: Ubuntu 20.04 LTS 2) ii pacemaker 2.0.3-3ubuntu3 amd64 cluster resource manager 3) setting "100s" in the timeout of a resource should result in a 100 second timeout, not a 100 milisecond timeout 4) the settings unit value "s", is being ignored. force me to set the timeout to 10000 to get a 10 second timeout [Impact]  * Cluster resource timeouts are not working and should be working. Timeouts are important in order for the actions (done by the resource) don't timeout before we're expecting (sometimes starting a resource can take more time than the default time because of configuration files, or cache to be loaded, etc). [Test Case]  * Create a pacemaker cluster with Ubuntu focal and configure a primitive with: primitive haproxy systemd:haproxy \         op monitor interval=2s \         op start interval=0s timeout=500s \         op stop interval=0s timeout=500s \         meta migration-threshold=2 or even primitive haproxy systemd:haproxy \         op monitor interval=2s \         op start interval=0s timeout=500 \         op stop interval=0s timeout=500 \         meta migration-threshold=2 and observe timeouts are not being respected. [Regression Potential] * Debian was still using ftime() for pacemaker 2.0.3, and, because of deprecation warnings, wgrant has changed it in: pacemaker (2.0.3-3ubuntu2): This was "bad" because it made this issue to appear (as we started using clock_gettime(CLOCK_MONOTONIC) instead of ftime(). But.. it was good, because in order for pacemaker to support systemd resources a monotonic clock is required (and this change enabled it). * So, there is no easy path: Its either we disable clock_gettime() support, by defining PCMK_TIME_EMERGENCY_CGT (like 2.0.3 does by default) - and stick with broken systemd resources + FTBFS - or we fix the clock_gettime() support (with this patchset) enabled by wgrant in 2.0.3. Now... to the potential issues: * This patchset was not done in 2.0.3 because it was missed also (it is like "half fix" for clock_gettime() was done before the release).  * The number of patches are not small but they're ALL related to the same thing: fixing timeout not working and re-organizing timing for resources. They're also mostly touching the same file: execd_commands.c (and configure.ac to control macros). * timeouts are confirmed broken for systemd resources (like the test case shows). We could, perhaps, brake for OCF resorces and/or fencing as well. * This change has been recommended by upstream maintainer (from 2 merge numbers he pointed out in the upstream bug = https://bugs.clusterlabs.org/show_bug.cgi?id=5429). [Other Info]  * Original Description (from the reporter): While working on pacemaker, i discovered a issue with timeouts haproxy_stop_0 on primary 'OCF_TIMEOUT' (198): call=583, status='Timed Out', exitreason='', last-rc-change='1970-01-04 17:21:18 -05:00', queued=44ms, exec=176272ms this lead me down the path of finding that setting a timeout unit value was not doing anything primitive haproxy systemd:haproxy \         op monitor interval=2s \         op start interval=0s timeout=500s \         op stop interval=0s timeout=500s \         meta migration-threshold=2 primitive haproxy systemd:haproxy \         op monitor interval=2s \         op start interval=0s timeout=500 \         op stop interval=0s timeout=500 \         meta migration-threshold=2 the two above configs result in the same behavior, pacemaker/crm seems to be ignoring the "s" I file a bug with pacemaker itself https://bugs.clusterlabs.org/show_bug.cgi?id=5429 but this lead to the following responsed, copied from the ticket: <<Looking back on your irc chat, I see you have a version of Pacemaker with a known bug: <<haproxy_stop_0 on primary 'OCF_TIMEOUT' (198): call=583, status='Timed Out', exitreason='', last-rc-<<change='1970-01-04 17:21:18 -05:00', queued=44ms, exec=176272ms <<The incorrect date is a result of bugs that occur in systemd resources when Pacemaker 2.0.3 is built <<with the -UPCMK_TIME_EMERGENCY_CGT C flag (which is not the default). I was only aware of that being the <<case in one Fedora release. If those are stock Ubuntu packages, please file an Ubuntu bug to make sure <<they are aware of it. <<The underlying bugs are fixed as of the Pacemaker 2.0.4 release. If anyone wants to backport specific <<commits instead, the github pull requests #1992 and #1997 should take care of it. It appears the the root cause of my issue with setting timeout values with units ("600s") is a bug in the build process of ubuntu pacemaker 1) lsb_release -d Description: Ubuntu 20.04 LTS 2) ii pacemaker 2.0.3-3ubuntu3 amd64 cluster resource manager 3) setting "100s" in the timeout of a resource should result in a 100 second timeout, not a 100 milisecond timeout 4) the settings unit value "s", is being ignored. force me to set the timeout to 10000 to get a 10 second timeout
2020-09-28 14:49:04 George Kraft bug added subscriber George Kraft
2020-09-29 01:30:05 Rafael David Tinoco description [Impact]  * Cluster resource timeouts are not working and should be working. Timeouts are important in order for the actions (done by the resource) don't timeout before we're expecting (sometimes starting a resource can take more time than the default time because of configuration files, or cache to be loaded, etc). [Test Case]  * Create a pacemaker cluster with Ubuntu focal and configure a primitive with: primitive haproxy systemd:haproxy \         op monitor interval=2s \         op start interval=0s timeout=500s \         op stop interval=0s timeout=500s \         meta migration-threshold=2 or even primitive haproxy systemd:haproxy \         op monitor interval=2s \         op start interval=0s timeout=500 \         op stop interval=0s timeout=500 \         meta migration-threshold=2 and observe timeouts are not being respected. [Regression Potential] * Debian was still using ftime() for pacemaker 2.0.3, and, because of deprecation warnings, wgrant has changed it in: pacemaker (2.0.3-3ubuntu2): This was "bad" because it made this issue to appear (as we started using clock_gettime(CLOCK_MONOTONIC) instead of ftime(). But.. it was good, because in order for pacemaker to support systemd resources a monotonic clock is required (and this change enabled it). * So, there is no easy path: Its either we disable clock_gettime() support, by defining PCMK_TIME_EMERGENCY_CGT (like 2.0.3 does by default) - and stick with broken systemd resources + FTBFS - or we fix the clock_gettime() support (with this patchset) enabled by wgrant in 2.0.3. Now... to the potential issues: * This patchset was not done in 2.0.3 because it was missed also (it is like "half fix" for clock_gettime() was done before the release).  * The number of patches are not small but they're ALL related to the same thing: fixing timeout not working and re-organizing timing for resources. They're also mostly touching the same file: execd_commands.c (and configure.ac to control macros). * timeouts are confirmed broken for systemd resources (like the test case shows). We could, perhaps, brake for OCF resorces and/or fencing as well. * This change has been recommended by upstream maintainer (from 2 merge numbers he pointed out in the upstream bug = https://bugs.clusterlabs.org/show_bug.cgi?id=5429). [Other Info]  * Original Description (from the reporter): While working on pacemaker, i discovered a issue with timeouts haproxy_stop_0 on primary 'OCF_TIMEOUT' (198): call=583, status='Timed Out', exitreason='', last-rc-change='1970-01-04 17:21:18 -05:00', queued=44ms, exec=176272ms this lead me down the path of finding that setting a timeout unit value was not doing anything primitive haproxy systemd:haproxy \         op monitor interval=2s \         op start interval=0s timeout=500s \         op stop interval=0s timeout=500s \         meta migration-threshold=2 primitive haproxy systemd:haproxy \         op monitor interval=2s \         op start interval=0s timeout=500 \         op stop interval=0s timeout=500 \         meta migration-threshold=2 the two above configs result in the same behavior, pacemaker/crm seems to be ignoring the "s" I file a bug with pacemaker itself https://bugs.clusterlabs.org/show_bug.cgi?id=5429 but this lead to the following responsed, copied from the ticket: <<Looking back on your irc chat, I see you have a version of Pacemaker with a known bug: <<haproxy_stop_0 on primary 'OCF_TIMEOUT' (198): call=583, status='Timed Out', exitreason='', last-rc-<<change='1970-01-04 17:21:18 -05:00', queued=44ms, exec=176272ms <<The incorrect date is a result of bugs that occur in systemd resources when Pacemaker 2.0.3 is built <<with the -UPCMK_TIME_EMERGENCY_CGT C flag (which is not the default). I was only aware of that being the <<case in one Fedora release. If those are stock Ubuntu packages, please file an Ubuntu bug to make sure <<they are aware of it. <<The underlying bugs are fixed as of the Pacemaker 2.0.4 release. If anyone wants to backport specific <<commits instead, the github pull requests #1992 and #1997 should take care of it. It appears the the root cause of my issue with setting timeout values with units ("600s") is a bug in the build process of ubuntu pacemaker 1) lsb_release -d Description: Ubuntu 20.04 LTS 2) ii pacemaker 2.0.3-3ubuntu3 amd64 cluster resource manager 3) setting "100s" in the timeout of a resource should result in a 100 second timeout, not a 100 milisecond timeout 4) the settings unit value "s", is being ignored. force me to set the timeout to 10000 to get a 10 second timeout SRU reviewer: The merge request has been reviewed by @paelzer initially, before the SRU review. The most important comment is this: https://code.launchpad.net/~rafaeldtinoco/ubuntu/+source/pacemaker/+git/pacemaker/+merge/391398/comments/1030355 Clarifying why the commits were picked. Thanks for reviewing this [Impact]  * Cluster resource timeouts are not working and should be working. Timeouts are important in order for the actions (done by the resource) don't timeout before we're expecting (sometimes starting a resource can take more time than the default time because of configuration files, or cache to be loaded, etc). [Test Case]  * Create a pacemaker cluster with Ubuntu focal and configure a primitive with: primitive haproxy systemd:haproxy \         op monitor interval=2s \         op start interval=0s timeout=500s \         op stop interval=0s timeout=500s \         meta migration-threshold=2 or even primitive haproxy systemd:haproxy \         op monitor interval=2s \         op start interval=0s timeout=500 \         op stop interval=0s timeout=500 \         meta migration-threshold=2 and observe timeouts are not being respected. [Regression Potential]  * Debian was still using ftime() for pacemaker 2.0.3, and, because of deprecation warnings, wgrant has changed it in: pacemaker (2.0.3-3ubuntu2): This was "bad" because it made this issue to appear (as we started using clock_gettime(CLOCK_MONOTONIC) instead of ftime(). But.. it was good, because in order for pacemaker to support systemd resources a monotonic clock is required (and this change enabled it).  * So, there is no easy path: Its either we disable clock_gettime() support, by defining PCMK_TIME_EMERGENCY_CGT (like 2.0.3 does by default) - and stick with broken systemd resources + FTBFS - or we fix the clock_gettime() support (with this patchset) enabled by wgrant in 2.0.3. Now... to the potential issues:  * This patchset was not done in 2.0.3 because it was missed also (it is like "half fix" for clock_gettime() was done before the release).  * The number of patches are not small but they're ALL related to the same thing: fixing timeout not working and re-organizing timing for resources. They're also mostly touching the same file: execd_commands.c (and configure.ac to control macros).  * timeouts are confirmed broken for systemd resources (like the test case shows). We could, perhaps, brake for OCF resorces and/or fencing as well.  * This change has been recommended by upstream maintainer (from 2 merge numbers he pointed out in the upstream bug = https://bugs.clusterlabs.org/show_bug.cgi?id=5429). [Other Info]  * Original Description (from the reporter): While working on pacemaker, i discovered a issue with timeouts haproxy_stop_0 on primary 'OCF_TIMEOUT' (198): call=583, status='Timed Out', exitreason='', last-rc-change='1970-01-04 17:21:18 -05:00', queued=44ms, exec=176272ms this lead me down the path of finding that setting a timeout unit value was not doing anything primitive haproxy systemd:haproxy \         op monitor interval=2s \         op start interval=0s timeout=500s \         op stop interval=0s timeout=500s \         meta migration-threshold=2 primitive haproxy systemd:haproxy \         op monitor interval=2s \         op start interval=0s timeout=500 \         op stop interval=0s timeout=500 \         meta migration-threshold=2 the two above configs result in the same behavior, pacemaker/crm seems to be ignoring the "s" I file a bug with pacemaker itself https://bugs.clusterlabs.org/show_bug.cgi?id=5429 but this lead to the following responsed, copied from the ticket: <<Looking back on your irc chat, I see you have a version of Pacemaker with a known bug: <<haproxy_stop_0 on primary 'OCF_TIMEOUT' (198): call=583, status='Timed Out', exitreason='', last-rc-<<change='1970-01-04 17:21:18 -05:00', queued=44ms, exec=176272ms <<The incorrect date is a result of bugs that occur in systemd resources when Pacemaker 2.0.3 is built <<with the -UPCMK_TIME_EMERGENCY_CGT C flag (which is not the default). I was only aware of that being the <<case in one Fedora release. If those are stock Ubuntu packages, please file an Ubuntu bug to make sure <<they are aware of it. <<The underlying bugs are fixed as of the Pacemaker 2.0.4 release. If anyone wants to backport specific <<commits instead, the github pull requests #1992 and #1997 should take care of it. It appears the the root cause of my issue with setting timeout values with units ("600s") is a bug in the build process of ubuntu pacemaker 1) lsb_release -d Description: Ubuntu 20.04 LTS 2) ii pacemaker 2.0.3-3ubuntu3 amd64 cluster resource manager 3) setting "100s" in the timeout of a resource should result in a 100 second timeout, not a 100 milisecond timeout 4) the settings unit value "s", is being ignored. force me to set the timeout to 10000 to get a 10 second timeout
2020-10-01 20:03:57 Rafael David Tinoco description SRU reviewer: The merge request has been reviewed by @paelzer initially, before the SRU review. The most important comment is this: https://code.launchpad.net/~rafaeldtinoco/ubuntu/+source/pacemaker/+git/pacemaker/+merge/391398/comments/1030355 Clarifying why the commits were picked. Thanks for reviewing this [Impact]  * Cluster resource timeouts are not working and should be working. Timeouts are important in order for the actions (done by the resource) don't timeout before we're expecting (sometimes starting a resource can take more time than the default time because of configuration files, or cache to be loaded, etc). [Test Case]  * Create a pacemaker cluster with Ubuntu focal and configure a primitive with: primitive haproxy systemd:haproxy \         op monitor interval=2s \         op start interval=0s timeout=500s \         op stop interval=0s timeout=500s \         meta migration-threshold=2 or even primitive haproxy systemd:haproxy \         op monitor interval=2s \         op start interval=0s timeout=500 \         op stop interval=0s timeout=500 \         meta migration-threshold=2 and observe timeouts are not being respected. [Regression Potential]  * Debian was still using ftime() for pacemaker 2.0.3, and, because of deprecation warnings, wgrant has changed it in: pacemaker (2.0.3-3ubuntu2): This was "bad" because it made this issue to appear (as we started using clock_gettime(CLOCK_MONOTONIC) instead of ftime(). But.. it was good, because in order for pacemaker to support systemd resources a monotonic clock is required (and this change enabled it).  * So, there is no easy path: Its either we disable clock_gettime() support, by defining PCMK_TIME_EMERGENCY_CGT (like 2.0.3 does by default) - and stick with broken systemd resources + FTBFS - or we fix the clock_gettime() support (with this patchset) enabled by wgrant in 2.0.3. Now... to the potential issues:  * This patchset was not done in 2.0.3 because it was missed also (it is like "half fix" for clock_gettime() was done before the release).  * The number of patches are not small but they're ALL related to the same thing: fixing timeout not working and re-organizing timing for resources. They're also mostly touching the same file: execd_commands.c (and configure.ac to control macros).  * timeouts are confirmed broken for systemd resources (like the test case shows). We could, perhaps, brake for OCF resorces and/or fencing as well.  * This change has been recommended by upstream maintainer (from 2 merge numbers he pointed out in the upstream bug = https://bugs.clusterlabs.org/show_bug.cgi?id=5429). [Other Info]  * Original Description (from the reporter): While working on pacemaker, i discovered a issue with timeouts haproxy_stop_0 on primary 'OCF_TIMEOUT' (198): call=583, status='Timed Out', exitreason='', last-rc-change='1970-01-04 17:21:18 -05:00', queued=44ms, exec=176272ms this lead me down the path of finding that setting a timeout unit value was not doing anything primitive haproxy systemd:haproxy \         op monitor interval=2s \         op start interval=0s timeout=500s \         op stop interval=0s timeout=500s \         meta migration-threshold=2 primitive haproxy systemd:haproxy \         op monitor interval=2s \         op start interval=0s timeout=500 \         op stop interval=0s timeout=500 \         meta migration-threshold=2 the two above configs result in the same behavior, pacemaker/crm seems to be ignoring the "s" I file a bug with pacemaker itself https://bugs.clusterlabs.org/show_bug.cgi?id=5429 but this lead to the following responsed, copied from the ticket: <<Looking back on your irc chat, I see you have a version of Pacemaker with a known bug: <<haproxy_stop_0 on primary 'OCF_TIMEOUT' (198): call=583, status='Timed Out', exitreason='', last-rc-<<change='1970-01-04 17:21:18 -05:00', queued=44ms, exec=176272ms <<The incorrect date is a result of bugs that occur in systemd resources when Pacemaker 2.0.3 is built <<with the -UPCMK_TIME_EMERGENCY_CGT C flag (which is not the default). I was only aware of that being the <<case in one Fedora release. If those are stock Ubuntu packages, please file an Ubuntu bug to make sure <<they are aware of it. <<The underlying bugs are fixed as of the Pacemaker 2.0.4 release. If anyone wants to backport specific <<commits instead, the github pull requests #1992 and #1997 should take care of it. It appears the the root cause of my issue with setting timeout values with units ("600s") is a bug in the build process of ubuntu pacemaker 1) lsb_release -d Description: Ubuntu 20.04 LTS 2) ii pacemaker 2.0.3-3ubuntu3 amd64 cluster resource manager 3) setting "100s" in the timeout of a resource should result in a 100 second timeout, not a 100 milisecond timeout 4) the settings unit value "s", is being ignored. force me to set the timeout to 10000 to get a 10 second timeout [Impact]  * Cluster resource timeouts are not working and should be working. Timeouts are important in order for the actions (done by the resource) don't timeout before we're expecting (sometimes starting a resource can take more time than the default time because of configuration files, or cache to be loaded, etc). [Test Case] * configure correctly a pacemaker cluster and add the following resources: # fencing primitive fence-focal01 stonith:fence_virsh \ params ipaddr=192.168.100.202 \ secure=true plug=focal01 login=fenceuser \ op monitor interval=30s primitive fence-focal02 stonith:fence_virsh \ params ipaddr=192.168.100.202 \ secure=true plug=focal02 login=fenceuser \ op monitor interval=30s primitive fence-focal03 stonith:fence_virsh \ params ipaddr=192.168.100.202 \ secure=true plug=focal03 login=fenceuser \ op monitor interval=30s # resources primitive virtual_ip IPaddr2 \ params ip=10.250.92.90 nic=public01 \ op monitor interval=5s primitive webserver systemd:lighttpd \ op monitor interval=5s \ op start interval=0s timeout=2s \ op stop interval=0s timeout=2s \ meta migration-threshold=2 # resource group group webserver_vip webserver virtual_ip \ meta target-role=Stopped # locations location fence-focal01-location fence-focal01 -inf: focal01 location fence-focal02-location fence-focal02 -inf: focal02 location fence-focal03-location fence-focal03 -inf: focal03 # properties property cib-bootstrap-options: \ have-watchdog=false \ dc-version=2.0.3-4b1f869f0f \ cluster-infrastructure=corosync \ stonith-enabled=on \ stonith-action=reboot \ no-quorum-policy=stop \ cluster-name=focal * Try to stop an already started resource group with "op stop timeout=2s" for the systemd resource will not be accounted as 2 seconds: Failed Resource Actions: * webserver_stop_0 on focal03 'OCF_TIMEOUT' (198): call=29, status='Timed Out', exitreason='', last-rc-change='1970-01-01 00:01:57Z', queued=1828ms, exec=204557ms * Watch the cluster collapse.. (fencing nodes, trying to start resources, fencing nodes again, and over) Increasing timeout to 20s does not help: Failed Resource Actions: * webserver_stop_0 on focal01 'OCF_TIMEOUT' (198): call=47, status='Timed Out', exitreason='', last-rc-change='1970-01-01 00:10:35Z', queued=20ms, exec=236013ms * webserver_start_0 on focal03 'OCF_TIMEOUT' (198): call=22, status='Timed Out', exitreason='', last-rc-change='1970-01-01 00:05:09Z', queued=33ms, exec=241831ms and the systemd resources startup is much less than 20 seconds. [Regression Potential]  * Debian was still using ftime() for pacemaker 2.0.3, and, because of deprecation warnings, wgrant has changed it in: pacemaker (2.0.3-3ubuntu2): This was "bad" because it made this issue to appear (as we started using clock_gettime(CLOCK_MONOTONIC) instead of ftime(). But.. it was good, because in order for pacemaker to support systemd resources a monotonic clock is required (and this change enabled it).  * So, there is no easy path: Its either we disable clock_gettime() support, by defining PCMK_TIME_EMERGENCY_CGT (like 2.0.3 does by default) - and stick with broken systemd resources + FTBFS - or we fix the clock_gettime() support (with this patchset) enabled by wgrant in 2.0.3. Now... to the potential issues:  * This patchset was not done in 2.0.3 because it was missed also (it is like "half fix" for clock_gettime() was done before the release).  * The number of patches are not small but they're ALL related to the same thing: fixing timeout not working and re-organizing timing for resources. They're also mostly touching the same file: execd_commands.c (and configure.ac to control macros).  * timeouts are confirmed broken for systemd resources (like the test case shows). We could, perhaps, brake for OCF resorces and/or fencing as well.  * This change has been recommended by upstream maintainer (from 2 merge numbers he pointed out in the upstream bug = https://bugs.clusterlabs.org/show_bug.cgi?id=5429). [Other Info]  * Original Description (from the reporter): While working on pacemaker, i discovered a issue with timeouts haproxy_stop_0 on primary 'OCF_TIMEOUT' (198): call=583, status='Timed Out', exitreason='', last-rc-change='1970-01-04 17:21:18 -05:00', queued=44ms, exec=176272ms this lead me down the path of finding that setting a timeout unit value was not doing anything primitive haproxy systemd:haproxy \         op monitor interval=2s \         op start interval=0s timeout=500s \         op stop interval=0s timeout=500s \         meta migration-threshold=2 primitive haproxy systemd:haproxy \         op monitor interval=2s \         op start interval=0s timeout=500 \         op stop interval=0s timeout=500 \         meta migration-threshold=2 the two above configs result in the same behavior, pacemaker/crm seems to be ignoring the "s" I file a bug with pacemaker itself https://bugs.clusterlabs.org/show_bug.cgi?id=5429 but this lead to the following responsed, copied from the ticket: <<Looking back on your irc chat, I see you have a version of Pacemaker with a known bug: <<haproxy_stop_0 on primary 'OCF_TIMEOUT' (198): call=583, status='Timed Out', exitreason='', last-rc-<<change='1970-01-04 17:21:18 -05:00', queued=44ms, exec=176272ms <<The incorrect date is a result of bugs that occur in systemd resources when Pacemaker 2.0.3 is built <<with the -UPCMK_TIME_EMERGENCY_CGT C flag (which is not the default). I was only aware of that being the <<case in one Fedora release. If those are stock Ubuntu packages, please file an Ubuntu bug to make sure <<they are aware of it. <<The underlying bugs are fixed as of the Pacemaker 2.0.4 release. If anyone wants to backport specific <<commits instead, the github pull requests #1992 and #1997 should take care of it. It appears the the root cause of my issue with setting timeout values with units ("600s") is a bug in the build process of ubuntu pacemaker 1) lsb_release -d Description: Ubuntu 20.04 LTS 2) ii pacemaker 2.0.3-3ubuntu3 amd64 cluster resource manager 3) setting "100s" in the timeout of a resource should result in a 100 second timeout, not a 100 milisecond timeout 4) the settings unit value "s", is being ignored. force me to set the timeout to 10000 to get a 10 second timeout
2020-10-01 20:12:47 Rafael David Tinoco description [Impact]  * Cluster resource timeouts are not working and should be working. Timeouts are important in order for the actions (done by the resource) don't timeout before we're expecting (sometimes starting a resource can take more time than the default time because of configuration files, or cache to be loaded, etc). [Test Case] * configure correctly a pacemaker cluster and add the following resources: # fencing primitive fence-focal01 stonith:fence_virsh \ params ipaddr=192.168.100.202 \ secure=true plug=focal01 login=fenceuser \ op monitor interval=30s primitive fence-focal02 stonith:fence_virsh \ params ipaddr=192.168.100.202 \ secure=true plug=focal02 login=fenceuser \ op monitor interval=30s primitive fence-focal03 stonith:fence_virsh \ params ipaddr=192.168.100.202 \ secure=true plug=focal03 login=fenceuser \ op monitor interval=30s # resources primitive virtual_ip IPaddr2 \ params ip=10.250.92.90 nic=public01 \ op monitor interval=5s primitive webserver systemd:lighttpd \ op monitor interval=5s \ op start interval=0s timeout=2s \ op stop interval=0s timeout=2s \ meta migration-threshold=2 # resource group group webserver_vip webserver virtual_ip \ meta target-role=Stopped # locations location fence-focal01-location fence-focal01 -inf: focal01 location fence-focal02-location fence-focal02 -inf: focal02 location fence-focal03-location fence-focal03 -inf: focal03 # properties property cib-bootstrap-options: \ have-watchdog=false \ dc-version=2.0.3-4b1f869f0f \ cluster-infrastructure=corosync \ stonith-enabled=on \ stonith-action=reboot \ no-quorum-policy=stop \ cluster-name=focal * Try to stop an already started resource group with "op stop timeout=2s" for the systemd resource will not be accounted as 2 seconds: Failed Resource Actions: * webserver_stop_0 on focal03 'OCF_TIMEOUT' (198): call=29, status='Timed Out', exitreason='', last-rc-change='1970-01-01 00:01:57Z', queued=1828ms, exec=204557ms * Watch the cluster collapse.. (fencing nodes, trying to start resources, fencing nodes again, and over) Increasing timeout to 20s does not help: Failed Resource Actions: * webserver_stop_0 on focal01 'OCF_TIMEOUT' (198): call=47, status='Timed Out', exitreason='', last-rc-change='1970-01-01 00:10:35Z', queued=20ms, exec=236013ms * webserver_start_0 on focal03 'OCF_TIMEOUT' (198): call=22, status='Timed Out', exitreason='', last-rc-change='1970-01-01 00:05:09Z', queued=33ms, exec=241831ms and the systemd resources startup is much less than 20 seconds. [Regression Potential]  * Debian was still using ftime() for pacemaker 2.0.3, and, because of deprecation warnings, wgrant has changed it in: pacemaker (2.0.3-3ubuntu2): This was "bad" because it made this issue to appear (as we started using clock_gettime(CLOCK_MONOTONIC) instead of ftime(). But.. it was good, because in order for pacemaker to support systemd resources a monotonic clock is required (and this change enabled it).  * So, there is no easy path: Its either we disable clock_gettime() support, by defining PCMK_TIME_EMERGENCY_CGT (like 2.0.3 does by default) - and stick with broken systemd resources + FTBFS - or we fix the clock_gettime() support (with this patchset) enabled by wgrant in 2.0.3. Now... to the potential issues:  * This patchset was not done in 2.0.3 because it was missed also (it is like "half fix" for clock_gettime() was done before the release).  * The number of patches are not small but they're ALL related to the same thing: fixing timeout not working and re-organizing timing for resources. They're also mostly touching the same file: execd_commands.c (and configure.ac to control macros).  * timeouts are confirmed broken for systemd resources (like the test case shows). We could, perhaps, brake for OCF resorces and/or fencing as well.  * This change has been recommended by upstream maintainer (from 2 merge numbers he pointed out in the upstream bug = https://bugs.clusterlabs.org/show_bug.cgi?id=5429). [Other Info]  * Original Description (from the reporter): While working on pacemaker, i discovered a issue with timeouts haproxy_stop_0 on primary 'OCF_TIMEOUT' (198): call=583, status='Timed Out', exitreason='', last-rc-change='1970-01-04 17:21:18 -05:00', queued=44ms, exec=176272ms this lead me down the path of finding that setting a timeout unit value was not doing anything primitive haproxy systemd:haproxy \         op monitor interval=2s \         op start interval=0s timeout=500s \         op stop interval=0s timeout=500s \         meta migration-threshold=2 primitive haproxy systemd:haproxy \         op monitor interval=2s \         op start interval=0s timeout=500 \         op stop interval=0s timeout=500 \         meta migration-threshold=2 the two above configs result in the same behavior, pacemaker/crm seems to be ignoring the "s" I file a bug with pacemaker itself https://bugs.clusterlabs.org/show_bug.cgi?id=5429 but this lead to the following responsed, copied from the ticket: <<Looking back on your irc chat, I see you have a version of Pacemaker with a known bug: <<haproxy_stop_0 on primary 'OCF_TIMEOUT' (198): call=583, status='Timed Out', exitreason='', last-rc-<<change='1970-01-04 17:21:18 -05:00', queued=44ms, exec=176272ms <<The incorrect date is a result of bugs that occur in systemd resources when Pacemaker 2.0.3 is built <<with the -UPCMK_TIME_EMERGENCY_CGT C flag (which is not the default). I was only aware of that being the <<case in one Fedora release. If those are stock Ubuntu packages, please file an Ubuntu bug to make sure <<they are aware of it. <<The underlying bugs are fixed as of the Pacemaker 2.0.4 release. If anyone wants to backport specific <<commits instead, the github pull requests #1992 and #1997 should take care of it. It appears the the root cause of my issue with setting timeout values with units ("600s") is a bug in the build process of ubuntu pacemaker 1) lsb_release -d Description: Ubuntu 20.04 LTS 2) ii pacemaker 2.0.3-3ubuntu3 amd64 cluster resource manager 3) setting "100s" in the timeout of a resource should result in a 100 second timeout, not a 100 milisecond timeout 4) the settings unit value "s", is being ignored. force me to set the timeout to 10000 to get a 10 second timeout [Impact]  * Cluster resource operation timeouts are not working correctly for systemd resources and should be working. Timeouts are important in order for the actions executed by pacemaker - for the systemd resource in question - don't wait forever to start (or stop) a service, causing the police engine to take the correct decisions (like trying to start the resource somewhere else). [Test Case]  * configure correctly a pacemaker cluster and add the following resources: # fencing primitive fence-focal01 stonith:fence_virsh \         params ipaddr=192.168.100.202 \         secure=true plug=focal01 login=fenceuser \         op monitor interval=30s primitive fence-focal02 stonith:fence_virsh \         params ipaddr=192.168.100.202 \         secure=true plug=focal02 login=fenceuser \         op monitor interval=30s primitive fence-focal03 stonith:fence_virsh \         params ipaddr=192.168.100.202 \         secure=true plug=focal03 login=fenceuser \         op monitor interval=30s # resources primitive virtual_ip IPaddr2 \         params ip=10.250.92.90 nic=public01 \         op monitor interval=5s primitive webserver systemd:lighttpd \         op monitor interval=5s \         op start interval=0s timeout=2s \         op stop interval=0s timeout=2s \         meta migration-threshold=2 # resource group group webserver_vip webserver virtual_ip \         meta target-role=Stopped # locations location fence-focal01-location fence-focal01 -inf: focal01 location fence-focal02-location fence-focal02 -inf: focal02 location fence-focal03-location fence-focal03 -inf: focal03 # properties property cib-bootstrap-options: \         have-watchdog=false \         dc-version=2.0.3-4b1f869f0f \         cluster-infrastructure=corosync \         stonith-enabled=on \         stonith-action=reboot \         no-quorum-policy=stop \         cluster-name=focal * Try to stop an already started resource group with "op stop timeout=2s" for the systemd resource will not be accounted as 2 seconds: Failed Resource Actions:   * webserver_stop_0 on focal03 'OCF_TIMEOUT' (198): call=29, status='Timed Out', exitreason='', last-rc-change='1970-01-01 00:01:57Z', queued=1828ms, exec=204557ms * Watch the cluster collapse.. (fencing nodes, trying to start resources, fencing nodes again, and over) Increasing timeout to 20s does not help: Failed Resource Actions:   * webserver_stop_0 on focal01 'OCF_TIMEOUT' (198): call=47, status='Timed Out', exitreason='', last-rc-change='1970-01-01 00:10:35Z', queued=20ms, exec=236013ms   * webserver_start_0 on focal03 'OCF_TIMEOUT' (198): call=22, status='Timed Out', exitreason='', last-rc-change='1970-01-01 00:05:09Z', queued=33ms, exec=241831ms and the systemd resources startup is much less than 20 seconds. [Regression Potential]  * Debian was still using ftime() for pacemaker 2.0.3, and, because of deprecation warnings, wgrant has changed it in: pacemaker (2.0.3-3ubuntu2): This was "bad" because it made this issue to appear (as we started using clock_gettime(CLOCK_MONOTONIC) instead of ftime(). But.. it was good, because in order for pacemaker to support systemd resources a monotonic clock is required (and this change enabled it).  * So, there is no easy path: Its either we disable clock_gettime() support, by defining PCMK_TIME_EMERGENCY_CGT (like 2.0.3 does by default) - and stick with broken systemd resources + FTBFS - or we fix the clock_gettime() support (with this patchset) enabled by wgrant in 2.0.3. Now... to the potential issues:  * This patchset was not done in 2.0.3 because it was missed also (it is like "half fix" for clock_gettime() was done before the release).  * The number of patches are not small but they're ALL related to the same thing: fixing timeout not working and re-organizing timing for resources. They're also mostly touching the same file: execd_commands.c (and configure.ac to control macros).  * timeouts are confirmed broken for systemd resources (like the test case shows). We could, perhaps, brake for OCF resorces and/or fencing as well.  * This change has been recommended by upstream maintainer (from 2 merge numbers he pointed out in the upstream bug = https://bugs.clusterlabs.org/show_bug.cgi?id=5429). [Other Info]  * Original Description (from the reporter): While working on pacemaker, i discovered a issue with timeouts haproxy_stop_0 on primary 'OCF_TIMEOUT' (198): call=583, status='Timed Out', exitreason='', last-rc-change='1970-01-04 17:21:18 -05:00', queued=44ms, exec=176272ms this lead me down the path of finding that setting a timeout unit value was not doing anything primitive haproxy systemd:haproxy \         op monitor interval=2s \         op start interval=0s timeout=500s \         op stop interval=0s timeout=500s \         meta migration-threshold=2 primitive haproxy systemd:haproxy \         op monitor interval=2s \         op start interval=0s timeout=500 \         op stop interval=0s timeout=500 \         meta migration-threshold=2 the two above configs result in the same behavior, pacemaker/crm seems to be ignoring the "s" I file a bug with pacemaker itself https://bugs.clusterlabs.org/show_bug.cgi?id=5429 but this lead to the following responsed, copied from the ticket: <<Looking back on your irc chat, I see you have a version of Pacemaker with a known bug: <<haproxy_stop_0 on primary 'OCF_TIMEOUT' (198): call=583, status='Timed Out', exitreason='', last-rc-<<change='1970-01-04 17:21:18 -05:00', queued=44ms, exec=176272ms <<The incorrect date is a result of bugs that occur in systemd resources when Pacemaker 2.0.3 is built <<with the -UPCMK_TIME_EMERGENCY_CGT C flag (which is not the default). I was only aware of that being the <<case in one Fedora release. If those are stock Ubuntu packages, please file an Ubuntu bug to make sure <<they are aware of it. <<The underlying bugs are fixed as of the Pacemaker 2.0.4 release. If anyone wants to backport specific <<commits instead, the github pull requests #1992 and #1997 should take care of it. It appears the the root cause of my issue with setting timeout values with units ("600s") is a bug in the build process of ubuntu pacemaker 1) lsb_release -d Description: Ubuntu 20.04 LTS 2) ii pacemaker 2.0.3-3ubuntu3 amd64 cluster resource manager 3) setting "100s" in the timeout of a resource should result in a 100 second timeout, not a 100 milisecond timeout 4) the settings unit value "s", is being ignored. force me to set the timeout to 10000 to get a 10 second timeout
2020-10-06 16:56:16 Rafael David Tinoco pacemaker (Ubuntu Focal): assignee Rafael David Tinoco (rafaeldtinoco)
2020-10-06 16:56:25 Rafael David Tinoco pacemaker (Ubuntu Focal): importance High Undecided
2020-10-08 05:42:12 Launchpad Janitor merge proposal linked https://code.launchpad.net/~rafaeldtinoco/ubuntu/+source/pacemaker/+git/pacemaker/+merge/391960
2020-10-08 14:11:13 Rafael David Tinoco description [Impact]  * Cluster resource operation timeouts are not working correctly for systemd resources and should be working. Timeouts are important in order for the actions executed by pacemaker - for the systemd resource in question - don't wait forever to start (or stop) a service, causing the police engine to take the correct decisions (like trying to start the resource somewhere else). [Test Case]  * configure correctly a pacemaker cluster and add the following resources: # fencing primitive fence-focal01 stonith:fence_virsh \         params ipaddr=192.168.100.202 \         secure=true plug=focal01 login=fenceuser \         op monitor interval=30s primitive fence-focal02 stonith:fence_virsh \         params ipaddr=192.168.100.202 \         secure=true plug=focal02 login=fenceuser \         op monitor interval=30s primitive fence-focal03 stonith:fence_virsh \         params ipaddr=192.168.100.202 \         secure=true plug=focal03 login=fenceuser \         op monitor interval=30s # resources primitive virtual_ip IPaddr2 \         params ip=10.250.92.90 nic=public01 \         op monitor interval=5s primitive webserver systemd:lighttpd \         op monitor interval=5s \         op start interval=0s timeout=2s \         op stop interval=0s timeout=2s \         meta migration-threshold=2 # resource group group webserver_vip webserver virtual_ip \         meta target-role=Stopped # locations location fence-focal01-location fence-focal01 -inf: focal01 location fence-focal02-location fence-focal02 -inf: focal02 location fence-focal03-location fence-focal03 -inf: focal03 # properties property cib-bootstrap-options: \         have-watchdog=false \         dc-version=2.0.3-4b1f869f0f \         cluster-infrastructure=corosync \         stonith-enabled=on \         stonith-action=reboot \         no-quorum-policy=stop \         cluster-name=focal * Try to stop an already started resource group with "op stop timeout=2s" for the systemd resource will not be accounted as 2 seconds: Failed Resource Actions:   * webserver_stop_0 on focal03 'OCF_TIMEOUT' (198): call=29, status='Timed Out', exitreason='', last-rc-change='1970-01-01 00:01:57Z', queued=1828ms, exec=204557ms * Watch the cluster collapse.. (fencing nodes, trying to start resources, fencing nodes again, and over) Increasing timeout to 20s does not help: Failed Resource Actions:   * webserver_stop_0 on focal01 'OCF_TIMEOUT' (198): call=47, status='Timed Out', exitreason='', last-rc-change='1970-01-01 00:10:35Z', queued=20ms, exec=236013ms   * webserver_start_0 on focal03 'OCF_TIMEOUT' (198): call=22, status='Timed Out', exitreason='', last-rc-change='1970-01-01 00:05:09Z', queued=33ms, exec=241831ms and the systemd resources startup is much less than 20 seconds. [Regression Potential]  * Debian was still using ftime() for pacemaker 2.0.3, and, because of deprecation warnings, wgrant has changed it in: pacemaker (2.0.3-3ubuntu2): This was "bad" because it made this issue to appear (as we started using clock_gettime(CLOCK_MONOTONIC) instead of ftime(). But.. it was good, because in order for pacemaker to support systemd resources a monotonic clock is required (and this change enabled it).  * So, there is no easy path: Its either we disable clock_gettime() support, by defining PCMK_TIME_EMERGENCY_CGT (like 2.0.3 does by default) - and stick with broken systemd resources + FTBFS - or we fix the clock_gettime() support (with this patchset) enabled by wgrant in 2.0.3. Now... to the potential issues:  * This patchset was not done in 2.0.3 because it was missed also (it is like "half fix" for clock_gettime() was done before the release).  * The number of patches are not small but they're ALL related to the same thing: fixing timeout not working and re-organizing timing for resources. They're also mostly touching the same file: execd_commands.c (and configure.ac to control macros).  * timeouts are confirmed broken for systemd resources (like the test case shows). We could, perhaps, brake for OCF resorces and/or fencing as well.  * This change has been recommended by upstream maintainer (from 2 merge numbers he pointed out in the upstream bug = https://bugs.clusterlabs.org/show_bug.cgi?id=5429). [Other Info]  * Original Description (from the reporter): While working on pacemaker, i discovered a issue with timeouts haproxy_stop_0 on primary 'OCF_TIMEOUT' (198): call=583, status='Timed Out', exitreason='', last-rc-change='1970-01-04 17:21:18 -05:00', queued=44ms, exec=176272ms this lead me down the path of finding that setting a timeout unit value was not doing anything primitive haproxy systemd:haproxy \         op monitor interval=2s \         op start interval=0s timeout=500s \         op stop interval=0s timeout=500s \         meta migration-threshold=2 primitive haproxy systemd:haproxy \         op monitor interval=2s \         op start interval=0s timeout=500 \         op stop interval=0s timeout=500 \         meta migration-threshold=2 the two above configs result in the same behavior, pacemaker/crm seems to be ignoring the "s" I file a bug with pacemaker itself https://bugs.clusterlabs.org/show_bug.cgi?id=5429 but this lead to the following responsed, copied from the ticket: <<Looking back on your irc chat, I see you have a version of Pacemaker with a known bug: <<haproxy_stop_0 on primary 'OCF_TIMEOUT' (198): call=583, status='Timed Out', exitreason='', last-rc-<<change='1970-01-04 17:21:18 -05:00', queued=44ms, exec=176272ms <<The incorrect date is a result of bugs that occur in systemd resources when Pacemaker 2.0.3 is built <<with the -UPCMK_TIME_EMERGENCY_CGT C flag (which is not the default). I was only aware of that being the <<case in one Fedora release. If those are stock Ubuntu packages, please file an Ubuntu bug to make sure <<they are aware of it. <<The underlying bugs are fixed as of the Pacemaker 2.0.4 release. If anyone wants to backport specific <<commits instead, the github pull requests #1992 and #1997 should take care of it. It appears the the root cause of my issue with setting timeout values with units ("600s") is a bug in the build process of ubuntu pacemaker 1) lsb_release -d Description: Ubuntu 20.04 LTS 2) ii pacemaker 2.0.3-3ubuntu3 amd64 cluster resource manager 3) setting "100s" in the timeout of a resource should result in a 100 second timeout, not a 100 milisecond timeout 4) the settings unit value "s", is being ignored. force me to set the timeout to 10000 to get a 10 second timeout [Impact]  * Cluster resource operation timeouts are not working correctly for systemd resources and should be working. Timeouts are important in order for the actions executed by pacemaker - for the systemd resource in question - don't wait forever to start (or stop) a service, causing the police engine to take the correct decisions (like trying to start the resource somewhere else). [Test Case]  * configure correctly a pacemaker cluster and add the following resources: # fencing primitive fence-focal01 stonith:fence_virsh \         params ipaddr=192.168.100.202 \         secure=true plug=focal01 login=fenceuser \         op monitor interval=30s primitive fence-focal02 stonith:fence_virsh \         params ipaddr=192.168.100.202 \         secure=true plug=focal02 login=fenceuser \         op monitor interval=30s primitive fence-focal03 stonith:fence_virsh \         params ipaddr=192.168.100.202 \         secure=true plug=focal03 login=fenceuser \         op monitor interval=30s # resources primitive virtual_ip IPaddr2 \         params ip=10.250.92.90 nic=public01 \         op monitor interval=5s primitive webserver systemd:lighttpd \         op monitor interval=5s \         op start interval=0s timeout=2s \         op stop interval=0s timeout=2s \         meta migration-threshold=2 # resource group group webserver_vip webserver virtual_ip \         meta target-role=Stopped # locations location fence-focal01-location fence-focal01 -inf: focal01 location fence-focal02-location fence-focal02 -inf: focal02 location fence-focal03-location fence-focal03 -inf: focal03 # properties property cib-bootstrap-options: \         have-watchdog=false \         dc-version=2.0.3-4b1f869f0f \         cluster-infrastructure=corosync \         stonith-enabled=on \         stonith-action=reboot \         no-quorum-policy=stop \         cluster-name=focal * Try to stop an already started resource group with "op stop timeout=2s" for the systemd resource will not be accounted as 2 seconds: Failed Resource Actions:   * webserver_stop_0 on focal03 'OCF_TIMEOUT' (198): call=29, status='Timed Out', exitreason='', last-rc-change='1970-01-01 00:01:57Z', queued=1828ms, exec=204557ms * Watch the cluster collapse.. (fencing nodes, trying to start resources, fencing nodes again, and over) Increasing timeout to 20s does not help: Failed Resource Actions:   * webserver_stop_0 on focal01 'OCF_TIMEOUT' (198): call=47, status='Timed Out', exitreason='', last-rc-change='1970-01-01 00:10:35Z', queued=20ms, exec=236013ms   * webserver_start_0 on focal03 'OCF_TIMEOUT' (198): call=22, status='Timed Out', exitreason='', last-rc-change='1970-01-01 00:05:09Z', queued=33ms, exec=241831ms and the systemd resources startup is much less than 20 seconds. [Regression Potential]  * Debian was still using ftime() for pacemaker 2.0.3, and, because of deprecation warnings, wgrant has changed it in: pacemaker (2.0.3-3ubuntu2): This was "bad" because it made this issue to appear (as we started using clock_gettime(CLOCK_MONOTONIC) instead of ftime(). But.. it was good, because in order for pacemaker to support systemd resources a monotonic clock is required (and this change enabled it).  * So, there is no easy path: Its either we disable clock_gettime() support, by defining PCMK_TIME_EMERGENCY_CGT (like 2.0.3 does by default) - and stick with broken systemd resources + FTBFS - or we fix the clock_gettime() support (with this patchset) enabled by wgrant in 2.0.3. Now... to the potential issues: * After SRU review it was decided that, instead of cherry-picking the 2 upstream merges pointed by upstream maintainer (#1992 and #1997) we would only backport changes that affect clock_gettime() code base and execution path. This is per SRU guidelines, trying to minimize amount of changes to be reviewed and merged.  * The original fix (merges #1992 and #1997) were not merged in 2.0.3 because they were missed (it is like "half fix" for clock_gettime() was done before the release). * There are 2 possible clocking choices for pacemaker in 2.0.3: To use ftime() if supported (the upstream default) OR to use clock_gettime() if selected (it becomes the upstream default after upstream merges #1992 and #1997 were done, but after 2.0.3 release). * I confined all chances inside "#ifdef PCMK__TIME_USE_CGT" scope and made sure that one could compile same source code with ftime() support (just because I cared about not braking compilation for someone else if needed). * Fixes are done in execd_commands, the code responsible for starting and stopping resources and/or fencing agents (anything that will need an execv() basically). This could jeopardize other agents (other than systemd ones) so functional and regression tests are needed. * Someone that has increased systemd resources timeouts because they were not being respected (2 would have to be value 2000 in previous version, since seconds were not being respected) could have their timeout settings increased now to 2000sec... so it is advised that those should review their timeout settings and always use a timing metric (like suffixing timeout with "s").  * This change has been recommended by upstream maintainer (from 2 merge numbers he pointed out in the upstream bug = https://bugs.clusterlabs.org/show_bug.cgi?id=5429). [Other Info]  * Original Description (from the reporter): While working on pacemaker, i discovered a issue with timeouts haproxy_stop_0 on primary 'OCF_TIMEOUT' (198): call=583, status='Timed Out', exitreason='', last-rc-change='1970-01-04 17:21:18 -05:00', queued=44ms, exec=176272ms this lead me down the path of finding that setting a timeout unit value was not doing anything primitive haproxy systemd:haproxy \         op monitor interval=2s \         op start interval=0s timeout=500s \         op stop interval=0s timeout=500s \         meta migration-threshold=2 primitive haproxy systemd:haproxy \         op monitor interval=2s \         op start interval=0s timeout=500 \         op stop interval=0s timeout=500 \         meta migration-threshold=2 the two above configs result in the same behavior, pacemaker/crm seems to be ignoring the "s" I file a bug with pacemaker itself https://bugs.clusterlabs.org/show_bug.cgi?id=5429 but this lead to the following responsed, copied from the ticket: <<Looking back on your irc chat, I see you have a version of Pacemaker with a known bug: <<haproxy_stop_0 on primary 'OCF_TIMEOUT' (198): call=583, status='Timed Out', exitreason='', last-rc-<<change='1970-01-04 17:21:18 -05:00', queued=44ms, exec=176272ms <<The incorrect date is a result of bugs that occur in systemd resources when Pacemaker 2.0.3 is built <<with the -UPCMK_TIME_EMERGENCY_CGT C flag (which is not the default). I was only aware of that being the <<case in one Fedora release. If those are stock Ubuntu packages, please file an Ubuntu bug to make sure <<they are aware of it. <<The underlying bugs are fixed as of the Pacemaker 2.0.4 release. If anyone wants to backport specific <<commits instead, the github pull requests #1992 and #1997 should take care of it. It appears the the root cause of my issue with setting timeout values with units ("600s") is a bug in the build process of ubuntu pacemaker 1) lsb_release -d Description: Ubuntu 20.04 LTS 2) ii pacemaker 2.0.3-3ubuntu3 amd64 cluster resource manager 3) setting "100s" in the timeout of a resource should result in a 100 second timeout, not a 100 milisecond timeout 4) the settings unit value "s", is being ignored. force me to set the timeout to 10000 to get a 10 second timeout
2020-10-08 19:31:26 Robie Basak pacemaker (Ubuntu Focal): status In Progress Fix Committed
2020-10-08 19:31:28 Robie Basak bug added subscriber Ubuntu Stable Release Updates Team
2020-10-08 19:31:31 Robie Basak bug added subscriber SRU Verification
2020-10-08 19:31:34 Robie Basak tags server-next server-next verification-needed verification-needed-focal
2020-10-15 12:49:07 Jason Grammenos tags server-next verification-needed verification-needed-focal server-next verification-done verification-done-focal
2020-10-20 04:26:06 Launchpad Janitor merge proposal linked https://code.launchpad.net/~rafaeldtinoco/ubuntu/+source/pacemaker/+git/pacemaker/+merge/392509
2020-10-20 04:26:48 Launchpad Janitor merge proposal linked https://code.launchpad.net/~rafaeldtinoco/ubuntu/+source/pacemaker/+git/pacemaker/+merge/392510
2020-10-26 10:17:20 Launchpad Janitor pacemaker (Ubuntu Focal): status Fix Committed Fix Released
2020-10-26 10:17:26 Łukasz Zemczak removed subscriber Ubuntu Stable Release Updates Team