2020-06-02 13:59:09 |
Jason Grammenos |
bug |
|
|
added bug |
2020-06-17 18:10:34 |
Rafael David Tinoco |
pacemaker (Ubuntu): status |
New |
Triaged |
|
2020-06-17 18:10:39 |
Rafael David Tinoco |
pacemaker (Ubuntu): assignee |
|
Rafael David Tinoco (rafaeldtinoco) |
|
2020-06-17 18:10:48 |
Rafael David Tinoco |
bug |
|
|
added subscriber Ubuntu Server |
2020-07-04 16:32:55 |
Peter Kasza |
bug |
|
|
added subscriber Peter Kasza |
2020-09-25 15:24:11 |
Alexander Balderson |
bug |
|
|
added subscriber Canonical Field Critical |
2020-09-25 19:43:04 |
Alexander Balderson |
removed subscriber Canonical Field Critical |
|
|
|
2020-09-25 19:43:12 |
Alexander Balderson |
bug |
|
|
added subscriber Canonical Field High |
2020-09-25 20:42:11 |
Launchpad Janitor |
merge proposal linked |
|
https://code.launchpad.net/~rafaeldtinoco/ubuntu/+source/pacemaker/+git/pacemaker/+merge/391397 |
|
2020-09-25 20:54:45 |
Rafael David Tinoco |
description |
While working on pacemaker, i discovered a issue with timeouts
haproxy_stop_0 on primary 'OCF_TIMEOUT' (198): call=583, status='Timed Out', exitreason='', last-rc-change='1970-01-04 17:21:18 -05:00', queued=44ms, exec=176272ms
this lead me down the path of finding that setting a timeout unit value was not doing anything
primitive haproxy systemd:haproxy \
op monitor interval=2s \
op start interval=0s timeout=500s \
op stop interval=0s timeout=500s \
meta migration-threshold=2
primitive haproxy systemd:haproxy \
op monitor interval=2s \
op start interval=0s timeout=500 \
op stop interval=0s timeout=500 \
meta migration-threshold=2
the two above configs result in the same behaviour, pacemaker/crm seems to be ignoring the "s"
I file a bug with pacemaker itself
https://bugs.clusterlabs.org/show_bug.cgi?id=5429
but this lead to the following responsed, copied from the ticket:
<<Looking back on your irc chat, I see you have a version of Pacemaker with a known bug:
<<haproxy_stop_0 on primary 'OCF_TIMEOUT' (198): call=583, status='Timed Out', exitreason='', last-rc-<<change='1970-01-04 17:21:18 -05:00', queued=44ms, exec=176272ms
<<The incorrect date is a result of bugs that occur in systemd resources when Pacemaker 2.0.3 is built <<with the -UPCMK_TIME_EMERGENCY_CGT C flag (which is not the default). I was only aware of that being the <<case in one Fedora release. If those are stock Ubuntu packages, please file an Ubuntu bug to make sure <<they are aware of it.
<<The underlying bugs are fixed as of the Pacemaker 2.0.4 release. If anyone wants to backport specific <<commits instead, the github pull requests #1992 and #1997 should take care of it.
It appears the the root cause of my issue with setting timeout values with units ("600s") is a bug in the build process of ubuntu pacemaker
1) lsb_release -d Description: Ubuntu 20.04 LTS
2) ii pacemaker 2.0.3-3ubuntu3 amd64 cluster resource manager
3) setting "100s" in the timeout of a resource should result in a 100 second timeout, not a 100 milisecond timeout
4) the settings unit value "s", is being ignored. force me to set the timeout to 10000 to get a 10 second timeout |
[Impact]
* Cluster resource timeouts are not working and should be working. Timeouts are important in order for the actions (done for the resource) don't timeout before we're expecting (sometimes starting a resource can take more time than the default time because of configuration files, or cache to be loaded, etc).
[Test Case]
* Create a pacemaker cluster with Ubuntu focal and configure a primitive with:
primitive haproxy systemd:haproxy \
op monitor interval=2s \
op start interval=0s timeout=500s \
op stop interval=0s timeout=500s \
meta migration-threshold=2
or even
primitive haproxy systemd:haproxy \
op monitor interval=2s \
op start interval=0s timeout=500 \
op stop interval=0s timeout=500 \
meta migration-threshold=2
and observe timeouts are not being respected.
[Regression Potential]
* The number of patches are not small but they're ALL related to the same thing: fixing timeout not working and re-organizing timing for resources.
* TBD (more info to come)
[Other Info]
* Original Description (from the reporter):
While working on pacemaker, i discovered a issue with timeouts
haproxy_stop_0 on primary 'OCF_TIMEOUT' (198): call=583, status='Timed Out', exitreason='', last-rc-change='1970-01-04 17:21:18 -05:00', queued=44ms, exec=176272ms
this lead me down the path of finding that setting a timeout unit value was not doing anything
primitive haproxy systemd:haproxy \
op monitor interval=2s \
op start interval=0s timeout=500s \
op stop interval=0s timeout=500s \
meta migration-threshold=2
primitive haproxy systemd:haproxy \
op monitor interval=2s \
op start interval=0s timeout=500 \
op stop interval=0s timeout=500 \
meta migration-threshold=2
the two above configs result in the same behavior, pacemaker/crm seems to be ignoring the "s"
I file a bug with pacemaker itself
https://bugs.clusterlabs.org/show_bug.cgi?id=5429
but this lead to the following responsed, copied from the ticket:
<<Looking back on your irc chat, I see you have a version of Pacemaker with a known bug:
<<haproxy_stop_0 on primary 'OCF_TIMEOUT' (198): call=583, status='Timed Out', exitreason='', last-rc-<<change='1970-01-04 17:21:18 -05:00', queued=44ms, exec=176272ms
<<The incorrect date is a result of bugs that occur in systemd resources when Pacemaker 2.0.3 is built <<with the -UPCMK_TIME_EMERGENCY_CGT C flag (which is not the default). I was only aware of that being the <<case in one Fedora release. If those are stock Ubuntu packages, please file an Ubuntu bug to make sure <<they are aware of it.
<<The underlying bugs are fixed as of the Pacemaker 2.0.4 release. If anyone wants to backport specific <<commits instead, the github pull requests #1992 and #1997 should take care of it.
It appears the the root cause of my issue with setting timeout values with units ("600s") is a bug in the build process of ubuntu pacemaker
1) lsb_release -d Description: Ubuntu 20.04 LTS
2) ii pacemaker 2.0.3-3ubuntu3 amd64 cluster resource manager
3) setting "100s" in the timeout of a resource should result in a 100 second timeout, not a 100 milisecond timeout
4) the settings unit value "s", is being ignored. force me to set the timeout to 10000 to get a 10 second timeout |
|
2020-09-25 21:01:26 |
Launchpad Janitor |
merge proposal linked |
|
https://code.launchpad.net/~rafaeldtinoco/ubuntu/+source/pacemaker/+git/pacemaker/+merge/391398 |
|
2020-09-25 21:03:42 |
Rafael David Tinoco |
nominated for series |
|
Ubuntu Focal |
|
2020-09-25 21:03:42 |
Rafael David Tinoco |
bug task added |
|
pacemaker (Ubuntu Focal) |
|
2020-09-25 21:03:51 |
Rafael David Tinoco |
pacemaker (Ubuntu): status |
Triaged |
Fix Released |
|
2020-09-25 21:03:53 |
Rafael David Tinoco |
pacemaker (Ubuntu): assignee |
Rafael David Tinoco (rafaeldtinoco) |
|
|
2020-09-25 21:03:56 |
Rafael David Tinoco |
pacemaker (Ubuntu Focal): assignee |
|
Rafael David Tinoco (rafaeldtinoco) |
|
2020-09-25 21:03:57 |
Rafael David Tinoco |
pacemaker (Ubuntu Focal): importance |
Undecided |
High |
|
2020-09-25 21:04:02 |
Rafael David Tinoco |
pacemaker (Ubuntu Focal): status |
New |
In Progress |
|
2020-09-25 21:04:09 |
Rafael David Tinoco |
tags |
|
server-next |
|
2020-09-25 21:04:15 |
Rafael David Tinoco |
merge proposal unlinked |
https://code.launchpad.net/~rafaeldtinoco/ubuntu/+source/pacemaker/+git/pacemaker/+merge/391397 |
|
|
2020-09-25 21:05:32 |
Rafael David Tinoco |
bug |
|
|
added subscriber Ubuntu HA Interest |
2020-09-26 05:45:36 |
Rafael David Tinoco |
description |
[Impact]
* Cluster resource timeouts are not working and should be working. Timeouts are important in order for the actions (done for the resource) don't timeout before we're expecting (sometimes starting a resource can take more time than the default time because of configuration files, or cache to be loaded, etc).
[Test Case]
* Create a pacemaker cluster with Ubuntu focal and configure a primitive with:
primitive haproxy systemd:haproxy \
op monitor interval=2s \
op start interval=0s timeout=500s \
op stop interval=0s timeout=500s \
meta migration-threshold=2
or even
primitive haproxy systemd:haproxy \
op monitor interval=2s \
op start interval=0s timeout=500 \
op stop interval=0s timeout=500 \
meta migration-threshold=2
and observe timeouts are not being respected.
[Regression Potential]
* The number of patches are not small but they're ALL related to the same thing: fixing timeout not working and re-organizing timing for resources.
* TBD (more info to come)
[Other Info]
* Original Description (from the reporter):
While working on pacemaker, i discovered a issue with timeouts
haproxy_stop_0 on primary 'OCF_TIMEOUT' (198): call=583, status='Timed Out', exitreason='', last-rc-change='1970-01-04 17:21:18 -05:00', queued=44ms, exec=176272ms
this lead me down the path of finding that setting a timeout unit value was not doing anything
primitive haproxy systemd:haproxy \
op monitor interval=2s \
op start interval=0s timeout=500s \
op stop interval=0s timeout=500s \
meta migration-threshold=2
primitive haproxy systemd:haproxy \
op monitor interval=2s \
op start interval=0s timeout=500 \
op stop interval=0s timeout=500 \
meta migration-threshold=2
the two above configs result in the same behavior, pacemaker/crm seems to be ignoring the "s"
I file a bug with pacemaker itself
https://bugs.clusterlabs.org/show_bug.cgi?id=5429
but this lead to the following responsed, copied from the ticket:
<<Looking back on your irc chat, I see you have a version of Pacemaker with a known bug:
<<haproxy_stop_0 on primary 'OCF_TIMEOUT' (198): call=583, status='Timed Out', exitreason='', last-rc-<<change='1970-01-04 17:21:18 -05:00', queued=44ms, exec=176272ms
<<The incorrect date is a result of bugs that occur in systemd resources when Pacemaker 2.0.3 is built <<with the -UPCMK_TIME_EMERGENCY_CGT C flag (which is not the default). I was only aware of that being the <<case in one Fedora release. If those are stock Ubuntu packages, please file an Ubuntu bug to make sure <<they are aware of it.
<<The underlying bugs are fixed as of the Pacemaker 2.0.4 release. If anyone wants to backport specific <<commits instead, the github pull requests #1992 and #1997 should take care of it.
It appears the the root cause of my issue with setting timeout values with units ("600s") is a bug in the build process of ubuntu pacemaker
1) lsb_release -d Description: Ubuntu 20.04 LTS
2) ii pacemaker 2.0.3-3ubuntu3 amd64 cluster resource manager
3) setting "100s" in the timeout of a resource should result in a 100 second timeout, not a 100 milisecond timeout
4) the settings unit value "s", is being ignored. force me to set the timeout to 10000 to get a 10 second timeout |
[Impact]
* Cluster resource timeouts are not working and should be working. Timeouts are important in order for the actions (done by the resource) don't timeout before we're expecting (sometimes starting a resource can take more time than the default time because of configuration files, or cache to be loaded, etc).
[Test Case]
* Create a pacemaker cluster with Ubuntu focal and configure a primitive with:
primitive haproxy systemd:haproxy \
op monitor interval=2s \
op start interval=0s timeout=500s \
op stop interval=0s timeout=500s \
meta migration-threshold=2
or even
primitive haproxy systemd:haproxy \
op monitor interval=2s \
op start interval=0s timeout=500 \
op stop interval=0s timeout=500 \
meta migration-threshold=2
and observe timeouts are not being respected.
[Regression Potential]
* Debian was still using ftime() for pacemaker 2.0.3, and, because of deprecation warnings, wgrant has changed it in: pacemaker (2.0.3-3ubuntu2):
This was "bad" because it made this issue to appear (as we started using clock_gettime(CLOCK_MONOTONIC) instead of ftime(). But.. it was good, because in order for pacemaker to support systemd resources a monotonic clock is required (and this change enabled it).
* So, there is no easy path: Its either we disable clock_gettime() support, by defining PCMK_TIME_EMERGENCY_CGT (like 2.0.3 does by default) - and stick with broken systemd resources + FTBFS - or we fix the clock_gettime() support (with this patchset) enabled by wgrant in 2.0.3.
Now... to the potential issues:
* This patchset was not done in 2.0.3 because it was missed also (it is like "half fix" for clock_gettime() was done before the release).
* The number of patches are not small but they're ALL related to the same thing: fixing timeout not working and re-organizing timing for resources. They're also mostly touching the same file: execd_commands.c (and configure.ac to control macros).
* timeouts are confirmed broken for systemd resources (like the test case shows). We could, perhaps, brake for OCF resorces and/or fencing as well.
* This change has been recommended by upstream maintainer (from 2 merge numbers he pointed out in the upstream bug = https://bugs.clusterlabs.org/show_bug.cgi?id=5429).
[Other Info]
* Original Description (from the reporter):
While working on pacemaker, i discovered a issue with timeouts
haproxy_stop_0 on primary 'OCF_TIMEOUT' (198): call=583, status='Timed Out', exitreason='', last-rc-change='1970-01-04 17:21:18 -05:00', queued=44ms, exec=176272ms
this lead me down the path of finding that setting a timeout unit value was not doing anything
primitive haproxy systemd:haproxy \
op monitor interval=2s \
op start interval=0s timeout=500s \
op stop interval=0s timeout=500s \
meta migration-threshold=2
primitive haproxy systemd:haproxy \
op monitor interval=2s \
op start interval=0s timeout=500 \
op stop interval=0s timeout=500 \
meta migration-threshold=2
the two above configs result in the same behavior, pacemaker/crm seems to be ignoring the "s"
I file a bug with pacemaker itself
https://bugs.clusterlabs.org/show_bug.cgi?id=5429
but this lead to the following responsed, copied from the ticket:
<<Looking back on your irc chat, I see you have a version of Pacemaker with a known bug:
<<haproxy_stop_0 on primary 'OCF_TIMEOUT' (198): call=583, status='Timed Out', exitreason='', last-rc-<<change='1970-01-04 17:21:18 -05:00', queued=44ms, exec=176272ms
<<The incorrect date is a result of bugs that occur in systemd resources when Pacemaker 2.0.3 is built <<with the -UPCMK_TIME_EMERGENCY_CGT C flag (which is not the default). I was only aware of that being the <<case in one Fedora release. If those are stock Ubuntu packages, please file an Ubuntu bug to make sure <<they are aware of it.
<<The underlying bugs are fixed as of the Pacemaker 2.0.4 release. If anyone wants to backport specific <<commits instead, the github pull requests #1992 and #1997 should take care of it.
It appears the the root cause of my issue with setting timeout values with units ("600s") is a bug in the build process of ubuntu pacemaker
1) lsb_release -d Description: Ubuntu 20.04 LTS
2) ii pacemaker 2.0.3-3ubuntu3 amd64 cluster resource manager
3) setting "100s" in the timeout of a resource should result in a 100 second timeout, not a 100 milisecond timeout
4) the settings unit value "s", is being ignored. force me to set the timeout to 10000 to get a 10 second timeout |
|
2020-09-28 14:49:04 |
George Kraft |
bug |
|
|
added subscriber George Kraft |
2020-09-29 01:30:05 |
Rafael David Tinoco |
description |
[Impact]
* Cluster resource timeouts are not working and should be working. Timeouts are important in order for the actions (done by the resource) don't timeout before we're expecting (sometimes starting a resource can take more time than the default time because of configuration files, or cache to be loaded, etc).
[Test Case]
* Create a pacemaker cluster with Ubuntu focal and configure a primitive with:
primitive haproxy systemd:haproxy \
op monitor interval=2s \
op start interval=0s timeout=500s \
op stop interval=0s timeout=500s \
meta migration-threshold=2
or even
primitive haproxy systemd:haproxy \
op monitor interval=2s \
op start interval=0s timeout=500 \
op stop interval=0s timeout=500 \
meta migration-threshold=2
and observe timeouts are not being respected.
[Regression Potential]
* Debian was still using ftime() for pacemaker 2.0.3, and, because of deprecation warnings, wgrant has changed it in: pacemaker (2.0.3-3ubuntu2):
This was "bad" because it made this issue to appear (as we started using clock_gettime(CLOCK_MONOTONIC) instead of ftime(). But.. it was good, because in order for pacemaker to support systemd resources a monotonic clock is required (and this change enabled it).
* So, there is no easy path: Its either we disable clock_gettime() support, by defining PCMK_TIME_EMERGENCY_CGT (like 2.0.3 does by default) - and stick with broken systemd resources + FTBFS - or we fix the clock_gettime() support (with this patchset) enabled by wgrant in 2.0.3.
Now... to the potential issues:
* This patchset was not done in 2.0.3 because it was missed also (it is like "half fix" for clock_gettime() was done before the release).
* The number of patches are not small but they're ALL related to the same thing: fixing timeout not working and re-organizing timing for resources. They're also mostly touching the same file: execd_commands.c (and configure.ac to control macros).
* timeouts are confirmed broken for systemd resources (like the test case shows). We could, perhaps, brake for OCF resorces and/or fencing as well.
* This change has been recommended by upstream maintainer (from 2 merge numbers he pointed out in the upstream bug = https://bugs.clusterlabs.org/show_bug.cgi?id=5429).
[Other Info]
* Original Description (from the reporter):
While working on pacemaker, i discovered a issue with timeouts
haproxy_stop_0 on primary 'OCF_TIMEOUT' (198): call=583, status='Timed Out', exitreason='', last-rc-change='1970-01-04 17:21:18 -05:00', queued=44ms, exec=176272ms
this lead me down the path of finding that setting a timeout unit value was not doing anything
primitive haproxy systemd:haproxy \
op monitor interval=2s \
op start interval=0s timeout=500s \
op stop interval=0s timeout=500s \
meta migration-threshold=2
primitive haproxy systemd:haproxy \
op monitor interval=2s \
op start interval=0s timeout=500 \
op stop interval=0s timeout=500 \
meta migration-threshold=2
the two above configs result in the same behavior, pacemaker/crm seems to be ignoring the "s"
I file a bug with pacemaker itself
https://bugs.clusterlabs.org/show_bug.cgi?id=5429
but this lead to the following responsed, copied from the ticket:
<<Looking back on your irc chat, I see you have a version of Pacemaker with a known bug:
<<haproxy_stop_0 on primary 'OCF_TIMEOUT' (198): call=583, status='Timed Out', exitreason='', last-rc-<<change='1970-01-04 17:21:18 -05:00', queued=44ms, exec=176272ms
<<The incorrect date is a result of bugs that occur in systemd resources when Pacemaker 2.0.3 is built <<with the -UPCMK_TIME_EMERGENCY_CGT C flag (which is not the default). I was only aware of that being the <<case in one Fedora release. If those are stock Ubuntu packages, please file an Ubuntu bug to make sure <<they are aware of it.
<<The underlying bugs are fixed as of the Pacemaker 2.0.4 release. If anyone wants to backport specific <<commits instead, the github pull requests #1992 and #1997 should take care of it.
It appears the the root cause of my issue with setting timeout values with units ("600s") is a bug in the build process of ubuntu pacemaker
1) lsb_release -d Description: Ubuntu 20.04 LTS
2) ii pacemaker 2.0.3-3ubuntu3 amd64 cluster resource manager
3) setting "100s" in the timeout of a resource should result in a 100 second timeout, not a 100 milisecond timeout
4) the settings unit value "s", is being ignored. force me to set the timeout to 10000 to get a 10 second timeout |
SRU reviewer:
The merge request has been reviewed by @paelzer initially, before the SRU review. The most important comment is this:
https://code.launchpad.net/~rafaeldtinoco/ubuntu/+source/pacemaker/+git/pacemaker/+merge/391398/comments/1030355
Clarifying why the commits were picked. Thanks for reviewing this
[Impact]
* Cluster resource timeouts are not working and should be working. Timeouts are important in order for the actions (done by the resource) don't timeout before we're expecting (sometimes starting a resource can take more time than the default time because of configuration files, or cache to be loaded, etc).
[Test Case]
* Create a pacemaker cluster with Ubuntu focal and configure a primitive with:
primitive haproxy systemd:haproxy \
op monitor interval=2s \
op start interval=0s timeout=500s \
op stop interval=0s timeout=500s \
meta migration-threshold=2
or even
primitive haproxy systemd:haproxy \
op monitor interval=2s \
op start interval=0s timeout=500 \
op stop interval=0s timeout=500 \
meta migration-threshold=2
and observe timeouts are not being respected.
[Regression Potential]
* Debian was still using ftime() for pacemaker 2.0.3, and, because of deprecation warnings, wgrant has changed it in: pacemaker (2.0.3-3ubuntu2):
This was "bad" because it made this issue to appear (as we started using clock_gettime(CLOCK_MONOTONIC) instead of ftime(). But.. it was good, because in order for pacemaker to support systemd resources a monotonic clock is required (and this change enabled it).
* So, there is no easy path: Its either we disable clock_gettime() support, by defining PCMK_TIME_EMERGENCY_CGT (like 2.0.3 does by default) - and stick with broken systemd resources + FTBFS - or we fix the clock_gettime() support (with this patchset) enabled by wgrant in 2.0.3.
Now... to the potential issues:
* This patchset was not done in 2.0.3 because it was missed also (it is like "half fix" for clock_gettime() was done before the release).
* The number of patches are not small but they're ALL related to the same thing: fixing timeout not working and re-organizing timing for resources. They're also mostly touching the same file: execd_commands.c (and configure.ac to control macros).
* timeouts are confirmed broken for systemd resources (like the test case shows). We could, perhaps, brake for OCF resorces and/or fencing as well.
* This change has been recommended by upstream maintainer (from 2 merge numbers he pointed out in the upstream bug = https://bugs.clusterlabs.org/show_bug.cgi?id=5429).
[Other Info]
* Original Description (from the reporter):
While working on pacemaker, i discovered a issue with timeouts
haproxy_stop_0 on primary 'OCF_TIMEOUT' (198): call=583, status='Timed Out', exitreason='', last-rc-change='1970-01-04 17:21:18 -05:00', queued=44ms, exec=176272ms
this lead me down the path of finding that setting a timeout unit value was not doing anything
primitive haproxy systemd:haproxy \
op monitor interval=2s \
op start interval=0s timeout=500s \
op stop interval=0s timeout=500s \
meta migration-threshold=2
primitive haproxy systemd:haproxy \
op monitor interval=2s \
op start interval=0s timeout=500 \
op stop interval=0s timeout=500 \
meta migration-threshold=2
the two above configs result in the same behavior, pacemaker/crm seems to be ignoring the "s"
I file a bug with pacemaker itself
https://bugs.clusterlabs.org/show_bug.cgi?id=5429
but this lead to the following responsed, copied from the ticket:
<<Looking back on your irc chat, I see you have a version of Pacemaker with a known bug:
<<haproxy_stop_0 on primary 'OCF_TIMEOUT' (198): call=583, status='Timed Out', exitreason='', last-rc-<<change='1970-01-04 17:21:18 -05:00', queued=44ms, exec=176272ms
<<The incorrect date is a result of bugs that occur in systemd resources when Pacemaker 2.0.3 is built <<with the -UPCMK_TIME_EMERGENCY_CGT C flag (which is not the default). I was only aware of that being the <<case in one Fedora release. If those are stock Ubuntu packages, please file an Ubuntu bug to make sure <<they are aware of it.
<<The underlying bugs are fixed as of the Pacemaker 2.0.4 release. If anyone wants to backport specific <<commits instead, the github pull requests #1992 and #1997 should take care of it.
It appears the the root cause of my issue with setting timeout values with units ("600s") is a bug in the build process of ubuntu pacemaker
1) lsb_release -d Description: Ubuntu 20.04 LTS
2) ii pacemaker 2.0.3-3ubuntu3 amd64 cluster resource manager
3) setting "100s" in the timeout of a resource should result in a 100 second timeout, not a 100 milisecond timeout
4) the settings unit value "s", is being ignored. force me to set the timeout to 10000 to get a 10 second timeout |
|
2020-10-01 20:03:57 |
Rafael David Tinoco |
description |
SRU reviewer:
The merge request has been reviewed by @paelzer initially, before the SRU review. The most important comment is this:
https://code.launchpad.net/~rafaeldtinoco/ubuntu/+source/pacemaker/+git/pacemaker/+merge/391398/comments/1030355
Clarifying why the commits were picked. Thanks for reviewing this
[Impact]
* Cluster resource timeouts are not working and should be working. Timeouts are important in order for the actions (done by the resource) don't timeout before we're expecting (sometimes starting a resource can take more time than the default time because of configuration files, or cache to be loaded, etc).
[Test Case]
* Create a pacemaker cluster with Ubuntu focal and configure a primitive with:
primitive haproxy systemd:haproxy \
op monitor interval=2s \
op start interval=0s timeout=500s \
op stop interval=0s timeout=500s \
meta migration-threshold=2
or even
primitive haproxy systemd:haproxy \
op monitor interval=2s \
op start interval=0s timeout=500 \
op stop interval=0s timeout=500 \
meta migration-threshold=2
and observe timeouts are not being respected.
[Regression Potential]
* Debian was still using ftime() for pacemaker 2.0.3, and, because of deprecation warnings, wgrant has changed it in: pacemaker (2.0.3-3ubuntu2):
This was "bad" because it made this issue to appear (as we started using clock_gettime(CLOCK_MONOTONIC) instead of ftime(). But.. it was good, because in order for pacemaker to support systemd resources a monotonic clock is required (and this change enabled it).
* So, there is no easy path: Its either we disable clock_gettime() support, by defining PCMK_TIME_EMERGENCY_CGT (like 2.0.3 does by default) - and stick with broken systemd resources + FTBFS - or we fix the clock_gettime() support (with this patchset) enabled by wgrant in 2.0.3.
Now... to the potential issues:
* This patchset was not done in 2.0.3 because it was missed also (it is like "half fix" for clock_gettime() was done before the release).
* The number of patches are not small but they're ALL related to the same thing: fixing timeout not working and re-organizing timing for resources. They're also mostly touching the same file: execd_commands.c (and configure.ac to control macros).
* timeouts are confirmed broken for systemd resources (like the test case shows). We could, perhaps, brake for OCF resorces and/or fencing as well.
* This change has been recommended by upstream maintainer (from 2 merge numbers he pointed out in the upstream bug = https://bugs.clusterlabs.org/show_bug.cgi?id=5429).
[Other Info]
* Original Description (from the reporter):
While working on pacemaker, i discovered a issue with timeouts
haproxy_stop_0 on primary 'OCF_TIMEOUT' (198): call=583, status='Timed Out', exitreason='', last-rc-change='1970-01-04 17:21:18 -05:00', queued=44ms, exec=176272ms
this lead me down the path of finding that setting a timeout unit value was not doing anything
primitive haproxy systemd:haproxy \
op monitor interval=2s \
op start interval=0s timeout=500s \
op stop interval=0s timeout=500s \
meta migration-threshold=2
primitive haproxy systemd:haproxy \
op monitor interval=2s \
op start interval=0s timeout=500 \
op stop interval=0s timeout=500 \
meta migration-threshold=2
the two above configs result in the same behavior, pacemaker/crm seems to be ignoring the "s"
I file a bug with pacemaker itself
https://bugs.clusterlabs.org/show_bug.cgi?id=5429
but this lead to the following responsed, copied from the ticket:
<<Looking back on your irc chat, I see you have a version of Pacemaker with a known bug:
<<haproxy_stop_0 on primary 'OCF_TIMEOUT' (198): call=583, status='Timed Out', exitreason='', last-rc-<<change='1970-01-04 17:21:18 -05:00', queued=44ms, exec=176272ms
<<The incorrect date is a result of bugs that occur in systemd resources when Pacemaker 2.0.3 is built <<with the -UPCMK_TIME_EMERGENCY_CGT C flag (which is not the default). I was only aware of that being the <<case in one Fedora release. If those are stock Ubuntu packages, please file an Ubuntu bug to make sure <<they are aware of it.
<<The underlying bugs are fixed as of the Pacemaker 2.0.4 release. If anyone wants to backport specific <<commits instead, the github pull requests #1992 and #1997 should take care of it.
It appears the the root cause of my issue with setting timeout values with units ("600s") is a bug in the build process of ubuntu pacemaker
1) lsb_release -d Description: Ubuntu 20.04 LTS
2) ii pacemaker 2.0.3-3ubuntu3 amd64 cluster resource manager
3) setting "100s" in the timeout of a resource should result in a 100 second timeout, not a 100 milisecond timeout
4) the settings unit value "s", is being ignored. force me to set the timeout to 10000 to get a 10 second timeout |
[Impact]
* Cluster resource timeouts are not working and should be working. Timeouts are important in order for the actions (done by the resource) don't timeout before we're expecting (sometimes starting a resource can take more time than the default time because of configuration files, or cache to be loaded, etc).
[Test Case]
* configure correctly a pacemaker cluster and add the following resources:
# fencing
primitive fence-focal01 stonith:fence_virsh \
params ipaddr=192.168.100.202 \
secure=true plug=focal01 login=fenceuser \
op monitor interval=30s
primitive fence-focal02 stonith:fence_virsh \
params ipaddr=192.168.100.202 \
secure=true plug=focal02 login=fenceuser \
op monitor interval=30s
primitive fence-focal03 stonith:fence_virsh \
params ipaddr=192.168.100.202 \
secure=true plug=focal03 login=fenceuser \
op monitor interval=30s
# resources
primitive virtual_ip IPaddr2 \
params ip=10.250.92.90 nic=public01 \
op monitor interval=5s
primitive webserver systemd:lighttpd \
op monitor interval=5s \
op start interval=0s timeout=2s \
op stop interval=0s timeout=2s \
meta migration-threshold=2
# resource group
group webserver_vip webserver virtual_ip \
meta target-role=Stopped
# locations
location fence-focal01-location fence-focal01 -inf: focal01
location fence-focal02-location fence-focal02 -inf: focal02
location fence-focal03-location fence-focal03 -inf: focal03
# properties
property cib-bootstrap-options: \
have-watchdog=false \
dc-version=2.0.3-4b1f869f0f \
cluster-infrastructure=corosync \
stonith-enabled=on \
stonith-action=reboot \
no-quorum-policy=stop \
cluster-name=focal
* Try to stop an already started resource group with "op stop timeout=2s" for the systemd resource will not be accounted as 2 seconds:
Failed Resource Actions:
* webserver_stop_0 on focal03 'OCF_TIMEOUT' (198): call=29, status='Timed Out', exitreason='', last-rc-change='1970-01-01 00:01:57Z', queued=1828ms, exec=204557ms
* Watch the cluster collapse.. (fencing nodes, trying to start resources, fencing nodes again, and over)
Increasing timeout to 20s does not help:
Failed Resource Actions:
* webserver_stop_0 on focal01 'OCF_TIMEOUT' (198): call=47, status='Timed Out', exitreason='', last-rc-change='1970-01-01 00:10:35Z', queued=20ms, exec=236013ms
* webserver_start_0 on focal03 'OCF_TIMEOUT' (198): call=22, status='Timed Out', exitreason='', last-rc-change='1970-01-01 00:05:09Z', queued=33ms, exec=241831ms
and the systemd resources startup is much less than 20 seconds.
[Regression Potential]
* Debian was still using ftime() for pacemaker 2.0.3, and, because of deprecation warnings, wgrant has changed it in: pacemaker (2.0.3-3ubuntu2):
This was "bad" because it made this issue to appear (as we started using clock_gettime(CLOCK_MONOTONIC) instead of ftime(). But.. it was good, because in order for pacemaker to support systemd resources a monotonic clock is required (and this change enabled it).
* So, there is no easy path: Its either we disable clock_gettime() support, by defining PCMK_TIME_EMERGENCY_CGT (like 2.0.3 does by default) - and stick with broken systemd resources + FTBFS - or we fix the clock_gettime() support (with this patchset) enabled by wgrant in 2.0.3.
Now... to the potential issues:
* This patchset was not done in 2.0.3 because it was missed also (it is like "half fix" for clock_gettime() was done before the release).
* The number of patches are not small but they're ALL related to the same thing: fixing timeout not working and re-organizing timing for resources. They're also mostly touching the same file: execd_commands.c (and configure.ac to control macros).
* timeouts are confirmed broken for systemd resources (like the test case shows). We could, perhaps, brake for OCF resorces and/or fencing as well.
* This change has been recommended by upstream maintainer (from 2 merge numbers he pointed out in the upstream bug = https://bugs.clusterlabs.org/show_bug.cgi?id=5429).
[Other Info]
* Original Description (from the reporter):
While working on pacemaker, i discovered a issue with timeouts
haproxy_stop_0 on primary 'OCF_TIMEOUT' (198): call=583, status='Timed Out', exitreason='', last-rc-change='1970-01-04 17:21:18 -05:00', queued=44ms, exec=176272ms
this lead me down the path of finding that setting a timeout unit value was not doing anything
primitive haproxy systemd:haproxy \
op monitor interval=2s \
op start interval=0s timeout=500s \
op stop interval=0s timeout=500s \
meta migration-threshold=2
primitive haproxy systemd:haproxy \
op monitor interval=2s \
op start interval=0s timeout=500 \
op stop interval=0s timeout=500 \
meta migration-threshold=2
the two above configs result in the same behavior, pacemaker/crm seems to be ignoring the "s"
I file a bug with pacemaker itself
https://bugs.clusterlabs.org/show_bug.cgi?id=5429
but this lead to the following responsed, copied from the ticket:
<<Looking back on your irc chat, I see you have a version of Pacemaker with a known bug:
<<haproxy_stop_0 on primary 'OCF_TIMEOUT' (198): call=583, status='Timed Out', exitreason='', last-rc-<<change='1970-01-04 17:21:18 -05:00', queued=44ms, exec=176272ms
<<The incorrect date is a result of bugs that occur in systemd resources when Pacemaker 2.0.3 is built <<with the -UPCMK_TIME_EMERGENCY_CGT C flag (which is not the default). I was only aware of that being the <<case in one Fedora release. If those are stock Ubuntu packages, please file an Ubuntu bug to make sure <<they are aware of it.
<<The underlying bugs are fixed as of the Pacemaker 2.0.4 release. If anyone wants to backport specific <<commits instead, the github pull requests #1992 and #1997 should take care of it.
It appears the the root cause of my issue with setting timeout values with units ("600s") is a bug in the build process of ubuntu pacemaker
1) lsb_release -d Description: Ubuntu 20.04 LTS
2) ii pacemaker 2.0.3-3ubuntu3 amd64 cluster resource manager
3) setting "100s" in the timeout of a resource should result in a 100 second timeout, not a 100 milisecond timeout
4) the settings unit value "s", is being ignored. force me to set the timeout to 10000 to get a 10 second timeout |
|
2020-10-01 20:12:47 |
Rafael David Tinoco |
description |
[Impact]
* Cluster resource timeouts are not working and should be working. Timeouts are important in order for the actions (done by the resource) don't timeout before we're expecting (sometimes starting a resource can take more time than the default time because of configuration files, or cache to be loaded, etc).
[Test Case]
* configure correctly a pacemaker cluster and add the following resources:
# fencing
primitive fence-focal01 stonith:fence_virsh \
params ipaddr=192.168.100.202 \
secure=true plug=focal01 login=fenceuser \
op monitor interval=30s
primitive fence-focal02 stonith:fence_virsh \
params ipaddr=192.168.100.202 \
secure=true plug=focal02 login=fenceuser \
op monitor interval=30s
primitive fence-focal03 stonith:fence_virsh \
params ipaddr=192.168.100.202 \
secure=true plug=focal03 login=fenceuser \
op monitor interval=30s
# resources
primitive virtual_ip IPaddr2 \
params ip=10.250.92.90 nic=public01 \
op monitor interval=5s
primitive webserver systemd:lighttpd \
op monitor interval=5s \
op start interval=0s timeout=2s \
op stop interval=0s timeout=2s \
meta migration-threshold=2
# resource group
group webserver_vip webserver virtual_ip \
meta target-role=Stopped
# locations
location fence-focal01-location fence-focal01 -inf: focal01
location fence-focal02-location fence-focal02 -inf: focal02
location fence-focal03-location fence-focal03 -inf: focal03
# properties
property cib-bootstrap-options: \
have-watchdog=false \
dc-version=2.0.3-4b1f869f0f \
cluster-infrastructure=corosync \
stonith-enabled=on \
stonith-action=reboot \
no-quorum-policy=stop \
cluster-name=focal
* Try to stop an already started resource group with "op stop timeout=2s" for the systemd resource will not be accounted as 2 seconds:
Failed Resource Actions:
* webserver_stop_0 on focal03 'OCF_TIMEOUT' (198): call=29, status='Timed Out', exitreason='', last-rc-change='1970-01-01 00:01:57Z', queued=1828ms, exec=204557ms
* Watch the cluster collapse.. (fencing nodes, trying to start resources, fencing nodes again, and over)
Increasing timeout to 20s does not help:
Failed Resource Actions:
* webserver_stop_0 on focal01 'OCF_TIMEOUT' (198): call=47, status='Timed Out', exitreason='', last-rc-change='1970-01-01 00:10:35Z', queued=20ms, exec=236013ms
* webserver_start_0 on focal03 'OCF_TIMEOUT' (198): call=22, status='Timed Out', exitreason='', last-rc-change='1970-01-01 00:05:09Z', queued=33ms, exec=241831ms
and the systemd resources startup is much less than 20 seconds.
[Regression Potential]
* Debian was still using ftime() for pacemaker 2.0.3, and, because of deprecation warnings, wgrant has changed it in: pacemaker (2.0.3-3ubuntu2):
This was "bad" because it made this issue to appear (as we started using clock_gettime(CLOCK_MONOTONIC) instead of ftime(). But.. it was good, because in order for pacemaker to support systemd resources a monotonic clock is required (and this change enabled it).
* So, there is no easy path: Its either we disable clock_gettime() support, by defining PCMK_TIME_EMERGENCY_CGT (like 2.0.3 does by default) - and stick with broken systemd resources + FTBFS - or we fix the clock_gettime() support (with this patchset) enabled by wgrant in 2.0.3.
Now... to the potential issues:
* This patchset was not done in 2.0.3 because it was missed also (it is like "half fix" for clock_gettime() was done before the release).
* The number of patches are not small but they're ALL related to the same thing: fixing timeout not working and re-organizing timing for resources. They're also mostly touching the same file: execd_commands.c (and configure.ac to control macros).
* timeouts are confirmed broken for systemd resources (like the test case shows). We could, perhaps, brake for OCF resorces and/or fencing as well.
* This change has been recommended by upstream maintainer (from 2 merge numbers he pointed out in the upstream bug = https://bugs.clusterlabs.org/show_bug.cgi?id=5429).
[Other Info]
* Original Description (from the reporter):
While working on pacemaker, i discovered a issue with timeouts
haproxy_stop_0 on primary 'OCF_TIMEOUT' (198): call=583, status='Timed Out', exitreason='', last-rc-change='1970-01-04 17:21:18 -05:00', queued=44ms, exec=176272ms
this lead me down the path of finding that setting a timeout unit value was not doing anything
primitive haproxy systemd:haproxy \
op monitor interval=2s \
op start interval=0s timeout=500s \
op stop interval=0s timeout=500s \
meta migration-threshold=2
primitive haproxy systemd:haproxy \
op monitor interval=2s \
op start interval=0s timeout=500 \
op stop interval=0s timeout=500 \
meta migration-threshold=2
the two above configs result in the same behavior, pacemaker/crm seems to be ignoring the "s"
I file a bug with pacemaker itself
https://bugs.clusterlabs.org/show_bug.cgi?id=5429
but this lead to the following responsed, copied from the ticket:
<<Looking back on your irc chat, I see you have a version of Pacemaker with a known bug:
<<haproxy_stop_0 on primary 'OCF_TIMEOUT' (198): call=583, status='Timed Out', exitreason='', last-rc-<<change='1970-01-04 17:21:18 -05:00', queued=44ms, exec=176272ms
<<The incorrect date is a result of bugs that occur in systemd resources when Pacemaker 2.0.3 is built <<with the -UPCMK_TIME_EMERGENCY_CGT C flag (which is not the default). I was only aware of that being the <<case in one Fedora release. If those are stock Ubuntu packages, please file an Ubuntu bug to make sure <<they are aware of it.
<<The underlying bugs are fixed as of the Pacemaker 2.0.4 release. If anyone wants to backport specific <<commits instead, the github pull requests #1992 and #1997 should take care of it.
It appears the the root cause of my issue with setting timeout values with units ("600s") is a bug in the build process of ubuntu pacemaker
1) lsb_release -d Description: Ubuntu 20.04 LTS
2) ii pacemaker 2.0.3-3ubuntu3 amd64 cluster resource manager
3) setting "100s" in the timeout of a resource should result in a 100 second timeout, not a 100 milisecond timeout
4) the settings unit value "s", is being ignored. force me to set the timeout to 10000 to get a 10 second timeout |
[Impact]
* Cluster resource operation timeouts are not working correctly for systemd resources and should be working. Timeouts are important in order for the actions executed by pacemaker - for the systemd resource in question - don't wait forever to start (or stop) a service, causing the police engine to take the correct decisions (like trying to start the resource somewhere else).
[Test Case]
* configure correctly a pacemaker cluster and add the following resources:
# fencing
primitive fence-focal01 stonith:fence_virsh \
params ipaddr=192.168.100.202 \
secure=true plug=focal01 login=fenceuser \
op monitor interval=30s
primitive fence-focal02 stonith:fence_virsh \
params ipaddr=192.168.100.202 \
secure=true plug=focal02 login=fenceuser \
op monitor interval=30s
primitive fence-focal03 stonith:fence_virsh \
params ipaddr=192.168.100.202 \
secure=true plug=focal03 login=fenceuser \
op monitor interval=30s
# resources
primitive virtual_ip IPaddr2 \
params ip=10.250.92.90 nic=public01 \
op monitor interval=5s
primitive webserver systemd:lighttpd \
op monitor interval=5s \
op start interval=0s timeout=2s \
op stop interval=0s timeout=2s \
meta migration-threshold=2
# resource group
group webserver_vip webserver virtual_ip \
meta target-role=Stopped
# locations
location fence-focal01-location fence-focal01 -inf: focal01
location fence-focal02-location fence-focal02 -inf: focal02
location fence-focal03-location fence-focal03 -inf: focal03
# properties
property cib-bootstrap-options: \
have-watchdog=false \
dc-version=2.0.3-4b1f869f0f \
cluster-infrastructure=corosync \
stonith-enabled=on \
stonith-action=reboot \
no-quorum-policy=stop \
cluster-name=focal
* Try to stop an already started resource group with "op stop timeout=2s" for the systemd resource will not be accounted as 2 seconds:
Failed Resource Actions:
* webserver_stop_0 on focal03 'OCF_TIMEOUT' (198): call=29, status='Timed Out', exitreason='', last-rc-change='1970-01-01 00:01:57Z', queued=1828ms, exec=204557ms
* Watch the cluster collapse.. (fencing nodes, trying to start resources, fencing nodes again, and over)
Increasing timeout to 20s does not help:
Failed Resource Actions:
* webserver_stop_0 on focal01 'OCF_TIMEOUT' (198): call=47, status='Timed Out', exitreason='', last-rc-change='1970-01-01 00:10:35Z', queued=20ms, exec=236013ms
* webserver_start_0 on focal03 'OCF_TIMEOUT' (198): call=22, status='Timed Out', exitreason='', last-rc-change='1970-01-01 00:05:09Z', queued=33ms, exec=241831ms
and the systemd resources startup is much less than 20 seconds.
[Regression Potential]
* Debian was still using ftime() for pacemaker 2.0.3, and, because of deprecation warnings, wgrant has changed it in: pacemaker (2.0.3-3ubuntu2):
This was "bad" because it made this issue to appear (as we started using clock_gettime(CLOCK_MONOTONIC) instead of ftime(). But.. it was good, because in order for pacemaker to support systemd resources a monotonic clock is required (and this change enabled it).
* So, there is no easy path: Its either we disable clock_gettime() support, by defining PCMK_TIME_EMERGENCY_CGT (like 2.0.3 does by default) - and stick with broken systemd resources + FTBFS - or we fix the clock_gettime() support (with this patchset) enabled by wgrant in 2.0.3.
Now... to the potential issues:
* This patchset was not done in 2.0.3 because it was missed also (it is like "half fix" for clock_gettime() was done before the release).
* The number of patches are not small but they're ALL related to the same thing: fixing timeout not working and re-organizing timing for resources. They're also mostly touching the same file: execd_commands.c (and configure.ac to control macros).
* timeouts are confirmed broken for systemd resources (like the test case shows). We could, perhaps, brake for OCF resorces and/or fencing as well.
* This change has been recommended by upstream maintainer (from 2 merge numbers he pointed out in the upstream bug = https://bugs.clusterlabs.org/show_bug.cgi?id=5429).
[Other Info]
* Original Description (from the reporter):
While working on pacemaker, i discovered a issue with timeouts
haproxy_stop_0 on primary 'OCF_TIMEOUT' (198): call=583, status='Timed Out', exitreason='', last-rc-change='1970-01-04 17:21:18 -05:00', queued=44ms, exec=176272ms
this lead me down the path of finding that setting a timeout unit value was not doing anything
primitive haproxy systemd:haproxy \
op monitor interval=2s \
op start interval=0s timeout=500s \
op stop interval=0s timeout=500s \
meta migration-threshold=2
primitive haproxy systemd:haproxy \
op monitor interval=2s \
op start interval=0s timeout=500 \
op stop interval=0s timeout=500 \
meta migration-threshold=2
the two above configs result in the same behavior, pacemaker/crm seems to be ignoring the "s"
I file a bug with pacemaker itself
https://bugs.clusterlabs.org/show_bug.cgi?id=5429
but this lead to the following responsed, copied from the ticket:
<<Looking back on your irc chat, I see you have a version of Pacemaker with a known bug:
<<haproxy_stop_0 on primary 'OCF_TIMEOUT' (198): call=583, status='Timed Out', exitreason='', last-rc-<<change='1970-01-04 17:21:18 -05:00', queued=44ms, exec=176272ms
<<The incorrect date is a result of bugs that occur in systemd resources when Pacemaker 2.0.3 is built <<with the -UPCMK_TIME_EMERGENCY_CGT C flag (which is not the default). I was only aware of that being the <<case in one Fedora release. If those are stock Ubuntu packages, please file an Ubuntu bug to make sure <<they are aware of it.
<<The underlying bugs are fixed as of the Pacemaker 2.0.4 release. If anyone wants to backport specific <<commits instead, the github pull requests #1992 and #1997 should take care of it.
It appears the the root cause of my issue with setting timeout values with units ("600s") is a bug in the build process of ubuntu pacemaker
1) lsb_release -d Description: Ubuntu 20.04 LTS
2) ii pacemaker 2.0.3-3ubuntu3 amd64 cluster resource manager
3) setting "100s" in the timeout of a resource should result in a 100 second timeout, not a 100 milisecond timeout
4) the settings unit value "s", is being ignored. force me to set the timeout to 10000 to get a 10 second timeout |
|
2020-10-06 16:56:16 |
Rafael David Tinoco |
pacemaker (Ubuntu Focal): assignee |
Rafael David Tinoco (rafaeldtinoco) |
|
|
2020-10-06 16:56:25 |
Rafael David Tinoco |
pacemaker (Ubuntu Focal): importance |
High |
Undecided |
|
2020-10-08 05:42:12 |
Launchpad Janitor |
merge proposal linked |
|
https://code.launchpad.net/~rafaeldtinoco/ubuntu/+source/pacemaker/+git/pacemaker/+merge/391960 |
|
2020-10-08 14:11:13 |
Rafael David Tinoco |
description |
[Impact]
* Cluster resource operation timeouts are not working correctly for systemd resources and should be working. Timeouts are important in order for the actions executed by pacemaker - for the systemd resource in question - don't wait forever to start (or stop) a service, causing the police engine to take the correct decisions (like trying to start the resource somewhere else).
[Test Case]
* configure correctly a pacemaker cluster and add the following resources:
# fencing
primitive fence-focal01 stonith:fence_virsh \
params ipaddr=192.168.100.202 \
secure=true plug=focal01 login=fenceuser \
op monitor interval=30s
primitive fence-focal02 stonith:fence_virsh \
params ipaddr=192.168.100.202 \
secure=true plug=focal02 login=fenceuser \
op monitor interval=30s
primitive fence-focal03 stonith:fence_virsh \
params ipaddr=192.168.100.202 \
secure=true plug=focal03 login=fenceuser \
op monitor interval=30s
# resources
primitive virtual_ip IPaddr2 \
params ip=10.250.92.90 nic=public01 \
op monitor interval=5s
primitive webserver systemd:lighttpd \
op monitor interval=5s \
op start interval=0s timeout=2s \
op stop interval=0s timeout=2s \
meta migration-threshold=2
# resource group
group webserver_vip webserver virtual_ip \
meta target-role=Stopped
# locations
location fence-focal01-location fence-focal01 -inf: focal01
location fence-focal02-location fence-focal02 -inf: focal02
location fence-focal03-location fence-focal03 -inf: focal03
# properties
property cib-bootstrap-options: \
have-watchdog=false \
dc-version=2.0.3-4b1f869f0f \
cluster-infrastructure=corosync \
stonith-enabled=on \
stonith-action=reboot \
no-quorum-policy=stop \
cluster-name=focal
* Try to stop an already started resource group with "op stop timeout=2s" for the systemd resource will not be accounted as 2 seconds:
Failed Resource Actions:
* webserver_stop_0 on focal03 'OCF_TIMEOUT' (198): call=29, status='Timed Out', exitreason='', last-rc-change='1970-01-01 00:01:57Z', queued=1828ms, exec=204557ms
* Watch the cluster collapse.. (fencing nodes, trying to start resources, fencing nodes again, and over)
Increasing timeout to 20s does not help:
Failed Resource Actions:
* webserver_stop_0 on focal01 'OCF_TIMEOUT' (198): call=47, status='Timed Out', exitreason='', last-rc-change='1970-01-01 00:10:35Z', queued=20ms, exec=236013ms
* webserver_start_0 on focal03 'OCF_TIMEOUT' (198): call=22, status='Timed Out', exitreason='', last-rc-change='1970-01-01 00:05:09Z', queued=33ms, exec=241831ms
and the systemd resources startup is much less than 20 seconds.
[Regression Potential]
* Debian was still using ftime() for pacemaker 2.0.3, and, because of deprecation warnings, wgrant has changed it in: pacemaker (2.0.3-3ubuntu2):
This was "bad" because it made this issue to appear (as we started using clock_gettime(CLOCK_MONOTONIC) instead of ftime(). But.. it was good, because in order for pacemaker to support systemd resources a monotonic clock is required (and this change enabled it).
* So, there is no easy path: Its either we disable clock_gettime() support, by defining PCMK_TIME_EMERGENCY_CGT (like 2.0.3 does by default) - and stick with broken systemd resources + FTBFS - or we fix the clock_gettime() support (with this patchset) enabled by wgrant in 2.0.3.
Now... to the potential issues:
* This patchset was not done in 2.0.3 because it was missed also (it is like "half fix" for clock_gettime() was done before the release).
* The number of patches are not small but they're ALL related to the same thing: fixing timeout not working and re-organizing timing for resources. They're also mostly touching the same file: execd_commands.c (and configure.ac to control macros).
* timeouts are confirmed broken for systemd resources (like the test case shows). We could, perhaps, brake for OCF resorces and/or fencing as well.
* This change has been recommended by upstream maintainer (from 2 merge numbers he pointed out in the upstream bug = https://bugs.clusterlabs.org/show_bug.cgi?id=5429).
[Other Info]
* Original Description (from the reporter):
While working on pacemaker, i discovered a issue with timeouts
haproxy_stop_0 on primary 'OCF_TIMEOUT' (198): call=583, status='Timed Out', exitreason='', last-rc-change='1970-01-04 17:21:18 -05:00', queued=44ms, exec=176272ms
this lead me down the path of finding that setting a timeout unit value was not doing anything
primitive haproxy systemd:haproxy \
op monitor interval=2s \
op start interval=0s timeout=500s \
op stop interval=0s timeout=500s \
meta migration-threshold=2
primitive haproxy systemd:haproxy \
op monitor interval=2s \
op start interval=0s timeout=500 \
op stop interval=0s timeout=500 \
meta migration-threshold=2
the two above configs result in the same behavior, pacemaker/crm seems to be ignoring the "s"
I file a bug with pacemaker itself
https://bugs.clusterlabs.org/show_bug.cgi?id=5429
but this lead to the following responsed, copied from the ticket:
<<Looking back on your irc chat, I see you have a version of Pacemaker with a known bug:
<<haproxy_stop_0 on primary 'OCF_TIMEOUT' (198): call=583, status='Timed Out', exitreason='', last-rc-<<change='1970-01-04 17:21:18 -05:00', queued=44ms, exec=176272ms
<<The incorrect date is a result of bugs that occur in systemd resources when Pacemaker 2.0.3 is built <<with the -UPCMK_TIME_EMERGENCY_CGT C flag (which is not the default). I was only aware of that being the <<case in one Fedora release. If those are stock Ubuntu packages, please file an Ubuntu bug to make sure <<they are aware of it.
<<The underlying bugs are fixed as of the Pacemaker 2.0.4 release. If anyone wants to backport specific <<commits instead, the github pull requests #1992 and #1997 should take care of it.
It appears the the root cause of my issue with setting timeout values with units ("600s") is a bug in the build process of ubuntu pacemaker
1) lsb_release -d Description: Ubuntu 20.04 LTS
2) ii pacemaker 2.0.3-3ubuntu3 amd64 cluster resource manager
3) setting "100s" in the timeout of a resource should result in a 100 second timeout, not a 100 milisecond timeout
4) the settings unit value "s", is being ignored. force me to set the timeout to 10000 to get a 10 second timeout |
[Impact]
* Cluster resource operation timeouts are not working correctly for systemd resources and should be working. Timeouts are important in order for the actions executed by pacemaker - for the systemd resource in question - don't wait forever to start (or stop) a service, causing the police engine to take the correct decisions (like trying to start the resource somewhere else).
[Test Case]
* configure correctly a pacemaker cluster and add the following resources:
# fencing
primitive fence-focal01 stonith:fence_virsh \
params ipaddr=192.168.100.202 \
secure=true plug=focal01 login=fenceuser \
op monitor interval=30s
primitive fence-focal02 stonith:fence_virsh \
params ipaddr=192.168.100.202 \
secure=true plug=focal02 login=fenceuser \
op monitor interval=30s
primitive fence-focal03 stonith:fence_virsh \
params ipaddr=192.168.100.202 \
secure=true plug=focal03 login=fenceuser \
op monitor interval=30s
# resources
primitive virtual_ip IPaddr2 \
params ip=10.250.92.90 nic=public01 \
op monitor interval=5s
primitive webserver systemd:lighttpd \
op monitor interval=5s \
op start interval=0s timeout=2s \
op stop interval=0s timeout=2s \
meta migration-threshold=2
# resource group
group webserver_vip webserver virtual_ip \
meta target-role=Stopped
# locations
location fence-focal01-location fence-focal01 -inf: focal01
location fence-focal02-location fence-focal02 -inf: focal02
location fence-focal03-location fence-focal03 -inf: focal03
# properties
property cib-bootstrap-options: \
have-watchdog=false \
dc-version=2.0.3-4b1f869f0f \
cluster-infrastructure=corosync \
stonith-enabled=on \
stonith-action=reboot \
no-quorum-policy=stop \
cluster-name=focal
* Try to stop an already started resource group with "op stop timeout=2s" for the systemd resource will not be accounted as 2 seconds:
Failed Resource Actions:
* webserver_stop_0 on focal03 'OCF_TIMEOUT' (198): call=29, status='Timed Out', exitreason='', last-rc-change='1970-01-01 00:01:57Z', queued=1828ms, exec=204557ms
* Watch the cluster collapse.. (fencing nodes, trying to start resources, fencing nodes again, and over)
Increasing timeout to 20s does not help:
Failed Resource Actions:
* webserver_stop_0 on focal01 'OCF_TIMEOUT' (198): call=47, status='Timed Out', exitreason='', last-rc-change='1970-01-01 00:10:35Z', queued=20ms, exec=236013ms
* webserver_start_0 on focal03 'OCF_TIMEOUT' (198): call=22, status='Timed Out', exitreason='', last-rc-change='1970-01-01 00:05:09Z', queued=33ms, exec=241831ms
and the systemd resources startup is much less than 20 seconds.
[Regression Potential]
* Debian was still using ftime() for pacemaker 2.0.3, and, because of deprecation warnings, wgrant has changed it in: pacemaker (2.0.3-3ubuntu2):
This was "bad" because it made this issue to appear (as we started using clock_gettime(CLOCK_MONOTONIC) instead of ftime(). But.. it was good, because in order for pacemaker to support systemd resources a monotonic clock is required (and this change enabled it).
* So, there is no easy path: Its either we disable clock_gettime() support, by defining PCMK_TIME_EMERGENCY_CGT (like 2.0.3 does by default) - and stick with broken systemd resources + FTBFS - or we fix the clock_gettime() support (with this patchset) enabled by wgrant in 2.0.3.
Now... to the potential issues:
* After SRU review it was decided that, instead of cherry-picking the 2 upstream merges pointed by upstream maintainer (#1992 and #1997) we would only backport changes that affect clock_gettime() code base and execution path. This is per SRU guidelines, trying to minimize amount of changes to be reviewed and merged.
* The original fix (merges #1992 and #1997) were not merged in 2.0.3 because they were missed (it is like "half fix" for clock_gettime() was done before the release).
* There are 2 possible clocking choices for pacemaker in 2.0.3: To use ftime() if supported (the upstream default) OR to use clock_gettime() if selected (it becomes the upstream default after upstream merges #1992 and #1997 were done, but after 2.0.3 release).
* I confined all chances inside "#ifdef PCMK__TIME_USE_CGT" scope and made sure that one could compile same source code with ftime() support (just because I cared about not braking compilation for someone else if needed).
* Fixes are done in execd_commands, the code responsible for starting and stopping resources and/or fencing agents (anything that will need an execv() basically). This could jeopardize other agents (other than systemd ones) so functional and regression tests are needed.
* Someone that has increased systemd resources timeouts because they were not being respected (2 would have to be value 2000 in previous version, since seconds were not being respected) could have their timeout settings increased now to 2000sec... so it is advised that those should review their timeout settings and always use a timing metric (like suffixing timeout with "s").
* This change has been recommended by upstream maintainer (from 2 merge numbers he pointed out in the upstream bug = https://bugs.clusterlabs.org/show_bug.cgi?id=5429).
[Other Info]
* Original Description (from the reporter):
While working on pacemaker, i discovered a issue with timeouts
haproxy_stop_0 on primary 'OCF_TIMEOUT' (198): call=583, status='Timed Out', exitreason='', last-rc-change='1970-01-04 17:21:18 -05:00', queued=44ms, exec=176272ms
this lead me down the path of finding that setting a timeout unit value was not doing anything
primitive haproxy systemd:haproxy \
op monitor interval=2s \
op start interval=0s timeout=500s \
op stop interval=0s timeout=500s \
meta migration-threshold=2
primitive haproxy systemd:haproxy \
op monitor interval=2s \
op start interval=0s timeout=500 \
op stop interval=0s timeout=500 \
meta migration-threshold=2
the two above configs result in the same behavior, pacemaker/crm seems to be ignoring the "s"
I file a bug with pacemaker itself
https://bugs.clusterlabs.org/show_bug.cgi?id=5429
but this lead to the following responsed, copied from the ticket:
<<Looking back on your irc chat, I see you have a version of Pacemaker with a known bug:
<<haproxy_stop_0 on primary 'OCF_TIMEOUT' (198): call=583, status='Timed Out', exitreason='', last-rc-<<change='1970-01-04 17:21:18 -05:00', queued=44ms, exec=176272ms
<<The incorrect date is a result of bugs that occur in systemd resources when Pacemaker 2.0.3 is built <<with the -UPCMK_TIME_EMERGENCY_CGT C flag (which is not the default). I was only aware of that being the <<case in one Fedora release. If those are stock Ubuntu packages, please file an Ubuntu bug to make sure <<they are aware of it.
<<The underlying bugs are fixed as of the Pacemaker 2.0.4 release. If anyone wants to backport specific <<commits instead, the github pull requests #1992 and #1997 should take care of it.
It appears the the root cause of my issue with setting timeout values with units ("600s") is a bug in the build process of ubuntu pacemaker
1) lsb_release -d Description: Ubuntu 20.04 LTS
2) ii pacemaker 2.0.3-3ubuntu3 amd64 cluster resource manager
3) setting "100s" in the timeout of a resource should result in a 100 second timeout, not a 100 milisecond timeout
4) the settings unit value "s", is being ignored. force me to set the timeout to 10000 to get a 10 second timeout |
|
2020-10-08 19:31:26 |
Robie Basak |
pacemaker (Ubuntu Focal): status |
In Progress |
Fix Committed |
|
2020-10-08 19:31:28 |
Robie Basak |
bug |
|
|
added subscriber Ubuntu Stable Release Updates Team |
2020-10-08 19:31:31 |
Robie Basak |
bug |
|
|
added subscriber SRU Verification |
2020-10-08 19:31:34 |
Robie Basak |
tags |
server-next |
server-next verification-needed verification-needed-focal |
|
2020-10-15 12:49:07 |
Jason Grammenos |
tags |
server-next verification-needed verification-needed-focal |
server-next verification-done verification-done-focal |
|
2020-10-20 04:26:06 |
Launchpad Janitor |
merge proposal linked |
|
https://code.launchpad.net/~rafaeldtinoco/ubuntu/+source/pacemaker/+git/pacemaker/+merge/392509 |
|
2020-10-20 04:26:48 |
Launchpad Janitor |
merge proposal linked |
|
https://code.launchpad.net/~rafaeldtinoco/ubuntu/+source/pacemaker/+git/pacemaker/+merge/392510 |
|
2020-10-26 10:17:20 |
Launchpad Janitor |
pacemaker (Ubuntu Focal): status |
Fix Committed |
Fix Released |
|
2020-10-26 10:17:26 |
Łukasz Zemczak |
removed subscriber Ubuntu Stable Release Updates Team |
|
|
|