Ubuntu
ceph package

Bug #1922561
Activity log

Activity log for bug #1922561

Date	Who	What changed	Old value	New value	Message
2021-04-05 13:16:43	Mauricio Faria de Oliveira	bug			added bug
2021-04-05 13:17:42	Mauricio Faria de Oliveira	nominated for series		Ubuntu Hirsute
2021-04-05 13:17:42	Mauricio Faria de Oliveira	bug task added		ceph (Ubuntu Hirsute)
2021-04-05 13:17:42	Mauricio Faria de Oliveira	nominated for series		Ubuntu Focal
2021-04-05 13:17:42	Mauricio Faria de Oliveira	bug task added		ceph (Ubuntu Focal)
2021-04-05 13:17:42	Mauricio Faria de Oliveira	nominated for series		Ubuntu Groovy
2021-04-05 13:17:42	Mauricio Faria de Oliveira	bug task added		ceph (Ubuntu Groovy)
2021-04-05 13:18:22	Mauricio Faria de Oliveira	bug task added		cloud-archive
2021-04-05 13:20:49	Mauricio Faria de Oliveira	nominated for series		cloud-archive/ussuri
2021-04-05 13:20:49	Mauricio Faria de Oliveira	bug task added		cloud-archive/ussuri
2021-04-05 13:20:49	Mauricio Faria de Oliveira	nominated for series		cloud-archive/wallaby
2021-04-05 13:20:49	Mauricio Faria de Oliveira	bug task added		cloud-archive/wallaby
2021-04-05 13:20:49	Mauricio Faria de Oliveira	nominated for series		cloud-archive/victoria
2021-04-05 13:20:49	Mauricio Faria de Oliveira	bug task added		cloud-archive/victoria
2021-04-05 13:20:49	Mauricio Faria de Oliveira	nominated for series		cloud-archive/train
2021-04-05 13:20:49	Mauricio Faria de Oliveira	bug task added		cloud-archive/train
2021-04-05 13:21:54	Mauricio Faria de Oliveira	description	[Impact] The `osd_fast_shutdown` option (available in Nautilus and enabled by default) may cause the cluster log to receive too many entries of `osd.X reported immediately failed by osd.Y` on large clusters. This happens as the monitor no longer receives the OSD message to notify that the OSD is shutting down and now relies on other OSDs telling it about the failed OSD (not really 'failed' but shutdown.) This might be an issue for LMA stacks/tools that check ceph logs for failed lines, and then require additional logic to filter on an intended OSD (fast) shutdown; might not be an option/possible, and require an admin to analyze. [Fix] The `osd_fast_shutdown_notify_mon` option for OSDs (disabled by default) can tell the monitor it is shutting down (done in slow/ non-fast shutdown) under `osd_fast_shutdown`. This introduces minimal delay (the ack from the mon is required to prevent the messages), and addresses the cluster log issue. PS: the `osd_mon_shutdown_timeout` option can be used to control the maximum amount of time waiting for the monitor ack to arrive. The new option should be available in the following Ceph releases: - Pacific 16.2.0 [1] - Octopus 15.2.11 [2] - Nautilus TBD (at least 14.2.20) This bug tracks the release of this patch in Ubuntu/Cloud Archive. [Test Case] - Stop an OSD and watch the OSD and MON logs. - Before / or with `osd_fast_shutdown_notify_mon = false`: ``` osd log: 2021-01-09T18:59:52.448+0000 7f937fcdc700 -1 received signal: Terminated from -bash (PID: 408) UID: 1000 2021-01-09T18:59:52.448+0000 7f937fcdc700 -1 osd.2 22 * Got signal Terminated * 2021-01-09T18:59:52.448+0000 7f937fcdc700 -1 osd.2 22 * Immediate shutdown (osd_fast_shutdown=true) * mon log: $ cat out/mon.a.log \| grep '^2021-01-09T18:59:' \| grep 'osd.0 reported immediately failed by osd.' \| rev \| cut -d: -f1 \| rev \| sort \| uniq -c 4 osd.0 reported immediately failed by osd.1 4 osd.0 reported immediately failed by osd.2 4 osd.0 reported immediately failed by osd.3 4 osd.0 reported immediately failed by osd.4 4 osd.0 reported immediately failed by osd.5 4 osd.0 reported immediately failed by osd.6 4 osd.0 reported immediately failed by osd.7 4 osd.0 reported immediately failed by osd.8 4 osd.0 reported immediately failed by osd.9 ``` - After / with `osd_fast_shutdown_notify_mon = true`: ``` osd log: 2021-01-14T17:21:10.825+0000 7feceded1700 -1 received signal: Terminated from -bash (PID: 1750) UID: 1000 2021-01-14T17:21:10.825+0000 7feceded1700 -1 osd.0 80 * Got signal Terminated * 2021-01-14T17:21:10.825+0000 7feceded1700 -1 osd.0 80 * Immediate shutdown (osd_fast_shutdown=true) * 2021-01-14T17:21:10.825+0000 7feceded1700 0 osd.0 80 prepare_to_stop telling mon we are shutting down ... 2021-01-14T17:21:11.021+0000 7fecdac0c700 0 osd.0 80 got_stop_ack starting shutdown 2021-01-14T17:21:11.021+0000 7feceded1700 0 osd.0 80 prepare_to_stop starting shutdown mon log: 2021-01-14T17:21:10.829+0000 7f62fce61700 0 log_channel(cluster) log [INF] : osd.0 marked itself down 2021-01-14T17:21:10.885+0000 7f62ff666700 1 mon.a@0(leader).osd e80 do_prune osdmap full prune enabled 2021-01-14T17:21:10.889+0000 7f62ff666700 0 log_channel(cluster) log [WRN] : Health check failed: 1 osds down (OSD_DOWN) 2021-01-14T17:21:10.957+0000 7f62fb65e700 1 mon.a@0(leader).osd e81 e81: 10 total, 9 up, 10 in 2021-01-14T17:21:11.013+0000 7f62fb65e700 0 log_channel(cluster) log [DBG] : osdmap e81: 10 total, 9 up, 10 in ``` [Where problems could occur] Any regression from this patch should manifest in OSD shutdown, but only when the option is enabled. The patch is quite small and contained to the OSD shutdown path. It is effectively a nop when the option is disabled (by default). It is a small change from the newly introduced default behavior, but it just re-introduces a message in the shutdown path, which is how it used to be done on previous releases and even earlier stable releases in Nautilus. [1] https://tracker.ceph.com/issues/49683 [2] https://tracker.ceph.com/issues/49681 [3] https://tracker.ceph.com/issues/49682	[Impact] The `osd_fast_shutdown` option (available in Nautilus and enabled by default) may cause the cluster log to receive too many entries of `osd.X reported immediately failed by osd.Y` on large clusters. This happens as the monitor no longer receives the OSD message to notify that the OSD is shutting down and now relies on other OSDs telling it about the failed OSD (not really 'failed' but shutdown.) This might be an issue for LMA stacks/tools that check ceph logs for failed lines, and then require additional logic to filter on an intended OSD (fast) shutdown; might not be an option/possible, and require an admin to analyze. [Fix] The `osd_fast_shutdown_notify_mon` option for OSDs (disabled by default) can tell the monitor it is shutting down (done in slow/ non-fast shutdown) under `osd_fast_shutdown`. This introduces minimal delay (the ack from the mon is required to prevent the messages), and addresses the cluster log issue. PS: the `osd_mon_shutdown_timeout` option can be used to control the maximum amount of time waiting for the monitor ack to arrive. The new option should be available in the following Ceph releases: - Pacific 16.2.0 [1] [Hirsute+] - Octopus 15.2.11 [2] [Focal/Groovy; Ussuri+] - Nautilus TBD (at least 14.2.20) [Eoan is EOL; Train] This bug tracks the release of this patch in Ubuntu/Cloud Archive. [Test Case] - Stop an OSD and watch the OSD and MON logs. - Before / or with `osd_fast_shutdown_notify_mon = false`: ``` osd log: 2021-01-09T18:59:52.448+0000 7f937fcdc700 -1 received signal: Terminated from -bash (PID: 408) UID: 1000 2021-01-09T18:59:52.448+0000 7f937fcdc700 -1 osd.2 22 * Got signal Terminated * 2021-01-09T18:59:52.448+0000 7f937fcdc700 -1 osd.2 22 * Immediate shutdown (osd_fast_shutdown=true) * mon log: $ cat out/mon.a.log \| grep '^2021-01-09T18:59:' \| grep 'osd.0 reported immediately failed by osd.' \| rev \| cut -d: -f1 \| rev \| sort \| uniq -c 4 osd.0 reported immediately failed by osd.1 4 osd.0 reported immediately failed by osd.2 4 osd.0 reported immediately failed by osd.3 4 osd.0 reported immediately failed by osd.4 4 osd.0 reported immediately failed by osd.5 4 osd.0 reported immediately failed by osd.6 4 osd.0 reported immediately failed by osd.7 4 osd.0 reported immediately failed by osd.8 4 osd.0 reported immediately failed by osd.9 ``` - After / with `osd_fast_shutdown_notify_mon = true`: ``` osd log: 2021-01-14T17:21:10.825+0000 7feceded1700 -1 received signal: Terminated from -bash (PID: 1750) UID: 1000 2021-01-14T17:21:10.825+0000 7feceded1700 -1 osd.0 80 * Got signal Terminated * 2021-01-14T17:21:10.825+0000 7feceded1700 -1 osd.0 80 * Immediate shutdown (osd_fast_shutdown=true) * 2021-01-14T17:21:10.825+0000 7feceded1700 0 osd.0 80 prepare_to_stop telling mon we are shutting down ... 2021-01-14T17:21:11.021+0000 7fecdac0c700 0 osd.0 80 got_stop_ack starting shutdown 2021-01-14T17:21:11.021+0000 7feceded1700 0 osd.0 80 prepare_to_stop starting shutdown mon log: 2021-01-14T17:21:10.829+0000 7f62fce61700 0 log_channel(cluster) log [INF] : osd.0 marked itself down 2021-01-14T17:21:10.885+0000 7f62ff666700 1 mon.a@0(leader).osd e80 do_prune osdmap full prune enabled 2021-01-14T17:21:10.889+0000 7f62ff666700 0 log_channel(cluster) log [WRN] : Health check failed: 1 osds down (OSD_DOWN) 2021-01-14T17:21:10.957+0000 7f62fb65e700 1 mon.a@0(leader).osd e81 e81: 10 total, 9 up, 10 in 2021-01-14T17:21:11.013+0000 7f62fb65e700 0 log_channel(cluster) log [DBG] : osdmap e81: 10 total, 9 up, 10 in ``` [Where problems could occur] Any regression from this patch should manifest in OSD shutdown, but only when the option is enabled. The patch is quite small and contained to the OSD shutdown path. It is effectively a nop when the option is disabled (by default). It is a small change from the newly introduced default behavior, but it just re-introduces a message in the shutdown path, which is how it used to be done on previous releases and even earlier stable releases in Nautilus. [1] https://tracker.ceph.com/issues/49683 [2] https://tracker.ceph.com/issues/49681 [3] https://tracker.ceph.com/issues/49682
2021-04-05 13:22:08	Mauricio Faria de Oliveira	cloud-archive/train: status	New	Confirmed
2021-04-05 13:22:26	Mauricio Faria de Oliveira	cloud-archive/ussuri: status	New	Confirmed
2021-04-05 13:22:39	Mauricio Faria de Oliveira	cloud-archive/victoria: status	New	Invalid
2021-04-05 13:22:43	Mauricio Faria de Oliveira	cloud-archive/wallaby: status	New	Invalid
2021-04-05 13:22:51	Mauricio Faria de Oliveira	ceph (Ubuntu Focal): status	New	Confirmed
2021-04-05 13:22:55	Mauricio Faria de Oliveira	ceph (Ubuntu Groovy): status	New	Confirmed
2021-04-05 13:22:59	Mauricio Faria de Oliveira	ceph (Ubuntu Hirsute): status	New	Confirmed
2021-04-05 13:25:13	Mauricio Faria de Oliveira	description	[Impact] The `osd_fast_shutdown` option (available in Nautilus and enabled by default) may cause the cluster log to receive too many entries of `osd.X reported immediately failed by osd.Y` on large clusters. This happens as the monitor no longer receives the OSD message to notify that the OSD is shutting down and now relies on other OSDs telling it about the failed OSD (not really 'failed' but shutdown.) This might be an issue for LMA stacks/tools that check ceph logs for failed lines, and then require additional logic to filter on an intended OSD (fast) shutdown; might not be an option/possible, and require an admin to analyze. [Fix] The `osd_fast_shutdown_notify_mon` option for OSDs (disabled by default) can tell the monitor it is shutting down (done in slow/ non-fast shutdown) under `osd_fast_shutdown`. This introduces minimal delay (the ack from the mon is required to prevent the messages), and addresses the cluster log issue. PS: the `osd_mon_shutdown_timeout` option can be used to control the maximum amount of time waiting for the monitor ack to arrive. The new option should be available in the following Ceph releases: - Pacific 16.2.0 [1] [Hirsute+] - Octopus 15.2.11 [2] [Focal/Groovy; Ussuri+] - Nautilus TBD (at least 14.2.20) [Eoan is EOL; Train] This bug tracks the release of this patch in Ubuntu/Cloud Archive. [Test Case] - Stop an OSD and watch the OSD and MON logs. - Before / or with `osd_fast_shutdown_notify_mon = false`: ``` osd log: 2021-01-09T18:59:52.448+0000 7f937fcdc700 -1 received signal: Terminated from -bash (PID: 408) UID: 1000 2021-01-09T18:59:52.448+0000 7f937fcdc700 -1 osd.2 22 * Got signal Terminated * 2021-01-09T18:59:52.448+0000 7f937fcdc700 -1 osd.2 22 * Immediate shutdown (osd_fast_shutdown=true) * mon log: $ cat out/mon.a.log \| grep '^2021-01-09T18:59:' \| grep 'osd.0 reported immediately failed by osd.' \| rev \| cut -d: -f1 \| rev \| sort \| uniq -c 4 osd.0 reported immediately failed by osd.1 4 osd.0 reported immediately failed by osd.2 4 osd.0 reported immediately failed by osd.3 4 osd.0 reported immediately failed by osd.4 4 osd.0 reported immediately failed by osd.5 4 osd.0 reported immediately failed by osd.6 4 osd.0 reported immediately failed by osd.7 4 osd.0 reported immediately failed by osd.8 4 osd.0 reported immediately failed by osd.9 ``` - After / with `osd_fast_shutdown_notify_mon = true`: ``` osd log: 2021-01-14T17:21:10.825+0000 7feceded1700 -1 received signal: Terminated from -bash (PID: 1750) UID: 1000 2021-01-14T17:21:10.825+0000 7feceded1700 -1 osd.0 80 * Got signal Terminated * 2021-01-14T17:21:10.825+0000 7feceded1700 -1 osd.0 80 * Immediate shutdown (osd_fast_shutdown=true) * 2021-01-14T17:21:10.825+0000 7feceded1700 0 osd.0 80 prepare_to_stop telling mon we are shutting down ... 2021-01-14T17:21:11.021+0000 7fecdac0c700 0 osd.0 80 got_stop_ack starting shutdown 2021-01-14T17:21:11.021+0000 7feceded1700 0 osd.0 80 prepare_to_stop starting shutdown mon log: 2021-01-14T17:21:10.829+0000 7f62fce61700 0 log_channel(cluster) log [INF] : osd.0 marked itself down 2021-01-14T17:21:10.885+0000 7f62ff666700 1 mon.a@0(leader).osd e80 do_prune osdmap full prune enabled 2021-01-14T17:21:10.889+0000 7f62ff666700 0 log_channel(cluster) log [WRN] : Health check failed: 1 osds down (OSD_DOWN) 2021-01-14T17:21:10.957+0000 7f62fb65e700 1 mon.a@0(leader).osd e81 e81: 10 total, 9 up, 10 in 2021-01-14T17:21:11.013+0000 7f62fb65e700 0 log_channel(cluster) log [DBG] : osdmap e81: 10 total, 9 up, 10 in ``` [Where problems could occur] Any regression from this patch should manifest in OSD shutdown, but only when the option is enabled. The patch is quite small and contained to the OSD shutdown path. It is effectively a nop when the option is disabled (by default). It is a small change from the newly introduced default behavior, but it just re-introduces a message in the shutdown path, which is how it used to be done on previous releases and even earlier stable releases in Nautilus. [1] https://tracker.ceph.com/issues/49683 [2] https://tracker.ceph.com/issues/49681 [3] https://tracker.ceph.com/issues/49682	[Impact] The `osd_fast_shutdown` option (available in Nautilus and enabled by default) may cause the cluster log to receive too many entries of `osd.X reported immediately failed by osd.Y` on large clusters. This happens as the monitor no longer receives the OSD message to notify that the OSD is shutting down and now relies on other OSDs telling it about the failed OSD (not really 'failed' but shutdown.) This might be an issue for LMA stacks/tools that check ceph logs for failed lines, and then require additional logic to filter on an intended OSD (fast) shutdown; might not be an option/possible, and require an admin to analyze. [Fix] The `osd_fast_shutdown_notify_mon` option for OSDs (disabled by default) can tell the monitor it is shutting down (done in slow/ non-fast shutdown) under `osd_fast_shutdown`. This introduces minimal delay (the ack from the mon is required to prevent the messages), and addresses the cluster log issue. PS: the `osd_mon_shutdown_timeout` option can be used to control the maximum amount of time waiting for the monitor ack to arrive. The new option should be available in the following Ceph releases: - Pacific 16.2.0 [1] [Hirsute+] - Octopus 15.2.11 [2] [Focal/Groovy; Ussuri+] - Nautilus TBD (at least 14.2.20) [3] [Eoan is EOL; Train] This bug tracks the release of this patch in Ubuntu/Cloud Archive. [Test Case] - Stop an OSD and watch the OSD and MON logs. - Before / or with `osd_fast_shutdown_notify_mon = false`: ``` osd log: 2021-01-09T18:59:52.448+0000 7f937fcdc700 -1 received signal: Terminated from -bash (PID: 408) UID: 1000 2021-01-09T18:59:52.448+0000 7f937fcdc700 -1 osd.2 22 * Got signal Terminated * 2021-01-09T18:59:52.448+0000 7f937fcdc700 -1 osd.2 22 * Immediate shutdown (osd_fast_shutdown=true) * mon log: $ cat out/mon.a.log \| grep '^2021-01-09T18:59:' \| grep 'osd.0 reported immediately failed by osd.' \| rev \| cut -d: -f1 \| rev \| sort \| uniq -c 4 osd.0 reported immediately failed by osd.1 4 osd.0 reported immediately failed by osd.2 4 osd.0 reported immediately failed by osd.3 4 osd.0 reported immediately failed by osd.4 4 osd.0 reported immediately failed by osd.5 4 osd.0 reported immediately failed by osd.6 4 osd.0 reported immediately failed by osd.7 4 osd.0 reported immediately failed by osd.8 4 osd.0 reported immediately failed by osd.9 ``` - After / with `osd_fast_shutdown_notify_mon = true`: ``` osd log: 2021-01-14T17:21:10.825+0000 7feceded1700 -1 received signal: Terminated from -bash (PID: 1750) UID: 1000 2021-01-14T17:21:10.825+0000 7feceded1700 -1 osd.0 80 * Got signal Terminated * 2021-01-14T17:21:10.825+0000 7feceded1700 -1 osd.0 80 * Immediate shutdown (osd_fast_shutdown=true) * 2021-01-14T17:21:10.825+0000 7feceded1700 0 osd.0 80 prepare_to_stop telling mon we are shutting down ... 2021-01-14T17:21:11.021+0000 7fecdac0c700 0 osd.0 80 got_stop_ack starting shutdown 2021-01-14T17:21:11.021+0000 7feceded1700 0 osd.0 80 prepare_to_stop starting shutdown mon log: 2021-01-14T17:21:10.829+0000 7f62fce61700 0 log_channel(cluster) log [INF] : osd.0 marked itself down 2021-01-14T17:21:10.885+0000 7f62ff666700 1 mon.a@0(leader).osd e80 do_prune osdmap full prune enabled 2021-01-14T17:21:10.889+0000 7f62ff666700 0 log_channel(cluster) log [WRN] : Health check failed: 1 osds down (OSD_DOWN) 2021-01-14T17:21:10.957+0000 7f62fb65e700 1 mon.a@0(leader).osd e81 e81: 10 total, 9 up, 10 in 2021-01-14T17:21:11.013+0000 7f62fb65e700 0 log_channel(cluster) log [DBG] : osdmap e81: 10 total, 9 up, 10 in ``` [Where problems could occur] Any regression from this patch should manifest in OSD shutdown, but only when the option is enabled. The patch is quite small and contained to the OSD shutdown path. It is effectively a nop when the option is disabled (by default). It is a small change from the newly introduced default behavior, but it just re-introduces a message in the shutdown path, which is how it used to be done on previous releases and even earlier stable releases in Nautilus. [1] https://tracker.ceph.com/issues/49683 [2] https://tracker.ceph.com/issues/49681 [3] https://tracker.ceph.com/issues/49682
2021-04-21 11:36:15	Mauricio Faria de Oliveira	description	[Impact] The `osd_fast_shutdown` option (available in Nautilus and enabled by default) may cause the cluster log to receive too many entries of `osd.X reported immediately failed by osd.Y` on large clusters. This happens as the monitor no longer receives the OSD message to notify that the OSD is shutting down and now relies on other OSDs telling it about the failed OSD (not really 'failed' but shutdown.) This might be an issue for LMA stacks/tools that check ceph logs for failed lines, and then require additional logic to filter on an intended OSD (fast) shutdown; might not be an option/possible, and require an admin to analyze. [Fix] The `osd_fast_shutdown_notify_mon` option for OSDs (disabled by default) can tell the monitor it is shutting down (done in slow/ non-fast shutdown) under `osd_fast_shutdown`. This introduces minimal delay (the ack from the mon is required to prevent the messages), and addresses the cluster log issue. PS: the `osd_mon_shutdown_timeout` option can be used to control the maximum amount of time waiting for the monitor ack to arrive. The new option should be available in the following Ceph releases: - Pacific 16.2.0 [1] [Hirsute+] - Octopus 15.2.11 [2] [Focal/Groovy; Ussuri+] - Nautilus TBD (at least 14.2.20) [3] [Eoan is EOL; Train] This bug tracks the release of this patch in Ubuntu/Cloud Archive. [Test Case] - Stop an OSD and watch the OSD and MON logs. - Before / or with `osd_fast_shutdown_notify_mon = false`: ``` osd log: 2021-01-09T18:59:52.448+0000 7f937fcdc700 -1 received signal: Terminated from -bash (PID: 408) UID: 1000 2021-01-09T18:59:52.448+0000 7f937fcdc700 -1 osd.2 22 * Got signal Terminated * 2021-01-09T18:59:52.448+0000 7f937fcdc700 -1 osd.2 22 * Immediate shutdown (osd_fast_shutdown=true) * mon log: $ cat out/mon.a.log \| grep '^2021-01-09T18:59:' \| grep 'osd.0 reported immediately failed by osd.' \| rev \| cut -d: -f1 \| rev \| sort \| uniq -c 4 osd.0 reported immediately failed by osd.1 4 osd.0 reported immediately failed by osd.2 4 osd.0 reported immediately failed by osd.3 4 osd.0 reported immediately failed by osd.4 4 osd.0 reported immediately failed by osd.5 4 osd.0 reported immediately failed by osd.6 4 osd.0 reported immediately failed by osd.7 4 osd.0 reported immediately failed by osd.8 4 osd.0 reported immediately failed by osd.9 ``` - After / with `osd_fast_shutdown_notify_mon = true`: ``` osd log: 2021-01-14T17:21:10.825+0000 7feceded1700 -1 received signal: Terminated from -bash (PID: 1750) UID: 1000 2021-01-14T17:21:10.825+0000 7feceded1700 -1 osd.0 80 * Got signal Terminated * 2021-01-14T17:21:10.825+0000 7feceded1700 -1 osd.0 80 * Immediate shutdown (osd_fast_shutdown=true) * 2021-01-14T17:21:10.825+0000 7feceded1700 0 osd.0 80 prepare_to_stop telling mon we are shutting down ... 2021-01-14T17:21:11.021+0000 7fecdac0c700 0 osd.0 80 got_stop_ack starting shutdown 2021-01-14T17:21:11.021+0000 7feceded1700 0 osd.0 80 prepare_to_stop starting shutdown mon log: 2021-01-14T17:21:10.829+0000 7f62fce61700 0 log_channel(cluster) log [INF] : osd.0 marked itself down 2021-01-14T17:21:10.885+0000 7f62ff666700 1 mon.a@0(leader).osd e80 do_prune osdmap full prune enabled 2021-01-14T17:21:10.889+0000 7f62ff666700 0 log_channel(cluster) log [WRN] : Health check failed: 1 osds down (OSD_DOWN) 2021-01-14T17:21:10.957+0000 7f62fb65e700 1 mon.a@0(leader).osd e81 e81: 10 total, 9 up, 10 in 2021-01-14T17:21:11.013+0000 7f62fb65e700 0 log_channel(cluster) log [DBG] : osdmap e81: 10 total, 9 up, 10 in ``` [Where problems could occur] Any regression from this patch should manifest in OSD shutdown, but only when the option is enabled. The patch is quite small and contained to the OSD shutdown path. It is effectively a nop when the option is disabled (by default). It is a small change from the newly introduced default behavior, but it just re-introduces a message in the shutdown path, which is how it used to be done on previous releases and even earlier stable releases in Nautilus. [1] https://tracker.ceph.com/issues/49683 [2] https://tracker.ceph.com/issues/49681 [3] https://tracker.ceph.com/issues/49682	[Impact] The `osd_fast_shutdown` option (available in Nautilus and enabled by default) may cause the cluster log to receive too many entries of `osd.X reported immediately failed by osd.Y` on large clusters. This happens as the monitor no longer receives the OSD message to notify that the OSD is shutting down and now relies on other OSDs telling it about the failed OSD (not really 'failed' but shutdown.) This might be an issue for LMA stacks/tools that check ceph logs for failed lines, and then require additional logic to filter on an intended OSD (fast) shutdown; might not be an option/possible, and require an admin to analyze. [Fix] The `osd_fast_shutdown_notify_mon` option for OSDs (disabled by default) can tell the monitor it is shutting down (done in slow/ non-fast shutdown) under `osd_fast_shutdown`. This introduces minimal delay (the ack from the mon is required to prevent the messages), and addresses the cluster log issue. PS: the `osd_mon_shutdown_timeout` option can be used to control the maximum amount of time waiting for the monitor ack to arrive. The new option should be available in the following Ceph releases: - Pacific 16.2.0 [1] [Hirsute+] - Octopus 15.2.12 [2] [Focal/Groovy; Ussuri+] - Nautilus 14.2.21) [3] [Eoan is EOL; Train] This bug tracks the release of this patch in Ubuntu/Cloud Archive. [Test Case] - Stop an OSD and watch the OSD and MON logs. - Before / or with `osd_fast_shutdown_notify_mon = false`: ``` osd log: 2021-01-09T18:59:52.448+0000 7f937fcdc700 -1 received signal: Terminated from -bash (PID: 408) UID: 1000 2021-01-09T18:59:52.448+0000 7f937fcdc700 -1 osd.2 22 * Got signal Terminated * 2021-01-09T18:59:52.448+0000 7f937fcdc700 -1 osd.2 22 * Immediate shutdown (osd_fast_shutdown=true) * mon log: $ cat out/mon.a.log \| grep '^2021-01-09T18:59:' \| grep 'osd.0 reported immediately failed by osd.' \| rev \| cut -d: -f1 \| rev \| sort \| uniq -c 4 osd.0 reported immediately failed by osd.1 4 osd.0 reported immediately failed by osd.2 4 osd.0 reported immediately failed by osd.3 4 osd.0 reported immediately failed by osd.4 4 osd.0 reported immediately failed by osd.5 4 osd.0 reported immediately failed by osd.6 4 osd.0 reported immediately failed by osd.7 4 osd.0 reported immediately failed by osd.8 4 osd.0 reported immediately failed by osd.9 ``` - After / with `osd_fast_shutdown_notify_mon = true`: ``` osd log: 2021-01-14T17:21:10.825+0000 7feceded1700 -1 received signal: Terminated from -bash (PID: 1750) UID: 1000 2021-01-14T17:21:10.825+0000 7feceded1700 -1 osd.0 80 * Got signal Terminated * 2021-01-14T17:21:10.825+0000 7feceded1700 -1 osd.0 80 * Immediate shutdown (osd_fast_shutdown=true) * 2021-01-14T17:21:10.825+0000 7feceded1700 0 osd.0 80 prepare_to_stop telling mon we are shutting down ... 2021-01-14T17:21:11.021+0000 7fecdac0c700 0 osd.0 80 got_stop_ack starting shutdown 2021-01-14T17:21:11.021+0000 7feceded1700 0 osd.0 80 prepare_to_stop starting shutdown mon log: 2021-01-14T17:21:10.829+0000 7f62fce61700 0 log_channel(cluster) log [INF] : osd.0 marked itself down 2021-01-14T17:21:10.885+0000 7f62ff666700 1 mon.a@0(leader).osd e80 do_prune osdmap full prune enabled 2021-01-14T17:21:10.889+0000 7f62ff666700 0 log_channel(cluster) log [WRN] : Health check failed: 1 osds down (OSD_DOWN) 2021-01-14T17:21:10.957+0000 7f62fb65e700 1 mon.a@0(leader).osd e81 e81: 10 total, 9 up, 10 in 2021-01-14T17:21:11.013+0000 7f62fb65e700 0 log_channel(cluster) log [DBG] : osdmap e81: 10 total, 9 up, 10 in ``` [Where problems could occur] Any regression from this patch should manifest in OSD shutdown, but only when the option is enabled. The patch is quite small and contained to the OSD shutdown path. It is effectively a nop when the option is disabled (by default). It is a small change from the newly introduced default behavior, but it just re-introduces a message in the shutdown path, which is how it used to be done on previous releases and even earlier stable releases in Nautilus. [1] https://tracker.ceph.com/issues/49683 [2] https://tracker.ceph.com/issues/49681 [3] https://tracker.ceph.com/issues/49682
2021-04-21 11:36:40	Mauricio Faria de Oliveira	ceph (Ubuntu Hirsute): status	Confirmed	Fix Released
2021-05-14 12:05:22	Mauricio Faria de Oliveira	description	[Impact] The `osd_fast_shutdown` option (available in Nautilus and enabled by default) may cause the cluster log to receive too many entries of `osd.X reported immediately failed by osd.Y` on large clusters. This happens as the monitor no longer receives the OSD message to notify that the OSD is shutting down and now relies on other OSDs telling it about the failed OSD (not really 'failed' but shutdown.) This might be an issue for LMA stacks/tools that check ceph logs for failed lines, and then require additional logic to filter on an intended OSD (fast) shutdown; might not be an option/possible, and require an admin to analyze. [Fix] The `osd_fast_shutdown_notify_mon` option for OSDs (disabled by default) can tell the monitor it is shutting down (done in slow/ non-fast shutdown) under `osd_fast_shutdown`. This introduces minimal delay (the ack from the mon is required to prevent the messages), and addresses the cluster log issue. PS: the `osd_mon_shutdown_timeout` option can be used to control the maximum amount of time waiting for the monitor ack to arrive. The new option should be available in the following Ceph releases: - Pacific 16.2.0 [1] [Hirsute+] - Octopus 15.2.12 [2] [Focal/Groovy; Ussuri+] - Nautilus 14.2.21) [3] [Eoan is EOL; Train] This bug tracks the release of this patch in Ubuntu/Cloud Archive. [Test Case] - Stop an OSD and watch the OSD and MON logs. - Before / or with `osd_fast_shutdown_notify_mon = false`: ``` osd log: 2021-01-09T18:59:52.448+0000 7f937fcdc700 -1 received signal: Terminated from -bash (PID: 408) UID: 1000 2021-01-09T18:59:52.448+0000 7f937fcdc700 -1 osd.2 22 * Got signal Terminated * 2021-01-09T18:59:52.448+0000 7f937fcdc700 -1 osd.2 22 * Immediate shutdown (osd_fast_shutdown=true) * mon log: $ cat out/mon.a.log \| grep '^2021-01-09T18:59:' \| grep 'osd.0 reported immediately failed by osd.' \| rev \| cut -d: -f1 \| rev \| sort \| uniq -c 4 osd.0 reported immediately failed by osd.1 4 osd.0 reported immediately failed by osd.2 4 osd.0 reported immediately failed by osd.3 4 osd.0 reported immediately failed by osd.4 4 osd.0 reported immediately failed by osd.5 4 osd.0 reported immediately failed by osd.6 4 osd.0 reported immediately failed by osd.7 4 osd.0 reported immediately failed by osd.8 4 osd.0 reported immediately failed by osd.9 ``` - After / with `osd_fast_shutdown_notify_mon = true`: ``` osd log: 2021-01-14T17:21:10.825+0000 7feceded1700 -1 received signal: Terminated from -bash (PID: 1750) UID: 1000 2021-01-14T17:21:10.825+0000 7feceded1700 -1 osd.0 80 * Got signal Terminated * 2021-01-14T17:21:10.825+0000 7feceded1700 -1 osd.0 80 * Immediate shutdown (osd_fast_shutdown=true) * 2021-01-14T17:21:10.825+0000 7feceded1700 0 osd.0 80 prepare_to_stop telling mon we are shutting down ... 2021-01-14T17:21:11.021+0000 7fecdac0c700 0 osd.0 80 got_stop_ack starting shutdown 2021-01-14T17:21:11.021+0000 7feceded1700 0 osd.0 80 prepare_to_stop starting shutdown mon log: 2021-01-14T17:21:10.829+0000 7f62fce61700 0 log_channel(cluster) log [INF] : osd.0 marked itself down 2021-01-14T17:21:10.885+0000 7f62ff666700 1 mon.a@0(leader).osd e80 do_prune osdmap full prune enabled 2021-01-14T17:21:10.889+0000 7f62ff666700 0 log_channel(cluster) log [WRN] : Health check failed: 1 osds down (OSD_DOWN) 2021-01-14T17:21:10.957+0000 7f62fb65e700 1 mon.a@0(leader).osd e81 e81: 10 total, 9 up, 10 in 2021-01-14T17:21:11.013+0000 7f62fb65e700 0 log_channel(cluster) log [DBG] : osdmap e81: 10 total, 9 up, 10 in ``` [Where problems could occur] Any regression from this patch should manifest in OSD shutdown, but only when the option is enabled. The patch is quite small and contained to the OSD shutdown path. It is effectively a nop when the option is disabled (by default). It is a small change from the newly introduced default behavior, but it just re-introduces a message in the shutdown path, which is how it used to be done on previous releases and even earlier stable releases in Nautilus. [1] https://tracker.ceph.com/issues/49683 [2] https://tracker.ceph.com/issues/49681 [3] https://tracker.ceph.com/issues/49682	[Impact] The `osd_fast_shutdown` option (available in Nautilus and enabled by default) may cause the cluster log to receive too many entries of `osd.X reported immediately failed by osd.Y` on large clusters. This happens as the monitor no longer receives the OSD message to notify that the OSD is shutting down and now relies on other OSDs telling it about the failed OSD (not really 'failed' but shutdown.) This might be an issue for LMA stacks/tools that check ceph logs for failed lines, and then require additional logic to filter on an intended OSD (fast) shutdown; might not be an option/possible, and require an admin to analyze. [Fix] The `osd_fast_shutdown_notify_mon` option for OSDs (disabled by default) can tell the monitor it is shutting down (done in slow/ non-fast shutdown) under `osd_fast_shutdown`. This introduces minimal delay (the ack from the mon is required to prevent the messages), and addresses the cluster log issue. PS: the `osd_mon_shutdown_timeout` option can be used to control the maximum amount of time waiting for the monitor ack to arrive. The new option should be available in the following Ceph releases: - Pacific 16.2.0 [1] [Hirsute+] - Octopus 15.2.13 [2] [Focal/Groovy; Ussuri+] - Nautilus 14.2.22) [3] [Eoan is EOL; Train] This bug tracks the release of this patch in Ubuntu/Cloud Archive. [Test Case] - Stop an OSD and watch the OSD and MON logs. - Before / or with `osd_fast_shutdown_notify_mon = false`: ``` osd log: 2021-01-09T18:59:52.448+0000 7f937fcdc700 -1 received signal: Terminated from -bash (PID: 408) UID: 1000 2021-01-09T18:59:52.448+0000 7f937fcdc700 -1 osd.2 22 * Got signal Terminated * 2021-01-09T18:59:52.448+0000 7f937fcdc700 -1 osd.2 22 * Immediate shutdown (osd_fast_shutdown=true) * mon log: $ cat out/mon.a.log \| grep '^2021-01-09T18:59:' \| grep 'osd.0 reported immediately failed by osd.' \| rev \| cut -d: -f1 \| rev \| sort \| uniq -c 4 osd.0 reported immediately failed by osd.1 4 osd.0 reported immediately failed by osd.2 4 osd.0 reported immediately failed by osd.3 4 osd.0 reported immediately failed by osd.4 4 osd.0 reported immediately failed by osd.5 4 osd.0 reported immediately failed by osd.6 4 osd.0 reported immediately failed by osd.7 4 osd.0 reported immediately failed by osd.8 4 osd.0 reported immediately failed by osd.9 ``` - After / with `osd_fast_shutdown_notify_mon = true`: ``` osd log: 2021-01-14T17:21:10.825+0000 7feceded1700 -1 received signal: Terminated from -bash (PID: 1750) UID: 1000 2021-01-14T17:21:10.825+0000 7feceded1700 -1 osd.0 80 * Got signal Terminated * 2021-01-14T17:21:10.825+0000 7feceded1700 -1 osd.0 80 * Immediate shutdown (osd_fast_shutdown=true) * 2021-01-14T17:21:10.825+0000 7feceded1700 0 osd.0 80 prepare_to_stop telling mon we are shutting down ... 2021-01-14T17:21:11.021+0000 7fecdac0c700 0 osd.0 80 got_stop_ack starting shutdown 2021-01-14T17:21:11.021+0000 7feceded1700 0 osd.0 80 prepare_to_stop starting shutdown mon log: 2021-01-14T17:21:10.829+0000 7f62fce61700 0 log_channel(cluster) log [INF] : osd.0 marked itself down 2021-01-14T17:21:10.885+0000 7f62ff666700 1 mon.a@0(leader).osd e80 do_prune osdmap full prune enabled 2021-01-14T17:21:10.889+0000 7f62ff666700 0 log_channel(cluster) log [WRN] : Health check failed: 1 osds down (OSD_DOWN) 2021-01-14T17:21:10.957+0000 7f62fb65e700 1 mon.a@0(leader).osd e81 e81: 10 total, 9 up, 10 in 2021-01-14T17:21:11.013+0000 7f62fb65e700 0 log_channel(cluster) log [DBG] : osdmap e81: 10 total, 9 up, 10 in ``` [Where problems could occur] Any regression from this patch should manifest in OSD shutdown, but only when the option is enabled. The patch is quite small and contained to the OSD shutdown path. It is effectively a nop when the option is disabled (by default). It is a small change from the newly introduced default behavior, but it just re-introduces a message in the shutdown path, which is how it used to be done on previous releases and even earlier stable releases in Nautilus. [1] https://tracker.ceph.com/issues/49683 [2] https://tracker.ceph.com/issues/49681 [3] https://tracker.ceph.com/issues/49682
2021-06-15 15:24:25	Mauricio Faria de Oliveira	cloud-archive/train: status	Confirmed	Invalid
2021-06-15 15:26:28	Mauricio Faria de Oliveira	ceph (Ubuntu Groovy): status	Confirmed	Invalid
2021-11-03 14:52:43	Mauricio Faria de Oliveira	ceph (Ubuntu Focal): status	Confirmed	Fix Released
2021-11-03 14:52:45	Mauricio Faria de Oliveira	cloud-archive/ussuri: status	Confirmed	Fix Released

Ubuntuceph package

Activity log for bug #1922561

Ubuntu
ceph package