osd shutdown may flood the cluster log with 'osd.X reported immediately failed by osd.Y'
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Ubuntu Cloud Archive |
Invalid
|
Undecided
|
Unassigned | ||
Train |
Invalid
|
Undecided
|
Unassigned | ||
Ussuri |
Fix Released
|
Undecided
|
Unassigned | ||
Victoria |
Invalid
|
Undecided
|
Unassigned | ||
Wallaby |
Invalid
|
Undecided
|
Unassigned | ||
ceph (Ubuntu) |
Fix Released
|
Undecided
|
Unassigned | ||
Focal |
Fix Released
|
Undecided
|
Unassigned | ||
Groovy |
Invalid
|
Undecided
|
Unassigned | ||
Hirsute |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
[Impact]
The `osd_fast_shutdown` option (available in Nautilus and enabled
by default) may cause the cluster log to receive too many entries
of `osd.X reported immediately failed by osd.Y` on large clusters.
This happens as the monitor no longer receives the OSD message to
notify that the OSD is shutting down and now relies on other OSDs
telling it about the failed OSD (not really 'failed' but shutdown.)
This might be an issue for LMA stacks/tools that check ceph logs
for failed lines, and then require additional logic to filter on
an intended OSD (fast) shutdown; might not be an option/possible,
and require an admin to analyze.
[Fix]
The `osd_fast_
default) can tell the monitor it is shutting down (done in slow/
non-fast shutdown) under `osd_fast_
This introduces minimal delay (the ack from the mon is required
to prevent the messages), and addresses the cluster log issue.
PS: the `osd_mon_
the maximum amount of time waiting for the monitor ack to arrive.
The new option should be available in the following Ceph releases:
- Pacific 16.2.0 [1] [Hirsute+]
- Octopus 15.2.13 [2] [Focal/Groovy; Ussuri+]
- Nautilus 14.2.22) [3] [Eoan is EOL; Train]
This bug tracks the release of this patch in Ubuntu/Cloud Archive.
[Test Case]
- Stop an OSD and watch the OSD and MON logs.
- Before / or with `osd_fast_
```
osd log:
2021-01-
2021-01-
2021-01-
mon log:
$ cat out/mon.a.log | grep '^2021-
4 osd.0 reported immediately failed by osd.1
4 osd.0 reported immediately failed by osd.2
4 osd.0 reported immediately failed by osd.3
4 osd.0 reported immediately failed by osd.4
4 osd.0 reported immediately failed by osd.5
4 osd.0 reported immediately failed by osd.6
4 osd.0 reported immediately failed by osd.7
4 osd.0 reported immediately failed by osd.8
4 osd.0 reported immediately failed by osd.9
```
- After / with `osd_fast_
```
osd log:
2021-01-
2021-01-
2021-01-
2021-01-
...
2021-01-
2021-01-
mon log:
2021-01-
2021-01-
2021-01-
2021-01-
2021-01-
```
[Where problems could occur]
Any regression from this patch should manifest in OSD shutdown, but only when the option is enabled.
The patch is quite small and contained to the OSD shutdown path.
It is effectively a nop when the option is disabled (by default).
It is a small change from the newly introduced default behavior,
but it just re-introduces a message in the shutdown path, which
is how it used to be done on previous releases and even earlier
stable releases in Nautilus.
[1] https:/
[2] https:/
[3] https:/
Adding target Ubuntu/Cloud Archive releases.
(note: Victoria and Wallaby have no ceph packages yet, thus marking as Invalid.)
```
The new option should be available in the following Ceph releases:
- Pacific 16.2.0 [Hirsute+]
- Octopus 15.2.11 [Focal/Groovy; Ussuri+]
- Nautilus TBD (at least 14.2.20) [Eoan is EOL; Train]
This bug tracks the release of this patch in Ubuntu/Cloud Archive.
```