[SRU] ceph-osd takes all memory at boot
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Ubuntu Cloud Archive |
Invalid
|
Undecided
|
Unassigned | ||
Queens |
Invalid
|
Undecided
|
Unassigned | ||
Ussuri |
Fix Released
|
High
|
Unassigned | ||
Wallaby |
Invalid
|
Undecided
|
Unassigned | ||
Xena |
Invalid
|
Undecided
|
Unassigned | ||
Yoga |
Invalid
|
Undecided
|
Unassigned | ||
ceph (Ubuntu) |
Fix Released
|
Undecided
|
Unassigned | ||
Bionic |
Invalid
|
Undecided
|
Unassigned | ||
Focal |
Fix Released
|
High
|
nikhil kshirsagar | ||
Jammy |
Invalid
|
Undecided
|
Unassigned | ||
Kinetic |
Invalid
|
Undecided
|
Unassigned |
Bug Description
[Impact]
The OSD will fail to trim the pg log dup entries, which could result in millions of dup entries for a PG while it was supposed to be at most 3000 (controlled by option osd_pg_
This could cause OSD to run out of memory and crash, and it might not be able to start up again due to the need of loading millions of dup entries. This could happen to multiple OSDs at the same time (as also reported by many community users), so we may get a completely unusable cluster if we hit this issue.
The current known trigger for this problem is the pg split, as the whole dup entries will be copied during the pg split. The reason we don’t observe this so often before is that the pg autoscale wasn’t turned on by default, it’s on by default since from octopus.
Note that there is also no way to check the number of dups in a PG online.
[Test Plan]
To see the problem, follow this approach for a test cluster, with for eg. 3 OSDs,
#ps -eaf | grep osd
root 334891 1 0 Sep21 ? 00:42:03 /home/nikhil/
root 335541 1 0 Sep21 ? 00:40:20 /home/nikhil/
kill all OSDs, so they're down,
root@focal-
2022-09-
2022-09-
cluster:
id: 9e7c0a82-
health: HEALTH_WARN
2 osds down
1 host (3 osds) down
1 root (3 osds) down
Reduced data availability: 169 pgs stale
services:
mon: 3 daemons, quorum a,b,c (age 3s)
mgr: x(active, since 28h)
mds: a:1 {0=a=up:active}
osd: 3 osds: 0 up (since 83m), 2 in (since 91m)
rgw: 1 daemon active (8000)
task status:
data:
pools: 7 pools, 169 pgs
objects: 255 objects, 9.5 KiB
usage: 4.1 GiB used, 198 GiB / 202 GiB avail
pgs: 255/765 objects degraded (33.333%)
105 stale+active+
64 stale+active+
Then inject dups using this json for all OSDs,
root@nikhil-
[
{"reqid": "client.4177.0:0",
"version": "3'0",
"user_version": "0",
"generate": "500000",
"return_code": "0"}
]
Use the ceph-objectstor
root@focal-
root@focal-
root@focal-
Then set osd debug level to 20 (since here is the log that actually doing the trim: https:/
set debug osd=20 in global in ceph.conf,
root@focal-
debug osd=20
Then bring up the OSDs
/home/nikhil/
/home/nikhil/
/home/nikhil/
Run some IO on the OSDs. Wait at least a few hours.
Then take the OSDs down (so the command below can be run), and run,
root@focal-
You will see at the end of that output in the file op.log, the number of dups is still as it was when they were injected, (no trimming has taken place)
{
},
{
}
]
},
"pg_missing_t": {
"missing": [],
}
To verify the patch:
With the patch in place, once the dups are injected, output of ./bin/ceph-
Then bring up the OSDs and start IO using rbd bench-write, leave the IO running a few hours, till these logs (https:/
root@focal-
2022-09-
...
...
2022-09-
# grep -ri "trim dup " *.log | grep 4177 | wc -l
390001 <-- total of all OSDs, should be ~ 3x what is seen in the below output (dups trimmed till 130001) if you have 3 OSDs for eg. Basically this number of trimmed dup logs are from all OSDs combined.
And the output of ./bin/ceph-
"dups": [
{
},
{
},
This will verify that the dups are being trimmed by the patch, and it is working correctly. And of course, OSDs should not go OOM at boot time!
[Where problems could occur]
This is not a clean cherry-pick due to some differences in the octopus and master codebases, related to RocksDBStore and Objectstore. (see https:/
Also, an earlier attempt to fix this issue upstream was reverted, as discussed at https:/
While this fix has been tested and validated after building it into the upstream 15.2.17 release (please see the [Test Plan] section), we would still need to proceed with extreme caution by allowing some time for problems (if any) to surface before going ahead with this SRU, and running our QA tests on the packages that build this fix into the 15.2.17 release before releasing it to the customer who await this fix on octopus.
[Other Info]
The way this is fixed is that the PGLog needs to trim duplicates by the number of entries rather than the versions. That way, we prevent unbounded duplicate growth.
Reported upstream at https:/
tags: | added: seg |
description: | updated |
description: | updated |
description: | updated |
description: | updated |
description: | updated |
tags: |
added: verification-done-focal removed: verification-failed-focal |
tags: |
added: verification-ussuri-done removed: verification-ussuri-needed |
Changed in ceph (Ubuntu Bionic): | |
status: | New → Invalid |
upstream reverted https:/ /github. com/ceph/ ceph/pull/ 45529 and https:/ /github. com/ceph/ ceph/pull/ 46253 (see https:/ /github. com/ceph/ ceph/pull/ 46610 and https:/ /github. com/ceph/ ceph/pull/ 46611)
The proper fix now is through https:/ /github. com/ceph/ ceph/pull/ 47046