ceph: nautilus: backport fixes for msgr/eventcenter
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Ubuntu Cloud Archive |
Fix Released
|
Undecided
|
Mauricio Faria de Oliveira | ||
Train |
Fix Released
|
High
|
Unassigned | ||
Ussuri |
Fix Released
|
Undecided
|
Unassigned | ||
Victoria |
Fix Released
|
Undecided
|
Mauricio Faria de Oliveira | ||
ceph (Ubuntu) |
Fix Released
|
Undecided
|
Unassigned | ||
Eoan |
Won't Fix
|
Undecided
|
Unassigned | ||
Focal |
Fix Released
|
Undecided
|
Unassigned | ||
Groovy |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
[Impact]
* Ceph Nautilus/14 may hit daemon crashes in msgr/eventcenter
as it lacks backport fixes to properly protect many threads
in the connection close/reset/reuse paths.
* Once a daemon crash occurs, the cluster becomes HEALTH_WARN,
and reports in status: "N daemons have recently crashed"
* Example:
$ juju run --unit ceph-mon/0 "sudo ceph -s"
cluster:
id: ...
health: HEALTH_WARN
1 daemons have recently crashed
[Fix]
* The backport patches in Ceph PR #33820 [1] fix this problem.
* There are 8 patches in it, but only 5 are strictly required
(3 are related to testcases/
and 1 is already applied; so actually only 4 patches needed
(the 'msg/async:' patches.)
[1] https:/
[Test Case]
* The test-case patch in the PR is a reliable reproducer; it
can be applied then built with -DWITH_TESTS=ON in d/rules;
found in 'obj-x86_
* On a test ceph cluster (e.g., 1 MON, 3 OSDs) in the mon node:
$ sudo LD_LIBRARY_
./ceph_
* This hits segfaults with the stack traces seen by the reporter,
and other traces as well in the original package, and no errors
in the patched package.
* Attached the test-case binary 'ceph_test_
the juju bundle for the test ceph cluster 'ceph-lp1890334
[Regression Potential]
* These patches change the connection close/reset/reuse logic,
so regressions would likely manifest in such functions but
be exposed/hit errors actually in daemon communication.
* There are no further related fixes upstream.
[Other Info]
* Patches already available on Ceph Octopus/15 on Focal.
* Not reporting against Eoan (Train) as it is EOL.
[Original Description]
Ceph Nautilus in bionic-train may hit daemon crashes (e.g., ceph-mgr)
in msgr/eventcenter as it lacks the following set of fixes backports:
https:/
Reporting the bug against UCA since Ubuntu Eoan (Train) is EOL.
Working on the debdiffs and tests.
Example stack trace as reported by 'ceph crash info' and GDB:
$ sudo ceph crash info <crash ID>
...
"process_name": "ceph-mgr",
...
"backtrace": [
"(bool ProtocolV2:
]
...
(gdb) bt
#0 raise (sig=sig@entry=11) at ../sysdeps/
#1 0x000055b9deda9140 in reraise_fatal (signum=11) at ./src/global/
#2 handle_fatal_signal (signum=11) at ./src/global/
#3 <signal handler called>
#4 ceph::msgr:
#5 ProtocolV2:
#6 0x00007f8e4bf249dd in ProtocolV2:
at ./src/msg/
#7 0x00007f8e4bf39d55 in ProtocolV2:
#8 0x00007f8e4bef89e3 in AsyncConnection
#9 0x00007f8e4bf51157 in EventCenter:
timeout_
#10 0x00007f8e4bf55848 in NetworkStack:
#11 std::_Function_
at /usr/include/
#12 0x00007f8e4a9b06df in ?? () from /usr/lib/
#13 0x00007f8e4ae876db in start_thread (arg=0x7f8e466d
#14 0x00007f8e4a06da3f in clone () at ../sysdeps/
Changed in cloud-archive: | |
assignee: | nobody → Mauricio Faria de Oliveira (mfo) |
status: | New → In Progress |
description: | updated |
description: | updated |
description: | updated |
tags: | added: sts |
Changed in ceph (Ubuntu Eoan): | |
status: | New → Won't Fix |
Test case: juju bundle for ceph cluster