ceph: nautilus: backport fixes for msgr/eventcenter

Bug #1890334 reported by Mauricio Faria de Oliveira
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
Fix Released
Undecided
Mauricio Faria de Oliveira
Train
Fix Released
High
Unassigned
Ussuri
Fix Released
Undecided
Unassigned
Victoria
Fix Released
Undecided
Mauricio Faria de Oliveira
ceph (Ubuntu)
Fix Released
Undecided
Unassigned
Eoan
Won't Fix
Undecided
Unassigned
Focal
Fix Released
Undecided
Unassigned
Groovy
Fix Released
Undecided
Unassigned

Bug Description

[Impact]

 * Ceph Nautilus/14 may hit daemon crashes in msgr/eventcenter
   as it lacks backport fixes to properly protect many threads
   in the connection close/reset/reuse paths.

 * Once a daemon crash occurs, the cluster becomes HEALTH_WARN,
   and reports in status: "N daemons have recently crashed"

 * Example:

    $ juju run --unit ceph-mon/0 "sudo ceph -s"
    cluster:
    id: ...
    health: HEALTH_WARN
    1 daemons have recently crashed

[Fix]

 * The backport patches in Ceph PR #33820 [1] fix this problem.

 * There are 8 patches in it, but only 5 are strictly required
   (3 are related to testcases/sanitizers, not used in package),
   and 1 is already applied; so actually only 4 patches needed
   (the 'msg/async:' patches.)

  [1] https://github.com/ceph/ceph/pull/33820

[Test Case]

 * The test-case patch in the PR is a reliable reproducer; it
   can be applied then built with -DWITH_TESTS=ON in d/rules;
   found in 'obj-x86_64-linux-gnu/bin/ceph_test_rados_api_misc'

 * On a test ceph cluster (e.g., 1 MON, 3 OSDs) in the mon node:

   $ sudo LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu/ceph/ \
    ./ceph_test_rados_api_misc --gtest_filter=LibRadosMisc.ShutdownRace

 * This hits segfaults with the stack traces seen by the reporter,
   and other traces as well in the original package, and no errors
   in the patched package.

 * Attached the test-case binary 'ceph_test_rados_api_misc' and
   the juju bundle for the test ceph cluster 'ceph-lp1890334.yaml'.

[Regression Potential]

 * These patches change the connection close/reset/reuse logic,
   so regressions would likely manifest in such functions but
   be exposed/hit errors actually in daemon communication.

 * There are no further related fixes upstream.

[Other Info]

 * Patches already available on Ceph Octopus/15 on Focal.
 * Not reporting against Eoan (Train) as it is EOL.

[Original Description]

Ceph Nautilus in bionic-train may hit daemon crashes (e.g., ceph-mgr)
in msgr/eventcenter as it lacks the following set of fixes backports:

  https://github.com/ceph/ceph/pull/33820

Reporting the bug against UCA since Ubuntu Eoan (Train) is EOL.
Working on the debdiffs and tests.

Example stack trace as reported by 'ceph crash info' and GDB:

$ sudo ceph crash info <crash ID>
...
    "process_name": "ceph-mgr",
...
    "backtrace": [
        "(()+0x128a0) [0x7f8e4ae928a0]",
        "(bool ProtocolV2::append_frame<ceph::msgr::v2::MessageFrame>(ceph::msgr::v2::MessageFrame&)+0x48a) [0x7f8e4bf4219a]",
        "(ProtocolV2::write_message(Message*, bool)+0x4dd) [0x7f8e4bf249dd]",
        "(ProtocolV2::write_event()+0x2c5) [0x7f8e4bf39d55]",
        "(AsyncConnection::handle_write()+0x43) [0x7f8e4bef89e3]",
        "(EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xd57) [0x7f8e4bf51157]",
        "(()+0x59b848) [0x7f8e4bf55848]",
        "(()+0xbd6df) [0x7f8e4a9b06df]",
        "(()+0x76db) [0x7f8e4ae876db]",
        "(clone()+0x3f) [0x7f8e4a06da3f]"
    ]
...

(gdb) bt
#0 raise (sig=sig@entry=11) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x000055b9deda9140 in reraise_fatal (signum=11) at ./src/global/signal_handler.cc:81
#2 handle_fatal_signal (signum=11) at ./src/global/signal_handler.cc:326
#3 <signal handler called>
#4 ceph::msgr::v2::Frame<ceph::msgr::v2::MessageFrame, (unsigned short)8, (unsigned short)8, (unsigned short)8, (unsigned short)4096>::get_buffer (session_stream_handlers=..., this=<optimized out>) at ./src/msg/async/frames_v2.h:273
#5 ProtocolV2::append_frame<ceph::msgr::v2::MessageFrame> (this=this@entry=0x55b9e4830680, frame=...) at ./src/msg/async/ProtocolV2.cc:552
#6 0x00007f8e4bf249dd in ProtocolV2::write_message (this=this@entry=0x55b9e4830680, m=m@entry=0x55b9e596da40, more=more@entry=false)
    at ./src/msg/async/ProtocolV2.cc:515
#7 0x00007f8e4bf39d55 in ProtocolV2::write_event (this=0x55b9e4830680) at ./src/msg/async/ProtocolV2.cc:627
#8 0x00007f8e4bef89e3 in AsyncConnection::handle_write (this=0x55b9e73ec480) at ./src/msg/async/AsyncConnection.cc:692
#9 0x00007f8e4bf51157 in EventCenter::process_events (this=this@entry=0x55b9e05502c0, timeout_microseconds=<optimized out>,
    timeout_microseconds@entry=30000000, working_dur=working_dur@entry=0x7f8e466d5828) at ./src/msg/async/Event.cc:441
#10 0x00007f8e4bf55848 in NetworkStack::<lambda()>::operator() (__closure=0x55b9e05feff8) at ./src/msg/async/Stack.cc:53
#11 std::_Function_handler<void(), NetworkStack::add_thread(unsigned int)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...)
    at /usr/include/c++/7/bits/std_function.h:316
#12 0x00007f8e4a9b06df in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#13 0x00007f8e4ae876db in start_thread (arg=0x7f8e466d8700) at pthread_create.c:463
#14 0x00007f8e4a06da3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Tags: sts
Changed in cloud-archive:
assignee: nobody → Mauricio Faria de Oliveira (mfo)
status: New → In Progress
description: updated
description: updated
description: updated
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Test case: juju bundle for ceph cluster

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Test case: reproducer, compressed

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Test case: steps to reproduce

$ juju add-model ceph-lp1890334
$ juju deploy ./ceph-lp1890334.yaml

$ watch -c juju status --color # wait all green/active/idle

$ juju scp ceph_test_rados_api_misc ceph-mon/0:~
$ juju ssh ceph-mon/0

With the original package:
---

$ dpkg -s librados2 | grep ^Version:
Version: 14.2.9-0ubuntu0.19.10.1~cloud0

$ time sudo LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu/ceph/ ~/ceph_test_rados_api_misc --gtest_filter=LibRadosMisc.ShutdownRace'

Note: Google Test filter = LibRadosMisc.ShutdownRace
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from LibRadosMisc
[ RUN ] LibRadosMisc.ShutdownRace
*** Caught signal (Segmentation fault) **
 in thread 7f1a88953700 thread_name:msgr-worker-1
 ceph version 14.2.9 (581f22da52345dba46ee232b73b990f06029a2a0) nautilus (stable)
 1: (()+0x128a0) [0x7f1bc19708a0]
 2: (bool ProtocolV2::append_frame<ceph::msgr::v2::MessageFrame>(ceph::msgr::v2::MessageFrame&)+0x48a) [0x7f1bb908e19a]
 3: (ProtocolV2::write_message(Message*, bool)+0x4dd) [0x7f1bb90709dd]
 4: (ProtocolV2::write_event()+0x2c5) [0x7f1bb9085d55]
 5: (AsyncConnection::handle_write()+0x43) [0x7f1bb90449e3]
 6: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xd57) [0x7f1bb909d157]
 7: (()+0x59b848) [0x7f1bb90a1848]
 8: (()+0xbd6df) [0x7f1bb842e6df]
 9: (()+0x76db) [0x7f1bc19656db]
 10: (clone()+0x3f) [0x7f1bb7e89a3f]
2020-08-06 17:01:07.630 7f1a88953700 -1 *** Caught signal (Segmentation fault) **
 in thread 7f1a88953700 thread_name:msgr-worker-1
 ...
Segmentation fault

real 0m4.786s
user 0m1.546s
sys 0m1.449s

With the patched package:
---

$ dpkg -s librados2 | grep ^Version:
Version: 14.2.9-0ubuntu0.19.10.1~cloud1

$ time sudo LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu/ceph/ ./ceph_test_rados_api_misc --gtest_filter=LibRadosMisc.ShutdownRace
Note: Google Test filter = LibRadosMisc.ShutdownRace
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from LibRadosMisc
[ RUN ] LibRadosMisc.ShutdownRace
[ OK ] LibRadosMisc.ShutdownRace (28182 ms)
[----------] 1 test from LibRadosMisc (28182 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (30586 ms total)
[ PASSED ] 1 test.

real 0m30.627s
user 0m51.452s
sys 0m20.738s

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Test case: pass/fail ratio over multiple runs

The reproducer hits the issue most of the time.

To confirm this, here is a 15-test loop.

Note it needs to be able to delete test pools, otherwise the test fails in the exit path.
And the loop deletes test pools after the test-case runs anyway, in case it segfaulted and didn't clean up.

$ for i in $(seq 1 15); do
  sudo ceph tell 'mon.*' injectargs '--mon-allow-pool-delete=true'

  sudo LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu/ceph/ \
    ./ceph_test_rados_api_misc --gtest_filter=LibRadosMisc.ShutdownRace \
    && pass=$((pass+1)) \
    || fail=$((fail+1))

  for test_pool in $(sudo ceph osd pool ls | grep '^test-rados-api'); do
    sudo ceph osd pool rm $test_pool $test_pool --yes-i-really-really-mean-it
  done
done 2>&1 | tee test.log

With the original package: (pass = 5, fail = 10, total = 15)
---

$ dpkg -s librados2 | grep ^Version:
Version: 14.2.9-0ubuntu0.19.10.1~cloud0

$ grep -c RUN test.log
15
$ grep -c PASSED test.log
5

With the patched package: (pass = 15, fail = 0, total = 15)
---

$ dpkg -s librados2 | grep ^Version:
Version: 14.2.9-0ubuntu0.19.10.1~cloud1

$ grep -c RUN test.log
15
$ grep -c PASSED test.log
15

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Test packages built successfully on PPA (ppa:mfo/lp1890334)
with Dependency on ~ubuntu-cloud-archive/train-staging.

$ juju ssh ceph-mon/0

$ sudo add-apt-repository ppa:mfo/lp1890334
$ sudo apt install ceph -y

$ dpkg -s librados2 | grep ^Version:
Version: 14.2.9-0ubuntu0.19.10.1~cloud1

All good with daemons/services:

$ sudo systemctl restart 'ceph*.target'
$ systemctl --failed | grep ceph
$

All good with test case:

$ sudo ceph tell mon.* injectargs '--mon-allow-pool-delete=true'
$ time sudo LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu/ceph/ ./ceph_test_rados_api_misc --gtest_filter=LibRadosMisc.ShutdownRace
Note: Google Test filter = LibRadosMisc.ShutdownRace
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from LibRadosMisc
[ RUN ] LibRadosMisc.ShutdownRace
[ OK ] LibRadosMisc.ShutdownRace (41067 ms)
[----------] 1 test from LibRadosMisc (41067 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (43965 ms total)
[ PASSED ] 1 test.

real 0m44.029s
user 0m57.915s
sys 0m27.113s

tags: added: sts
Changed in ceph (Ubuntu Eoan):
status: New → Won't Fix
Revision history for this message
Edward Hope-Morley (hopem) wrote :

Fix already exists in Groovy and Focal/Ussuri (since 15.2.0).

Changed in ceph (Ubuntu Groovy):
status: New → Fix Committed
Changed in ceph (Ubuntu Focal):
status: New → Fix Committed
status: Fix Committed → Fix Released
Changed in ceph (Ubuntu Groovy):
status: Fix Committed → Fix Released
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Hello Mauricio,

Thank you for this. This issue appears to be fixed in 14.2.10 via [1] so I think we should just pick up the new stable point release. I've opened LP: #1891077 to track the stable point release for 14.2.10.

[1] https://ceph.io/releases/v14-2-10-nautilus-released/
"core: msgr: backport the EventCenter-related fixes (pr#33820, Radoslaw Zarzynski, Jeff Layton, Kefu Chai)"

Thank you,
Corey

Revision history for this message
Dan Hill (hillpd) wrote :

Due a critical regression in 14.2.10 [0], this will be fixed in the SRU for 14.2.11.

[0] https://bugs.launchpad.net/cloud-archive/+bug/1891077/comments/3

Revision history for this message
Corey Bryant (corey.bryant) wrote :

14.2.11 is in train-proposed

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Corey, thanks.

The packages in train-proposed look good -- the test-case always passes.

cheers,
Mauricio

$ dpkg -s librados2 | grep ^Version:
Version: 14.2.11-0ubuntu0.19.10.1~cloud2

$ sudo ceph mon versions
{
    "ceph version 14.2.11 (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus (stable)": 1
}

$ time sudo LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu/ceph/ ~/ceph_test_rados_api_misc-v14.2.11 --gtest_filter=LibRadosMisc.ShutdownRace
Note: Google Test filter = LibRadosMisc.ShutdownRace
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from LibRadosMisc
[ RUN ] LibRadosMisc.ShutdownRace
[ OK ] LibRadosMisc.ShutdownRace (54014 ms)
[----------] 1 test from LibRadosMisc (54014 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (56782 ms total)
[ PASSED ] 1 test.

real 0m56.832s
user 1m13.559s
sys 0m40.346s

...

$ grep -c RUN test.log
15

$ grep -c PASSED test.log
15

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Just for documentation purposes,

The test-case has to be rebuilt with the new source package; attaching it.
The old test-case/v14.2.9-based doesn't run in new packages/v14.2.11-based.

Old test-case:

$ time sudo LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu/ceph/ \
  ~/ceph_test_rados_api_misc \
  --gtest_filter=LibRadosMisc.ShutdownRace
terminate called after throwing an instance of 'std::system_error'
  what(): Invalid argument
Aborted

real 0m0.361s
user 0m0.019s
sys 0m0.020s

New test-case:

$ time sudo LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu/ceph/ \
  ~/ceph_test_rados_api_misc-v14.2.11 \
  --gtest_filter=LibRadosMisc.ShutdownRace
Note: Google Test filter = LibRadosMisc.ShutdownRace
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from LibRadosMisc
[ RUN ] LibRadosMisc.ShutdownRace
[ OK ] LibRadosMisc.ShutdownRace (54014 ms)
[----------] 1 test from LibRadosMisc (54014 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (56782 ms total)
[ PASSED ] 1 test.

real 0m56.832s
user 1m13.559s
sys 0m40.346s

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Steps to build the test case:

$ sudo add-apt-repository -s cloud-archive:train-proposed
$ sudo apt update

$ apt source ceph
$ cd ceph-14.2.11

$ sed 's/-DWITH_TESTS=OFF/-DWITH_TESTS=ON/' -i debian/rules
$ dch -l '+withtestson' 'Enable building tests.'

$ sudo apt build-dep -y ceph
$ DEB_BUILD_OPTIONS=parallel=20 dpkg-buildpackage -us -uc

$ ls -lh obj-x86_64-linux-gnu/bin/ceph_test_rados_api_misc
-rwxrwxr-x 1 ubuntu ubuntu 13M Aug 24 15:36 obj-x86_64-linux-gnu/bin/ceph_test_rados_api_misc

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

With the -proposed package: (pass = 15, fail = 0, total = 15)
---

$ dpkg -s librados2 | grep ^Version:
Version: 14.2.11-0ubuntu0.19.10.1~cloud2

$ for i in $(seq 1 15); do
  sudo ceph tell 'mon.*' injectargs '--mon-allow-pool-delete=true'

  sudo LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu/ceph/ \
    ~/ceph_test_rados_api_misc-v14.2.11 \
    --gtest_filter=LibRadosMisc.ShutdownRace

  for test_pool in $(sudo ceph osd pool ls | grep '^test-rados-api'); do
    sudo ceph osd pool rm $test_pool $test_pool --yes-i-really-really-mean-it
  done
done 2>&1 | tee test.log

$ grep -c RUN test.log
15

$ grep -c PASSED test.log
15

Revision history for this message
Corey Bryant (corey.bryant) wrote :

Mauricio, thanks for the detailed work on this.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

This bug was fixed in the package ceph - 14.2.11-0ubuntu0.19.10.1~cloud2
---------------

 ceph (14.2.11-0ubuntu0.19.10.1~cloud2) bionic-train; urgency=medium
 .
   * d/p/align-write-zeroes-declarations.patch: Fix failing i386 build
     caused by mismatched function declarations, updated based on
     upstream review comments (LP: #1891661).
 .
 ceph (14.2.11-0ubuntu0.19.10.1~cloud1) bionic-train; urgency=medium
 .
   * New upstream point release (LP: #1891077).
   * d/p/align-write-zeroes-declarations.patch: Fix failing i386 build
     caused by mismatched function declarations (LP: #1891661).

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.