"crash" module is always on but not properly configured

Bug #2000630 reported by Nobuto Murata
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ceph Monitor Charm
Fix Released
Undecided
Samuel Allan
Quincy.2
Fix Released
Undecided
Unassigned
Ceph OSD Charm
Fix Released
Undecided
Samuel Allan
Quincy.2
Fix Released
Undecided
Unassigned
ceph (Ubuntu)
New
Undecided
Unassigned

Bug Description

cloud:focal-yoga (quincy)

$ juju ssh ceph-mon/leader -- sudo ceph version
ceph version 17.2.0 (43e2e60a7559d3f46c9d53f1ca875fd499a1e35e) quincy (stable)

How to reproduce:

1. make sure "crash" module is on (it's a part of "always on" modules)

    https://docs.ceph.com/en/quincy/mgr/crash/

    $ juju ssh ceph-mon/leader -- sudo ceph mgr module ls | grep crash
    crash on (always on)

2. intentionally crash ceph-osd process (in this example I used SIGSEGV)

    $ juju ssh ceph-osd/leader -- sudo pkill --signal SIGSEGV ceph-osd

3. make sure a normal crash file is generated for apport *and* a set of files for ceph crash module.

    # ll -h /var/crash/
    total 121M
    drwxrwxrwt 2 root root 4.0K Dec 28 10:42 ./
    drwxr-xr-x 13 root root 4.0K Dec 12 21:41 ../
    -rw-r----- 1 ceph ceph 121M Dec 28 10:42 _usr_bin_ceph-osd.64045.crash

    # ll -h /var/lib/ceph/crash/*
    '/var/lib/ceph/crash/2022-12-28T10:42:04.661282Z_51be6c87-4a42-4fbb-afe5-264d94cd6c79':
    total 1.6M
    drwx------ 2 ceph ceph 4.0K Dec 28 10:42 ./
    drwxr-xr-x 4 ceph ceph 4.0K Dec 28 10:42 ../
    -r--r--r-- 1 ceph ceph 0 Dec 28 10:42 done
    -rw-r--r-- 1 ceph ceph 1.6M Dec 28 10:42 log
    -rw------- 1 ceph ceph 926 Dec 28 10:42 meta

    /var/lib/ceph/crash/posted:
    total 8.0K
    drwxr-xr-x 2 root root 4.0K Sep 13 17:47 ./
    drwxr-xr-x 4 ceph ceph 4.0K Dec 28 10:42 ../

4. check syslog for post failures to MON units.

Dec 28 10:51:18 famous-skunk ceph-crash[10667]: WARNING:ceph-crash:post /var/lib/ceph/crash/2022-12-28T10:42:04.661282Z_51be6c87-4a42-4fbb-afe5-264d94cd6c79 as client.crash.famous-skunk failed: (None, b'2022-12-28T10:51:18.368+0000 7f427dc2f700 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.crash.famous-skunk.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or directory\n2022-12-28T10:51:18.368+0000 7f427dc2f700 -1 AuthRegistry(0x7f427805f4f0) no keyring found at /etc/ceph/ceph.client.crash.famous-skunk.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin, disabling cephx\n2022-12-28T10:51:18.376+0000 7f427c9cd700 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.crash.famous-skunk.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or directory\n2022-12-28T10:51:18.376+0000 7f427c9cd700 -1 AuthRegistry(0x7f4278065748) no keyring found at /etc/ceph/ceph.client.crash.famous-skunk.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin, disabling cephx\n2022-12-28T10:51:18.376+0000 7f427c9cd700 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.crash.famous-skunk.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or directory\n2022-12-28T10:51:18.376+0000 7f427c9cd700 -1 AuthRegistry(0x7f427c9cc000) no keyring found at /etc/ceph/ceph.client.crash.famous-skunk.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin, disabling cephx\n[errno 2] RADOS object not found (error connecting to the cluster)\n')

Revision history for this message
Nobuto Murata (nobuto) wrote :

By configuring authentication for the crash module, OSD nodes posted a recent crash to MON. Real outages can happen after a few crashes of MONs or OSDs so this should be helpful to give a heads-up to operators to diagnose a recent crash.

https://docs.ceph.com/en/quincy/mgr/crash/

$ juju ssh ceph-mon/leader -- sudo ceph auth get-or-create client.crash mon 'profile crash' mgr 'profile crash'
[client.crash]
     key = AQCRI6xje9HrHxAAU20bKTeL3k2pIlPNazeVfQ==

$ juju run -a ceph-osd '
cat <<EOF | sudo tee /etc/ceph/ceph.client.crash.keyring
[client.crash]
     key = AQCRI6xje9HrHxAAU20bKTeL3k2pIlPNazeVfQ==
EOF
'

$ sudo ceph health detail
HEALTH_WARN 1 daemons have recently crashed
[WRN] RECENT_CRASH: 1 daemons have recently crashed
    osd.2 crashed on host famous-skunk at 2022-12-28T10:42:04.661282Z

Changed in charm-ceph-mon:
assignee: nobody → Samuel Walladge (swalladge)
Changed in charm-ceph-mon:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ceph-mon (master)
Revision history for this message
Samuel Allan (samuelallan) wrote :
Changed in charm-ceph-osd:
status: New → In Progress
assignee: nobody → Samuel Walladge (swalladge)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ceph-mon (master)

Reviewed: https://review.opendev.org/c/openstack/charm-ceph-mon/+/869138
Committed: https://opendev.org/openstack/charm-ceph-mon/commit/b2408e9dd7130e3edf977c62533e6ea408c589eb
Submitter: "Zuul (22348)"
Branch: master

commit b2408e9dd7130e3edf977c62533e6ea408c589eb
Author: Samuel Walladge <email address hidden>
Date: Wed Jan 4 15:59:34 2023 +1030

    Create a key for ceph-osd for crash module auth

    This will be set on the osd relation,
    so the ceph-osd charm can use this key for auth
    by the crash reporting module.

    ref. https://docs.ceph.com/en/latest/mgr/crash/

    See https://review.opendev.org/c/openstack/charm-ceph-osd/+/869139
    for how this key is used by ceph-osd.

    Closes-Bug: #2000630
    Change-Id: Ic95aae6b5981a6df1e0b3c310bcef8018c494a24

Changed in charm-ceph-mon:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ceph-osd (master)

Reviewed: https://review.opendev.org/c/openstack/charm-ceph-osd/+/869139
Committed: https://opendev.org/openstack/charm-ceph-osd/commit/97be046f9b541067193c50323e00657110184f97
Submitter: "Zuul (22348)"
Branch: master

commit 97be046f9b541067193c50323e00657110184f97
Author: Samuel Walladge <email address hidden>
Date: Wed Jan 4 15:59:04 2023 +1030

    Save the crash module auth key

    Read the key set on the mon relation,
    and use ceph-authtool to save it to a keyring,
    for use by the crash module for crash reporting.

    When this auth key is set, the crash module (enabled by default)
    will update ceph-mon with a report.
    It also results in a neat summary of recent crashes
    that can be viewed by `ceph health detail`.
    For example:

    ```
    $ juju ssh ceph-mon/leader -- sudo ceph health detail

    HEALTH_WARN 1 daemons have recently crashed
    [WRN] RECENT_CRASH: 1 daemons have recently crashed
        osd.1 crashed on host node-3 at 2023-01-04T05:25:18.218628Z
    ```

    ref. https://docs.ceph.com/en/latest/mgr/crash/

    See also https://review.opendev.org/c/openstack/charm-ceph-mon/+/869138
    for where the client_crash_key relation data set is implemented.

    Depends-On: https://review.opendev.org/c/openstack/charm-ceph-mon/+/869138

    Closes-Bug: #2000630
    Change-Id: I77c84c368e6665e4988ebe9a735f000f99d0b78e

Changed in charm-ceph-osd:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ceph-osd (stable/quincy.2)

Fix proposed to branch: stable/quincy.2
Review: https://review.opendev.org/c/openstack/charm-ceph-osd/+/876456

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ceph-mon (stable/quincy.2)

Fix proposed to branch: stable/quincy.2
Review: https://review.opendev.org/c/openstack/charm-ceph-mon/+/876457

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ceph-mon (stable/quincy.2)

Reviewed: https://review.opendev.org/c/openstack/charm-ceph-mon/+/876457
Committed: https://opendev.org/openstack/charm-ceph-mon/commit/9df0390d3f17251e3fb664cb90341a86435f764c
Submitter: "Zuul (22348)"
Branch: stable/quincy.2

commit 9df0390d3f17251e3fb664cb90341a86435f764c
Author: Samuel Walladge <email address hidden>
Date: Wed Jan 4 15:59:34 2023 +1030

    Create a key for ceph-osd for crash module auth

    This will be set on the osd relation,
    so the ceph-osd charm can use this key for auth
    by the crash reporting module.

    ref. https://docs.ceph.com/en/latest/mgr/crash/

    See https://review.opendev.org/c/openstack/charm-ceph-osd/+/869139
    for how this key is used by ceph-osd.

    Closes-Bug: #2000630
    Change-Id: Ic95aae6b5981a6df1e0b3c310bcef8018c494a24
    (cherry picked from commit b2408e9dd7130e3edf977c62533e6ea408c589eb)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ceph-osd (stable/quincy.2)

Reviewed: https://review.opendev.org/c/openstack/charm-ceph-osd/+/876456
Committed: https://opendev.org/openstack/charm-ceph-osd/commit/ad59f1bda762cbb75ed2cd2be758135eee2ffc36
Submitter: "Zuul (22348)"
Branch: stable/quincy.2

commit ad59f1bda762cbb75ed2cd2be758135eee2ffc36
Author: Samuel Walladge <email address hidden>
Date: Wed Jan 4 15:59:04 2023 +1030

    Save the crash module auth key

    Read the key set on the mon relation,
    and use ceph-authtool to save it to a keyring,
    for use by the crash module for crash reporting.

    When this auth key is set, the crash module (enabled by default)
    will update ceph-mon with a report.
    It also results in a neat summary of recent crashes
    that can be viewed by `ceph health detail`.
    For example:

    ```
    $ juju ssh ceph-mon/leader -- sudo ceph health detail

    HEALTH_WARN 1 daemons have recently crashed
    [WRN] RECENT_CRASH: 1 daemons have recently crashed
        osd.1 crashed on host node-3 at 2023-01-04T05:25:18.218628Z
    ```

    ref. https://docs.ceph.com/en/latest/mgr/crash/

    See also https://review.opendev.org/c/openstack/charm-ceph-mon/+/869138
    for where the client_crash_key relation data set is implemented.

    Depends-On: https://review.opendev.org/c/openstack/charm-ceph-mon/+/869138

    Closes-Bug: #2000630
    Change-Id: I77c84c368e6665e4988ebe9a735f000f99d0b78e
    (cherry picked from commit 97be046f9b541067193c50323e00657110184f97)

Revision history for this message
Nobuto Murata (nobuto) wrote (last edit ):

Opening a packaging task. Posting the crash data itself succeeds after having the patches to charms.

However, the ceph-crash process cannot move the posted crash to the "posted/" directory due to a permission issue.

Oct 24 07:21:39 more-llama ceph-crash[27895]: ERROR:ceph-crash:Error scraping /var/lib/ceph/crash: [Errno 13] Permission denied: '/var/lib/ceph/crash/2023-10-24T07:15:21.937207Z_f08b6b76-fae0-458a-969b-e105dab0b327' -> '/var/lib/ceph/crash/posted/2023-10-24T07:15:21.937207Z_f08b6b76-fae0-458a-969b-e105dab0b327'

# ll /var/lib/ceph/crash/
total 28
drwxr-xr-x 7 ceph ceph 4096 Oct 24 07:15 ./
drwxr-x--- 15 ceph ceph 4096 Oct 24 04:21 ../
drwx------ 2 ceph ceph 4096 Oct 24 07:15 2023-10-24T07:15:21.937207Z_f08b6b76-fae0-458a-969b-e105dab0b327/
drwx------ 2 ceph ceph 4096 Oct 24 07:15 2023-10-24T07:15:21.937914Z_f9eccf77-39fa-440f-8b26-c410edece34a/
drwx------ 2 ceph ceph 4096 Oct 24 07:15 2023-10-24T07:15:51.704821Z_05bd68eb-4da6-4e6c-a7fa-3399c6a9d1cc/
drwx------ 2 ceph ceph 4096 Oct 24 07:15 2023-10-24T07:15:51.705072Z_47e7d9a9-8f61-402e-b62f-d2abbea11735/
drwxr-xr-x 2 root root 4096 May 26 14:42 posted/

# dpkg -S /var/lib/ceph/crash/posted/
ceph-base: /var/lib/ceph/crash/posted

/var/lib/ceph/crash/posted/ should be owned by ceph:ceph instead of root:root as the process is running as the ceph user.

# apt policy ceph-base
ceph-base:
  Installed: 17.2.6-0ubuntu0.22.04.1
  Candidate: 17.2.6-0ubuntu0.22.04.1
  Version table:
 *** 17.2.6-0ubuntu0.22.04.1 500
        500 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 Packages
        100 /var/lib/dpkg/status
     17.2.5-0ubuntu0.22.04.3 500
        500 http://archive.ubuntu.com/ubuntu jammy-security/main amd64 Packages
     17.1.0-0ubuntu3 500
        500 http://archive.ubuntu.com/ubuntu jammy/main amd64 Packages

Changed in charm-ceph-mon:
status: Fix Committed → Fix Released
Changed in charm-ceph-osd:
status: Fix Committed → Fix Released
Revision history for this message
Nobuto Murata (nobuto) wrote :

In the postinst script, chown is executed against only /var/lib/ceph/crash, not recursively.

+ dpkg-statoverride --list /var/lib/ceph/crash
+ [ -d /run/systemd/system ]
+ [ crash = mon ]
+ chown ceph:ceph /var/lib/ceph/crash

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.