NRPE: nag for mismatched versions of connected ceph clients

Bug #1943628 reported by Adam Dyess
20
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ceph Monitor Charm
Confirmed
Wishlist
Hicham El Gharbi

Bug Description

Please create an NRPE check that checks 'ceph versions' for the overall field to have versions which are divergent:

WARN -- any versions are 1 release behind the mon
CRIT -- any versions are 2 releases behind the mon

Bonus points if the check can tell you which OSDs (and host) are the laggers

Tags: bseng-4 sts
Revision history for this message
Dan Hill (hillpd) wrote :

I would also suggest:

CRIT -- any non-mon versions are ahead of the mon

tags: added: sts
Changed in charm-ceph-mon:
status: New → Triaged
importance: Undecided → Wishlist
Alvaro Uria (aluria)
tags: added: bseng-4
Revision history for this message
Alvaro Uria (aluria) wrote :

This check should be created on a single unit. Ceph Mons units deployed is usually 3 or more, but we should only run checks from a unit. Using "is-leader" should help in this regard. Leadership will change when a unit is not reachable for more than 30s. In such case, once the ex-leader recovers, it won't be leader anymore and should remove the nrpe check (the new Juju leader should add the nrpe check).

Changed in charm-ceph-mon:
assignee: nobody → Hicham El Gharbi (hicham-el-gharbi)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to charm-ceph-mon (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/charm-ceph-mon/+/844604

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ceph-mon (master)
Changed in charm-ceph-mon:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on charm-ceph-mon (master)

Change abandoned by "Hicham El Gharbi <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/charm-ceph-mon/+/847871
Reason: unexpected merge ID

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ceph-mon (master)

Reviewed: https://review.opendev.org/c/openstack/charm-ceph-mon/+/844604
Committed: https://opendev.org/openstack/charm-ceph-mon/commit/dfbda68e1add1e8a31ef0e14c043b584532fcd03
Submitter: "Zuul (22348)"
Branch: master

commit dfbda68e1add1e8a31ef0e14c043b584532fcd03
Author: Hicham El Gharbi <email address hidden>
Date: Tue Jul 19 12:18:06 2022 +0200

    Create NRPE check to verify ceph daemons versions

    This NRPE check confirms if the versions of cluster daemons are divergent.

    WARN - any minor version diverged
    WARN – any versions are 1 release behind the mon
    CRIT – any versions are 2 releases behind the mon
    CRIT – any versions releases are head the mon

    A juju action is also provided 'get-versions-report'
    which provide to users, a quick way to see
    daemons versions running on cluster hosts.

    Closes-Bug: #1943628
    Change-Id: I41b5c8576dc9cf885fa813a93e6d51e8804eb9d8

Changed in charm-ceph-mon:
status: In Progress → Fix Committed
Revision history for this message
Vern Hart (vern) wrote :

I'm seeing this test fail in a new deployment with the following nagios alert:

    UNKNOWN: could not determine OSDs versions, error: Command '['ceph', 'versions']' retruned non-zero exit status 1.

Line 98 of files/nagios/check_ceph_status.py tries to run "ceph versions" but this file is executed by nrpe as the nagios user. This user does not have access to the ceph keys so cannot run ceph commands.

This needs to be relegated to a cron job so it can run as root, the output saved to a text file, and then that text file slurped in by the nrpe check. This is how all the other checks work that rely on the output of ceph commands.

Revision history for this message
Nobuto Murata (nobuto) wrote (last edit ):

I think this is from the quincy -> quincy.2 switch. The patch was not included in quincy branch.

$ git branch -rv --contains dfbda68e1add1e8a31ef0e14c043b584532fcd03
  origin/HEAD -> origin/master
  origin/master 87600a9 Merge "Create a key for ceph-osd for crash module auth"
  origin/stable/quincy.2 eaa57e8 Merge "Make sure lockfile-progs package is installed" into stable/quincy.2

Revision history for this message
Nobuto Murata (nobuto) wrote :

Subscribing ~field-high to request a revert for the time being.

Revision history for this message
Nobuto Murata (nobuto) wrote :

Reproduced on my end:

$ juju run-action --wait nrpe/0 run-nrpe-check name=check-ceph-daemons-versions
unit-nrpe-0:
  UnitId: nrpe/0
  id: "20"
  results:
    Stderr: |
      2023-02-01T03:03:09.556+0000 7f4677361700 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or directory
      2023-02-01T03:03:09.556+0000 7f4677361700 -1 AuthRegistry(0x7f467005f540) no keyring found at /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin, disabling cephx
      2023-02-01T03:03:09.556+0000 7f4677361700 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or directory
      2023-02-01T03:03:09.556+0000 7f4677361700 -1 AuthRegistry(0x7f4670064d88) no keyring found at /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin, disabling cephx
      2023-02-01T03:03:09.560+0000 7f4677361700 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or directory
      2023-02-01T03:03:09.560+0000 7f4677361700 -1 AuthRegistry(0x7f4677360000) no keyring found at /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin, disabling cephx
      [errno 2] RADOS object not found (error connecting to the cluster)
    check-output: 'UNKNOWN: could not determine OSDs versions, error: Command ''[''ceph'',
      ''versions'']'' returned non-zero exit status 1.'
  status: completed
  timing:
    completed: 2023-02-01 03:03:10 +0000 UTC
    enqueued: 2023-02-01 03:03:09 +0000 UTC
    started: 2023-02-01 03:03:09 +0000 UTC

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to charm-ceph-mon (stable/quincy.2)

Related fix proposed to branch: stable/quincy.2
Review: https://review.opendev.org/c/openstack/charm-ceph-mon/+/872302

Revision history for this message
Nobuto Murata (nobuto) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to charm-ceph-mon (master)

Reviewed: https://review.opendev.org/c/openstack/charm-ceph-mon/+/872301
Committed: https://opendev.org/openstack/charm-ceph-mon/commit/c9389a8cd0165aa8718e6f8aac61a21af99c9144
Submitter: "Zuul (22348)"
Branch: master

commit c9389a8cd0165aa8718e6f8aac61a21af99c9144
Author: Nobuto Murata <email address hidden>
Date: Wed Feb 1 03:11:22 2023 +0000

    Revert "Create NRPE check to verify ceph daemons versions"

    This reverts commit dfbda68e1add1e8a31ef0e14c043b584532fcd03.

    Reason for revert:

    The Ceph version check seems to be missing a consideration of users to
    execute the nrpe check. It actually fails to get keyrings to execute the
    command as it's run by a non-root user.

    $ juju run-action --wait nrpe/0 run-nrpe-check name=check-ceph-daemons-versions
    unit-nrpe-0:
      UnitId: nrpe/0
      id: "20"
      results:
        Stderr: |
          2023-02-01T03:03:09.556+0000 7f4677361700 -1 auth: unable to find
          a keyring on
          /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin:
          (2) No such file or directory
          2023-02-01T03:03:09.556+0000 7f4677361700 -1
          AuthRegistry(0x7f467005f540) no keyring found at
          /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,
          disabling cephx
          2023-02-01T03:03:09.556+0000 7f4677361700 -1 auth: unable to find
          a keyring on
          /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin:
          (2) No such file or directory
          2023-02-01T03:03:09.556+0000 7f4677361700 -1
          AuthRegistry(0x7f4670064d88) no keyring found at
          /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,
          disabling cephx
          2023-02-01T03:03:09.560+0000 7f4677361700 -1 auth: unable to find
          a keyring on
          /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin:
          (2) No such file or directory
          2023-02-01T03:03:09.560+0000 7f4677361700 -1
          AuthRegistry(0x7f4677360000) no keyring found at
          /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,
          disabling cephx
          [errno 2] RADOS object not found (error connecting to the cluster)
        check-output: 'UNKNOWN: could not determine OSDs versions, error: Command ''[''ceph'',
          ''versions'']'' returned non-zero exit status 1.'
      status: completed
      timing:
        completed: 2023-02-01 03:03:10 +0000 UTC
        enqueued: 2023-02-01 03:03:09 +0000 UTC
        started: 2023-02-01 03:03:09 +0000 UTC

    Related-Bug: #1943628
    Change-Id: I84b306e84661e6664e8a69fa93dfdb02fa4f1e7e

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to charm-ceph-mon (stable/quincy.2)

Reviewed: https://review.opendev.org/c/openstack/charm-ceph-mon/+/872302
Committed: https://opendev.org/openstack/charm-ceph-mon/commit/ddfa1d4b8ab4e209c85ed2630c0b4fd6d12db7d2
Submitter: "Zuul (22348)"
Branch: stable/quincy.2

commit ddfa1d4b8ab4e209c85ed2630c0b4fd6d12db7d2
Author: Nobuto Murata <email address hidden>
Date: Wed Feb 1 03:11:22 2023 +0000

    Revert "Create NRPE check to verify ceph daemons versions"

    This reverts commit dfbda68e1add1e8a31ef0e14c043b584532fcd03.

    Reason for revert:

    The Ceph version check seems to be missing a consideration of users to
    execute the nrpe check. It actually fails to get keyrings to execute the
    command as it's run by a non-root user.

    $ juju run-action --wait nrpe/0 run-nrpe-check name=check-ceph-daemons-versions
    unit-nrpe-0:
      UnitId: nrpe/0
      id: "20"
      results:
        Stderr: |
          2023-02-01T03:03:09.556+0000 7f4677361700 -1 auth: unable to find
          a keyring on
          /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin:
          (2) No such file or directory
          2023-02-01T03:03:09.556+0000 7f4677361700 -1
          AuthRegistry(0x7f467005f540) no keyring found at
          /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,
          disabling cephx
          2023-02-01T03:03:09.556+0000 7f4677361700 -1 auth: unable to find
          a keyring on
          /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin:
          (2) No such file or directory
          2023-02-01T03:03:09.556+0000 7f4677361700 -1
          AuthRegistry(0x7f4670064d88) no keyring found at
          /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,
          disabling cephx
          2023-02-01T03:03:09.560+0000 7f4677361700 -1 auth: unable to find
          a keyring on
          /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin:
          (2) No such file or directory
          2023-02-01T03:03:09.560+0000 7f4677361700 -1
          AuthRegistry(0x7f4677360000) no keyring found at
          /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,
          disabling cephx
          [errno 2] RADOS object not found (error connecting to the cluster)
        check-output: 'UNKNOWN: could not determine OSDs versions, error: Command ''[''ceph'',
          ''versions'']'' returned non-zero exit status 1.'
      status: completed
      timing:
        completed: 2023-02-01 03:03:10 +0000 UTC
        enqueued: 2023-02-01 03:03:09 +0000 UTC
        started: 2023-02-01 03:03:09 +0000 UTC

    Related-Bug: #1943628
    Change-Id: I84b306e84661e6664e8a69fa93dfdb02fa4f1e7e

Revision history for this message
Nobuto Murata (nobuto) wrote :

The change in question has been reverted. So I'm setting the status back to Confirmed. And will unsubscribe ~field-high.

Changed in charm-ceph-mon:
status: Fix Committed → Confirmed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.