PrometheusJobMissing alert - no such job as job="ceph"

Bug #2044062 reported by Nobuto Murata
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ceph Monitor Charm
Fix Committed
Low
Unassigned

Bug Description

ceph-mon quincy/stable 194

The converted alert rules try to find a job as:

> absent(up{job="ceph",juju_application="ceph-mon",juju_model="ceph",juju_model_uuid="57cd5e92-34f9-4176-813f-7ad333194608"})

However in the charm world, there is no such job as job="ceph" and it triggers the following alert out of the box.

====
PrometheusJobMissing (1 active)

name: PrometheusJobMissing
expr: absent(up{job="ceph",juju_application="ceph-mon",juju_model="ceph",juju_model_uuid="57cd5e92-34f9-4176-813f-7ad333194608"})
for: 30s
labels:
juju_application: ceph-mon
juju_charm: ceph-mon
juju_model: ceph
juju_model_uuid: 57cd5e92-34f9-4176-813f-7ad333194608
oid: 1.3.6.1.4.1.50495.1.2.1.12.1
severity: critical
type: ceph_default
annotations:
description: The prometheus job that scrapes from Ceph is no longer defined, this will effectively mean you'll have no metrics or alerts for the cluster. Please review the job definitions in the prometheus.yml file of the prometheus instance.
summary: The scrape job for Ceph is missing from Prometheus
====

The expected job name is "juju_ceph_57cd5e92_ceph-mon_prometheus_scrape-{0,1,2}" in this case.

Revision history for this message
Nobuto Murata (nobuto) wrote :
Revision history for this message
Nobuto Murata (nobuto) wrote :
Revision history for this message
Nobuto Murata (nobuto) wrote :

For the record, other real alerts are mostly fine though. Because those alerts are not using "job=" in the query.

Revision history for this message
Nobuto Murata (nobuto) wrote :

cos-tools' document has an example to convert a query, but it doesn't touch anything about the job name.

https://github.com/canonical/cos-tool/blob/main/README.md

> job:request_latency_seconds:mean5m{job=\"myjob\"} > 0.5

> job:request_latency_seconds:mean5m{job="myjob",juju_application="proxy",juju_model="lma",juju_model_uuid="12345",juju_unit="proxy/1"} > 0.5

Not sure how the job name is supposed to be written in the COS world like with a special variable or a job name shouldn't be used at all.

Revision history for this message
Peter Sabaini (peter-sabaini) wrote :

Triaged.
Bug seems clear; as a workaround could upload a fixed rule file ressource

Changed in charm-ceph-mon:
status: New → Triaged
importance: Undecided → Low
Revision history for this message
Nobuto Murata (nobuto) wrote :

Subscribing ~field-high.

I'm sorry, this cannot be Low. Basically,

$ juju deploy ceph-mon && juju relate ceph-mon cos-prometheus

will fire this false alarm straightaway and it says:

"this will effectively mean you'll have no metrics or alerts for the cluster"

The response cannot be "you can upload a file to override what's defined in the charm". Please figure out a path forward (with the COS team as necessary) and update the alert definition in/for the charm.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ceph-mon (master)
Changed in charm-ceph-mon:
status: Triaged → In Progress
Revision history for this message
Nobuto Murata (nobuto) wrote :

For the record with

> absent(up{})
to
> absent(up{juju_application="ceph-mon",juju_model="ceph",juju_model_uuid="a23fcb4b-992d-40bd-820b-6ac0f69db5e2"})

Revision history for this message
Nobuto Murata (nobuto) wrote :

For the record with

> up{} == 0
to
> up{juju_application="ceph-mon",juju_model="ceph",juju_model_uuid="a23fcb4b-992d-40bd-820b-6ac0f69db5e2"} == 0

Healthy status.

Revision history for this message
Nobuto Murata (nobuto) wrote :

For the error status when prometheus module is down.

Revision history for this message
Ethan Myers (ethanmyers) wrote :

Is there an ETA for when the fix will be available? Currently affecting a project that will be handed over to managed services.

Revision history for this message
Nobuto Murata (nobuto) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ceph-mon (master)

Reviewed: https://review.opendev.org/c/openstack/charm-ceph-mon/+/901826
Committed: https://opendev.org/openstack/charm-ceph-mon/commit/fb3262183102171da5704868d7522290b3a9ede4
Submitter: "Zuul (22348)"
Branch: master

commit fb3262183102171da5704868d7522290b3a9ede4
Author: Nobuto Murata <email address hidden>
Date: Sun Mar 17 22:55:39 2024 +0900

    Don't expect a static job name

    A job name passed via the prometheus_scrape library doesn't end up as a
    static job name in the prometheus configuration file in the COS world
    even though COS expects a fixed string. Practically we cannot have a
    static job name like job=ceph in any of the alert rules in COS since the
    charms will convert the string "ceph" into:

    > juju_MODELNAME_ID_APPNAME_prometheus_scrape_JOBNAME(ceph)-N

    Let's give up the possibility of the static job name and use "up{}" so
    it will be annotated with the model name/ID, etc. without any specific
    job related condition. It will break the alert rules when one unit have
    more than one scraping endpoint because there will be no way to
    distinguish multiple scraping jobs. Ceph MON only has one prometheus
    endpoint for the time being so this change shouldn't cause an immediate
    issue. Overall, it's not ideal but at least better than the current
    status, which is an alert error out of the box.

    The following alert rule:
    > up{} == 0
    will be converted and annotated as:
    > up{juju_application="ceph-mon",juju_model="ceph",juju_model_uuid="UUID"} == 0

    Closes-Bug: #2044062

    Change-Id: I0df8bc0238349b5f03179dfb8f4da95da48140c7

Changed in charm-ceph-mon:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ceph-mon (stable/quincy.2)

Fix proposed to branch: stable/quincy.2
Review: https://review.opendev.org/c/openstack/charm-ceph-mon/+/913814

Revision history for this message
Nobuto Murata (nobuto) wrote :

The actual alert rules are from the charm resource, so another manual step would be necessary to make it available.
https://charmhub.io/ceph-mon/resources/alert-rules?channel=latest/edge

Revision history for this message
Peter Sabaini (peter-sabaini) wrote :
Revision history for this message
Peter Sabaini (peter-sabaini) wrote :
Download full text (67.6 KiB)

For reference, here is the current revision / resource mapping:

Track Base Channel Version Revision Resources
latest ubuntu 20.04 (amd64) stable - - -
                                          candidate - - -
                                          beta - - -
                                          edge 202 202 alert-rules (r3)
                  ubuntu 20.04 (arm64) stable - - -
                                          candidate - - -
                                          beta - - -
                                          edge 202 202 alert-rules (r3)
                  ubuntu 20.04 (ppc64el) stable - - -
                                          candidate - - -
                                          beta - - -
                                          edge 202 202 alert-rules (r3)
                  ubuntu 20.04 (s390x) stable - - -
                                          candidate - - -
                                          beta - - -
                                          edge 202 202 alert-rules (r3)
                  ubuntu 22.04 (amd64) stable - - -
                                          candidate - - -
                                          beta - - -
                                          edge 202 202 alert-rules (r3)
                  ubuntu 22.04 (arm64) stable - - -
                                          candidate - - -
                                          beta - - -
                                          edge 202 202 alert-rules (r3)
                  ubuntu 22.04 (ppc64el) stable - - -
                                          candidate - - -
                                          beta - - -
                                          edge 202 202 alert-rules (r3)
                  ubuntu 22.04 (s390x) stable - - -
                                          candidate - - -
                                          beta - - -
                                          edge 202 202 alert-rules (r3)
                  ubuntu 22.10 (amd64) stable - - -
                                          candidate - - -
                                          beta - - -
                                          edge 179 179 alert-rules (r1)
                  ubuntu 22.10 (arm64) stable - - -
                            ...

Revision history for this message
Nobuto Murata (nobuto) wrote :

Thanks. Looks like the quincy channel already refers to the latest revision of the resource.

quincy ubuntu 20.04 (amd64) stable 201 201 alert-rules (r3)
                                          candidate ↑ ↑ ↑
                                          beta ↑ ↑ ↑
                                          edge ↑ ↑ ↑
...
                  ubuntu 22.04 (amd64) stable 201 201 alert-rules (r3)
                                          candidate ↑ ↑ ↑
                                          beta ↑ ↑ ↑
                                          edge ↑ ↑ ↑

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ceph-mon (stable/quincy.2)

Reviewed: https://review.opendev.org/c/openstack/charm-ceph-mon/+/913814
Committed: https://opendev.org/openstack/charm-ceph-mon/commit/61defed938f49f338bc5df9a9a8794dfae2021e6
Submitter: "Zuul (22348)"
Branch: stable/quincy.2

commit 61defed938f49f338bc5df9a9a8794dfae2021e6
Author: Nobuto Murata <email address hidden>
Date: Sun Mar 17 22:55:39 2024 +0900

    Don't expect a static job name

    A job name passed via the prometheus_scrape library doesn't end up as a
    static job name in the prometheus configuration file in the COS world
    even though COS expects a fixed string. Practically we cannot have a
    static job name like job=ceph in any of the alert rules in COS since the
    charms will convert the string "ceph" into:

    > juju_MODELNAME_ID_APPNAME_prometheus_scrape_JOBNAME(ceph)-N

    Let's give up the possibility of the static job name and use "up{}" so
    it will be annotated with the model name/ID, etc. without any specific
    job related condition. It will break the alert rules when one unit have
    more than one scraping endpoint because there will be no way to
    distinguish multiple scraping jobs. Ceph MON only has one prometheus
    endpoint for the time being so this change shouldn't cause an immediate
    issue. Overall, it's not ideal but at least better than the current
    status, which is an alert error out of the box.

    The following alert rule:
    > up{} == 0
    will be converted and annotated as:
    > up{juju_application="ceph-mon",juju_model="ceph",juju_model_uuid="UUID"} == 0

    Closes-Bug: #2044062

    Change-Id: I0df8bc0238349b5f03179dfb8f4da95da48140c7
    (cherry picked from commit fb3262183102171da5704868d7522290b3a9ede4)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.