Prometheus Ceph Exporter Charm

Exporter returns zero values for osdmap (osds_up, etc) using latest/edge channel against bionic-ussuri/octopus

Bug #1899030 reported by Vladimir Grevtsev on 2020-10-08

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Prometheus Ceph Exporter Charm	Won't Fix	High	Unassigned

Bug Description

$ sudo snap list
Name Version Rev Tracking Publisher Notes
core 16-2.46.1 9993 latest/stable canonical✓ core
core18 20200724 1885 latest/stable canonical✓ base
prometheus-ceph-exporter 3.0.0-nautilus 21 latest/edge xavpaice -

# Exporter returns wrong metrics

$ curl -s 127.0.0.1:9128/metrics | grep ceph_osds
# HELP ceph_osds Count of total OSDs in the cluster
# TYPE ceph_osds gauge
ceph_osds{cluster="ceph"} 0
# HELP ceph_osds_down Count of OSDs that are in DOWN state
# TYPE ceph_osds_down gauge
ceph_osds_down{cluster="ceph"} 0
# HELP ceph_osds_in Count of OSDs that are in IN state and available to serve requests
# TYPE ceph_osds_in gauge
ceph_osds_in{cluster="ceph"} 0
# HELP ceph_osds_up Count of OSDs that are in UP state
# TYPE ceph_osds_up gauge
ceph_osds_up{cluster="ceph"} 0

###### real cluster state:

$ sudo ceph -s
  cluster:
    id: d3cf4c68-07e0-11eb-9f24-00163ea7a656
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum juju-e75ae8-5-lxd-0,juju-e75ae8-4-lxd-0,juju-e75ae8-3-lxd-0 (age 2h)
    mgr: juju-e75ae8-3-lxd-0(active, since 2h), standbys: juju-e75ae8-4-lxd-0, juju-e75ae8-5-lxd-0
    osd: 25 osds: 25 up (since 2h), 25 in (since 46h)
    rgw: 3 daemons active (juju-e75ae8-3-lxd-1, juju-e75ae8-4-lxd-1, juju-e75ae8-5-lxd-1)

task status:

  data:
    pools: 19 pools, 961 pgs
    objects: 1.05k objects, 2.6 GiB
    usage: 35 GiB used, 54 TiB / 54 TiB avail
    pgs: 961 active+clean

io:
client: 17 KiB/s rd, 17 op/s rd, 0 op/s wr

Revision history for this message

Vladimir Grevtsev (vlgrevtsev) wrote on 2020-10-08:

[juju] Ceph - Cluster - Grafana 2020-10-08 16-22-09.png Edit (89.5 KiB, image/png)

The above discrepancy leads to the wrong data rendering in the Grafana dashboard as well (see screenshot attached).

Revision history for this message

Drew Freiberger (afreiberger) wrote on 2020-10-22:

While I work to reproduce this, which version(s) of ceph did you run into this issue on?

Changed in charm-prometheus-ceph-exporter:
importance:	Undecided → High
status:	New → Incomplete

Revision history for this message

Drew Freiberger (afreiberger) wrote on 2020-10-22:

ceph source cloud:bionic-ussuri

Changed in charm-prometheus-ceph-exporter:
status:	Incomplete → Triaged

Revision history for this message

Drew Freiberger (afreiberger) wrote on 2020-10-22:

It appears that the gauges changed from ceph_osds_in to ceph_osd_in and as such, we may have to alter some of the dashboards around this change.

# HELP ceph_osd_in OSD In Status
# TYPE ceph_osd_in gauge
ceph_osd_in{cluster="ceph",device_class="hdd",host="juju-13f892-test-1",osd="osd.2",rack="",root="default"} 1
ceph_osd_in{cluster="ceph",device_class="hdd",host="juju-13f892-test-2",osd="osd.0",rack="",root="default"} 1
ceph_osd_in{cluster="ceph",device_class="hdd",host="juju-13f892-test-3",osd="osd.1",rack="",root="default"} 1

ceph_osds_in{cluster="ceph"} 0

On the dashboard Ceph Cluster, the ceph_osds_up query needs to change to:

sum(ceph_osd_in{job="$job"})

the remnant of ceph_osds_in within the metrics endpoint is odd and I'm thinking it's due to a cache in prometheus-ceph-exporter that survives the restart of the service. I'm attempting to reproduce now by deploying a new p-c-e unit that starts on the edge snap v3.0.0-nautilus.

Here's the upstream code for osd_in and osds_down, but note the lack of osds_in or osds_up.
https://github.com/digitalocean/ceph_exporter/blob/3.0.0-nautilus/collectors/osd.go

The dashboard is already updated to run the sum(ceph_osd_up{job="$job"}), I think the osds_in and out queries just needs the same treatment in the dashboard.

Revision history for this message

Drew Freiberger (afreiberger) wrote on 2020-10-22:

okay, the up in down and total are in health.go and should still exist.

https://github.com/digitalocean/ceph_exporter/blob/339d20fc237bf35a62631eba366b5b347209d57f/collectors/health.go#L701

I do see that ceph status does change format from:
osdmap e36520: 18 osds: 18 up, 18 in
to
osd: 3 osds: 3 up (since 75m), 3 in (since 75m)

which may not be accounted for in the code. The 3.0.0 code supports nautilus, but bionic-ussuri is ceph octopus.

Going to test nautilus, lumninous, and jewel ceph against p-c-e edge snap now to see if this is an ussuri/octopus-only issue.

Revision history for this message

Drew Freiberger (afreiberger) wrote on 2020-10-22:

Okay, digging further through the code, p-c-e is running:
https://github.com/digitalocean/ceph_exporter/blob/339d20fc237bf35a62631eba366b5b347209d57f/collectors/health.go#L1480

Command is ultimately:
ceph status --format json

unmarshalling the output here:
https://github.com/digitalocean/ceph_exporter/blob/339d20fc237bf35a62631eba366b5b347209d57f/collectors/health.go#L983-L990

using this struct:
https://github.com/digitalocean/ceph_exporter/blob/339d20fc237bf35a62631eba366b5b347209d57f/collectors/health.go#L913

As you can see this is expecting a struct that includes:
osdmap:{osdmap:{*data here*}}

This is the format for this data for jewel, luminous, mimic, and nautilus, but octopus changes the output of ceph status to only report:

osdmap:{*data here*}

and does away with the second layer of osdmap which means that cluster health data is not collected properly by this revision of prometheus-ceph-exporter for octopus.

The divergence of this will likely mean we have to version prometheus-ceph-exporter somewhat closely with the version of ceph it is related to, and there is not an available revision of prometheus-ceph-exporter that works with octopus released yet.

While we can certainly work around the issue by updating dashboards to use the per-osd metrics, the overall cluster health metrics would be nice to have properly in prometheus, as the data is currently misleading for ceph octopus.

Revision history for this message

Drew Freiberger (afreiberger) wrote on 2020-10-22:

filed upstream issue:

https://github.com/digitalocean/ceph_exporter/issues/182

Revision history for this message

Drew Freiberger (afreiberger) wrote on 2020-10-22:

If we want to implement ceph-specific snap channels, we can query the ceph version from the p-c-e unit with:

cd /var/snap/prometheus-ceph-exporter/current
ceph -c ceph.conf --id prometheus-ceph-exporter version

This will return a string like:
ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic (stable)

Item 5 returned is the codename of the version on all supported revs.

summary:	Exporter returns malformed or invalid data using latest/edge channel + against bionic-ussuri/octopus
summary:	- Exporter returns malformed or invalid data using latest/edge channel - against bionic-ussuri/octopus + Exporter returns zero values for osdmap (osds_up, etc) using latest/edge + channel against bionic-ussuri/octopus
Changed in charm-prometheus-ceph-exporter:
status:	Triaged → Confirmed

Revision history for this message

Drew Freiberger (afreiberger) wrote on 2020-10-22:

Subscribing field-medium as there are issues with p-c-e and ussuri clouds affecting live customers.

James Troup (elmo) on 2021-09-13

Changed in charm-prometheus-ceph-exporter:
status:	Confirmed → In Progress

Revision history for this message

Kellen Renshaw (krenshaw) wrote on 2022-04-26:

#10

Upstream now has support for Nautilus/Octopus/Pacific. https://github.com/digitalocean/ceph_exporter/issues/182 has been resolved.

The snap needs to be updated, unsure if the tracks idea is still needed.

Revision history for this message

Kellen Renshaw (krenshaw) wrote on 2022-05-03:

#11

I was informed that the prometheus-ceph-exporter charm has been deprecated in favor of the ceph-dashboard charm that uses the built-in Ceph Dashboard.

https://opendev.org/openstack/charm-ceph-dashboard