[COS] Ceph dashboard metrics aggregation time does not match default prometheus scrape interval

Bug #2056301 reported by Gaetan Gouzi
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ceph Dashboard Charm
Triaged
Medium
Unassigned

Bug Description

*Description*

COS Ceph dashboards are using prometheus queries with aggregate functions (irate) over the last 1m. Which is exactly the default Prometheus scraping interval. An irate function needs at least a vector of 2 datapoints to compute a result.

*More information*

The promQL queries used in Ceph Grafana dashboards are aggregating data over [1m].

Example:
```
"expr": "avg(irate(ceph_osd_op_r_latency_sum{juju_application=~\"$juju_application\",juju_model=~\"$juju_model\",juju_model_uuid=~\"$juju_model_uuid\",juju_unit=~\"$juju_unit\"}[1m]) / on(ceph_daemon) irate(ceph_osd_op_r_latency_count{juju_application=~\"$juju_application\",juju_model=~\"$juju_model\",juju_model_uuid=~\"$juju_model_uuid\",juju_unit=~\"$juju_unit\"}[1m]) * 1000)"
```

Such query fails because, by default, Prometheus scraping interval is configured to scrape metrics every 1m.
Therefore, the irate function is unable to compute a rate of a vector with 1 datapoint only (needs at least 2)

The query irate(ceph_osd_op_r_latency_sum[1m]) fails on prometheus. But the query irate(ceph_osd_op_r_latency_sum[2m]) succeeds.

Ideally, we should have metrics aggregate over an amount of time at least 2 times lower than the scraping interval (to have at least 2 datapoints to compute)

Workaround:
- Change the scraping interval to something lower than 30s using the https://github.com/canonical/prometheus-scrape-config-k8s-operator
- Manually adjust Grafana dashboards to increase the aggregation time of each promQL in each panel

Proposal:
- Change the default aggregation time from 1m to 30s in Grafana dashboards.

ceph-dashboard: quincy/stable
cos-proxy: latest/stable (rev 58)
prometheus: 2.48.0
juju: 3.1.7-genericlinux-amd64

Tags: cos
Revision history for this message
Gaetan Gouzi (ggouzi) wrote :
Revision history for this message
Gaetan Gouzi (ggouzi) wrote :

Adding screenshot rgw dashboard with 2m aggregation time

tags: added: cos
Revision history for this message
Peter Sabaini (peter-sabaini) wrote :

Thanks Gaetan for the report.

Changed in charm-ceph-dashboard:
status: New → Triaged
importance: Undecided → Medium
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.