Prometheus2 charm

Losing k8s-master causes prometheus manual-jobs duplicate

Bug #1899706 reported by Chris Johnston on 2020-10-13

This bug affects 2 people

	Status	Importance	Assigned to
Charm Helpers	Fix Released	Undecided	Unassigned
Kubernetes Control Plane Charm	Triaged	Medium	Unassigned
Prometheus2 charm	Triaged	Wishlist	Unassigned

Bug Description

If the kubernetes-master leader unit is lost, the new leader provides an additional set of relation data to prometheus without removing the data from the original leader. This causes duplicate manual-jobs in prometheus, in turn causing queries to produce duplicate results. This appears to be in how prometheus attempts to prevent job names from clashing by using the request_id in the job name [1], but the new leader is using a new request_id, thus not being marked as a duplicate of the original job.

kubernetes-master 1.17.12 active 3 kubernetes-master local 0 ubuntu
kubernetes-worker 1.17.12 active 4 kubernetes-worker local 0 ubuntu exposed
prometheus active 1 prometheus2 local 0 ubuntu

1) Deploy kubernetes-master in HA with 3 units and add relation with prometheus as documented. I used hacluster, I haven't tested with other methods.

2) Restart prometheus to get around bug #1891942 [2]:
juju ssh prometheus/0 "sudo systemctl restart snap.prometheus.prometheus.service"

3) grep 'job_name' from the prometheus config:
juju ssh prometheus/0 "sudo grep 'job_name' /var/snap/prometheus/current/prometheus.yml"

Note that there is one instance of each of:
"job_name": "kube-state-metrics-UUID"
"job_name": "k8s-api-endpoints-UUID"
"job_name": "kubernetes-nodes-UUID"
"job_name": "kube-state-telemetry-UUID"
"job_name": "kubernetes-cadvisor-UUID"

Record the UUID for each.

3) Run this command and notice that only the kubernetes-master leader unit provides data:
for rid in $(juju run -u prometheus/0 'relation-ids manual-jobs'); do echo $rid; for runit in $(juju run -u prometheus/0 "relation-list -r $rid"); do echo $rid - $runit; juju run -u prometheus/0 "relation-get -r $rid - $runit"; done; done

4) Stop the jujud kubernetes-master service on the leader:
juju ssh kubernetes-master/1 "sudo systemctl stop jujud-unit-kubernetes-master-1"

5) Allow things to settle and a new leader to be elected

6) Restart prometheuse to get around bug #1891942 [1]:
juju ssh prometheus/0 "sudo systemctl restart snap.prometheus.prometheus.service"

7) grep 'job_name' from the prometheus config:
juju ssh prometheus/0 "sudo grep 'job_name' /var/snap/prometheus/current/prometheus.yml"

Note that there are now two instances of each of:
"job_name": "kube-state-metrics-UUID"
"job_name": "k8s-api-endpoints-UUID"
"job_name": "kubernetes-nodes-UUID"
"job_name": "kube-state-telemetry-UUID"
"job_name": "kubernetes-cadvisor-UUID"

8) Run this command and notice that there are now two kubernetes-master units which provide data:
for rid in $(juju run -u prometheus/0 'relation-ids manual-jobs'); do echo $rid; for runit in $(juju run -u prometheus/0 "relation-list -r $rid"); do echo $rid - $runit; juju run -u prometheus/0 "relation-get -r $rid - $runit"; done; done

This causes a number of queries in prometheus to return 2x results since there are now 2x scrapes.

[1] https://github.com/juju-solutions/interface-prometheus-manual/blob/master/common.py#L22
[2] https://bugs.launchpad.net/charm-prometheus2/+bug/1891942

Tags:

Revision history for this message

George Kraft (cynerva) wrote on 2020-10-14:

Thanks for the detailed report and reproduction steps!

I think this will need to be fixed in the prometheus charm. The code that injects the UUID[1] is part of the to_json function that's invoked by prometheus[2]. This is supposed to "ensure uniqueness" i.e. prevent multiple requests from clashing. A side effect of this is that when different units request the same job, the prometheus charm will generate multiple copies of that job.

The request ID that's used is part of the request-response pattern from charms.reactive. It is stored in relation data, and relation data is unit-scoped. Updating that pattern to allow re-use of a request ID across units would be doable but challenging. I strongly believe that, instead, this is a case where the request ID is being misused by the receiving side. Perhaps the relation ID should be used instead, which seems to be unique per application?

I am adding prometheus to this issue, but also leaving it open against kubernetes-master, since I would like another engineer from our team to look at this.

[1]: https://github.com/juju-solutions/interface-prometheus-manual/blob/3f775242c16d53243c993d7ba0c896169ad1639e/common.py#L36
[2]: https://git.launchpad.net/charm-prometheus2/tree/src/reactive/prometheus.py#n748

Changed in charm-kubernetes-master:
importance:	Undecided → Medium
status:	New → Triaged

Revision history for this message

Drew Freiberger (afreiberger) wrote on 2020-10-14:

When I looked at the code from a charm-prometheus2 support perspective, my take was that the kubernetes-master charm will set the prometheus manual jobs up on leader-elected[1], but, the demoted leader does not stop advertising it's jobs.

Because the charm-prometheus2 is architected to monitor each unit of a related application, this is why the tagging of the request UUID to ensure each unit that is announcing to be monitored is monitored individually. Imagine 100 units of telegraf with an additional manual-job metric for a per-unit counter.

In the instance of kubernetes-master charm providing details for monitoring a single pod (per manual-job) behind the 3 HA k8s-master, there is only need for one monitoring endpoint for kubernetes-master (announced to be monitored through the kubeapi-load-balancer, IIRC), but multiple units advertising to prometheus to be monitored, the data is duplicated.

I believe the solution is to add a function to remove the prometheus manual scrape job details from the relation @when_not('leadership.is_leader'). basically, the opposite of this function when leadership is lost:

[1]: https://github.com/charmed-kubernetes/charm-kubernetes-master/blob/master/reactive/kubernetes_master.py#L3128

Drew Freiberger (afreiberger) on 2020-10-14

Changed in charm-prometheus2:
status:	New → Triaged
importance:	Undecided → Wishlist

Revision history for this message

Drew Freiberger (afreiberger) wrote on 2020-10-14:

I've marked this bug with wishlist against charm-prometheus2, as we could implement a scope variable in the relation that would allow for unit or app based monitors to be defined. This likely won't make it into the charm until the rewrites into the operator framework which is not yet on the roadmap.

Revision history for this message

Chris Johnston (cjohnston) wrote on 2020-10-14:

Removing and re-adding the kubernetes-master/prometheus relationship cleared the issue for now.

Seyeong Kim (seyeongkim) on 2020-10-19

tags:

added: sts

Revision history for this message

Seyeong Kim (seyeongkim) wrote on 2020-10-21:

how about this one? Please give me any advice

put timestamp from register_prometheus_jobs in kubernetes-master charm

https://pastebin.ubuntu.com/p/btZxVBHxn8/

and relation provider

https://pastebin.ubuntu.com/p/hD7Jh5x3qf/

then take that timestamp in prometheus charm like below, skip data with old timestamp, take newer one.

then we can take manual_jobs from later leader.

https://pastebin.ubuntu.com/p/Rthx7yWsSz/

I think it is not possible we can detect trigger when leader is down ( maybe im wrong )

Revision history for this message

Cory Johns (johnsca) wrote on 2020-10-27:

A better solution for this would be to use application-level relation data, but that does require newer version of Juju to support, and currently doesn't have support available in charmhelpers or reactive.

Revision history for this message

Cory Johns (johnsca) wrote on 2020-10-27:

Having the non-leader K8s unit clear its request from the relation data could still end up with duplicate data in the case of a stuck unit, but it would work in other cases without changes to the interface or promteheus2 charm.

Adding a timestamp would lead to complications when one side or the other is an older version of the charm which doesn't know about the timestamp, but could probably be handled gracefully enough to at least end up with the current behavior.

However, it looks like the interface could be changed to use the relation ID as the request ID instead of relying on the default UUID creation, as George suggested. That way, the existing de-duplication should work as-is and should even handle stuck non-leaders since the rest of the request body should be identical.

Revision history for this message

Cory Johns (johnsca) wrote on 2020-11-30:

To add to my last comment, in theory changing the request ID to be the relation ID should be as simple as adding a `request_id=relation.relation_id` param to [1] and I *think* it might even handle upgrades fine, too, since it selects the request to update based on the `job_name` field rather than the request ID. However, I'm probably missing something in how that would actually be handled on one side or the other if the request ID for an existing request is changed, so that would definitely need some testing.

[1]: https://github.com/juju-solutions/interface-prometheus-manual/blob/master/provides.py#L37-L41