Grafana dashboard missing k8s metrics

Bug #1866672 reported by David Coronel
22
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Kubernetes Control Plane Charm
Fix Released
High
George Kraft

Bug Description

After following the instructions from https://ubuntu.com/kubernetes/docs/monitoring#monitoring-with-prometheus-grafana-and-telegraf and adding the Juju relations, the Charmed Kubernetes dashboard appears in Grafana but the Kubernetes metrics are missing (ie. the Cluster memory usage, Cluster CPU usage and Cluster filesystem usage all show N/A).

Looking at the journalctl logs on prometheus/0 shows the following errors:

root@prometheus-1:~# journalctl | grep -i prometheus | tail -20

Mar 09 17:36:39 prometheus-1 prometheus.prometheus[13048]: level=error ts=2020-03-09T17:36:39.644551854Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:334: Failed to list *v1.Node: nodes is forbidden: User \"system:monitoring\" cannot list resource \"nodes\" in API group \"\" at the cluster scope"

Mar 09 17:36:39 prometheus-1 prometheus.prometheus[13048]: level=error ts=2020-03-09T17:36:39.647647282Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:264: Failed to list *v1.Pod: pods is forbidden: User \"system:monitoring\" cannot list resource \"pods\" in API group \"\" at the cluster scope"

Mar 09 17:36:39 prometheus-1 prometheus.prometheus[13048]: level=error ts=2020-03-09T17:36:39.648514145Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:262: Failed to list *v1.Endpoints: endpoints is forbidden: User \"system:monitoring\" cannot list resource \"endpoints\" in API group \"\" at the cluster scope"

Mar 09 17:36:39 prometheus-1 prometheus.prometheus[13048]: level=error ts=2020-03-09T17:36:39.650738053Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:263: Failed to list *v1.Service: services is forbidden: User \"system:monitoring\" cannot list resource \"services\" in API group \"\" at the cluster scope"

This Charmed Kubernetes environment has custom Keystone and Keystone-LDAP charms for Kerberos integration. The ~/.kube/config file is modified with a custom auth command that uses openstack token issue to get a token.

Additional details:

grafana charm revision 38
kubernetes-master charm revision 808
prometheus charm revision 14

Tags: cpe-onsite
Revision history for this message
David Coronel (davecore) wrote :

subscribed ~field-high

Revision history for this message
George Kraft (cynerva) wrote :

> This Charmed Kubernetes environment has custom Keystone and Keystone-LDAP charms for Kerberos integration.

Can you tell us more about this environment? Specifically, what are the values of the authorization-mode and enable-keystone-authorization charm configs on kubernetes-master?

Revision history for this message
David Coronel (davecore) wrote :

Thanks George.

authorization-mode is "RBAC,Node" and enable-keystone-authorization is "true".

This is a Charmed Kubernetes 1.17 environment with custom keystone and keystone-ldap charm forks that we added Kerberos support to.

Our RC file looks like this:

#!/usr/bin/env bash
export OS_AUTH_URL=http://<keystone hostname>:5000/krb/v3
export OS_PROJECT_ID=<project_id>
export OS_PROJECT_NAME="k8s"
export OS_PROJECT_DOMAIN_ID="<domain_id>"
export OS_REGION_NAME="RegionOne"
export OS_INTERFACE=public
export OS_IDENTITY_API_VERSION=3
export OS_AUTH_TYPE=v3kerberos

We can test our kerberos environment by doing

$ kinit <username>
$ openstack token issue

To be able to authenticate to Kubernetes through the kubectl command, we need to create a script that allows us to retrieve the keystone token and return it inside a json structure as follows:

=====
#!/bin/bash
TOKEN=$(openstack token issue -f value -c id)
JSON_STRING=$( jq -n \
                  --arg bn "$TOKEN" \
                  '{"apiVersion": "client.authentication.k8s.io/v1beta1","kind": "ExecCredential","status": {"token": $bn}}' )

echo $JSON_STRING
=====

Then we save the script, make it executable and add it to our $PATH. Then we retrieve the kubeconfig file from the kubernetes-master unit and change the command line to use that auth script:

=====
$ juju scp kubernetes-master/0:config ~/.kube/config
=====
- name: keystone-user
  user:
    exec:
      command: "auth"
      apiVersion: "client.authentication.k8s.io/v1beta1"
=====

Revision history for this message
David Coronel (davecore) wrote :

Prometheus shows that kube state metrics and telemetry targets are down/unhealthy:

=====
kube-state-metrics-4468db51-9ca3-4561-b8f7-8e8c74c8c4d0 (0/1 up)

Endpoint:
https://10.109.12.12:6443/api/v1/namespaces/kube-system/services/kube-state-metrics:8080/proxy/metrics

State: DOWN

Labels:
instance="10.109.12.12:6443"
job="kube-state-metrics-4468db51-9ca3-4561-b8f7-8e8c74c8c4d0"

Last Scrape: 182ms ago

Scrape Duration: 11.85ms

Error: server returned HTTP status 403 Forbidden
=====

kube-state-telemetry-9440e9d5-4394-4e9d-8cd1-8a234f5411a6 (0/1 up)

Endpoint:
https://10.109.12.12:6443/api/v1/namespaces/kube-system/services/kube-state-metrics:8081/proxy/metrics

State: DOWN

Labels:
instance="10.109.12.12:6443"
job="kube-state-telemetry-9440e9d5-4394-4e9d-8cd1-8a234f5411a6"

Last Scrape: 28.147s ago

Scrape Duration: 13.31ms

Error:
server returned HTTP status 403 Forbidden
====

This looks like cluster role binding issues but I don't know yet if it's because of this unique keystone/ldap/kerberos setup or not.

Revision history for this message
David Coronel (davecore) wrote :

I see that system:monitoring comes from /root/cdk/know_tokens.csv on the kubernetes-master units.

But there's no cluster role binding that gives this user any permissions.

As a workaround (I know it's overkill and I'm sure there's a better role), I added the cluster-admin role to the system:monitoring user and now this works. I'm seeing another issue with the prometheus targets but that looks like a different issue.

Revision history for this message
David Coronel (davecore) wrote :

The issue I see now (after adding the cluster-admin role to the system:monitoring user as a workaround) is that Prometheus tries to grab the k8s metrics from the k8s masters on port 443, but the k8s masters are not listening on port 443. Should it be 6443?

Revision history for this message
David Coronel (davecore) wrote :

I also notice that the targets kube-state-metrics and kube-state-telemetry targets point to the port 6443, but kubernetes-cadvisor and kubernetes-nodes point to 443.

Revision history for this message
David Coronel (davecore) wrote :

Could the differences between the cadvisor and state-telemetry templates explain this behavior?

https://github.com/charmed-kubernetes/charm-kubernetes-master/blob/master/templates/prometheus/kubernetes-cadvisor.yaml.j2

https://github.com/charmed-kubernetes/charm-kubernetes-master/blob/master/templates/prometheus/kube-state-telemetry.yaml.j2

Maybe the relabel_configs part in kubernetes-cadvisor.yaml.j2 is causing the 6443 to be replaced by the standard port 443?

Revision history for this message
David Coronel (davecore) wrote :

Changing the replacement field from "<k8s vip>" to "<k8s vip>:6443" in /var/snap/prometheus/26/prometheus.yml and restarting the prometheus snap with systemctl restart snap.prometheus.prometheus.service fixes the targets in Prometheus. And my graphs in the k8s dashboard in grafana are now showing data.

I'm still now sure if this is a configuration issue on my side or a real bug.

Revision history for this message
Xav Paice (xavpaice) wrote :

Have reproduced this in another site with similar configuration.

Changed in charm-kubernetes-master:
status: New → Confirmed
Revision history for this message
Camille Rodriguez (camille.rodriguez) wrote :

I hit the same issue with a fresh deploy, no ldap integration.

tags: added: cpe-onsite
Revision history for this message
Camille Rodriguez (camille.rodriguez) wrote :

I tried the workaround suggested by David (#9) and it did not work for me.

Changed in charm-kubernetes-master:
importance: Undecided → High
assignee: nobody → George Kraft (cynerva)
Revision history for this message
Camille Rodriguez (camille.rodriguez) wrote :
Download full text (8.8 KiB)

I was able to apply the workaround. Like David mentioned, this behavior is caused by TWO issues.
1) the lack of permission
2) the missing port in the snap configuration file.
In my context, the vip was on the kube-api-loadbalancer, so the port to add was :443

Here are some logs to help troubleshooting

● snap.prometheus.prometheus.service - Service for snap application prometheus.prometheus
   Loaded: loaded (/etc/systemd/system/snap.prometheus.prometheus.service; enabled; vendor preset: enabled)
   Active: active (running) since Tue 2020-04-21 11:44:54 EDT; 27min ago
 Main PID: 7639 (prometheus.wrap)
    Tasks: 12 (limit: 4915)
   CGroup: /system.slice/snap.prometheus.prometheus.service
           ├─7639 /bin/sh /snap/prometheus/26/bin/prometheus.wrapper
           └─7682 /snap/prometheus/26/bin/prometheus --web.listen-address :9090 --config.file=/var/snap/prometheus/26/prometheus.yml --storage.tsdb.path=/var/snap/prometheus/common

Apr 21 12:12:15 juju-157c1c-18 prometheus.prometheus[7639]: level=error ts=2020-04-21T16:12:15.904161751Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kub
Apr 21 12:12:16 juju-157c1c-18 prometheus.prometheus[7639]: level=error ts=2020-04-21T16:12:16.482846654Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kub
Apr 21 12:12:16 juju-157c1c-18 prometheus.prometheus[7639]: level=error ts=2020-04-21T16:12:16.483401075Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kub
Apr 21 12:12:16 juju-157c1c-18 prometheus.prometheus[7639]: level=error ts=2020-04-21T16:12:16.485616729Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kub
Apr 21 12:12:16 juju-157c1c-18 prometheus.prometheus[7639]: level=error ts=2020-04-21T16:12:16.488234098Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kub
Apr 21 12:12:16 juju-157c1c-18 prometheus.prometheus[7639]: level=error ts=2020-04-21T16:12:16.906305123Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kub
Apr 21 12:12:17 juju-157c1c-18 prometheus.prometheus[7639]: level=error ts=2020-04-21T16:12:17.48514382Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kube
Apr 21 12:12:17 juju-157c1c-18 prometheus.prometheus[7639]: level=error ts=2020-04-21T16:12:17.48603904Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kube
Apr 21 12:12:17 juju-157c1c-18 prometheus.prometheus[7639]: level=error ts=2020-04-21T16:12:17.48715521Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kube
Apr 21 12:12:17 juju-157c1c-18 prometheus.prometheus[7639]: level=error ts=2020-04-21T16:12:17.489873098Z caller=klog.go:94 component=k8s_client_runtime func=Er...

Read more...

Revision history for this message
George Kraft (cynerva) wrote :

I'm able to reproduce this and am working on a fix.

Changed in charm-kubernetes-master:
status: Confirmed → In Progress
Revision history for this message
George Kraft (cynerva) wrote :
tags: added: review-needed
George Kraft (cynerva)
Changed in charm-kubernetes-master:
status: In Progress → Fix Committed
tags: removed: review-needed
Changed in charm-kubernetes-master:
milestone: none → 1.18+ck1
Revision history for this message
George Kraft (cynerva) wrote :
Changed in charm-kubernetes-master:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.