certificates-relation-changed fails during apply_system_monitoring_rbac_role

Bug #1922578 reported by Joshua Genet
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Kubernetes Control Plane Charm
Incomplete
Undecided
Unassigned

Bug Description

This is using 1.21 on baremetal. We've got 2 successful runs on baremetal, so if this is an issue, it's not a consistent one.

Test run here:
https://solutions.qa.canonical.com/testruns/testRun/ba6e543a-9fb8-4f88-9422-20a75b81626e

Artifacts/Logs/Bundles here:
https://oil-jenkins.canonical.com/artifacts/ba6e543a-9fb8-4f88-9422-20a75b81626e/index.html

Juju k8s model crashdump here:
https://oil-jenkins.canonical.com/artifacts/ba6e543a-9fb8-4f88-9422-20a75b81626e/generated/generated/kubernetes/juju-crashdump-kubernetes-2021-04-03-12.56.58.tar.gz

Kubernetes-master/2 juju status reports: "Stopped services: kube-apiserver"

---

Juju unit logs show this:

2021-04-03 08:55:04 INFO juju-log certificates:24: Executing ['kubectl', '--kubeconfig=/root/.kube/config', 'apply', '-f', '/root/cdk/system-monitoring-rbac-role.yaml']
2021-04-03 08:55:04 WARNING certificates-relation-changed error: stat /root/.kube/config: no such file or directory
2021-04-03 08:55:04 DEBUG jujuc server.go:211 running hook tool "juju-log" for kubernetes-master/2-certificates-relation-changed-4328544975840366337
2021-04-03 08:55:04 INFO juju-log certificates:24: Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-kubernetes-master-2/charm/reactive/kubernetes_master.py", line 1873, in apply_system_monitoring_rbac_role
    kubectl("apply", "-f", path)
  File "lib/charms/layer/kubernetes_common.py", line 241, in kubectl
    return check_output(command)
  File "/usr/lib/python3.8/subprocess.py", line 411, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib/python3.8/subprocess.py", line 512, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['kubectl', '--kubeconfig=/root/.kube/config', 'apply', '-f', '/root/cdk/system-monitoring-rbac-role.yaml']' returned non-zero exit status 1.

Revision history for this message
Cory Johns (johnsca) wrote :
Download full text (4.7 KiB)

That function is automatically retried and succeeds later in the log, so that error is a red herring. (Perhaps we could improve the charm code so that it doesn't bother attempting that step until after the kube config has been written?) The actual issue is that the API service seems to have been running for about 10 minutes just fine and then is SIGTERM killed seemingly out of nowhere (see below). I have not seen that before and don't know what might have caused it.

---

Apr 03 09:05:39 juju-8dd1b7-4-lxd-1 kube-apiserver.daemon[115586]: I0403 09:05:39.844792 115586 httplog.go:89] "HTTP" verb="GET" URI="/api/v1/namespaces/kube-system/configmaps?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dextension-apiserver-authentication&resourceVersion=965&timeout=8m43s&timeoutSeconds=523&watch=true" latency="10.000475255s" userAgent="kube-controller-manager/v1.21.0 (linux/amd64) kubernetes/6b526c5" srcIP="192.168.33.168:49464" resp=0
Apr 03 09:05:46 juju-8dd1b7-4-lxd-1 kube-apiserver.daemon[115586]: I0403 09:05:46.849764 115586 client.go:360] parsed scheme: "passthrough"
Apr 03 09:05:46 juju-8dd1b7-4-lxd-1 kube-apiserver.daemon[115586]: I0403 09:05:46.849839 115586 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{https://192.168.33.172:2379 <nil> 0 <nil>}] <nil> <nil>}
Apr 03 09:05:46 juju-8dd1b7-4-lxd-1 kube-apiserver.daemon[115586]: I0403 09:05:46.849858 115586 clientconn.go:948] ClientConn switching balancer to "pick_first"
Apr 03 09:05:46 juju-8dd1b7-4-lxd-1 kube-apiserver.daemon[115586]: I0403 09:05:46.849999 115586 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc011e8f090, {CONNECTING <nil>}
Apr 03 09:05:46 juju-8dd1b7-4-lxd-1 kube-apiserver.daemon[115586]: I0403 09:05:46.866723 115586 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc011e8f090, {READY <nil>}
Apr 03 09:05:46 juju-8dd1b7-4-lxd-1 kube-apiserver.daemon[115586]: I0403 09:05:46.868046 115586 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
Apr 03 09:05:51 juju-8dd1b7-4-lxd-1 kube-apiserver.daemon[115586]: I0403 09:05:51.278603 115586 client.go:360] parsed scheme: "passthrough"
Apr 03 09:05:51 juju-8dd1b7-4-lxd-1 kube-apiserver.daemon[115586]: I0403 09:05:51.278681 115586 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{https://192.168.33.173:2379 <nil> 0 <nil>}] <nil> <nil>}
Apr 03 09:05:51 juju-8dd1b7-4-lxd-1 kube-apiserver.daemon[115586]: I0403 09:05:51.278700 115586 clientconn.go:948] ClientConn switching balancer to "pick_first"
Apr 03 09:05:51 juju-8dd1b7-4-lxd-1 kube-apiserver.daemon[115586]: I0403 09:05:51.278826 115586 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc011c924e0, {CONNECTING <nil>}
Apr 03 09:05:51 juju-8dd1b7-4-lxd-1 kube-apiserver.daemon[115586]: I0403 09:05:51.296269 115586 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc011c924e0, {READY <nil>}
Apr 03 09:05:51 juju-8dd1b7-4-lxd-1 kube-apiserver.daemon[115586]: I0403 09:05:51.297634 115586 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closi...

Read more...

Revision history for this message
Cory Johns (johnsca) wrote :

More readable logs in pastebin: https://pastebin.ubuntu.com/p/TpyG5jHsTX/

Revision history for this message
Joshua Genet (genet022) wrote :

We got 2 *more* successful runs using this SKU on Baremetal. So it very well could just be a network hiccup or something along those lines.

We'll keep an eye out and update here if we see it happen again.

Revision history for this message
George Kraft (cynerva) wrote :
Download full text (3.7 KiB)

In /var/log/juju/unit-kubernetes-master-2.log we see that the charm tries to restart kube-apiserver:

2021-04-03 09:05:30 INFO juju-log certificates:24: Invoking reactive handler: reactive/kubernetes_master.py:2192:configure_apiserver
2021-04-03 09:05:34 INFO juju-log certificates:24: status-set: maintenance: Restarting snap.kube-apiserver.daemon service

In /var/log/syslog we see the stop attempt:

Apr 3 09:05:34 juju-8dd1b7-4-lxd-1 systemd[1]: Stopping Cluster Controlled snap.kube-apiserver.daemon...
Apr 3 09:05:35 juju-8dd1b7-4-lxd-1 pacemaker-controld[82061]: notice: Initiating stop operation res_kube_apiserver_snap.kube_apiserver.daemon_stop_0 locally on juju-8dd1b7-4-lxd-1

kube-apiserver starts shutting things down, it has clearly received sigterm:

Apr 3 09:05:34 juju-8dd1b7-4-lxd-1 kube-apiserver.daemon[115586]: I0403 09:05:34.434449 115586 dynamic_cafile_content.go:182] Shutting down client-ca-bundle::/root/cdk/ca.crt
Apr 3 09:05:34 juju-8dd1b7-4-lxd-1 kube-apiserver.daemon[115586]: I0403 09:05:34.434560 115586 dynamic_serving_content.go:145] Shutting down aggregator-proxy-cert::/root/cdk/client.crt::/root/cdk/client.key
Apr 3 09:05:34 juju-8dd1b7-4-lxd-1 kube-apiserver.daemon[115586]: I0403 09:05:34.434576 115586 controller.go:123] Shutting down OpenAPI controller

Weirdly, it stops listening but then logs an error trying to talk to itself:

Apr 3 09:05:34 juju-8dd1b7-4-lxd-1 kube-apiserver.daemon[115586]: I0403 09:05:34.434762 115586 secure_serving.go:241] Stopped listening on [::]:6443
Apr 3 09:05:34 juju-8dd1b7-4-lxd-1 kube-apiserver.daemon[115586]: E0403 09:05:34.457316 115586 controller.go:184] Get "https://[::1]:6443/api/v1/namespaces/default/endpoints/kubernetes": dial tcp [::1]:6443: connect: connection refused

After 20 seconds, pacemaker gives up on waiting:

Apr 3 09:05:55 juju-8dd1b7-4-lxd-1 pacemaker-execd[82058]: notice: Giving up on res_kube_apiserver_snap.kube_apiserver.daemon stop (rc=196): timeout (elapsed=19980ms, remaining=20ms)
Apr 3 09:05:55 juju-8dd1b7-4-lxd-1 pacemaker-controld[82061]: error: Result of stop operation for res_kube_apiserver_snap.kube_apiserver.daemon on juju-8dd1b7-4-lxd-1: Timed Out

After 10 more seconds, systemd falls back to sigkill:

Apr 3 09:06:04 juju-8dd1b7-4-lxd-1 systemd[1]: snap.kube-apiserver.daemon.service: State 'stop-sigterm' timed out. Killing.
Apr 3 09:06:04 juju-8dd1b7-4-lxd-1 systemd[1]: snap.kube-apiserver.daemon.service: Killing process 115586 (kube-apiserver) with signal SIGKILL.
Apr 3 09:06:04 juju-8dd1b7-4-lxd-1 systemd[1]: snap.kube-apiserver.daemon.service: Killing process 115611 (kube-apiserver) with signal SIGKILL.
Apr 3 09:06:04 juju-8dd1b7-4-lxd-1 systemd[1]: snap.kube-apiserver.daemon.service: Killing process 115612 (n/a) with signal SIGKILL.
Apr 3 09:06:04 juju-8dd1b7-4-lxd-1 systemd[1]: snap.kube-apiserver.daemon.service: Killing process 115620 (kube-apiserver) with signal SIGKILL.
Apr 3 09:06:04 juju-8dd1b7-4-lxd-1 systemd[1]: snap.kube-apiserver.daemon.service: Killing process 115625 (n/a) with signal SIGKILL.
Apr 3 09:06:04 juju-8dd1b7-4-lxd-1 systemd[1]: snap.kube-apiserver.daemon.service: Killing process 115627 (n/a) with signal SI...

Read more...

Revision history for this message
George Kraft (cynerva) wrote :

To summarize the above, two things went wrong here that could use further investigation:
1) kube-apiserver took over 30 seconds to shut down after receiving sigterm. Why?
2) Following the sigterm timeout, pacemaker never restarted kube-apiserver. Why?

Revision history for this message
George Kraft (cynerva) wrote :

It looks like the sigterm timeout is actually pretty common. I can reproduce the sigterm timeout on a standard cs:charmed-kubernetes deployment, no hacluster, running Kubernetes 1.20.5, by running `sudo service snap.kube-apiserver.daemon stop`.

In the originally reported crashdump, I see that kubernetes-master/0 also hit the stop-sigterm timeout on kube-apiserver. But it recovered just fine because kube-apiserver was restarted shortly after.

So the weird part here is that pacemaker never restarted kube-apiserver on kubernetes-master/2, even after hours of it being stopped.

Revision history for this message
George Kraft (cynerva) wrote :

Weirdly, I'm not able to reproduce the sigterm timeout when I remove kubeapi-load-balancer from the deployment. This is making it quite difficult to reproduce the original problem since I need to run hacluster instead of kubeapi-load-balancer to reproduce the pacemaker part of the issue.

I'm going to mark this as incomplete for now since we don't have enough info to act on this report. We'll need either clear reproduction steps, or more detailed log output from pacemaker during a failure event, to figure out what's really going on here.

Changed in charm-kubernetes-master:
status: New → Incomplete
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.