Kubernetes Worker Charm

kubelet loses contact with kubernetes API, spams "use of closed network connection" errors

Bug #1884292 reported by Joshua Genet on 2020-06-19

This bug affects 4 people

Affects		Status	Importance	Assigned to	Milestone
	Kubernetes Worker Charm	Triaged	High	Unassigned

Bug Description

We're unsure this is directly related to test_dns_provider itself. It seems to be an issue with kube-apiserver getting into a funky state, but not sure how to narrow it down from there.

var/log/syslog:Jun 19 04:17:07 ip-172-31-42-92 kube-apiserver.daemon[26472]: logging error output: "{\"kind\":\"Status\",\"apiVersion\":\"v1\",\"metadata\":{},\"status\":\"Failure\",\"message\":\"no endpoints available for service \\\"kube-state-metrics\\\"\",\"reason\":\"ServiceUnavailable\",\"code\":503}\n"

---

The metrics server seems to have some issues. These may just be due to kube-apiserver being unable to reach it?

pod-logs/kube-system-metrics-server-v0.3.6-74c87686d-v4s4j-metrics-server-nanny.log:ERROR: logging before flag.Parse: I0619 04:55:53.602595 1 nanny_lib.go:108] Resources are not within the expected limits, updating the deployment. Actual: {Limits:map[] Requests:map[]} Expected: {Limits:map[cpu:{i:{value:0 scale:0} d:{Dec:0xc420407ce0} s: Format:DecimalSI} memory:{i:{value:0 scale:0} d:{Dec:0xc420407e90} s: Format:BinarySI}] Requests:map[cpu:{i:{value:0 scale:0} d:{Dec:0xc420407ce0} s: Format:DecimalSI} memory:{i:{value:0 scale:0} d:{Dec:0xc420407e90} s: Format:BinarySI}]}

---

We're also seeing nginx errors like this on the kubeapi-load-balancer unit.

var/log/nginx.error.log:2020/06/19 04:40:40 [error] 20694#20694: *1351 no live upstreams while connecting to upstream, client: 3.87.109.158, server: _, request: "GET /api/v1/endpoints?limit=500&resourceVersion=0 HTTP/1.1", upstream: "https://target_service/api/v1/endpoints?limit=500&resourceVersion=0", host: "3.81.51.24:443"

---

Here's a link to the artifacts from this run:
https://oil-jenkins.canonical.com/artifacts/aae83527-ae89-44e6-a57a-93a0974b1263/index.html

Tags:

Revision history for this message

George Kraft (cynerva) wrote on 2020-06-23:

kube-state-metrics errors are a red herring.

test_dns_provider timed out after 15 minutes. The test waits for coredns pods to be removed, but it looks like a coredns pod is stuck terminating. The node that hosts the pod is in Unknown status.

Kubelet on that node is failing to communicate with kube-apiserver due to "use of closed network connection" errors:

E0619 04:57:09.410568 14620 server.go:269] Authorization error (user=system:kube-apiserver, verb=get, resource=nodes, subresource=metrics)%!(EXTRA *url.Error=Post https://172.31.39.158:443/apis/authorization.k8s.io/v1/subjectaccessreviews: write tcp 172.31.39.53:57246->172.31.39.158:443: use of closed network connection)
E0619 04:57:11.237701 14620 kubelet_node_status.go:402] Error updating node status, will retry: error getting node "ip-172-31-39-53.ec2.internal": an error on the server ("") has prevented the request from succeeding (get nodes ip-172-31-39-53.ec2.internal)
E0619 04:57:11.237742 14620 kubelet_node_status.go:389] Unable to update node status: update node status exceeds retry count

This is a known issue in Kubernetes/Golang:

https://github.com/kubernetes/kubernetes/issues/87615
https://github.com/golang/go/issues/39750

summary:	- kube-apiserver 503 during test_dns_provider + kubelet loses contact with kube-apiserver with "use of closed network + connection" errors
summary:	- kubelet loses contact with kube-apiserver with "use of closed network + kubelet loses contact with kubernetes API, spams "use of closed network connection" errors
no longer affects:	charmed-kubernetes-testing
Changed in charm-kubernetes-worker:
importance:	Undecided → High
status:	New → Triaged

Revision history for this message

George Kraft (cynerva) wrote on 2020-06-23:

The workaround is to restart Kubelet.

Revision history for this message

Dorina Timbur (dorina-t) wrote on 2020-08-26:

Hi, this bug has affected a live production environment a few times already, causing customer outages. The workaround was of course applied. We'd appreciate a quick turnaround on this bug.

Revision history for this message

Peter Sabaini (peter-sabaini) wrote on 2020-09-07:

Subscribing field-high as we are hitting this in production

Revision history for this message

Chris Sanders (chris.sanders) wrote on 2020-10-29:

I'm unsubscribing field-high from from this bug after investigation it doesn't meet the definition based on the fact that this isn't a broken feature or newly released. This is just kubelet having a long standing bug upstream.

The upstream bug from above is: https://github.com/kubernetes/kubernetes/issues/87615
The current WIP merge is: https://github.com/kubernetes/kubernetes/pull/95981

When a fix is released we will update applicable packages to make them available.

Revision history for this message

Chris Johnston (cjohnston) wrote on 2020-12-15:

This should be released in 1.19.5 and 1.20

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

auto-github-golang-go #39750
[open NeedsInvestigation] Edit
auto-github-kubernetes-kubernetes #87615
[open kind/support sig/api-machinery sig/node] Edit

Bug watches keep track of this bug in other bug trackers.