kubelet loses contact with kubernetes API, spams "use of closed network connection" errors

Bug #1884292 reported by Joshua Genet
24
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Kubernetes Worker Charm
Triaged
High
Unassigned

Bug Description

We're unsure this is directly related to test_dns_provider itself. It seems to be an issue with kube-apiserver getting into a funky state, but not sure how to narrow it down from there.

var/log/syslog:Jun 19 04:17:07 ip-172-31-42-92 kube-apiserver.daemon[26472]: logging error output: "{\"kind\":\"Status\",\"apiVersion\":\"v1\",\"metadata\":{},\"status\":\"Failure\",\"message\":\"no endpoints available for service \\\"kube-state-metrics\\\"\",\"reason\":\"ServiceUnavailable\",\"code\":503}\n"

---

The metrics server seems to have some issues. These may just be due to kube-apiserver being unable to reach it?

pod-logs/kube-system-metrics-server-v0.3.6-74c87686d-v4s4j-metrics-server-nanny.log:ERROR: logging before flag.Parse: I0619 04:55:53.602595 1 nanny_lib.go:108] Resources are not within the expected limits, updating the deployment. Actual: {Limits:map[] Requests:map[]} Expected: {Limits:map[cpu:{i:{value:0 scale:0} d:{Dec:0xc420407ce0} s: Format:DecimalSI} memory:{i:{value:0 scale:0} d:{Dec:0xc420407e90} s: Format:BinarySI}] Requests:map[cpu:{i:{value:0 scale:0} d:{Dec:0xc420407ce0} s: Format:DecimalSI} memory:{i:{value:0 scale:0} d:{Dec:0xc420407e90} s: Format:BinarySI}]}

---

We're also seeing nginx errors like this on the kubeapi-load-balancer unit.

var/log/nginx.error.log:2020/06/19 04:40:40 [error] 20694#20694: *1351 no live upstreams while connecting to upstream, client: 3.87.109.158, server: _, request: "GET /api/v1/endpoints?limit=500&resourceVersion=0 HTTP/1.1", upstream: "https://target_service/api/v1/endpoints?limit=500&resourceVersion=0", host: "3.81.51.24:443"

---

Here's a link to the artifacts from this run:
https://oil-jenkins.canonical.com/artifacts/aae83527-ae89-44e6-a57a-93a0974b1263/index.html

Revision history for this message
George Kraft (cynerva) wrote :

kube-state-metrics errors are a red herring.

test_dns_provider timed out after 15 minutes. The test waits for coredns pods to be removed, but it looks like a coredns pod is stuck terminating. The node that hosts the pod is in Unknown status.

Kubelet on that node is failing to communicate with kube-apiserver due to "use of closed network connection" errors:

E0619 04:57:09.410568 14620 server.go:269] Authorization error (user=system:kube-apiserver, verb=get, resource=nodes, subresource=metrics)%!(EXTRA *url.Error=Post https://172.31.39.158:443/apis/authorization.k8s.io/v1/subjectaccessreviews: write tcp 172.31.39.53:57246->172.31.39.158:443: use of closed network connection)
E0619 04:57:11.237701 14620 kubelet_node_status.go:402] Error updating node status, will retry: error getting node "ip-172-31-39-53.ec2.internal": an error on the server ("") has prevented the request from succeeding (get nodes ip-172-31-39-53.ec2.internal)
E0619 04:57:11.237742 14620 kubelet_node_status.go:389] Unable to update node status: update node status exceeds retry count

This is a known issue in Kubernetes/Golang:

https://github.com/kubernetes/kubernetes/issues/87615
https://github.com/golang/go/issues/39750

summary: - kube-apiserver 503 during test_dns_provider
+ kubelet loses contact with kube-apiserver with "use of closed network
+ connection" errors
summary: - kubelet loses contact with kube-apiserver with "use of closed network
+ kubelet loses contact with kubernetes API, spams "use of closed network
connection" errors
no longer affects: charmed-kubernetes-testing
Changed in charm-kubernetes-worker:
importance: Undecided → High
status: New → Triaged
Revision history for this message
George Kraft (cynerva) wrote :

The workaround is to restart Kubelet.

Revision history for this message
Dorina Timbur (dorina-t) wrote :

Hi, this bug has affected a live production environment a few times already, causing customer outages. The workaround was of course applied. We'd appreciate a quick turnaround on this bug.

Revision history for this message
Peter Sabaini (peter-sabaini) wrote :

Subscribing field-high as we are hitting this in production

Revision history for this message
Chris Sanders (chris.sanders) wrote :

I'm unsubscribing field-high from from this bug after investigation it doesn't meet the definition based on the fact that this isn't a broken feature or newly released. This is just kubelet having a long standing bug upstream.

The upstream bug from above is: https://github.com/kubernetes/kubernetes/issues/87615
The current WIP merge is: https://github.com/kubernetes/kubernetes/pull/95981

When a fix is released we will update applicable packages to make them available.

Revision history for this message
Chris Johnston (cjohnston) wrote :

This should be released in 1.19.5 and 1.20

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.